Is it a Tree or a Bush? How K-NN Decides When Nature Won’t
This blog is more for me than anyone else. It is an AI generate summary of material from the machine learning course I am taking at Central Piedmont Community College. If helped me grapple with how mechanical process produce apparently thoughtful results.
Is it a Tree or a Bush? How K-NN Decides When Nature Won’t
You’re hiking. You spot a woody plant: about chest-high, lots of stems, kind of tree-ish, kind of bush-ish. Your brain hesitates—tree or bush? That hesitation is exactly the kind of problem machine learning tackles: how to label things that sit on the boundary between categories.
Below is a friendly walk-through of how we might define the label, and then how k-Nearest Neighbors (K-NN) would label it.
Step 1: What do we even mean by “tree” vs. “bush”?
Before algorithms, there’s ontology—the rules of what counts as what.
Possible labeling schemes:
Binary, mutually exclusive:
tree
orbush
.Multiclass with an “either” class:
tree
,bush
,either
.Multilabel: It can be both (
tree=1
,bush=1
)—useful when categories overlap.Soft labels / probabilities: “70% tree, 30% bush,” reflecting uncertainty or expert disagreement.
If you’re building a model, choose this first. Your modeling choice (K-NN, trees, neural nets) should reflect the nature of the label you want, not the other way around.
Step 2: Turn botany into numbers (features)
K-NN doesn’t read Latin names; it measures distances in feature space. Examples:
Height (m)
Dominant stem count (1 vs many)
Trunk diameter (cm at 10–50 cm above ground)
Branching height (how high the first major branches start)
Canopy shape (e.g., width/height ratio)
Leaf persistence (evergreen vs deciduous)
Growth habit score (expert-rated 0–1 “treeness”)
Age (if known; some plants start bushy, become tree-like)
Good features shrink ambiguity; bad features blur boundaries.
Step 3: How K-NN labels the plant
K-NN in one line: find the k closest labeled examples; vote.
Choose a distance: Euclidean is common; for mixed numeric/categorical features, you might use Gower or encode categoricals carefully.
Pick k: Small k (e.g., 1–3) is sensitive to local quirks; larger k smooths noise but can wash out minority classes.
Weight neighbors (optional): closer neighbors count more (e.g., 1/distance).
Outcome types with K-NN:
Hard label (majority vote):
tree
if most of the neighbors are trees.Soft label (class probabilities): proportion of neighbors per class—great for “70% tree, 30% bush.”
Multilabel vote: for each label, tally neighbors that have it; thresholds decide inclusion.
Example:
k = 7 neighbors → 4 labeled tree, 3 labeled bush
Hard label: tree
Soft label: P(tree)=0.57, P(bush)=0.43
That’s it: no training phase beyond storing the data, just measuring and voting.
When K-NN gets it wrong (or wobbly)
Shifty boundaries: Many species are truly in-between (think coppiced trees or arborescent shrubs). If your ontology is binary, K-NN will force a choice even when nature doesn’t.
Feature scaling: If trunk diameter ranges 0–80 cm but canopy ratio is 0–2, diameter will dominate Euclidean distance unless you standardize.
Class imbalance: If 90% of your samples are bushes, a borderline tree may get labeled bush. Use class-balanced weighting or tuned k.
Local bias: Atypical but nearby neighbors can sway the vote. Distance weighting helps.
What about Decision Trees (and “forests”)?
Decision Trees carve space with if-then rules:
If dominant stems > 1 and branching height < 0.5 m → bush
Else if trunk diameter > 10 cm and height > 3 m → tree
Else …
They’re interpretable, which is helpful when you need a policy you can explain: “We call it a tree if it’s taller than 3 m and has a single main stem,” etc.
Random Forests / Gradient-Boosted Trees average many trees, giving robust probabilities. They can still reflect ambiguity (e.g., 0.52 vs 0.48), but with fewer odd edge cases than a single tree.
K-NN vs Trees, quick compare:
Interpretability: Trees win (explicit rules).
Local nuance: K-NN wins (uses real nearby examples).
Training cost: K-NN “trains” instantly; trees need fitting.
Prediction cost: Trees are fast; K-NN can be slow on big datasets (needs nearest-neighbor search).
Feature scaling sensitivity: K-NN is sensitive; trees less so.
How should we label the “could-be-either” plant?
Pick the scheme that serves your use case:
Compliance / inventory systems: need one box? Use hard labels, but log the probability so borderline cases are reviewable.
Field guides / education: use soft labels to convey uncertainty; show top-2 classes with confidences.
Ecology research: prefer multilabel or soft labels; plants evolve forms across age and environment—don’t force a false binary.
Operations (e.g., pruning crews): pair a simple tree model for rules (“if height > X and single trunk → tree”) with a K-NN backup to flag ambiguous specimens for human review.
Make the model better (and fairer)
Active learning: When K-NN is uncertain (neighbors split), surface those cases to a botanist; add the new labeled example back in.
Metric learning: Learn a distance that better reflects “treeness” vs “bushness.”
Stratified sampling: Balance your dataset so minority forms (stunted trees, tree-like shrubs) are well represented.
Temporal features: Young vs mature forms; some “bushes” are baby trees.
A tiny mental picture
Imagine a scatterplot:
X-axis: trunk diameter
Y-axis: branching height
Trees cluster in the top-right (thicker trunk, higher branching). Bushes cluster bottom-left (thin stems, low branching). Our plant sits mid-slope. K-NN looks around: if its nearest neighbors are 60/40 trees/bushes, that’s the label—and the confidence.