Module 1 Β· Foundations Β· ISLR Ch 1 + 2

From Zero to Your First ML Model

A hands-on crash course built for people who want to use machine learning at work β€” not write papers about it. Every concept maps to a real Lincoln Industries problem.

0Lessons across 2 modules
0Interactive widgets
0Lincoln Practice Workbenches
Powered by Planning Cost Savings Operations
Course progress
0 of 8 lessons
Planning Cost Savings Operations
Lesson 1 Β· ~15 minutes

Welcome to the catalogue

The 3 flavors of ML and how each one shows up at Lincoln.

Watch this first β€” 85-second primer for what's below.

Most of what people call "AI" today is statistical learning β€” finding patterns in data and using them to make decisions. You don't need to write code to use it. You need to know what to ask for, what's possible, and what's snake oil. By the end of this catalogue, you'll know.

What this whole catalogue does

Every chapter of the textbook Introduction to Statistical Learning becomes a short, interactive crash course. We strip the math down to what you actually need, anchor every concept in a Lincoln Industries problem, and end with a working model you can defend in a meeting.

You'll travel from "speak the language" (this module) to "build a real model on your own data" (final modules) β€” without writing a single line of code along the way.

The three flavors of ML at Lincoln

Every machine learning problem on earth is one of three shapes. Once you can see the shape, you know which tool to reach for.

Regression β€” predicting a number. "What will the scrap rate be on Line 4 next week?"

Classification β€” predicting a label. "Will this part pass final QC, yes or no?"

Clustering β€” finding hidden groups in data. "Are there families of plating-line failures we keep seeing?"

Three flavors β†’ three Lincoln problems REGRESSION Predict a number "scrap rate next week?" CLASSIFICATION Predict a label PASS FAIL "will this part pass QC?" CLUSTERING Find hidden groups "failure families?"

Same techniques, three shapes. The shape of your question decides which one you need.

Why this catalogue exists

Most ML resources assume you'll write code. Most non-coders bounce off in the first hour. This catalogue assumes the opposite: you have a job, you're busy, you want to know what's actually possible β€” and what's actually doable for you.

The path is simple: Foundations now β†’ Linear Models β†’ Resampling β†’ Trees β†’ Deep Learning β†’ Unsupervised. Each module sharpens your judgment. The final modules ship you to an AutoML tool (Vertex AI, H2O, SageMaker Autopilot) where you push your own data through and ship a working model.

Same idea elsewhere

The AI agent monitoring shops you've heard about β€” Arize, Helicone, LangSmith β€” use exactly these three flavors. Regression to predict tool-call latency. Classification to flag whether an LLM response is hallucinating. Clustering to group similar failure modes across thousands of agent runs. Manufacturing and AI agents are using the same playbook.

Lesson 1 recap

  • ML at Lincoln boils down to three shapes: regression (numbers), classification (labels), clustering (hidden groups).
  • This catalogue is concept-first. By the end you'll be able to push real Lincoln data through an AutoML tool β€” no coding required.
  • Pick one painful prediction problem at work and carry it with you through every lesson.
Explain it to a coworker

In your own words β€” how would you explain this lesson's main concept to a peer at Lincoln? Save locally; have AI review for honest critique.

Saved βœ“
Foundational
Lesson 2 Β· ~15 minutes

The data language

n, p, X, y β€” just enough notation to read papers and books.

Watch this first β€” 70-second primer for what's below.

Every ML conversation uses four letters: n, p, X, y. Once you know what they mean, you can read any ML paper, blog post, or AutoML output without bouncing. This is a 10-minute investment that pays back forever.

What the letters mean

Imagine a spreadsheet of Lincoln plating jobs. Every row is a job. Every column is something you measured about that job. There's one column you care about predicting. That's all four letters.

n β€” the number of rows. How many jobs are in your dataset. If you have 50 plating jobs from last quarter, n = 50.

p β€” the number of columns (other than the target). How many things you measured per job. Bath temperature, line speed, surface prep score, plating thickness, age of solution, operator skill, customer tier, part complexity. p = 8.

X β€” the whole table of inputs. Capital X = matrix. n rows by p columns.

y β€” the column you want to predict. Lower-case y = a single column. For each row, one value.

Lincoln plating jobs as X and y JOB BATH Β°C SPEED PREP THICK SOLN AGE SKILL CMPLX y (PASS?) 00182140.922812043 PASS 00279220.711934025 FAIL 00385180.94319052 PASS …………………… … 05081150.882716043 PASS X β€” the input matrix (50 rows Γ— 8 columns) y β€” what we predict n = 50 jobs n = 50 Β· p = 8 Β· X is 50 Γ— 8 Β· y is one column of 50 values "Given X (the eight setup parameters), can we predict y (will it pass QC)?"

A row is a job. Columns are what you measured. The rightmost is what you want to predict. ML in one diagram.

Quantitative vs categorical

One more fork before you can read anything in the field. The type of y decides which family of algorithms you need.

Quantitative β€” y is a number. Plating yield (87%), thickness (28 microns), cycles since service (120). Use regression methods.

Categorical β€” y is a label. PASS or FAIL. Customer tier A, B, or C. Use classification methods.

You'll see "regression vs classification" mentioned constantly. This is what they mean.

Same idea elsewhere

When an AI agent shop talks about "feature space," they mean the same X matrix. Each agent run is one row; each thing they measured (latency, tokens, success-flag) is a column. When they say "target variable," they mean y. Same letters, same shape.

Lesson 2 recap

  • n = rows (observations). p = columns (features). X = input matrix. y = target column.
  • If y is a number, you have a regression problem. If y is a label, it's classification.
  • Every ML problem reduces to: "given X, predict y." Everything else is technique.
Explain it to a coworker

In your own words β€” how would you explain n, p, X, and y to a peer at Lincoln? Save locally; have AI review for honest critique.

Saved βœ“
Cost Savings Operations
Lesson 3 Β· ~20 minutes

Y = f(X) + Ξ΅

The equation every model is solving.

Watch this first β€” 75-second primer for what's below.

Every model on earth is solving the same equation: Y = f(X) + Ξ΅. Real outcome equals the true pattern plus noise. ML's job is to find f. The Ξ΅ is the part you can never predict β€” and accepting that is half the battle.

Reducible vs irreducible error

Your model's total error has two parts.

Reducible error β€” improves when you pick a better model or add better features. This is the part you fight against. Most of ML practice is shrinking it.

Irreducible error β€” locked in by the world. Random measurement noise, things you can't observe, factors that vary day-to-day. Lincoln framing: even a perfect model can't predict scrap rate to the decimal β€” vibration, humidity, operator focus, micro-power-fluctuations all carry noise no spreadsheet captures.

Knowing the difference saves your sanity. When your model isn't getting better, you have to ask: am I fighting reducible error (try harder) or irreducible error (this is the floor)?

The noise floor β€” what irreducible error looks like Black curve = true f Β· Cyan dots = observed Y Β· Red lines = Ξ΅ (irreducible noise) noise floor β€” can't beat this

Even with the perfect model, individual outcomes scatter. That scatter is Ξ΅. It's a floor β€” find it, accept it, move on.

Two reasons we estimate f

Same equation, two very different goals. Knowing which goal you have changes which model you should choose.

Prediction

"Just give me ΕΆ"

You don't care why the model says what it says. You just want accurate predictions. Black-box models are fine. Lincoln framing: "will this batch pass QC?" β€” it's fine if the model can't explain why, as long as it's right.

Inference

"Tell me which X drives Y"

You want to understand the relationship. Which lever moves the outcome? You need a clear, interpretable model. Lincoln framing: "which setup parameter has the biggest impact on yield?" β€” answers like this change how the floor runs.

Try it

Fit a curve through Lincoln's defect data

The blue dots are real defect measurements β€” they don't move. The black curve is YOUR model. Drag the four amber handles up and down to make the curve pass through the cloud of dots as closely as possible. Your MSE drops as the line gets closer to the dots β€” but it can never reach zero, because real data has noise around the true pattern.

Your MSE–
Floor (irreducible)~1.0

The amber dashed band is the noise floor β€” the gap between dots and the true pattern that no model can fight (that's the Ξ΅ in Y = f(X) + Ξ΅). The closer your MSE gets to 1.0, the closer your line is to the best any model could possibly do. You're done when MSE bottoms out β€” not when dots sit exactly on the line (they never will).

Same idea elsewhere

When you ask "why did this AI agent fail?" you're doing inference. When you ask "which agent will succeed on this prompt?" you're doing prediction. The same X, the same Y, the same Ξ΅ β€” the goal changes the model.

Lesson 3 recap

  • Every model is hunting for f in Y = f(X) + Ξ΅. The Ξ΅ is irreducible β€” accept it.
  • Reducible error you can fight (better model, better features). Irreducible error is the floor.
  • Two goals split the world: prediction (just be accurate) vs inference (explain the relationship).
Explain it to a coworker

In your own words β€” how would you explain Y = f(X) + Ξ΅ and the noise floor? Save locally; have AI review for honest critique.

Saved βœ“
Operations
Lesson 4 Β· ~20 minutes

How to find f

Parametric vs non-parametric. Flexibility vs interpretability.

Watch this first β€” 70-second primer for what's below.

There are two ways to find f. Pick a shape and let math fill in the constants (parametric), or let the data shape itself (non-parametric). Each comes with a tax. Knowing which to pay matters more than knowing the algorithm names.

Parametric β€” pick the shape, fit the numbers

Linear regression is the classic example. You assume f is a straight line. Now you only have two numbers to find: slope and intercept. Cheap, fast, easy to explain.

The tax: if the truth is curved and you assumed a line, you'll always be a little wrong. Doesn't matter how much data you throw at it. Your assumption is the ceiling.

Non-parametric β€” let the data lead

No assumed shape. The fit can wiggle as much as it needs to follow the data. KNN, decision trees, splines β€” all non-parametric.

The tax: you need a lot more data to get a stable answer. With 30 jobs you can fit a straight line confidently. With 30 jobs and a wiggly non-parametric model, you're chasing noise.

Same data, three shapes for f LINEAR Parametric Β· 2 numbers POLYNOMIAL Parametric Β· ~5 numbers FLEXIBLE Non-param Β· 10+ numbers More flexibility = more wiggle = more parameters = more data needed

Same scatter, three different commitments. Pick what your problem calls for β€” and what your data can support.

The flexibility / interpretability tradeoff

Here's the rule that matters in practice: more flexible models are harder to explain.

A linear regression gives you a one-line answer: "yield rises 0.4% for every degree increase in bath temp." Your boss can defend it. A wiggly non-parametric fit gives you "the model says so." Your boss is unlikely to defend that.

Choose your model based on who's asking and what they need. Sometimes accuracy wins, sometimes explainability wins. There's no universal right answer.

Try it

Pick a shape for f

Same Lincoln bath-temperature data. Three different shapes. Click each one and watch what happens.

Training MSE–
Interpretable?–

Linear is one straight line. Cheap, clear, but can't bend.

Same idea elsewhere

A logistic regression for AI agent routing is parametric β€” clean rules you can defend in a postmortem. A neural network is non-parametric β€” no assumed shape, eats data for breakfast, but good luck explaining a single decision.

Knowledge Return Β· From Lesson 2
Quick β€” what was the y in Lesson 2's plating-jobs table?
PASS / FAIL β€” the column we want to predict. Everything else (bath temp, speed, prep, thickness) is X.

Lesson 4 recap

  • Parametric = pick a shape (linear, polynomial), fit the numbers. Cheap, clear, can be wrong if the shape is wrong.
  • Non-parametric = let the data shape itself. Flexible, hungry for data, harder to explain.
  • More flexible = more accurate (sometimes) but less interpretable. The tradeoff is the whole game.
Explain it to a coworker

In your own words β€” how would you explain parametric vs non-parametric models, and the defensibility tradeoff? Save locally; have AI review for honest critique.

Saved βœ“
Planning Cost Savings
Lesson 5 Β· ~20 minutes

Did it actually work?

MSE and the train/test split. The non-negotiable pre-flight.

Watch this first β€” 65-second primer for what's below.

The single most common ML mistake is shipping a model that looks great on the data it was trained on β€” and falls apart in production. The fix is brutally simple: hide some data from the model, then test on it.

MSE β€” Mean Squared Error

The standard score for a regression model. For each prediction, take (prediction βˆ’ actual), square it, then average across all predictions.

Why squared? So big misses count more than small ones. A prediction that's off by 10 is much worse than ten predictions off by 1. The square makes that bite.

Lower MSE = better. Period. But: which data did you measure it on? That's the whole question.

The train/test split

Hold out 20% of your data before you train. Train the model on 80%. Score it on the 20% it has never seen. That score is your honest estimate of how it'll do in the real world.

Anything else is cheating. If the model has seen the data during training, asking it to predict that data is like giving a student the answer key during the exam. They'll ace it. They'll teach you nothing.

The train/test split All your data (e.g., 50 plating jobs) TRAINING SET (80%) β€” 40 jobs the model learns from TEST (20%) model never sees these β†’ honest score from here

Hide 20% before training. Score on it after. Anything else is fiction.

Overfitting β€” the trap

A super-flexible model can memorize the training data. It can fit every noisy point exactly. Train MSE β†’ near zero. The model looks brilliant. You ship it.

Then production data arrives. The same model is suddenly catastrophic. Why? Because it learned the noise, not the signal. Train MSE was a lie.

The only honest metric is the one measured on data the model hasn't seen. That's the rule. Memorize it.

Try it

Drag your predictions to fit Lincoln's daily yield

Then flip the test-set switch and see what you actually built.

Train MSE (8 visible)–

A tight fit on training data isn't proof of anything. Holdout is. Drag every handle right onto its cyan dot β€” train MSE goes to ~0. Then reveal the test set and watch what happens to test MSE.

Same idea elsewhere

When an AI shop says "we evaluated on a held-out set," this is what they mean. If they don't say that, ignore the accuracy number. There's a 90% chance they trained and tested on the same data β€” which tells you nothing.

Lesson 5 recap

  • MSE = average squared prediction error. Lower is better β€” but only on the right data.
  • Train/test split = hide 20%, train on 80%, score on the hidden 20%. Non-negotiable.
  • Overfitting happens when a flexible model memorizes the training noise. Train MSE looks great, test MSE blows up.
Explain it to a coworker

In your own words β€” how would you explain MSE and the train/test split? Save locally; have AI review for honest critique.

Saved βœ“
Cost Savings Operations
Lesson 6 Β· ~25 minutes Β· Flagship

Bias vs Variance

The central tradeoff. Move the slider. Feel it click.

Watch this first β€” 2-minute primer for what's below. The flagship.

This is the lesson that makes everything else click. Once you feel the bias-variance tradeoff in your hands, you'll never look at a model the same way. Every modeling decision in your career will be a flavor of this.

Bias β€” error from being too simple

Bias is what happens when your model can't bend enough to follow the truth. You assume defect rate is a flat line, but it's actually a curve. No matter how much data you give it, the line will always be wrong in the middle. That's bias.

High bias = the model is too rigid. It misses real patterns. The error is built into the assumption.

Variance β€” error from being too clingy

Variance is what happens when your model bends too eagerly. It hugs every data point, including the noisy ones. Re-train on a slightly different sample and the fit changes wildly. That instability is variance.

High variance = the model is too sensitive. It's chasing noise. Predictions on new data are unstable.

Three model fits β€” same data UNDERFIT High bias Β· Low variance "too rigid to follow the truth" JUST RIGHT Low bias Β· Low variance "follows the trend, ignores noise" OVERFIT Low bias Β· High variance "chases every data point, including noise"

Bias is a fence. Variance is a windsock. Just-right is the curve in the middle that ignores both extremes.

The U-shape

As you increase a model's flexibility, two things happen at once. Bias drops (more flexible models can follow the truth). Variance rises (more flexible models chase noise).

Total error = biasΒ² + variance + irreducible noise. Add it up across flexibility levels and you get a U-shape. The bottom of the U is the sweet spot β€” flexible enough to capture the pattern, not so flexible that it's chasing noise.

Every modeling decision is just trying to find the bottom of that U.

Flagship Β· Try it

The Bias–Variance Playground

Move the slider. Watch the curve, the bias, and the variance fight each other.

Model fit on 30 training points
Train MSE Β· Test MSE Β· BiasΒ² Β· Variance β€” across all flexibility levels
3 Train MSE – Β· Test MSE – Β· BiasΒ² – Β· Variance –

The dotted vertical line on the right chart marks where you are on the U-shape. The green dashed line marks the sweet spot β€” the flexibility level where test MSE bottoms out. Watch test MSE drop, bottom, then climb again as variance takes over.

Same idea elsewhere

When an AI agent shop says "we hyperparameter-tuned for the lowest validation loss," they're standing on the U-shape, looking for the dip. Same playground, different vocabulary.

Knowledge Return Β· From Lesson 5
From Lesson 5: what's the rule about evaluating a model's accuracy?
Always score on data the model has never seen β€” the train/test split. A model that memorizes history will score perfectly on history and badly in production. Test set is the only honest number.

Lesson 6 recap

  • Bias = error from too rigid a model. Misses real patterns.
  • Variance = error from too flexible a model. Chases noise.
  • Total error = biasΒ² + variance + irreducible noise. As flexibility grows, bias drops, variance rises.
  • The sweet spot is the bottom of the U. Every modeling decision is hunting for it.
Explain it to a coworker

In your own words β€” how would you explain bias, variance, and the U-shape? Save locally; have AI review for honest critique.

Saved βœ“
Operations
Lesson 7 Β· ~20 minutes

KNN β€” your first algorithm

Classify by neighbors. The simplest serious ML there is.

Watch this first β€” 75-second primer for what's below.

K-Nearest Neighbors is the simplest serious ML algorithm there is. There's no math to memorize. The model is the data. And you can draw it in your head.

How it works in one sentence

Given a new point, find the K closest points in your training data, take a majority vote. That's it.

Lincoln framing: a new plating job arrives with bath temp 82Β°C and line speed 14 parts/min. Look up the 10 most similar past jobs. If 7 passed and 3 failed, predict PASS. If 3 passed and 7 failed, predict FAIL. Done.

That's a real, defensible model. No coefficients. No training. The "model" is just your past data.

Picking K

K = 1 means you trust whoever's standing closest, even if they're an outlier. One weird past job dominates every prediction near it. Wiggly decision boundary. High variance.

K = 100 means you average over basically your whole dataset. Too smooth to catch local patterns. Blurry decision boundary. High bias.

The right K is somewhere in between, and yes β€” it's a bias-variance tradeoff. The U-shape from Lesson 6 shows up here too. Sound familiar?

KNN β€” different K, different boundary K = 3 2 blue, 1 orange β†’ BLUE K = 10 5 blue, 4 orange β†’ BLUE K = 25 5 blue, 4 orange β†’ BLUE Same point, three Ks. Different neighbors get a vote.

The black cross is the new job. Find K nearest. Take a vote. That's the entire algorithm.

Try it

Click to add data. Slide K.

You're predicting "will this plating job pass QC?" from two setup parameters.

K = 10

Background color = what KNN would predict at every point. Click anywhere to add a labeled job and watch the boundary react. Shift-click or right-click to remove a point. Try the K=1 preset, then K=25 β€” same data, totally different model.

Same idea elsewhere

Vector databases doing semantic search are KNN at scale. "Find the K embeddings closest to my query" is literally KNN with cosine distance. Every RAG pipeline, every recommendation engine, every AI agent's memory lookup β€” KNN under the hood.

Lesson 7 recap

  • KNN: find the K nearest past examples, take a majority vote. The model is the data.
  • Small K = wiggly boundary, high variance. Large K = smooth boundary, high bias. Same U-shape.
  • Perfect first algorithm for small Lincoln-sized datasets. Defensible, simple, surprisingly accurate.
Explain it to a coworker

In your own words β€” how would you explain KNN to someone who's never seen ML? Save locally; have AI review for honest critique.

Saved βœ“
Planning Cost Savings Operations
β˜… Capstone Β· ~30 minutes

Predictive Maintenance for Lincoln plating lines

The thing you can carry into work tomorrow.

Watch this first β€” 65-second primer for what's below.

Time to put it all together. You're going to build a predictive maintenance model for Lincoln's plating lines using everything from the last seven lessons. No coding. Just judgment.

The problem

50 machines on the floor. Some will fail in the next 7 days, some won't. We have 8 features per machine β€” age, cycles since service, vibration, bath temp drift, operator skill, part complexity, customer tier, and one decoy that looks important but isn't. Build a KNN classifier that predicts failure on a held-out test set.

What you'll do

Step 1. Pick which features go into the model. Some help. One is a trap.

Step 2. Pick K. (Remember Lesson 6 and 7 β€” find the U-shape sweet spot.)

Step 3. Watch test accuracy and the confusion matrix update live as you tweak.

Step 4. When you're happy, generate the spec sheet β€” your shippable artifact.

Why it matters

If you can flag a failure 7 days before it happens, you schedule maintenance during a planned slot instead of an emergency stop. That's literally Planning + Cost Savings + Operations all at once. It's the kind of model leadership notices.

Capstone Β· Build it

Predict which Lincoln plating lines fail in 7 days

Pick features. Pick K. See test accuracy. Then ship a spec sheet.

5
–
Best so far: –
Pred FailPred OK
Actual Fail––
Actual OK––

High True Positive + low False Negative = catching failures. That's the ballgame.

Aim for accuracy β‰₯ 0.80 on the test set. Try removing the decoy. Try different Ks. The best feature set + K combo isn't always obvious.

Same idea elsewhere

Every observability platform that "predicts incidents before they happen" is doing this exact loop. Pick features. Pick a model. Validate on past data. Ship. The flavor differs; the playbook doesn't.

Knowledge Return Β· From Lesson 6
From Lesson 6: what shape does test MSE make as you crank flexibility?
A U-shape. Bias falls fast, variance climbs, total error drops, bottoms out, then climbs back up. Tuning K in this capstone is hunting for the bottom of the U.

Capstone recap β€” what you just did

  • You picked features deliberately (and learned to spot decoys that hurt accuracy).
  • You tuned K for the bias-variance sweet spot β€” same U-shape from Lesson 6.
  • You evaluated on a held-out test set β€” the only honest metric.
  • You generated a printable spec sheet β€” your shippable artifact for leadership.
Explain it to a coworker

In your own words β€” how would you explain the predictive maintenance loop you just built (pick features, pick K, score, ship)? Save locally; have AI review for honest critique.

Saved βœ“
Planning Operations
Lesson 1 Β· Tree Methods Β· ~18 minutes

Decision trees β€” yes / no, all the way down

A model you can sketch on a napkin and your boss can actually follow.

A decision tree is a flowchart. "Is lot purity below 97%? Yes β†’ check humidity. Above 75%? Yes β†’ predict FAIL." That's it. No equations, no abstract math, just a series of yes/no splits ending in a prediction. Trees are the most defensible model you'll ever build β€” and they happen to be very accurate too.

The shape of a tree

Every node asks one yes/no question about one feature. Yes goes left, no goes right. You keep splitting until you hit a leaf β€” that leaf is your prediction.

For your sourcing data, a tree might learn this:

if lot_purity_pct < 97.0:
    if humidity_pct > 75:
        predict FAIL    ← leaf
    else:
        predict PASS    ← leaf
else:
    if bath_temp_f < 163:
        predict FAIL    ← leaf
    else:
        predict PASS    ← leaf

You can read this. If the model predicted FAIL on a new job, you can trace the exact path. That's defensibility you literally cannot get from a neural network.

How does a tree decide where to split?

The algorithm tries every possible split (every feature Γ— every threshold) and picks the one that separates passes from fails the most. The mathematical name for "how mixed up are these groups" is Gini impurity (classification) or residual sum of squares (regression). You don't need the math; you need the intuition: a good split makes the two resulting groups as pure as possible.

Then it does the same thing on each resulting group. Recursively. Until you say stop β€” or every leaf is pure.

Same idea elsewhere

Every fraud-detection vendor's "explainable AI" is just a tree (or a forest of them). When they tell you "we flagged this transaction because amount > $5000 AND device_age < 7 days," that's literally a tree path. The "AI" buzzword is doing a lot of work.

Knowledge Return Β· From Lesson 7
In KNN, what was the model itself?
The data. KNN doesn't learn anything β€” it just stores every past job and votes among the K nearest at prediction time. A tree is different: it learns a SET OF RULES from the data and throws the data away. Predictions become fast.

Lesson 8 recap

  • A decision tree = a flowchart of yes/no splits ending in predictions. Each leaf is a prediction.
  • The algorithm picks the split that separates the classes (pass/fail) most cleanly. Greedy, top-down.
  • Trees are the most defensible model in ML β€” you can trace every prediction along a literal path.
Explain it to a coworker

In your own words β€” how would you explain a decision tree to a teammate who's never seen one? Save locally; have AI review for honest critique.

Saved βœ“
Planning Cost Savings
Lesson 2 Β· Tree Methods Β· ~16 minutes

Pruning β€” when smaller trees beat bigger ones

A deep tree memorizes history. A pruned tree generalizes to the future.

Let a tree split until every leaf has exactly one job in it and you've built a perfect lookup table for the past β€” and a terrible predictor for the future. Bigger trees aren't better trees. The art is in knowing when to stop.

The overfitting trap

A tree with no depth limit will keep splitting until each leaf contains a single training job. Training accuracy: 100%. Test accuracy on new jobs: bad. The model memorized the noise instead of the signal β€” same overfitting story you saw in Lesson 6.

Your purity-vs-thickness scatter has noise around the true linear relationship. A deep tree will draw boundaries that zig-zag around every noisy point. New jobs land between those zig-zags and get predicted wrong.

Two ways to control tree size

Pre-pruning β€” stop early. Set a max_depth (e.g., 4 levels) or min_samples_leaf (e.g., at least 10 jobs per leaf) before training. Simple and fast.

Post-pruning β€” let the tree grow huge, then prune back nodes that don't help test accuracy. The textbook calls this cost-complexity pruning (ISLR2 Β§8.1.2). More principled but more compute.

For your Lincoln-sized data (hundreds of jobs, not millions), pre-pruning with max_depth = 3 to 5 usually beats the more elaborate methods. Don't over-engineer.

⚠️ The line speed change in Month 6 is a trap. A deep tree trained across Months 5-8 will memorize the old line_speed_ppm baseline and predict badly on jobs after the equipment change. A shallower tree (max_depth=3) won't learn this brittle pattern. Same generalization story, more honest model.
Knowledge Return Β· From Lesson 6
What does test error look like as model flexibility goes up?
A U-shape. Bias falls fast, variance climbs, total error drops, bottoms out, then climbs back up. Tuning max_depth on a tree is hunting for the bottom of the same U-shape β€” just with "depth" instead of "K" on the x-axis.

Lesson 9 recap

  • An unpruned tree overfits β€” it memorizes noise instead of signal.
  • Pre-pruning (set max_depth or min_samples_leaf upfront) is the simple, defensible default for small datasets.
  • Tuning depth is the same U-shape story from Lesson 6 β€” find the sweet spot, don't reach for the most complex model.
Explain it to a coworker

Why does a smaller tree often beat a bigger one in production? Save locally; have AI review.

Saved βœ“
Operations Cost Savings
Lesson 3 Β· Tree Methods Β· ~14 minutes

Regression trees β€” predicting numbers, not labels

Same flowchart, different leaves.

Everything you just learned about decision trees also predicts numbers. Instead of a leaf saying "PASS" or "FAIL," it says "thickness β‰ˆ 26.4 microns." Same algorithm, same defensibility, swapped output.

What's different (and what isn't)

The structure is identical: yes/no splits, leaves at the bottom. The only thing that changes is what the leaves contain and how the splits are chosen.

  • Leaves contain numbers (the average of the training jobs in that leaf), not class labels.
  • Splits minimize variance within each leaf, not Gini impurity. The math: reduce the squared error.

That's it. ISLR2 Β§8.1.1 covers this in 3 pages because there isn't much more to say.

Trees vs linear regression β€” head to head

If your data is genuinely linear, linear regression wins. It's the simplest model that fits a line.

If your data has thresholds, interactions, or non-linear chunks ("things change once temperature exceeds 167Β°F"), a tree wins. Trees handle these natively because every split IS a threshold.

Your sourcing data has both β€” purity β†’ thickness is mostly linear, but humidity has a threshold effect on QC, and Supplier D introduces a step-change. A tree captures this without you specifying the form.

Same idea elsewhere

Pricing engines, ad bidding, churn prediction, lead scoring β€” all use regression trees (and their cousins, gradient boosting) when the target is a number. They're the workhorses of "tabular ML" everywhere.

Lesson 10 recap

  • Regression trees predict numbers. Same structure as classification trees, different leaves.
  • Splits minimize within-leaf variance instead of impurity. The intuition is the same: separate the data as cleanly as possible.
  • Trees beat linear regression when the data has thresholds or step-changes; linear wins when it's just a line.
Explain it to a coworker

When would you reach for a regression tree over linear regression? Save locally; have AI review.

Saved βœ“
Planning Operations Cost Savings
Lesson 4 Β· Tree Methods Β· ~20 minutes

Random forests β€” many trees, one vote

The model that runs the small-data world.

Take your single decision tree. Now build 500 of them, each trained on a slightly different random slice of your data. Predict by majority vote. That's a random forest β€” and it's the most accurate model you'll ever fit to a small tabular dataset like yours. It's also boring, which is exactly why it works.

Why 500 trees beat 1 tree

A single decision tree is twitchy β€” change one data point and it can split differently. High variance.

If you train 500 trees, each on a slightly different sample (with replacement), each one will overfit in a slightly different way. Average their predictions and the noise cancels out. The signal survives. The variance drops dramatically.

This trick is called bootstrap aggregating ("bagging" β€” ISLR2 Β§8.2.1). Random forests add one more twist: at each split, only a random subset of features is considered. This forces the trees to be diverse β€” without it they'd all look identical.

Feature importance β€” the sourcing engineer's bonus

Random forests give you something a single tree can't: a clean answer to "which features actually matter?"

For every feature, the forest counts how much it improved predictions across all 500 trees. The result is a ranked list. This is the answer to "where should I focus as a sourcing engineer?" If lot_purity_pct ranks #1 and operator ranks #11, you stop blaming operators and start tightening supplier contracts.

Same idea elsewhere

When a Kaggle competition ends and the winner posts their solution, it's almost always XGBoost or LightGBM (gradient-boosted trees) β€” the close cousin of random forests we'll see next lesson. Tree ensembles dominate small-to-medium tabular data. Period.

Knowledge Return Β· From Lesson 8
What's the one tradeoff you accept when you move from one tree to 500?
Defensibility. One tree you can sketch on a napkin. Five hundred trees you can't. You can still report feature importance and pull individual trees for inspection, but you've traded "I can explain every prediction" for "I'm more accurate on average."

Lesson 11 recap

  • Random forest = 500 (or however many) trees, each trained on a random sample, voting on the final prediction.
  • The randomness gives the trees enough disagreement that averaging cancels their individual mistakes.
  • Bonus: feature importance ranking tells you which sourcing variables actually drive outcomes.
Explain it to a coworker

Why is averaging 500 noisy trees better than one careful tree? Save locally; have AI review.

Saved βœ“
Cost Savings Operations
Lesson 5 Β· Tree Methods Β· ~18 minutes

Boosting β€” trees that fix each other's mistakes

The current world champion of tabular data.

Random forests train every tree in parallel and average them. Boosting does the opposite β€” trees go one at a time, and each new tree focuses on the jobs the previous trees got wrong. Sequential, not parallel. The result: XGBoost, LightGBM, CatBoost β€” the names you've seen on every Kaggle leaderboard.

Bagging vs boosting in one sentence each

Bagging (random forest) β€” train many independent trees on different samples in parallel, average the predictions. Reduces variance.

Boosting β€” train one shallow tree. Look at what it got wrong. Train the next tree to fix those mistakes. Repeat. Reduces both bias and variance β€” but can overfit if you let it run too long.

Why everyone reaches for gradient boosting

On tabular data with hundreds to millions of rows, gradient boosting (the modern incarnation of the idea) usually wins. Why:

  • Handles mixed features natively. Numeric, categorical, missing values β€” boosters eat them all.
  • Fast inference. Predictions are still just walking a tree (many trees, but each is shallow).
  • Strong out-of-the-box. Default settings beat heavily-tuned linear models on most real datasets.
  • Built-in feature importance. Same bonus you got from random forests.

The trade: harder to defend than a single tree, more knobs to tune than a random forest, can overfit silently if you don't validate honestly. ISLR2 Β§8.2.3 covers the math.

⚠️ Boosting has more knobs. Number of trees (n_estimators), learning rate, max depth, subsample ratio, regularization β€” easily 6-8 hyperparameters. You can't tune them all by hand. We'll cover this in Lesson 13.

Lesson 12 recap

  • Boosting = sequential trees, each correcting the previous trees' mistakes. Different from random forest's parallel-and-average.
  • Gradient boosting (XGBoost, LightGBM) is the dominant tabular-data model today.
  • More accurate, more tunable, harder to defend. The tradeoff is real.
Explain it to a coworker

Bagging vs boosting β€” in your own words. Save locally; have AI review.

Saved βœ“
Planning Cost Savings
Lesson 6 Β· Tree Methods Β· ~15 minutes

Tuning β€” finding the right knob settings

A boring lesson that prevents 80% of model disasters.

Tree models have knobs: max_depth, n_estimators, learning_rate, min_samples_leaf. The right settings depend on YOUR data β€” there's no universal answer. Tuning is the difference between a 0.72 model and a 0.91 model on the same data.

Grid search vs random search

Grid search = try every combination of every knob you care about. Exhaustive, slow, but thorough. Fine for 2-3 hyperparameters on a small dataset like yours.

Random search = sample N random combinations from the hyperparameter space. Surprisingly often beats grid search per unit of compute because most knobs don't matter much; sampling lets you waste less time on the unimportant ones.

Bayesian optimization (Optuna, scikit-optimize) = smarter random search that learns from each trial. Overkill for Lincoln-sized data; use it once you're past 100K rows.

Cross-validation β€” the only honest tuning method

You CAN'T pick hyperparameters by looking at training accuracy β€” they'd all overfit. You CAN'T pick them by looking at your test set either β€” you'd be using the "honest" set to make a choice, contaminating it.

The fix: k-fold cross-validation on your TRAINING data only. Split the train set into k chunks (commonly 5). For each candidate setting: train on 4 chunks, validate on the 5th. Rotate. Average the 5 validation scores. Pick the setting with the best average. THEN evaluate the winner once on the held-out test set. ISLR2 Β§5.1 covers it.

This sounds elaborate; it's two lines of code in scikit-learn. The discipline matters more than the syntax.

Knowledge Return Β· From Lesson 5
Why can't you tune hyperparameters using your test set?
Because once you use the test set to make a decision, it stops being held-out. You'd be "learning from" the test data through your hyperparameter choices, and your reported accuracy becomes optimistic β€” exactly the kind of dishonest number you saw in Lesson 5. Cross-validation on TRAIN only gives you honest tuning.

Lesson 13 recap

  • Tree models have knobs that materially affect accuracy. Tune them.
  • Grid search for small spaces; random search for bigger ones; Bayesian when you're rich in data and compute.
  • Always tune on cross-validated train data β€” never touch the test set until you've picked your final model.
Explain it to a coworker

Why is using your test set to tune hyperparameters dishonest? Save locally; have AI review.

Saved βœ“
Operations Planning
Lesson 7 Β· Tree Methods Β· ~15 minutes

Production β€” ship it, then watch it

A model is never "done." It's "deployed and monitored."

Every model degrades. Suppliers change. Equipment changes. Seasons change. The model trained on Months 1-4 will be wrong by Month 9 unless someone retrains it. Production ML is not "build and forget" β€” it's a maintenance discipline.

Drift β€” the slow killer

Data drift β€” your inputs change shape. New supplier joins, humidity baseline rises, line speed target shifts (Month 6, remember?). The model still works mechanically but its accuracy decays because the world moved.

Concept drift β€” the relationship between inputs and outputs changes. Maybe Supplier E's purity number is now measured differently, or the QC spec was tightened by Quality. Same inputs, different "right" answer.

You only catch either if you're watching. The discipline: log every prediction + every eventual outcome. Compare weekly.

When to retrain β€” three triggers

  • Time-based. "Retrain monthly" β€” simple, defensible, no surprises. Default for stable processes.
  • Drift-based. Set a threshold on your monitoring metric (e.g., "rolling 30-day accuracy drops 5 percentage points below baseline"). Retrain when triggered. More efficient but needs monitoring infrastructure.
  • Event-based. Known regime change: new supplier, new equipment, new spec. Retrain on the day of the event. Sourcing engineer's instinct says this matters β€” trust it.

In practice: use all three. Time-based as the default, drift-based as the safety net, event-based when you know something changed.

Same idea elsewhere

Every "AI system" that quietly broke in production β€” credit scoring during COVID, fraud detection after Apple Pay launched, demand forecasting during supply chain shocks β€” failed because nobody was watching for drift. Production ML is 30% modeling, 70% monitoring.

Lesson 14 recap

  • Every model degrades. Log every prediction, log every outcome, compare weekly.
  • Data drift = inputs shift. Concept drift = the input-output relationship shifts. Both are silent killers.
  • Retrain on a schedule + on drift triggers + on known events. Belt, suspenders, and a backup belt.
Explain it to a coworker

What's the one thing you'd build into your monitoring before deploying any model? Save locally; have AI review.

Saved βœ“
Planning Cost Savings Operations
β˜… Tree Methods Capstone Β· ~25 minutes

Build a tree model on 8 months of Lincoln data

Pick your model. Tune it. Defend it.

Eight months of data. 480 jobs. Five suppliers. Two equipment regimes. One QC spec. Your turn to build a real tree-based predictor for incoming-lot QC pass rate β€” and write the spec sheet you'd hand to leadership.

What you're building

A model that, given a new incoming plating job's features (supplier, purity, bath conditions), predicts pass_qc. You'll pick from three model families and tune two key hyperparameters:

  • Single decision tree β€” most defensible. Tune max_depth.
  • Random forest β€” most accurate "default." Tune n_estimators + max_depth.
  • Gradient boosting β€” current world champion. Tune n_estimators + max_depth + learning_rate.

The workbench below lets you swap between models and see test accuracy on a held-out 96-job test set (last 20% of the data). Find the best honest accuracy you can. Then write your spec.

What you just built

  • A real tree-based QC predictor on 480 jobs of evolving Lincoln data.
  • Honest test-set evaluation with a held-out 96 jobs the model never saw during training/tuning.
  • A defense-ready spec sheet: which features matter, which model won, what to monitor in production.
Briefing copied to clipboard βœ“