Welcome to the catalogue
The 3 flavors of ML and how each one shows up at Lincoln.
Watch this first β 85-second primer for what's below.
What this whole catalogue does
Every chapter of the textbook Introduction to Statistical Learning becomes a short, interactive crash course. We strip the math down to what you actually need, anchor every concept in a Lincoln Industries problem, and end with a working model you can defend in a meeting.
You'll travel from "speak the language" (this module) to "build a real model on your own data" (final modules) β without writing a single line of code along the way.
The three flavors of ML at Lincoln
Every machine learning problem on earth is one of three shapes. Once you can see the shape, you know which tool to reach for.
Regression β predicting a number. "What will the scrap rate be on Line 4 next week?"
Classification β predicting a label. "Will this part pass final QC, yes or no?"
Clustering β finding hidden groups in data. "Are there families of plating-line failures we keep seeing?"
Same techniques, three shapes. The shape of your question decides which one you need.
Why this catalogue exists
Most ML resources assume you'll write code. Most non-coders bounce off in the first hour. This catalogue assumes the opposite: you have a job, you're busy, you want to know what's actually possible β and what's actually doable for you.
The path is simple: Foundations now β Linear Models β Resampling β Trees β Deep Learning β Unsupervised. Each module sharpens your judgment. The final modules ship you to an AutoML tool (Vertex AI, H2O, SageMaker Autopilot) where you push your own data through and ship a working model.
The AI agent monitoring shops you've heard about β Arize, Helicone, LangSmith β use exactly these three flavors. Regression to predict tool-call latency. Classification to flag whether an LLM response is hallucinating. Clustering to group similar failure modes across thousands of agent runs. Manufacturing and AI agents are using the same playbook.
Lesson 1 recap
- ML at Lincoln boils down to three shapes: regression (numbers), classification (labels), clustering (hidden groups).
- This catalogue is concept-first. By the end you'll be able to push real Lincoln data through an AutoML tool β no coding required.
- Pick one painful prediction problem at work and carry it with you through every lesson.
In your own words β how would you explain this lesson's main concept to a peer at Lincoln? Save locally; have AI review for honest critique.
The data language
n, p, X, y β just enough notation to read papers and books.
Watch this first β 70-second primer for what's below.
What the letters mean
Imagine a spreadsheet of Lincoln plating jobs. Every row is a job. Every column is something you measured about that job. There's one column you care about predicting. That's all four letters.
n β the number of rows. How many jobs are in your dataset. If you have 50 plating jobs from last quarter, n = 50.
p β the number of columns (other than the target). How many things you measured per job. Bath temperature, line speed, surface prep score, plating thickness, age of solution, operator skill, customer tier, part complexity. p = 8.
X β the whole table of inputs. Capital X = matrix. n rows by p columns.
y β the column you want to predict. Lower-case y = a single column. For each row, one value.
A row is a job. Columns are what you measured. The rightmost is what you want to predict. ML in one diagram.
Quantitative vs categorical
One more fork before you can read anything in the field. The type of y decides which family of algorithms you need.
Quantitative β y is a number. Plating yield (87%), thickness (28 microns), cycles since service (120). Use regression methods.
Categorical β y is a label. PASS or FAIL. Customer tier A, B, or C. Use classification methods.
You'll see "regression vs classification" mentioned constantly. This is what they mean.
When an AI agent shop talks about "feature space," they mean the same X matrix. Each agent run is one row; each thing they measured (latency, tokens, success-flag) is a column. When they say "target variable," they mean y. Same letters, same shape.
Lesson 2 recap
- n = rows (observations). p = columns (features). X = input matrix. y = target column.
- If y is a number, you have a regression problem. If y is a label, it's classification.
- Every ML problem reduces to: "given X, predict y." Everything else is technique.
In your own words β how would you explain n, p, X, and y to a peer at Lincoln? Save locally; have AI review for honest critique.
Y = f(X) + Ξ΅
The equation every model is solving.
Watch this first β 75-second primer for what's below.
Reducible vs irreducible error
Your model's total error has two parts.
Reducible error β improves when you pick a better model or add better features. This is the part you fight against. Most of ML practice is shrinking it.
Irreducible error β locked in by the world. Random measurement noise, things you can't observe, factors that vary day-to-day. Lincoln framing: even a perfect model can't predict scrap rate to the decimal β vibration, humidity, operator focus, micro-power-fluctuations all carry noise no spreadsheet captures.
Knowing the difference saves your sanity. When your model isn't getting better, you have to ask: am I fighting reducible error (try harder) or irreducible error (this is the floor)?
Even with the perfect model, individual outcomes scatter. That scatter is Ξ΅. It's a floor β find it, accept it, move on.
Two reasons we estimate f
Same equation, two very different goals. Knowing which goal you have changes which model you should choose.
"Just give me ΕΆ"
You don't care why the model says what it says. You just want accurate predictions. Black-box models are fine. Lincoln framing: "will this batch pass QC?" β it's fine if the model can't explain why, as long as it's right.
"Tell me which X drives Y"
You want to understand the relationship. Which lever moves the outcome? You need a clear, interpretable model. Lincoln framing: "which setup parameter has the biggest impact on yield?" β answers like this change how the floor runs.
Fit a curve through Lincoln's defect data
The blue dots are real defect measurements β they don't move. The black curve is YOUR model. Drag the four amber handles up and down to make the curve pass through the cloud of dots as closely as possible. Your MSE drops as the line gets closer to the dots β but it can never reach zero, because real data has noise around the true pattern.
The amber dashed band is the noise floor β the gap between dots and the true pattern that no model can fight (that's the Ξ΅ in Y = f(X) + Ξ΅). The closer your MSE gets to 1.0, the closer your line is to the best any model could possibly do. You're done when MSE bottoms out β not when dots sit exactly on the line (they never will).
When you ask "why did this AI agent fail?" you're doing inference. When you ask "which agent will succeed on this prompt?" you're doing prediction. The same X, the same Y, the same Ξ΅ β the goal changes the model.
Lesson 3 recap
- Every model is hunting for f in Y = f(X) + Ξ΅. The Ξ΅ is irreducible β accept it.
- Reducible error you can fight (better model, better features). Irreducible error is the floor.
- Two goals split the world: prediction (just be accurate) vs inference (explain the relationship).
In your own words β how would you explain Y = f(X) + Ξ΅ and the noise floor? Save locally; have AI review for honest critique.
How to find f
Parametric vs non-parametric. Flexibility vs interpretability.
Watch this first β 70-second primer for what's below.
Parametric β pick the shape, fit the numbers
Linear regression is the classic example. You assume f is a straight line. Now you only have two numbers to find: slope and intercept. Cheap, fast, easy to explain.
The tax: if the truth is curved and you assumed a line, you'll always be a little wrong. Doesn't matter how much data you throw at it. Your assumption is the ceiling.
Non-parametric β let the data lead
No assumed shape. The fit can wiggle as much as it needs to follow the data. KNN, decision trees, splines β all non-parametric.
The tax: you need a lot more data to get a stable answer. With 30 jobs you can fit a straight line confidently. With 30 jobs and a wiggly non-parametric model, you're chasing noise.
Same scatter, three different commitments. Pick what your problem calls for β and what your data can support.
The flexibility / interpretability tradeoff
Here's the rule that matters in practice: more flexible models are harder to explain.
A linear regression gives you a one-line answer: "yield rises 0.4% for every degree increase in bath temp." Your boss can defend it. A wiggly non-parametric fit gives you "the model says so." Your boss is unlikely to defend that.
Choose your model based on who's asking and what they need. Sometimes accuracy wins, sometimes explainability wins. There's no universal right answer.
Pick a shape for f
Same Lincoln bath-temperature data. Three different shapes. Click each one and watch what happens.
Linear is one straight line. Cheap, clear, but can't bend.
A logistic regression for AI agent routing is parametric β clean rules you can defend in a postmortem. A neural network is non-parametric β no assumed shape, eats data for breakfast, but good luck explaining a single decision.
Lesson 4 recap
- Parametric = pick a shape (linear, polynomial), fit the numbers. Cheap, clear, can be wrong if the shape is wrong.
- Non-parametric = let the data shape itself. Flexible, hungry for data, harder to explain.
- More flexible = more accurate (sometimes) but less interpretable. The tradeoff is the whole game.
In your own words β how would you explain parametric vs non-parametric models, and the defensibility tradeoff? Save locally; have AI review for honest critique.
Did it actually work?
MSE and the train/test split. The non-negotiable pre-flight.
Watch this first β 65-second primer for what's below.
MSE β Mean Squared Error
The standard score for a regression model. For each prediction, take (prediction β actual), square it, then average across all predictions.
Why squared? So big misses count more than small ones. A prediction that's off by 10 is much worse than ten predictions off by 1. The square makes that bite.
Lower MSE = better. Period. But: which data did you measure it on? That's the whole question.
The train/test split
Hold out 20% of your data before you train. Train the model on 80%. Score it on the 20% it has never seen. That score is your honest estimate of how it'll do in the real world.
Anything else is cheating. If the model has seen the data during training, asking it to predict that data is like giving a student the answer key during the exam. They'll ace it. They'll teach you nothing.
Hide 20% before training. Score on it after. Anything else is fiction.
Overfitting β the trap
A super-flexible model can memorize the training data. It can fit every noisy point exactly. Train MSE β near zero. The model looks brilliant. You ship it.
Then production data arrives. The same model is suddenly catastrophic. Why? Because it learned the noise, not the signal. Train MSE was a lie.
The only honest metric is the one measured on data the model hasn't seen. That's the rule. Memorize it.
Drag your predictions to fit Lincoln's daily yield
Then flip the test-set switch and see what you actually built.
A tight fit on training data isn't proof of anything. Holdout is. Drag every handle right onto its cyan dot β train MSE goes to ~0. Then reveal the test set and watch what happens to test MSE.
When an AI shop says "we evaluated on a held-out set," this is what they mean. If they don't say that, ignore the accuracy number. There's a 90% chance they trained and tested on the same data β which tells you nothing.
Lesson 5 recap
- MSE = average squared prediction error. Lower is better β but only on the right data.
- Train/test split = hide 20%, train on 80%, score on the hidden 20%. Non-negotiable.
- Overfitting happens when a flexible model memorizes the training noise. Train MSE looks great, test MSE blows up.
In your own words β how would you explain MSE and the train/test split? Save locally; have AI review for honest critique.
Bias vs Variance
The central tradeoff. Move the slider. Feel it click.
Watch this first β 2-minute primer for what's below. The flagship.
Bias β error from being too simple
Bias is what happens when your model can't bend enough to follow the truth. You assume defect rate is a flat line, but it's actually a curve. No matter how much data you give it, the line will always be wrong in the middle. That's bias.
High bias = the model is too rigid. It misses real patterns. The error is built into the assumption.
Variance β error from being too clingy
Variance is what happens when your model bends too eagerly. It hugs every data point, including the noisy ones. Re-train on a slightly different sample and the fit changes wildly. That instability is variance.
High variance = the model is too sensitive. It's chasing noise. Predictions on new data are unstable.
Bias is a fence. Variance is a windsock. Just-right is the curve in the middle that ignores both extremes.
The U-shape
As you increase a model's flexibility, two things happen at once. Bias drops (more flexible models can follow the truth). Variance rises (more flexible models chase noise).
Total error = biasΒ² + variance + irreducible noise. Add it up across flexibility levels and you get a U-shape. The bottom of the U is the sweet spot β flexible enough to capture the pattern, not so flexible that it's chasing noise.
Every modeling decision is just trying to find the bottom of that U.
The BiasβVariance Playground
Move the slider. Watch the curve, the bias, and the variance fight each other.
The dotted vertical line on the right chart marks where you are on the U-shape. The green dashed line marks the sweet spot β the flexibility level where test MSE bottoms out. Watch test MSE drop, bottom, then climb again as variance takes over.
When an AI agent shop says "we hyperparameter-tuned for the lowest validation loss," they're standing on the U-shape, looking for the dip. Same playground, different vocabulary.
Lesson 6 recap
- Bias = error from too rigid a model. Misses real patterns.
- Variance = error from too flexible a model. Chases noise.
- Total error = biasΒ² + variance + irreducible noise. As flexibility grows, bias drops, variance rises.
- The sweet spot is the bottom of the U. Every modeling decision is hunting for it.
In your own words β how would you explain bias, variance, and the U-shape? Save locally; have AI review for honest critique.
KNN β your first algorithm
Classify by neighbors. The simplest serious ML there is.
Watch this first β 75-second primer for what's below.
How it works in one sentence
Given a new point, find the K closest points in your training data, take a majority vote. That's it.
Lincoln framing: a new plating job arrives with bath temp 82Β°C and line speed 14 parts/min. Look up the 10 most similar past jobs. If 7 passed and 3 failed, predict PASS. If 3 passed and 7 failed, predict FAIL. Done.
That's a real, defensible model. No coefficients. No training. The "model" is just your past data.
Picking K
K = 1 means you trust whoever's standing closest, even if they're an outlier. One weird past job dominates every prediction near it. Wiggly decision boundary. High variance.
K = 100 means you average over basically your whole dataset. Too smooth to catch local patterns. Blurry decision boundary. High bias.
The right K is somewhere in between, and yes β it's a bias-variance tradeoff. The U-shape from Lesson 6 shows up here too. Sound familiar?
The black cross is the new job. Find K nearest. Take a vote. That's the entire algorithm.
Click to add data. Slide K.
You're predicting "will this plating job pass QC?" from two setup parameters.
Background color = what KNN would predict at every point. Click anywhere to add a labeled job and watch the boundary react. Shift-click or right-click to remove a point. Try the K=1 preset, then K=25 β same data, totally different model.
Vector databases doing semantic search are KNN at scale. "Find the K embeddings closest to my query" is literally KNN with cosine distance. Every RAG pipeline, every recommendation engine, every AI agent's memory lookup β KNN under the hood.
Lesson 7 recap
- KNN: find the K nearest past examples, take a majority vote. The model is the data.
- Small K = wiggly boundary, high variance. Large K = smooth boundary, high bias. Same U-shape.
- Perfect first algorithm for small Lincoln-sized datasets. Defensible, simple, surprisingly accurate.
In your own words β how would you explain KNN to someone who's never seen ML? Save locally; have AI review for honest critique.
Predictive Maintenance for Lincoln plating lines
The thing you can carry into work tomorrow.
Watch this first β 65-second primer for what's below.
The problem
50 machines on the floor. Some will fail in the next 7 days, some won't. We have 8 features per machine β age, cycles since service, vibration, bath temp drift, operator skill, part complexity, customer tier, and one decoy that looks important but isn't. Build a KNN classifier that predicts failure on a held-out test set.
What you'll do
Step 1. Pick which features go into the model. Some help. One is a trap.
Step 2. Pick K. (Remember Lesson 6 and 7 β find the U-shape sweet spot.)
Step 3. Watch test accuracy and the confusion matrix update live as you tweak.
Step 4. When you're happy, generate the spec sheet β your shippable artifact.
Why it matters
If you can flag a failure 7 days before it happens, you schedule maintenance during a planned slot instead of an emergency stop. That's literally Planning + Cost Savings + Operations all at once. It's the kind of model leadership notices.
Predict which Lincoln plating lines fail in 7 days
Pick features. Pick K. See test accuracy. Then ship a spec sheet.
| Pred Fail | Pred OK | |
|---|---|---|
| Actual Fail | β | β |
| Actual OK | β | β |
High True Positive + low False Negative = catching failures. That's the ballgame.
Aim for accuracy β₯ 0.80 on the test set. Try removing the decoy. Try different Ks. The best feature set + K combo isn't always obvious.
Every observability platform that "predicts incidents before they happen" is doing this exact loop. Pick features. Pick a model. Validate on past data. Ship. The flavor differs; the playbook doesn't.
Capstone recap β what you just did
- You picked features deliberately (and learned to spot decoys that hurt accuracy).
- You tuned K for the bias-variance sweet spot β same U-shape from Lesson 6.
- You evaluated on a held-out test set β the only honest metric.
- You generated a printable spec sheet β your shippable artifact for leadership.
In your own words β how would you explain the predictive maintenance loop you just built (pick features, pick K, score, ship)? Save locally; have AI review for honest critique.
Decision trees β yes / no, all the way down
A model you can sketch on a napkin and your boss can actually follow.
The shape of a tree
Every node asks one yes/no question about one feature. Yes goes left, no goes right. You keep splitting until you hit a leaf β that leaf is your prediction.
For your sourcing data, a tree might learn this:
if lot_purity_pct < 97.0:
if humidity_pct > 75:
predict FAIL β leaf
else:
predict PASS β leaf
else:
if bath_temp_f < 163:
predict FAIL β leaf
else:
predict PASS β leaf
You can read this. If the model predicted FAIL on a new job, you can trace the exact path. That's defensibility you literally cannot get from a neural network.
How does a tree decide where to split?
The algorithm tries every possible split (every feature Γ every threshold) and picks the one that separates passes from fails the most. The mathematical name for "how mixed up are these groups" is Gini impurity (classification) or residual sum of squares (regression). You don't need the math; you need the intuition: a good split makes the two resulting groups as pure as possible.
Then it does the same thing on each resulting group. Recursively. Until you say stop β or every leaf is pure.
Every fraud-detection vendor's "explainable AI" is just a tree (or a forest of them). When they tell you "we flagged this transaction because amount > $5000 AND device_age < 7 days," that's literally a tree path. The "AI" buzzword is doing a lot of work.
Lesson 8 recap
- A decision tree = a flowchart of yes/no splits ending in predictions. Each leaf is a prediction.
- The algorithm picks the split that separates the classes (pass/fail) most cleanly. Greedy, top-down.
- Trees are the most defensible model in ML β you can trace every prediction along a literal path.
In your own words β how would you explain a decision tree to a teammate who's never seen one? Save locally; have AI review for honest critique.
Pruning β when smaller trees beat bigger ones
A deep tree memorizes history. A pruned tree generalizes to the future.
The overfitting trap
A tree with no depth limit will keep splitting until each leaf contains a single training job. Training accuracy: 100%. Test accuracy on new jobs: bad. The model memorized the noise instead of the signal β same overfitting story you saw in Lesson 6.
Your purity-vs-thickness scatter has noise around the true linear relationship. A deep tree will draw boundaries that zig-zag around every noisy point. New jobs land between those zig-zags and get predicted wrong.
Two ways to control tree size
Pre-pruning β stop early. Set a max_depth (e.g., 4 levels) or min_samples_leaf (e.g., at least 10 jobs per leaf) before training. Simple and fast.
Post-pruning β let the tree grow huge, then prune back nodes that don't help test accuracy. The textbook calls this cost-complexity pruning (ISLR2 Β§8.1.2). More principled but more compute.
For your Lincoln-sized data (hundreds of jobs, not millions), pre-pruning with max_depth = 3 to 5 usually beats the more elaborate methods. Don't over-engineer.
Lesson 9 recap
- An unpruned tree overfits β it memorizes noise instead of signal.
- Pre-pruning (set max_depth or min_samples_leaf upfront) is the simple, defensible default for small datasets.
- Tuning depth is the same U-shape story from Lesson 6 β find the sweet spot, don't reach for the most complex model.
Why does a smaller tree often beat a bigger one in production? Save locally; have AI review.
Regression trees β predicting numbers, not labels
Same flowchart, different leaves.
What's different (and what isn't)
The structure is identical: yes/no splits, leaves at the bottom. The only thing that changes is what the leaves contain and how the splits are chosen.
- Leaves contain numbers (the average of the training jobs in that leaf), not class labels.
- Splits minimize variance within each leaf, not Gini impurity. The math: reduce the squared error.
That's it. ISLR2 Β§8.1.1 covers this in 3 pages because there isn't much more to say.
Trees vs linear regression β head to head
If your data is genuinely linear, linear regression wins. It's the simplest model that fits a line.
If your data has thresholds, interactions, or non-linear chunks ("things change once temperature exceeds 167Β°F"), a tree wins. Trees handle these natively because every split IS a threshold.
Your sourcing data has both β purity β thickness is mostly linear, but humidity has a threshold effect on QC, and Supplier D introduces a step-change. A tree captures this without you specifying the form.
Pricing engines, ad bidding, churn prediction, lead scoring β all use regression trees (and their cousins, gradient boosting) when the target is a number. They're the workhorses of "tabular ML" everywhere.
Lesson 10 recap
- Regression trees predict numbers. Same structure as classification trees, different leaves.
- Splits minimize within-leaf variance instead of impurity. The intuition is the same: separate the data as cleanly as possible.
- Trees beat linear regression when the data has thresholds or step-changes; linear wins when it's just a line.
When would you reach for a regression tree over linear regression? Save locally; have AI review.
Random forests β many trees, one vote
The model that runs the small-data world.
Why 500 trees beat 1 tree
A single decision tree is twitchy β change one data point and it can split differently. High variance.
If you train 500 trees, each on a slightly different sample (with replacement), each one will overfit in a slightly different way. Average their predictions and the noise cancels out. The signal survives. The variance drops dramatically.
This trick is called bootstrap aggregating ("bagging" β ISLR2 Β§8.2.1). Random forests add one more twist: at each split, only a random subset of features is considered. This forces the trees to be diverse β without it they'd all look identical.
Feature importance β the sourcing engineer's bonus
Random forests give you something a single tree can't: a clean answer to "which features actually matter?"
For every feature, the forest counts how much it improved predictions across all 500 trees. The result is a ranked list. This is the answer to "where should I focus as a sourcing engineer?" If lot_purity_pct ranks #1 and operator ranks #11, you stop blaming operators and start tightening supplier contracts.
When a Kaggle competition ends and the winner posts their solution, it's almost always XGBoost or LightGBM (gradient-boosted trees) β the close cousin of random forests we'll see next lesson. Tree ensembles dominate small-to-medium tabular data. Period.
Lesson 11 recap
- Random forest = 500 (or however many) trees, each trained on a random sample, voting on the final prediction.
- The randomness gives the trees enough disagreement that averaging cancels their individual mistakes.
- Bonus: feature importance ranking tells you which sourcing variables actually drive outcomes.
Why is averaging 500 noisy trees better than one careful tree? Save locally; have AI review.
Boosting β trees that fix each other's mistakes
The current world champion of tabular data.
Bagging vs boosting in one sentence each
Bagging (random forest) β train many independent trees on different samples in parallel, average the predictions. Reduces variance.
Boosting β train one shallow tree. Look at what it got wrong. Train the next tree to fix those mistakes. Repeat. Reduces both bias and variance β but can overfit if you let it run too long.
Why everyone reaches for gradient boosting
On tabular data with hundreds to millions of rows, gradient boosting (the modern incarnation of the idea) usually wins. Why:
- Handles mixed features natively. Numeric, categorical, missing values β boosters eat them all.
- Fast inference. Predictions are still just walking a tree (many trees, but each is shallow).
- Strong out-of-the-box. Default settings beat heavily-tuned linear models on most real datasets.
- Built-in feature importance. Same bonus you got from random forests.
The trade: harder to defend than a single tree, more knobs to tune than a random forest, can overfit silently if you don't validate honestly. ISLR2 Β§8.2.3 covers the math.
Lesson 12 recap
- Boosting = sequential trees, each correcting the previous trees' mistakes. Different from random forest's parallel-and-average.
- Gradient boosting (XGBoost, LightGBM) is the dominant tabular-data model today.
- More accurate, more tunable, harder to defend. The tradeoff is real.
Bagging vs boosting β in your own words. Save locally; have AI review.
Tuning β finding the right knob settings
A boring lesson that prevents 80% of model disasters.
Grid search vs random search
Grid search = try every combination of every knob you care about. Exhaustive, slow, but thorough. Fine for 2-3 hyperparameters on a small dataset like yours.
Random search = sample N random combinations from the hyperparameter space. Surprisingly often beats grid search per unit of compute because most knobs don't matter much; sampling lets you waste less time on the unimportant ones.
Bayesian optimization (Optuna, scikit-optimize) = smarter random search that learns from each trial. Overkill for Lincoln-sized data; use it once you're past 100K rows.
Cross-validation β the only honest tuning method
You CAN'T pick hyperparameters by looking at training accuracy β they'd all overfit. You CAN'T pick them by looking at your test set either β you'd be using the "honest" set to make a choice, contaminating it.
The fix: k-fold cross-validation on your TRAINING data only. Split the train set into k chunks (commonly 5). For each candidate setting: train on 4 chunks, validate on the 5th. Rotate. Average the 5 validation scores. Pick the setting with the best average. THEN evaluate the winner once on the held-out test set. ISLR2 Β§5.1 covers it.
This sounds elaborate; it's two lines of code in scikit-learn. The discipline matters more than the syntax.
Lesson 13 recap
- Tree models have knobs that materially affect accuracy. Tune them.
- Grid search for small spaces; random search for bigger ones; Bayesian when you're rich in data and compute.
- Always tune on cross-validated train data β never touch the test set until you've picked your final model.
Why is using your test set to tune hyperparameters dishonest? Save locally; have AI review.
Production β ship it, then watch it
A model is never "done." It's "deployed and monitored."
Drift β the slow killer
Data drift β your inputs change shape. New supplier joins, humidity baseline rises, line speed target shifts (Month 6, remember?). The model still works mechanically but its accuracy decays because the world moved.
Concept drift β the relationship between inputs and outputs changes. Maybe Supplier E's purity number is now measured differently, or the QC spec was tightened by Quality. Same inputs, different "right" answer.
You only catch either if you're watching. The discipline: log every prediction + every eventual outcome. Compare weekly.
When to retrain β three triggers
- Time-based. "Retrain monthly" β simple, defensible, no surprises. Default for stable processes.
- Drift-based. Set a threshold on your monitoring metric (e.g., "rolling 30-day accuracy drops 5 percentage points below baseline"). Retrain when triggered. More efficient but needs monitoring infrastructure.
- Event-based. Known regime change: new supplier, new equipment, new spec. Retrain on the day of the event. Sourcing engineer's instinct says this matters β trust it.
In practice: use all three. Time-based as the default, drift-based as the safety net, event-based when you know something changed.
Every "AI system" that quietly broke in production β credit scoring during COVID, fraud detection after Apple Pay launched, demand forecasting during supply chain shocks β failed because nobody was watching for drift. Production ML is 30% modeling, 70% monitoring.
Lesson 14 recap
- Every model degrades. Log every prediction, log every outcome, compare weekly.
- Data drift = inputs shift. Concept drift = the input-output relationship shifts. Both are silent killers.
- Retrain on a schedule + on drift triggers + on known events. Belt, suspenders, and a backup belt.
What's the one thing you'd build into your monitoring before deploying any model? Save locally; have AI review.
Build a tree model on 8 months of Lincoln data
Pick your model. Tune it. Defend it.
What you're building
A model that, given a new incoming plating job's features (supplier, purity, bath conditions), predicts pass_qc. You'll pick from three model families and tune two key hyperparameters:
- Single decision tree β most defensible. Tune max_depth.
- Random forest β most accurate "default." Tune n_estimators + max_depth.
- Gradient boosting β current world champion. Tune n_estimators + max_depth + learning_rate.
The workbench below lets you swap between models and see test accuracy on a held-out 96-job test set (last 20% of the data). Find the best honest accuracy you can. Then write your spec.
What you just built
- A real tree-based QC predictor on 480 jobs of evolving Lincoln data.
- Honest test-set evaluation with a held-out 96 jobs the model never saw during training/tuning.
- A defense-ready spec sheet: which features matter, which model won, what to monitor in production.