Module 1 Β· Foundations Β· ISLR Ch 1 + 2

From Zero to Your First ML Model

A hands-on crash course built for people who want to use machine learning at work β€” not write papers about it. Every concept maps to a real Lincoln Industries problem.

0Lessons
0Interactive widgets
0Lincoln pillars
Powered by Planning Cost Savings Operations
Course progress
0 of 8 lessons
Planning Cost Savings Operations
Lesson 1 Β· ~15 minutes

Welcome to the catalogue

The 3 flavors of ML and how each one shows up at Lincoln.

Watch this first β€” 85-second primer for what's below.

Most of what people call "AI" today is statistical learning β€” finding patterns in data and using them to make decisions. You don't need to write code to use it. You need to know what to ask for, what's possible, and what's snake oil. By the end of this catalogue, you'll know.

What this whole catalogue does

Every chapter of the textbook Introduction to Statistical Learning becomes a short, interactive crash course. We strip the math down to what you actually need, anchor every concept in a Lincoln Industries problem, and end with a working model you can defend in a meeting.

You'll travel from "speak the language" (this module) to "build a real model on your own data" (final modules) β€” without writing a single line of code along the way.

The three flavors of ML at Lincoln

Every machine learning problem on earth is one of three shapes. Once you can see the shape, you know which tool to reach for.

Regression β€” predicting a number. "What will the scrap rate be on Line 4 next week?"

Classification β€” predicting a label. "Will this part pass final QC, yes or no?"

Clustering β€” finding hidden groups in data. "Are there families of plating-line failures we keep seeing?"

Three flavors β†’ three Lincoln problems REGRESSION Predict a number "scrap rate next week?" CLASSIFICATION Predict a label PASS FAIL "will this part pass QC?" CLUSTERING Find hidden groups "failure families?"

Same techniques, three shapes. The shape of your question decides which one you need.

Why this catalogue exists

Most ML resources assume you'll write code. Most non-coders bounce off in the first hour. This catalogue assumes the opposite: you have a job, you're busy, you want to know what's actually possible β€” and what's actually doable for you.

The path is simple: Foundations now β†’ Linear Models β†’ Resampling β†’ Trees β†’ Deep Learning β†’ Unsupervised. Each module sharpens your judgment. The final modules ship you to an AutoML tool (Vertex AI, H2O, SageMaker Autopilot) where you push your own data through and ship a working model.

Same idea elsewhere

The AI agent monitoring shops you've heard about β€” Arize, Helicone, LangSmith β€” use exactly these three flavors. Regression to predict tool-call latency. Classification to flag whether an LLM response is hallucinating. Clustering to group similar failure modes across thousands of agent runs. Manufacturing and AI agents are using the same playbook.

On Monday at Lincoln

Pick the most painful prediction problem on your floor. "Which jobs will run over schedule?" "Which baths need refreshing this week?" "Which suppliers slip first?"

Hold that one problem in your head through the rest of this module. Every concept we hit, ask yourself: "how would I apply this to my problem?" By the capstone you'll have an answer.

Lesson 1 recap

  • ML at Lincoln boils down to three shapes: regression (numbers), classification (labels), clustering (hidden groups).
  • This catalogue is concept-first. By the end you'll be able to push real Lincoln data through an AutoML tool β€” no coding required.
  • Pick one painful prediction problem at work and carry it with you through every lesson.
Explain it to a coworker

In your own words β€” how would you explain this lesson's main concept to a peer at Lincoln? Save locally; have AI review for honest critique.

Saved βœ“
Foundational
Lesson 2 Β· ~15 minutes

The data language

n, p, X, y β€” just enough notation to read papers and books.

Watch this first β€” 70-second primer for what's below.

Every ML conversation uses four letters: n, p, X, y. Once you know what they mean, you can read any ML paper, blog post, or AutoML output without bouncing. This is a 10-minute investment that pays back forever.

What the letters mean

Imagine a spreadsheet of Lincoln plating jobs. Every row is a job. Every column is something you measured about that job. There's one column you care about predicting. That's all four letters.

n β€” the number of rows. How many jobs are in your dataset. If you have 50 plating jobs from last quarter, n = 50.

p β€” the number of columns (other than the target). How many things you measured per job. Bath temperature, line speed, surface prep score, plating thickness, age of solution, operator skill, customer tier, part complexity. p = 8.

X β€” the whole table of inputs. Capital X = matrix. n rows by p columns.

y β€” the column you want to predict. Lower-case y = a single column. For each row, one value.

Lincoln plating jobs as X and y JOB BATH Β°C SPEED PREP THICK SOLN AGE SKILL CMPLX y (PASS?) 00182140.922812043 PASS 00279220.711934025 FAIL 00385180.94319052 PASS …………………… … 05081150.882716043 PASS X β€” the input matrix (50 rows Γ— 8 columns) y β€” what we predict n = 50 jobs n = 50 Β· p = 8 Β· X is 50 Γ— 8 Β· y is one column of 50 values "Given X (the eight setup parameters), can we predict y (will it pass QC)?"

A row is a job. Columns are what you measured. The rightmost is what you want to predict. ML in one diagram.

Quantitative vs categorical

One more fork before you can read anything in the field. The type of y decides which family of algorithms you need.

Quantitative β€” y is a number. Plating yield (87%), thickness (28 microns), cycles since service (120). Use regression methods.

Categorical β€” y is a label. PASS or FAIL. Customer tier A, B, or C. Use classification methods.

You'll see "regression vs classification" mentioned constantly. This is what they mean.

Same idea elsewhere

When an AI agent shop talks about "feature space," they mean the same X matrix. Each agent run is one row; each thing they measured (latency, tokens, success-flag) is a column. When they say "target variable," they mean y. Same letters, same shape.

On Monday at Lincoln

Open any spreadsheet on your machine β€” production schedule, QC log, supplier scorecard. Count the rows β€” that's your n. Count the columns you didn't pick yourself β€” that's your p. Pick the column you'd want to predict β€” that's your y.

Congratulations. You now have a structured ML problem. The rest is choosing the right model for it.

Lesson 2 recap

  • n = rows (observations). p = columns (features). X = input matrix. y = target column.
  • If y is a number, you have a regression problem. If y is a label, it's classification.
  • Every ML problem reduces to: "given X, predict y." Everything else is technique.
Explain it to a coworker

In your own words β€” how would you explain n, p, X, and y to a peer at Lincoln? Save locally; have AI review for honest critique.

Saved βœ“
Cost Savings Operations
Lesson 3 Β· ~20 minutes

Y = f(X) + Ξ΅

The equation every model is solving.

Watch this first β€” 75-second primer for what's below.

Every model on earth is solving the same equation: Y = f(X) + Ξ΅. Real outcome equals the true pattern plus noise. ML's job is to find f. The Ξ΅ is the part you can never predict β€” and accepting that is half the battle.

Reducible vs irreducible error

Your model's total error has two parts.

Reducible error β€” improves when you pick a better model or add better features. This is the part you fight against. Most of ML practice is shrinking it.

Irreducible error β€” locked in by the world. Random measurement noise, things you can't observe, factors that vary day-to-day. Lincoln framing: even a perfect model can't predict scrap rate to the decimal β€” vibration, humidity, operator focus, micro-power-fluctuations all carry noise no spreadsheet captures.

Knowing the difference saves your sanity. When your model isn't getting better, you have to ask: am I fighting reducible error (try harder) or irreducible error (this is the floor)?

The noise floor β€” what irreducible error looks like Black curve = true f Β· Cyan dots = observed Y Β· Red lines = Ξ΅ (irreducible noise) noise floor β€” can't beat this

Even with the perfect model, individual outcomes scatter. That scatter is Ξ΅. It's a floor β€” find it, accept it, move on.

Two reasons we estimate f

Same equation, two very different goals. Knowing which goal you have changes which model you should choose.

Prediction

"Just give me ΕΆ"

You don't care why the model says what it says. You just want accurate predictions. Black-box models are fine. Lincoln framing: "will this batch pass QC?" β€” it's fine if the model can't explain why, as long as it's right.

Inference

"Tell me which X drives Y"

You want to understand the relationship. Which lever moves the outcome? You need a clear, interpretable model. Lincoln framing: "which setup parameter has the biggest impact on yield?" β€” answers like this change how the floor runs.

Try it

Fit a curve through Lincoln's defect data

The blue dots are real defect measurements β€” they don't move. The black curve is YOUR model. Drag the four amber handles up and down to make the curve pass through the cloud of dots as closely as possible. Your MSE drops as the line gets closer to the dots β€” but it can never reach zero, because real data has noise around the true pattern.

Your MSE–
Floor (irreducible)~1.0

The amber dashed band is the noise floor β€” the gap between dots and the true pattern that no model can fight (that's the Ξ΅ in Y = f(X) + Ξ΅). The closer your MSE gets to 1.0, the closer your line is to the best any model could possibly do. You're done when MSE bottoms out β€” not when dots sit exactly on the line (they never will).

Same idea elsewhere

When you ask "why did this AI agent fail?" you're doing inference. When you ask "which agent will succeed on this prompt?" you're doing prediction. The same X, the same Y, the same Ξ΅ β€” the goal changes the model.

On Monday at Lincoln

Take the prediction problem you picked in Lesson 1. Ask yourself: do I want a number or a class (prediction), or do I want to know which lever to pull (inference)?

If your boss is going to ask "why?" β€” choose an inference-friendly model later (linear regression, decision tree). If they only need a number on the dashboard β€” anything goes.

Lesson 3 recap

  • Every model is hunting for f in Y = f(X) + Ξ΅. The Ξ΅ is irreducible β€” accept it.
  • Reducible error you can fight (better model, better features). Irreducible error is the floor.
  • Two goals split the world: prediction (just be accurate) vs inference (explain the relationship).
Explain it to a coworker

In your own words β€” how would you explain Y = f(X) + Ξ΅ and the noise floor? Save locally; have AI review for honest critique.

Saved βœ“
Operations
Lesson 4 Β· ~20 minutes

How to find f

Parametric vs non-parametric. Flexibility vs interpretability.

Watch this first β€” 70-second primer for what's below.

There are two ways to find f. Pick a shape and let math fill in the constants (parametric), or let the data shape itself (non-parametric). Each comes with a tax. Knowing which to pay matters more than knowing the algorithm names.

Parametric β€” pick the shape, fit the numbers

Linear regression is the classic example. You assume f is a straight line. Now you only have two numbers to find: slope and intercept. Cheap, fast, easy to explain.

The tax: if the truth is curved and you assumed a line, you'll always be a little wrong. Doesn't matter how much data you throw at it. Your assumption is the ceiling.

Non-parametric β€” let the data lead

No assumed shape. The fit can wiggle as much as it needs to follow the data. KNN, decision trees, splines β€” all non-parametric.

The tax: you need a lot more data to get a stable answer. With 30 jobs you can fit a straight line confidently. With 30 jobs and a wiggly non-parametric model, you're chasing noise.

Same data, three shapes for f LINEAR Parametric Β· 2 numbers POLYNOMIAL Parametric Β· ~5 numbers FLEXIBLE Non-param Β· 10+ numbers More flexibility = more wiggle = more parameters = more data needed

Same scatter, three different commitments. Pick what your problem calls for β€” and what your data can support.

The flexibility / interpretability tradeoff

Here's the rule that matters in practice: more flexible models are harder to explain.

A linear regression gives you a one-line answer: "yield rises 0.4% for every degree increase in bath temp." Your boss can defend it. A wiggly non-parametric fit gives you "the model says so." Your boss is unlikely to defend that.

Choose your model based on who's asking and what they need. Sometimes accuracy wins, sometimes explainability wins. There's no universal right answer.

Try it

Pick a shape for f

Same Lincoln bath-temperature data. Three different shapes. Click each one and watch what happens.

Training MSE–
Interpretable?–

Linear is one straight line. Cheap, clear, but can't bend.

Same idea elsewhere

A logistic regression for AI agent routing is parametric β€” clean rules you can defend in a postmortem. A neural network is non-parametric β€” no assumed shape, eats data for breakfast, but good luck explaining a single decision.

Knowledge Return Β· From Lesson 2
Quick β€” what was the y in Lesson 2's plating-jobs table?
PASS / FAIL β€” the column we want to predict. Everything else (bath temp, speed, prep, thickness) is X.
On Monday at Lincoln

When leadership asks "why is the model predicting that?" β€” the parametric model gives you a one-line answer. The non-parametric model gives you "because patterns." Choose based on who's listening.

Default first move: try the simplest parametric model. If it's good enough, ship it. If it's not, level up.

Lesson 4 recap

  • Parametric = pick a shape (linear, polynomial), fit the numbers. Cheap, clear, can be wrong if the shape is wrong.
  • Non-parametric = let the data shape itself. Flexible, hungry for data, harder to explain.
  • More flexible = more accurate (sometimes) but less interpretable. The tradeoff is the whole game.
Explain it to a coworker

In your own words β€” how would you explain parametric vs non-parametric models, and the defensibility tradeoff? Save locally; have AI review for honest critique.

Saved βœ“
Planning Cost Savings
Lesson 5 Β· ~20 minutes

Did it actually work?

MSE and the train/test split. The non-negotiable pre-flight.

Watch this first β€” 65-second primer for what's below.

The single most common ML mistake is shipping a model that looks great on the data it was trained on β€” and falls apart in production. The fix is brutally simple: hide some data from the model, then test on it.

MSE β€” Mean Squared Error

The standard score for a regression model. For each prediction, take (prediction βˆ’ actual), square it, then average across all predictions.

Why squared? So big misses count more than small ones. A prediction that's off by 10 is much worse than ten predictions off by 1. The square makes that bite.

Lower MSE = better. Period. But: which data did you measure it on? That's the whole question.

The train/test split

Hold out 20% of your data before you train. Train the model on 80%. Score it on the 20% it has never seen. That score is your honest estimate of how it'll do in the real world.

Anything else is cheating. If the model has seen the data during training, asking it to predict that data is like giving a student the answer key during the exam. They'll ace it. They'll teach you nothing.

The train/test split All your data (e.g., 50 plating jobs) TRAINING SET (80%) β€” 40 jobs the model learns from TEST (20%) model never sees these β†’ honest score from here

Hide 20% before training. Score on it after. Anything else is fiction.

Overfitting β€” the trap

A super-flexible model can memorize the training data. It can fit every noisy point exactly. Train MSE β†’ near zero. The model looks brilliant. You ship it.

Then production data arrives. The same model is suddenly catastrophic. Why? Because it learned the noise, not the signal. Train MSE was a lie.

The only honest metric is the one measured on data the model hasn't seen. That's the rule. Memorize it.

Try it

Drag your predictions to fit Lincoln's daily yield

Then flip the test-set switch and see what you actually built.

Train MSE (8 visible)–

A tight fit on training data isn't proof of anything. Holdout is. Drag every handle right onto its cyan dot β€” train MSE goes to ~0. Then reveal the test set and watch what happens to test MSE.

Same idea elsewhere

When an AI shop says "we evaluated on a held-out set," this is what they mean. If they don't say that, ignore the accuracy number. There's a 90% chance they trained and tested on the same data β€” which tells you nothing.

On Monday at Lincoln

Never trust an accuracy number that doesn't tell you what it was tested on. Always ask: "was this measured on data the model has seen, or data it hadn't seen?"

If a vendor pitches you a model with 95% accuracy and can't answer that question β€” they don't know what they built. Run.

Lesson 5 recap

  • MSE = average squared prediction error. Lower is better β€” but only on the right data.
  • Train/test split = hide 20%, train on 80%, score on the hidden 20%. Non-negotiable.
  • Overfitting happens when a flexible model memorizes the training noise. Train MSE looks great, test MSE blows up.
Explain it to a coworker

In your own words β€” how would you explain MSE and the train/test split? Save locally; have AI review for honest critique.

Saved βœ“
Cost Savings Operations
Lesson 6 Β· ~25 minutes Β· Flagship

Bias vs Variance

The central tradeoff. Move the slider. Feel it click.

Watch this first β€” 2-minute primer for what's below. The flagship.

This is the lesson that makes everything else click. Once you feel the bias-variance tradeoff in your hands, you'll never look at a model the same way. Every modeling decision in your career will be a flavor of this.

Bias β€” error from being too simple

Bias is what happens when your model can't bend enough to follow the truth. You assume defect rate is a flat line, but it's actually a curve. No matter how much data you give it, the line will always be wrong in the middle. That's bias.

High bias = the model is too rigid. It misses real patterns. The error is built into the assumption.

Variance β€” error from being too clingy

Variance is what happens when your model bends too eagerly. It hugs every data point, including the noisy ones. Re-train on a slightly different sample and the fit changes wildly. That instability is variance.

High variance = the model is too sensitive. It's chasing noise. Predictions on new data are unstable.

Three model fits β€” same data UNDERFIT High bias Β· Low variance "too rigid to follow the truth" JUST RIGHT Low bias Β· Low variance "follows the trend, ignores noise" OVERFIT Low bias Β· High variance "chases every data point, including noise"

Bias is a fence. Variance is a windsock. Just-right is the curve in the middle that ignores both extremes.

The U-shape

As you increase a model's flexibility, two things happen at once. Bias drops (more flexible models can follow the truth). Variance rises (more flexible models chase noise).

Total error = biasΒ² + variance + irreducible noise. Add it up across flexibility levels and you get a U-shape. The bottom of the U is the sweet spot β€” flexible enough to capture the pattern, not so flexible that it's chasing noise.

Every modeling decision is just trying to find the bottom of that U.

Flagship Β· Try it

The Bias–Variance Playground

Move the slider. Watch the curve, the bias, and the variance fight each other.

Model fit on 30 training points
Train MSE Β· Test MSE Β· BiasΒ² Β· Variance β€” across all flexibility levels
3 Train MSE – Β· Test MSE – Β· BiasΒ² – Β· Variance –

The dotted vertical line on the right chart marks where you are on the U-shape. The green dashed line marks the sweet spot β€” the flexibility level where test MSE bottoms out. Watch test MSE drop, bottom, then climb again as variance takes over.

Same idea elsewhere

When an AI agent shop says "we hyperparameter-tuned for the lowest validation loss," they're standing on the U-shape, looking for the dip. Same playground, different vocabulary.

Knowledge Return Β· From Lesson 5
From Lesson 5: what's the rule about evaluating a model's accuracy?
Always score on data the model has never seen β€” the train/test split. A model that memorizes history will score perfectly on history and badly in production. Test set is the only honest number.
On Monday at Lincoln

When a model performs perfectly on past data, ask one question: how does it do on a held-out month? If the gap is large, that's variance speaking. The model memorized history. Ship a simpler one.

And when leadership says "make it more accurate" β€” push back. Past a point, more flexibility makes accuracy worse, not better. The U-shape is real.

Lesson 6 recap

  • Bias = error from too rigid a model. Misses real patterns.
  • Variance = error from too flexible a model. Chases noise.
  • Total error = biasΒ² + variance + irreducible noise. As flexibility grows, bias drops, variance rises.
  • The sweet spot is the bottom of the U. Every modeling decision is hunting for it.
Explain it to a coworker

In your own words β€” how would you explain bias, variance, and the U-shape? Save locally; have AI review for honest critique.

Saved βœ“
Operations
Lesson 7 Β· ~20 minutes

KNN β€” your first algorithm

Classify by neighbors. The simplest serious ML there is.

Watch this first β€” 75-second primer for what's below.

K-Nearest Neighbors is the simplest serious ML algorithm there is. There's no math to memorize. The model is the data. And you can draw it in your head.

How it works in one sentence

Given a new point, find the K closest points in your training data, take a majority vote. That's it.

Lincoln framing: a new plating job arrives with bath temp 82Β°C and line speed 14 parts/min. Look up the 10 most similar past jobs. If 7 passed and 3 failed, predict PASS. If 3 passed and 7 failed, predict FAIL. Done.

That's a real, defensible model. No coefficients. No training. The "model" is just your past data.

Picking K

K = 1 means you trust whoever's standing closest, even if they're an outlier. One weird past job dominates every prediction near it. Wiggly decision boundary. High variance.

K = 100 means you average over basically your whole dataset. Too smooth to catch local patterns. Blurry decision boundary. High bias.

The right K is somewhere in between, and yes β€” it's a bias-variance tradeoff. The U-shape from Lesson 6 shows up here too. Sound familiar?

KNN β€” different K, different boundary K = 3 2 blue, 1 orange β†’ BLUE K = 10 5 blue, 4 orange β†’ BLUE K = 25 5 blue, 4 orange β†’ BLUE Same point, three Ks. Different neighbors get a vote.

The black cross is the new job. Find K nearest. Take a vote. That's the entire algorithm.

Try it

Click to add data. Slide K.

You're predicting "will this plating job pass QC?" from two setup parameters.

K = 10

Background color = what KNN would predict at every point. Click anywhere to add a labeled job and watch the boundary react. Try the K=1 preset, then K=25 β€” same data, totally different model.

Same idea elsewhere

Vector databases doing semantic search are KNN at scale. "Find the K embeddings closest to my query" is literally KNN with cosine distance. Every RAG pipeline, every recommendation engine, every AI agent's memory lookup β€” KNN under the hood.

On Monday at Lincoln

KNN works great for small datasets. If you have 200 plating jobs and want to predict which will fail QC, KNN is a perfectly defensible first pass. No fancy infrastructure needed.

The catch: KNN gets slow with millions of rows (it has to search everything every time). For Lincoln-scale data, that's a non-issue.

Lesson 7 recap

  • KNN: find the K nearest past examples, take a majority vote. The model is the data.
  • Small K = wiggly boundary, high variance. Large K = smooth boundary, high bias. Same U-shape.
  • Perfect first algorithm for small Lincoln-sized datasets. Defensible, simple, surprisingly accurate.
Explain it to a coworker

In your own words β€” how would you explain KNN to someone who's never seen ML? Save locally; have AI review for honest critique.

Saved βœ“
Planning Cost Savings Operations
β˜… Capstone Β· ~30 minutes

Predictive Maintenance for Lincoln plating lines

The thing you can carry into work tomorrow.

Watch this first β€” 65-second primer for what's below.

Time to put it all together. You're going to build a predictive maintenance model for Lincoln's plating lines using everything from the last seven lessons. No coding. Just judgment.

The problem

50 machines on the floor. Some will fail in the next 7 days, some won't. We have 8 features per machine β€” age, cycles since service, vibration, bath temp drift, operator skill, part complexity, customer tier, and one decoy that looks important but isn't. Build a KNN classifier that predicts failure on a held-out test set.

What you'll do

Step 1. Pick which features go into the model. Some help. One is a trap.

Step 2. Pick K. (Remember Lesson 6 and 7 β€” find the U-shape sweet spot.)

Step 3. Watch test accuracy and the confusion matrix update live as you tweak.

Step 4. When you're happy, generate the spec sheet β€” your shippable artifact.

Why it matters

If you can flag a failure 7 days before it happens, you schedule maintenance during a planned slot instead of an emergency stop. That's literally Planning + Cost Savings + Operations all at once. It's the kind of model leadership notices.

Capstone Β· Build it

Predict which Lincoln plating lines fail in 7 days

Pick features. Pick K. See test accuracy. Then ship a spec sheet.

5
–
Best so far: –
Pred FailPred OK
Actual Fail––
Actual OK––

High True Positive + low False Negative = catching failures. That's the ballgame.

Aim for accuracy β‰₯ 0.80 on the test set. Try removing the decoy. Try different Ks. The best feature set + K combo isn't always obvious.

Same idea elsewhere

Every observability platform that "predicts incidents before they happen" is doing this exact loop. Pick features. Pick a model. Validate on past data. Ship. The flavor differs; the playbook doesn't.

Knowledge Return Β· From Lesson 6
From Lesson 6: what shape does test MSE make as you crank flexibility?
A U-shape. Bias falls fast, variance climbs, total error drops, bottoms out, then climbs back up. Tuning K in this capstone is hunting for the bottom of the U.
After this β€” the real version

Pull a real spreadsheet from work. Same 50ish rows, your real features, your real failed_within_7d column.

Run it through any AutoML tool β€” Vertex AI, H2O, SageMaker Autopilot, DataRobot β€” using what you learned about features and bias-variance. You can pick the right model. You can defend it. You're ready.

This is what the rest of the catalogue ships you toward.

Capstone recap β€” what you just did

  • You picked features deliberately (and learned to spot decoys that hurt accuracy).
  • You tuned K for the bias-variance sweet spot β€” same U-shape from Lesson 6.
  • You evaluated on a held-out test set β€” the only honest metric.
  • You generated a printable spec sheet β€” your shippable artifact for leadership.
Explain it to a coworker

In your own words β€” how would you explain the predictive maintenance loop you just built (pick features, pick K, score, ship)? Save locally; have AI review for honest critique.

Saved βœ“
Briefing copied to clipboard βœ“