/ Build ML / How a Data Scientist Thinks

How a Data Scientist Thinks in the Real World and Why Most People Fail at Machine Learning

Based on 2 real-world projects applied in industry

15 min read Undergraduate · Graduate

Starting pointThe problem nobody teaches you

If you've studied machine learning, you probably know the basics: regression, classification, a few models, maybe some Python.

Knowing algorithms doesn't mean knowing how to do machine learning.

In real projects — in industry, business, or tech — the problem never arrives as: "apply a classification model here."

It arrives as something far more ambiguous:

  • "we want to reduce losses"
  • "we need to predict failures"
  • "we want to improve sales"

And that's where most people fail.

The real challenge isn't training models. The real challenge is understanding the right problem.

In real ML projects, the most expensive mistakes don't come from picking the wrong algorithm. They come from defining the problem incorrectly from the start.

In this lesson, based on 2 real-world industry projects, I'll show you how a data scientist thinks — and how you can start thinking that way too.

Table of contents:

ContextBase projects

The examples in this lesson draw from two real-world projects.

Project 1 — Detecting losses from irregular consumption (energy fraud / theft)

  • ML translation: supervised classification (suspicious behavior detection)
  • Target: binary indicator — 0 = normal, 1 = fraud
  • Metric: Recall
  • Critical error: false negative — failing to detect fraud

Project 2 — Preventing failures in critical equipment (predictive maintenance)

  • ML translation: classification (will fail / won't) or regression (time to failure — TTF)
  • Target: probability of failure or remaining time
  • Metric: Recall
  • Critical error: false negative — failing to anticipate a breakdown

FrameworkThe most common mistake when implementing Machine Learning

After advising on these projects, the most frequent mistake follows this pattern:

"I have data → I'm going to apply a model"

The problem is that this approach is completely backwards.

The correct framework is:

1

What decision needs to be made?

2

What variable am I predicting?

3

What type of problem is this?

4

What metric actually matters?

5

Which errors are acceptable?

If you start with the model, you've already started wrong.

Because if you frame the problem incorrectly, you'll choose the wrong model, optimize for the wrong metric, and make useless decisions — even if your model appears to "work."

ObjectivesWhat you'll learn in this lesson

By the end, you should be able to:

  • Translate a business problem into a machine learning problem
  • Correctly identify the type of problem (classification, regression, etc.)
  • Define the target variable
  • Choose a metric that makes sense in the real world

The approach is: first I show you how I solved the first problem, then I leave the second one as an exercise for you.


Guided ExerciseTranslating a real problem into Machine Learning

Exercise · 10–15 min

Problem 1: Energy Company

Context: An energy company is facing financial losses due to illegal connections and meter tampering. The leadership team wants to use data to identify these cases proactively, but there's still no clear ML framing in place.

Your goal: take an ambiguous problem and turn it into something modelable — translating from business to ML, structured thinking, decision-making, and impact assessment.

Step 1 — What decision needs to be made?

The decision is not "predict fraud."

The real decision is: determine which cases are worth reviewing or acting on first.

A data scientist thinks this way: the model isn't the goal — the goal is to take an action.

The question to ask is: if the model gives me a prediction, what do I do with it?

  • If a case looks suspicious → it gets reviewed
  • If it doesn't → it's deprioritized

Conclusion: the decision is to prioritize which cases to inspect, using data to focus resources where the probability of a problem is highest.

Step 2 — What variable am I predicting?

This is where many people go wrong. Someone might jump to say: "the target is consumption." That's incorrect.

The data scientist asks: what decision does the business want to make?

Answer: identify whether a service account is suspicious or not.

The correct target is:

Theft  (1 = fraud,  0 = normal)

Key insight: the target doesn't always come pre-packaged in the data. It's defined based on the business decision.

Step 3 — What type of problem is this?

At first glance someone might say: "it's a classification problem." But an experienced data scientist doesn't start there.

They start by asking: what is the business actually trying to solve?

  • There are financial losses that aren't directly visible
  • There's irregular behavior that can be detected

This is an anomaly / fraud detection problem.

Since we have historical examples of fraud and normal behavior, it becomes supervised classification.

3.1 — What model should I use?

90% of students start with: "I'm going to use X model." That's a mistake.

First, define the context:

  • Data type: tabular
  • Size: moderate
  • Noise: high
  • Features: aggregated

The right call is RandomForestClassifier — not because it's trendy, but because the data is tabular, there's noise, there are non-linear relationships, and interpretability matters.

This is a context-driven decision, not a theory-driven one.

3.2 — What variables (features) should I use?

A data scientist doesn't use columns "as-is."

They ask: what signals indicate suspicious behavior? Then they engineer variables like:

  • Average consumption
  • Variability
  • Minimum and maximum values
  • Abrupt changes
  • Trend

Key insight: we're not modeling consumption. We're modeling behavior.

Step 4 — What metric should I use?

This is the most important point.

A data scientist asks: which error is more costly?

  • Missing fraud → financial loss
  • Flagging a false positive → unnecessary inspection

Recall should be prioritized.

The metric isn't chosen based on the model. It's chosen based on business impact.

Step 5 — Which errors are acceptable?

This rarely gets taught, but it's critical for any real-world project.

False negative — Critical

You miss fraud → you lose money → operational risk. Not acceptable.

False positive — Tolerable

Unnecessary inspection → minor cost. Acceptable up to a point.

"I'd rather over-flag than miss a fraud case."

This problem isn't about models. It's about: understanding the business, defining the problem clearly, building useful signals, and choosing the right metrics.


Deep WorkFull reframing of a real problem

Exercise · 20–25 min

Predictive Maintenance of a SAG Mill

Context: A mining company wants to anticipate failures in critical equipment to avoid operational losses. Historical operational data, sensor readings, and records of past failures are available.

Available data: periodic measurements of wear on critical components over time, with variations in operating conditions and progressive changes in system state.

Important: don't start by thinking about models. First understand what you want to predict and what decision you'll make with it.

Question 1 — Business-level problem

What is the goal? What decision do you want to make with the data?

Strong answer:

Understand how wear evolves over time in order to anticipate critical conditions and make timely decisions about maintenance or intervention.

Weak answer:

"Analyze the wear in the data" — describes data, not decisions.

Evaluation criteria:

  • High: you clearly define what decision needs to be made
  • Medium: you mention the goal but without a clear decision
  • Low: you describe the data instead of the decision to be taken

Question 2 — Type of problem

What are you trying to do? Prediction, detection, estimation, segmentation, behavior analysis...

Strong answer:

Prediction / estimation of continuous behavior (degradation over time).

Weak answer:

"It's a machine learning problem" — too generic, says nothing.

Evaluation criteria:

  • High: you describe the correct type plus the temporal context
  • Medium: you pick the right category but without clarity
  • Low: you choose the wrong category

Question 3 — Machine Learning problem

What are you modeling? What are you trying to predict or estimate?

Strong answer:

Model how wear evolves over time in order to estimate its future behavior.

Weak answer:

"Predict data" — vague, no context.

Question 4 — Target variable

How is it represented in data? Is it a category or a numerical value?

Strong answer:

A numerical variable representing wear or degradation rate (e.g., height loss or wear rate per unit of time).

Weak answer:

"Wear" — undefined and not measurable.

Question 5 — Type of approach

Classification, regression, or something else? Justify your answer.

Strong answer:

Regression, because the goal is to estimate a continuous variable (wear level or rate), not a category.

Weak answer:

"Classification" — without analyzing the type of target variable.

Question 6 — Metric

What metric would you use and why?

Strong answer:

MAE or RMSE, because the goal is to measure how far predictions are from the actual value in terms of magnitude.

Weak answer:

"Accuracy" — doesn't apply to this type of problem.

Question 7 — Critical errors

Which error is more serious? What consequences would it have?

Strong answer:

Underestimating wear, because it can lead to delayed decisions and increase the risk of unanticipated failures.

Weak answer:

"All errors are equal" — completely disconnected from reality.

Question 8 — Problem risks

Name at least 2–3 risks.

Strong answer:

  • Noisy or inconsistent data
  • Variability in operating conditions
  • Measurement errors
  • Incomplete data

Weak answer:

"There are no risks" — completely disconnected from reality.

Evaluation criteria:

  • High: realistic risks relevant to the context
  • Medium: generic risks with no connection to the project
  • Low: no risks identified

Optional ExerciseFeynman Technique — Explain it without technical terms

Exercise · 5–10 min

Explain what you just did

Explain your solution as if you were talking to someone with no technical background. Use these guiding questions:

  • What's happening at the company?
  • What are you trying to predict?
  • How does that help with decision-making?

Important rule: avoid unnecessary technical terms, formulas, and words like "model" or "algorithm" without explaining them first.

If you can explain it simply, clearly, and coherently, you truly understood it.

Wrap-upActive Recall

Exercise · 5 min

Active retrieval — no notes allowed

Question 1 — What are the 5 steps a Data Scientist follows?

Write them in order.

Question 2 — What's the difference between a business problem and an ML problem?

Explain it in one clear sentence.

Question 3 — How do you choose the right metric?

Give a concrete criterion.