/ ML Methodology / Define the problem
How to Define the Problem in a Machine Learning Project The first step when you don't know where to start
Most people go straight to the model. And that's why they get stuck. Defining the problem well is what gives coherence to everything that follows.
If you've been spending hours looking for data, models, or examples and still don't know where to start, it's because you haven't yet defined what you want to solve.
When you define your problem well, you lay the structural foundations for everything that follows: what you will predict, what data you will need, what model you can use, and how you will evaluate whether the result actually makes sense.
Defining the problem is the first thing we do when designing machine learning projects, because it gives the entire project coherence, reduces the risk of failure, and prevents wasted time.
I've seen this in both industry and academic projects. For example, in a project for the cement industry related to raw mix design, the team told me they wanted to "optimize the amount of gypsum" to "optimize cost." The objective was not wrong, but the problem was not sufficiently defined.
Starting with code, without knowing which model would be most appropriate and without having defined a global success metric for the project, meant the team was just trying things without reaching an interesting result.
That's why I suggested taking a step back: before modeling, we had to work directly with the quality engineer to understand which variable actually represented "optimize," and under what conditions. That process completely changed how the problem needed to be framed as a machine learning task.
Table of contents:
How to define a machine learning problem
To define the problem we use the problem framing method. This method translates a general project idea into an ML task with a target variable, available data, and a clear metric.
The steps are:
- General project idea: it can come from an article you've read, a business context, or something you've heard about.
- Deep understanding of context: who has the difficulty? What solutions have been proposed? What environment surrounds the problem?
- Problem definition: the statement that turns your idea into something technically solvable. This is where you define what it means to solve the problem. When this definition is clear, the rest of the project flows with coherence.
- Hypothesis and proposed solution: once the problem is defined, it makes sense to propose a model. Not before.
How to identify the general idea of your machine learning project
This is the first step, and also the simplest in appearance.
Project ideas usually arise from many sources: your own reasoning, experience, curiosity, or even necessity. In an academic context, ideas can come from scientific articles or other students' work. In a professional environment, they emerge from trying to improve a process, solve a problem, or reduce inefficiencies.
But even though this step seems simple, it is not trivial. Good ideas must have impact. Whether you aim to meet the standards of an academic committee or generate real value in a business, the project needs to justify its relevance.
If you don't yet have clarity about that impact, that's fine. The process of defining the problem will help you ground and refine the idea. And something important: this process is not linear. It is normal to move forward, step back, and adjust as you better understand the problem. A good idea is not born perfect — it is built.
Deep understanding of context for defining your machine learning problem
This is where the real work begins.
Deep understanding of context is what gives us the tools to correctly define the problem. Without context, there is no clarity. And without clarity, the model, the data, or the architecture end up answering poorly framed questions.
That context depends on the type of project. In a business environment, it means understanding the user, their tools, and their objectives. As Jeremy Jordan suggests, the problem should be defined from the user's perspective: what they are doing, what they want to achieve, and what difficulties they encounter — before trying to define the problem itself.
In the cement industry project, the user was non-technical and worked primarily in Excel to design the first mix prototypes. From the business side, the objective was to optimize gypsum usage to reduce costs without compromising strength. With that context, it was easier to define the problem clearly.
In an academic setting, context is built differently. Here you need to review the literature, identify the research gap, and rely on your advisors. What matters is being clear about what has been done and what remains unresolved.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access
How to write your machine learning problem definition
A good definition does not describe an idea. It describes a clear action. An actionable statement makes clear what you want to predict, with what data, and how you will measure whether it works.
In simple terms, your definition must include three things:
- The target variable.
- The available data.
- The evaluation metric.
Without these three elements, the problem remains ambiguous. According to the Google Developers guide, framing the problem well means translating a need into a concrete task that a model can learn. It's not about saying "I want to predict something"; it's about defining exactly what output you expect and in what context it will be used.
The business objective is not the same as the ML problem
Business speaks in terms of impact: reduce costs, improve quality, avoid waste. The ML problem speaks in terms of prediction: estimate a variable, classify a result, detect a pattern. Translating one into the other is the real work.
Following the cement industry case: "reduce costs" is not an ML problem. But "estimate the minimum amount of gypsum needed to meet a target strength" is.
Choosing the target variable is the most important decision
Choosing the target variable is not a technical detail. It is a design decision that defines everything else. For example, in the cement industry context you can define the problem in different ways:
- Predict compressive strength from cement composition. That is regression.
- Predict whether a mix meets a minimum strength threshold or not. That is classification.
The business objective does not change. But the ML problem does. Amazon Web Services documentation shows exactly this: a single objective can have multiple formulations, and each one leads to different decisions about data, model, and evaluation.
Business example
| Element | Formulation A | Formulation B |
|---|---|---|
| Business objective | Reduce costs in cement production | |
| ML problem | Estimate the minimum amount of gypsum for a target strength | Predict 28-day compressive strength |
| Target variable | Gypsum quantity | Compressive strength |
| Data | Mix composition, production conditions, lab results | Clinker, gypsum, limestone, pozzolan, and other components |
| Metric | Deviation from target strength | RMSE or MAE in strength prediction |
In both cases the context is the same. What changes is what you decide to predict. A good problem statement is not long. It is clear. And above all, it forces decisions to be made.
Hypothesis and initial solution in an ML project
Once the problem is well defined, it now makes sense to talk about models. Before that, any attempt is guesswork.
This is where the technical hypothesis comes in: an informed assumption about how to solve the problem with machine learning. It is not the final model. It is a reasonable first bet that connects what you want to predict with the data you have.
A well-formulated technical hypothesis answers something like: with the available data, we can predict this variable using this approach and evaluate the result with this metric.
The initial solution must be bounded. It does not aim to be perfect. It aims to validate whether you're heading in the right direction.
Example: the cement industry case
With the problem already defined — reducing costs by optimizing gypsum usage without losing strength — you can formulate different hypotheses:
- Regression: directly predict compressive strength from the cement's chemical composition.
- Classification: predict whether a mix meets a minimum strength level or not.
- Clustering: explore patterns in the mixes without using labels, to understand similar behaviors between formulations.
The context is the same. The business is the same. But the machine learning problem changes completely depending on which hypothesis you choose.
You don't need to solve everything from the start. You need a first version that allows you to learn quickly. The paper Matching Problems to Solutions: An Explainable Way of Solving Machine Learning Problems proposes a clear sequence: you understand the problem, define how to represent it, and then choose a solution consistent with that formulation. That order avoids building models disconnected from the real objective.
Why your model is not improving even though the code is correct
If your model is not improving, the problem is often not in the code. It's in how you defined the problem.
Metric misaligned with the real objective
One of the most common mistakes is optimizing something that does not represent the real objective. In the cement industry case, you might minimize the average prediction error for strength. But if you fail on the cases where strength falls below the acceptable minimum, the model is useless in operation. The technical metric was not aligned with what the business actually needed.
Poorly defined target
If the target variable does not properly capture what you want to solve, no model will fix that. In the same example: if you define the target as 28-day strength but the critical decision happens earlier, you are optimizing something that arrives too late. The problem was the target definition, not the algorithm.
A documented case in Problem Formulation and Fairness shows this clearly: a system with poor results was not failing because of the model, but because of how the target had been defined. Changing that definition changed the outcome without touching the algorithm.
This reinforces a key idea: model performance is limited by the quality of the problem formulation. If that fails, everything else fails too. And here we return to the starting point: everything begins with context and a good definition.
Frequently asked questions about defining the problem in machine learning
Why is it important to define the problem before choosing a machine learning model?
Because the problem definition determines what variable you will predict, what data you need, and how you will measure success. Without a clear definition, even a well-implemented model can produce results that are useless in practice.
What is problem framing in machine learning?
It is the method that translates a general project idea into an ML task with a target variable, available data, and a clear metric. Its steps are: general idea, deep context understanding, problem definition, and solution hypothesis.
What is the difference between a business objective and a machine learning problem?
The business objective speaks in terms of impact: reduce costs, improve quality. The ML problem speaks in terms of prediction: estimate a variable, classify a result. Translating one into the other is the real work of problem definition.
What should a good machine learning problem definition include?
A good definition must include three elements: the target variable you want to predict, the data available for training, and the evaluation metric that defines success. Without these three elements the problem remains ambiguous.
Why is my machine learning model not improving even though the code is correct?
Often the problem is not in the code but in how the problem was defined. The most common mistakes are a metric misaligned with the real objective and a poorly defined target that does not properly capture what you want to solve.
If you have a project underway and are not sure whether the problem is well defined, you can reach out to me directly. Often that is the adjustment that changes the entire result.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access