How to Define the Problem in a Machine Learning Project

Q: What is problem framing in machine learning?

Problem framing is the method that translates a general project idea into an ML task with a target variable, available data, and a clear metric. Its steps are: general idea, deep context understanding, problem definition, and solution hypothesis.

If you've been spending hours looking for data, models, or examples and still don't know where to start, it's because you haven't yet defined what you want to solve.

When you define your problem well, you lay the structural foundations for everything that follows: what you will predict, what data you will need, what model you can use, and how you will evaluate whether the result actually makes sense.

Defining the problem is the first thing we do when designing machine learning projects, because it gives the entire project coherence, reduces the risk of failure, and prevents wasted time.

Diagram showing the problem as the starting point before choosing data, models, or metrics — **The problem as the starting point.** Before choosing data, models, or metrics, you need to know exactly what you want to solve.

I've seen this in both industry and academic projects. For example, in a project for the cement industry related to raw mix design, the team told me they wanted to "optimize the amount of gypsum" to "optimize cost." The objective was not wrong, but the problem was not sufficiently defined.

Starting with code, without knowing which model would be most appropriate and without having defined a global success metric for the project, meant the team was just trying things without reaching an interesting result.

That's why I suggested taking a step back: before modeling, we had to work directly with the quality engineer to understand which variable actually represented "optimize," and under what conditions. That process completely changed how the problem needed to be framed as a machine learning task.

How to define a machine learning problem

To define the problem we use the problem framing method. This method translates a general project idea into an ML task with a target variable, available data, and a clear metric.

The steps are:

General project idea: it can come from an article you've read, a business context, or something you've heard about.
Deep understanding of context: who has the difficulty? What solutions have been proposed? What environment surrounds the problem?
Problem definition: the statement that turns your idea into something technically solvable. This is where you define what it means to solve the problem. When this definition is clear, the rest of the project flows with coherence.
Hypothesis and proposed solution: once the problem is defined, it makes sense to propose a model. Not before.

Diagram of the problem framing process: from general idea to actionable problem definition — **The problem framing process.** How to go from a general idea to an actionable problem definition before modeling.

How to identify the general idea of your machine learning project

This is the first step, and also the simplest in appearance.

Project ideas usually arise from many sources: your own reasoning, experience, curiosity, or even necessity. In an academic context, ideas can come from scientific articles or other students' work. In a professional environment, they emerge from trying to improve a process, solve a problem, or reduce inefficiencies.

But even though this step seems simple, it is not trivial. Good ideas must have impact. Whether you aim to meet the standards of an academic committee or generate real value in a business, the project needs to justify its relevance.

If you don't yet have clarity about that impact, that's fine. The process of defining the problem will help you ground and refine the idea. And something important: this process is not linear. It is normal to move forward, step back, and adjust as you better understand the problem. A good idea is not born perfect — it is built.

Deep understanding of context for defining your machine learning problem

This is where the real work begins.

Deep understanding of context is what gives us the tools to correctly define the problem. Without context, there is no clarity. And without clarity, the model, the data, or the architecture end up answering poorly framed questions.

That context depends on the type of project. In a business environment, it means understanding the user, their tools, and their objectives. As Jeremy Jordan suggests, the problem should be defined from the user's perspective: what they are doing, what they want to achieve, and what difficulties they encounter — before trying to define the problem itself.

In the cement industry project, the user was non-technical and worked primarily in Excel to design the first mix prototypes. From the business side, the objective was to optimize gypsum usage to reduce costs without compromising strength. With that context, it was easier to define the problem clearly.

In an academic setting, context is built differently. Here you need to review the literature, identify the research gap, and rely on your advisors. What matters is being clear about what has been done and what remains unresolved.

How to write your machine learning problem definition

A good definition does not describe an idea. It describes a clear action. An actionable statement makes clear what you want to predict, with what data, and how you will measure whether it works.

In simple terms, your definition must include three things:

The target variable.
The available data.
The evaluation metric.

Without these three elements, the problem remains ambiguous. According to the Google Developers guide, framing the problem well means translating a need into a concrete task that a model can learn. It's not about saying "I want to predict something"; it's about defining exactly what output you expect and in what context it will be used.

The business objective is not the same as the ML problem

Business speaks in terms of impact: reduce costs, improve quality, avoid waste. The ML problem speaks in terms of prediction: estimate a variable, classify a result, detect a pattern. Translating one into the other is the real work.

Following the cement industry case: "reduce costs" is not an ML problem. But "estimate the minimum amount of gypsum needed to meet a target strength" is.

Choosing the target variable is the most important decision

Choosing the target variable is not a technical detail. It is a design decision that defines everything else. For example, in the cement industry context you can define the problem in different ways:

Predict compressive strength from cement composition. That is regression.
Predict whether a mix meets a minimum strength threshold or not. That is classification.

The business objective does not change. But the ML problem does. Amazon Web Services documentation shows exactly this: a single objective can have multiple formulations, and each one leads to different decisions about data, model, and evaluation.

Business example

Element	Formulation A	Formulation B
Business objective	Reduce costs in cement production
ML problem	Estimate the minimum amount of gypsum for a target strength	Predict 28-day compressive strength
Target variable	Gypsum quantity	Compressive strength
Data	Mix composition, production conditions, lab results	Clinker, gypsum, limestone, pozzolan, and other components
Metric	Deviation from target strength	RMSE or MAE in strength prediction

In both cases the context is the same. What changes is what you decide to predict. A good problem statement is not long. It is clear. And above all, it forces decisions to be made.

Hypothesis and initial solution in an ML project

Once the problem is well defined, it now makes sense to talk about models. Before that, any attempt is guesswork.

This is where the technical hypothesis comes in: an informed assumption about how to solve the problem with machine learning. It is not the final model. It is a reasonable first bet that connects what you want to predict with the data you have.

A well-formulated technical hypothesis answers something like: with the available data, we can predict this variable using this approach and evaluate the result with this metric.

The initial solution must be bounded. It does not aim to be perfect. It aims to validate whether you're heading in the right direction.

Example: the cement industry case

With the problem already defined — reducing costs by optimizing gypsum usage without losing strength — you can formulate different hypotheses:

Regression: directly predict compressive strength from the cement's chemical composition.
Classification: predict whether a mix meets a minimum strength level or not.
Clustering: explore patterns in the mixes without using labels, to understand similar behaviors between formulations.

The context is the same. The business is the same. But the machine learning problem changes completely depending on which hypothesis you choose.

You don't need to solve everything from the start. You need a first version that allows you to learn quickly. The paper Matching Problems to Solutions: An Explainable Way of Solving Machine Learning Problems proposes a clear sequence: you understand the problem, define how to represent it, and then choose a solution consistent with that formulation. That order avoids building models disconnected from the real objective.

Why your model is not improving even though the code is correct

If your model is not improving, the problem is often not in the code. It's in how you defined the problem.

Metric misaligned with the real objective

One of the most common mistakes is optimizing something that does not represent the real objective. In the cement industry case, you might minimize the average prediction error for strength. But if you fail on the cases where strength falls below the acceptable minimum, the model is useless in operation. The technical metric was not aligned with what the business actually needed.

Poorly defined target

If the target variable does not properly capture what you want to solve, no model will fix that. In the same example: if you define the target as 28-day strength but the critical decision happens earlier, you are optimizing something that arrives too late. The problem was the target definition, not the algorithm.

A documented case in Problem Formulation and Fairness shows this clearly: a system with poor results was not failing because of the model, but because of how the target had been defined. Changing that definition changed the outcome without touching the algorithm.

This reinforces a key idea: model performance is limited by the quality of the problem formulation. If that fails, everything else fails too. And here we return to the starting point: everything begins with context and a good definition.

Frequently asked questions about defining the problem in machine learning

Why is it important to define the problem before choosing a machine learning model?

Because the problem definition determines what variable you will predict, what data you need, and how you will measure success. Without a clear definition, even a well-implemented model can produce results that are useless in practice.

What is problem framing in machine learning?

It is the method that translates a general project idea into an ML task with a target variable, available data, and a clear metric. Its steps are: general idea, deep context understanding, problem definition, and solution hypothesis.

What is the difference between a business objective and a machine learning problem?

The business objective speaks in terms of impact: reduce costs, improve quality. The ML problem speaks in terms of prediction: estimate a variable, classify a result. Translating one into the other is the real work of problem definition.

What should a good machine learning problem definition include?

A good definition must include three elements: the target variable you want to predict, the data available for training, and the evaluation metric that defines success. Without these three elements the problem remains ambiguous.

Why is my machine learning model not improving even though the code is correct?

Often the problem is not in the code but in how the problem was defined. The most common mistakes are a metric misaligned with the real objective and a poorly defined target that does not properly capture what you want to solve.

Continue with these resources at the same level

Or return to the main pillar on how to design a machine learning project before coding.

If you have a project underway and are not sure whether the problem is well defined, you can reach out to me directly. Often that is the adjustment that changes the entire result.

How to Define the Problem in a Machine Learning Project The first step when you don't know where to start

Table of contents: