How do I know if my machine learning results are good enough?

It depends on the domain and the audience, not the number itself. An F1 of 0.78 can be excellent for early detection of rare diseases and completely insufficient for spam filtering. The right question is not whether the number is high, but whether the model is better than the current alternative and whether its errors are acceptable given the real cost of being wrong in your specific context.

How to Interpret and Explain Your ML Model Results

Q: Do I have to use SHAP to defend an academic project?

It is not mandatory, but SHAP can add value when the committee asks why the model makes certain decisions. What is mandatory is being able to explain feature importance in some form, especially if you use black-box models like Random Forest or XGBoost. SHAP is the most robust tool for this, but it only adds real value if you can interpret its results — not if you simply show the chart without explaining it.

Q: How do I present negative results without making it look like the project failed?

Well-documented negative results are more valuable than positive results without context. The key is framing them as findings, not failures: what you learned about the problem, why the model has the limitations it has, and what conditions would be needed to improve results. In academic theses, negative results are perfectly acceptable and sometimes expected.

Q: What metrics should I present if I don't know who will read the report?

Always present the most honest metric for your type of problem, not the most favorable. For classification with balanced classes, accuracy and F1. For imbalanced classification, MCC and PR AUC. For regression, MAE with a minimal residual analysis. Then, if you know who the audience is, add a translation to impact language: what that number means in terms of real decisions or costs.

A data scientist announces "our model achieved an ROC AUC of 0.85" and the executive across the table thinks silently: "Should I be impressed?" This is not a hypothetical situation. It is the most frequent pattern in real projects where technical results are disconnected from the language in which decisions are made.

The problem is not the metric. The problem is that the metric has not been translated. And that translation is not optional: it is what determines whether the work generates impact or gets filed away.

This article does not explain how to calculate metrics — that is already covered in the model selection and evaluation article. It explains what to do with the metrics once you have them, depending on who you are talking to.

Tabla de contenido:

The most common mistake: numbers without context
Need help presenting your results?
Before interpreting: who are you explaining this to?
How to interpret classification metrics
How to interpret regression metrics
SHAP and local explainability
How to translate technical results to impact language
Visualizations that communicate vs visualizations that confuse
What to do when results are not good
Frequently asked questions

The most common mistake: presenting numbers without context

The mistake that repeats most often in real projects is assuming that stakeholders know what to do with the model's outputs. Delivering tables of probabilities, confusion matrices, or RMSE values without context generates an inevitable question: "What does this mean for us?" In projects where I have made that mistake, users ended up making decisions with the wrong number because no one had explained which one was relevant.

A study on arXiv about ML explainability for external stakeholders documents that even experts in different disciplines — academics, lawyers, regulators — have difficulty building a shared language around ML model results, and that this communication gap is one of the main obstacles to real adoption of these systems (Bhatt et al., 2020, arXiv 2007.05408).

The solution is not to simplify results to the point of losing rigor. It is to structure communication in layers: first the impact in domain language, then the metrics that support it, and finally the technical details for those who need them.

Need help presenting your results?

If you already have the model and the metrics but don't know how to communicate them to your committee, client, or team, I can review your case and help you structure the presentation correctly for your audience.

I want help with my results →

Before interpreting: who are you explaining this to?

The same metric needs a different translation depending on the audience. Defining this before preparing the presentation saves complete rewrites.

Your advisor or academic committee

They expect methodological rigor. They need to see the comparison with baselines, the variance across folds, the justification of the chosen metrics, and the analysis of the model's limitations. The most frequent mistake with academic committees is reporting only the best result without showing stability across different splits. A model that has 0.92 F1 on average but 0.68 on the worst fold has a real problem that the committee will detect.

A business stakeholder or client

They expect measurable impact on their KPIs, not technical jargon. What they need to see is the translation of the result into concrete decisions: what can the business do with this model that it couldn't do before, how much better is it compared to the previous situation, and what kinds of errors the model makes that the business must account for. In real projects I've learned that the question "who do we act on with this result?" is more useful than any metrics table for aligning expectations. Google defines that communicating experimental results must align with each stakeholder's expectations at each project phase (Google Developers — ML Projects).

A technical interviewer

They expect you to be able to defend the decisions you made. They are not looking for a perfect model — they want to see that you understand why you chose those metrics, what their limitations imply, and what you would have done differently with more time or more data. Fluency in talking about the project, more than the numerical result, is what convinces in technical interviews.

How to interpret classification metrics according to context

The accuracy trap with imbalanced classes

With imbalanced classes, accuracy lies. A model that always predicts the majority class in an 80/20 dataset achieves 80% accuracy and zero utility. In those cases, MCC and PR AUC are more honest because their denominators directly include performance on the minority class. When presenting results of an imbalanced classification model, always include MCC or PR AUC as the primary metric, and explicitly explain why accuracy is not enough.

How to read a confusion matrix to communicate where the model fails

The confusion matrix is not just a number — it is a map of the model's errors. To communicate it effectively to non-technical audiences, the most useful approach is not the numerical table but the translation into real consequences: how many positive cases did the model miss? How many false alerts does it generate? In a domestic violence detection project, the difference between a false negative and a false positive had completely different legal consequences, and that was what the evaluating committee needed to understand — not the matrix numbers.

When to use F1, MCC, or PR AUC

F1: when there is no clear reason to prioritize precision over recall and the classes are reasonably balanced.
MCC: when the classes are imbalanced and you need a metric that is not inflated by the majority class.
PR AUC: when the positive class is the minority and the system needs to find it with high precision — recommendations, anomaly detection, alert systems.

How to interpret regression metrics

MAE, RMSE, and residual analysis

MAE gives the average error in the same units as the problem — it is the easiest metric to communicate to non-technical audiences. RMSE penalizes large errors quadratically, which makes it more informative when large errors have serious operational consequences. The choice between the two is not technical: it belongs to the domain.

Residual analysis is what MAE and RMSE do not show: whether the model makes errors systematically in certain ranges of the target. A grocery price prediction model can have an acceptable global MAE and at the same time systematically predict poorly during periods of high inflation — precisely when accuracy matters most. In real projects, plotting the error as a function of the predicted value has revealed problems that the global number would never have detected.

SHAP and local explainability: when it adds value and when it's decorative

SHAP (SHapley Additive exPlanations) allows you to explain why the model made a specific decision for a specific instance. In projects with stakeholders who need to understand or audit individual decisions — credit, healthcare, justice — SHAP is practically mandatory. In projects where only aggregate performance matters, it may be dispensable.

The most frequent mistake with SHAP is not using it incorrectly in a technical sense, but showing the chart without explaining it. A feature importance chart does not explain itself. What it communicates — that passenger gender explains almost half of Titanic survival predictions, for example — needs to be translated into domain language for it to make sense to the audience.

When I have presented SHAP results to academic committees, what generated useful conversation was not the chart itself, but the question of whether that feature importance makes sense given domain knowledge. A model that assigns high importance to a variable that should not be relevant is a signal of data leakage or bias that is worth discussing more than any global metric.

How to translate technical results into impact language

The most effective technique for non-technical audiences is replacing the abstract number with a concrete consequence. Not "the model has a precision of 0.72", but "out of every 10 alerts the model generates, 7 correspond to real cases." Not "RMSE of 14.57 units", but "the model is off by an average of 15 bikes per hour of prediction, which during peak hours represents less than 5% of real demand."

In real projects, converting model outputs into three actionable categories — high, medium, and low risk, tied to available resources — has consistently been more useful than presenting raw probabilities. The audience does not need the number: they need to know what to do with it.

To compare the model against the previous situation, the most effective chart is not the ROC curve — it is a simple bar showing the result before and after the model in terms of business KPIs. The lift over the baseline, expressed as a percentage improvement or in monetary units when the domain allows, is what converts a technical result into a value argument.

When explaining uncertainty, transparency builds credibility rather than undermining it. Admitting "this forecast has a variation margin of ±X% in this segment" and making the model's assumptions explicit demonstrates rigor. Stakeholders who work with uncertainty in their own domains appreciate that the model does not claim a certainty it does not have.

Visualizations that communicate vs visualizations that confuse

Not all visualizations work for all audiences. There is a set of charts that systematically generate confusion rather than clarity, and that are worth avoiding regardless of how widely they are used in tutorials.

What to avoid

3D pie charts distort real proportions due to the perspective effect. Dual-axis line charts without explicit scale create misleading comparisons. A dashboard with more than five or six series on the same line chart becomes unreadable for any audience. Every time I have simplified a dashboard — separating charts or reducing series — stakeholder comprehension improved without exception.

What works

The before/after or model/baseline chart is the most impactful visualization for communicating value. A bar showing the current cost and the estimated cost with the model, or the percentage of correctly identified cases before and after, immediately makes the added value visible. For classification, the normalized confusion matrix (in percentages, not absolute values) is more readable and allows comparison between classes of different sizes. For feature importance, a horizontal bar chart with the five or ten most relevant features communicates more than a complex tree chart.

Interactive filters in dashboards add value when the stakeholder needs to explore scenarios — optimistic, pessimistic, by segment — without that exploration requiring technical intervention. I have seen this format completely change the dynamic of a results meeting: from a one-way presentation to a conversation about real decisions.

What to do when results are not good

A model with mediocre but well-interpreted results is more valuable than a model with excellent results that are poorly explained. This is not a consolation — it is a practical observation. Stakeholders who understand the model's limitations make better decisions with it than stakeholders who believe the model is infallible.

How to report limitations without destroying the credibility of the work: frame them as findings, not failures. "The model has difficulty predicting correctly during periods of high market volatility" is a useful finding. "The model fails" is not. The difference is that the first includes the context that explains the limitation and allows the stakeholder to know when to trust the model and when not to.

In academic theses, negative results are perfectly acceptable and sometimes expected. A project that demonstrates that a method does not work for a specific problem, and that documents why, contributes to the field's knowledge in the same way as one that reports positive results.

Frequently asked questions about interpreting ML results

How do I know if my results are good enough?

It depends on the domain, not the number. The right question is whether the model is better than the current alternative and whether its errors are acceptable given the real cost of being wrong in your context. An F1 of 0.78 can be excellent in one problem and completely insufficient in another.

Do I have to use SHAP to defend an academic project?

It is not mandatory, but it is the most robust tool for explaining why the model makes certain decisions. It only adds value if you can interpret its results and connect them to domain knowledge. Showing the chart without being able to explain it raises more questions than it answers.

How do I present negative results without making it look like the project failed?

By framing them as findings. Document what you learned about the problem, why the model has the limitations it has, and what conditions would be needed to improve results. In academic theses, well-documented negative results are valuable.

What metrics should I present if I don't know who will read the report?

The most honest metric for your type of problem. For imbalanced classification, MCC and PR AUC. For balanced classification, accuracy and F1. For regression, MAE with a minimal residual analysis. Then add a translation to impact language for non-technical readers.

Next step in the framework

Once you know how to interpret your results, the next step is structuring them in a document or defense that others can evaluate:

How to write and defend your ML project or thesis →

Or return to the full map: How to document, interpret, and present your ML project.

If you already have your metrics but don't know how to present them to your committee, client, or interviewer, I can help you structure the communication for your specific audience.

I want help interpreting my results →