/ ML Methodology / Modeling and Evaluation
Selection, Training and Evaluation of Machine Learning Models
A model with excellent metrics can be completely useless in production. The problem is almost never the algorithm: it's how it is evaluated, which metrics are used, and whether the validation reflects the real conditions of the system.
A Random Forest model on a bank credit dataset can reach 81.5% accuracy —and have an MCC of just 0.38. That means the metric that appears in the report says the model works well, while the metric that actually matters says it barely beats chance on the class where being wrong is most costly.
That is not a hypothetical case. It is one of the experimental results from Evaluating Supervised Machine Learning Models: Principles, Pitfalls, and Metric Selection (Liu et al., 2026), a recent study that used 15 real datasets to document exactly when and why model rankings change depending on which metric you use. The central conclusion is uncomfortable: model evaluation is not a routine step. It is a methodological decision that can make a system appear ready when it is not.
This article is a complete map of the problem: how to select models, how to design an evaluation that does not mislead you, which metrics to use depending on the domain, and how to know when a model is truly ready for production.
Table of contents:
- How to select the right model for your dataset
- How to split data without contaminating the evaluation
- Classification metrics: when each one matters
- The accuracy trap and class imbalance
- Regression metrics: MAE, RMSE and R² are not interchangeable
- How to detect overfitting when metrics seem correct
- How to compare models robustly
- Hyperparameter tuning without overfitting the evaluation
- Data leakage: the silent error that destroys models in production
- When is a model ready for production?
- How to monitor a model after deployment
- Frequently asked questions
🔵 How to select the best machine learning model for your dataset?
The right question is not "which model is the best?" It is: which model fails the least under my real conditions? The difference is not semantic. Liu et al. (2026) demonstrate this empirically: the same set of models produces different rankings on the same data depending exclusively on which metric is used to rank them. There is no universal winner.
What signals from the dataset determine which algorithm to use?
In real projects, selection decisions do not start with the algorithm but with the dataset analysis. The most determining signals are:
- Data volume: with few data, complex models (neural networks, gradient boosting with many estimators) have high variance and tend to overfit. Simple models —logistic regression, shallow trees— generalize better when the dataset does not exceed a few thousand records.
- Feature type: for well-preprocessed tabular data, tree-based models (Random Forest, XGBoost, LightGBM) consistently dominate benchmarks. For sequential or time-dependent data, this advantage disappears and models with explicit memory are needed.
- Target structure: a target with a very skewed distribution (fraud, anomaly detection, rare diseases) completely changes which model is viable, not just which metric to use.
- Required interpretability: in regulated sectors —health, credit, insurance— the ability to explain each prediction can eliminate technically superior options. This is not a technical constraint: it is a domain constraint.
Is there a universally better model?
No. And recent experimental evidence supports this with concrete data. In the Liu et al. study, the same Random Forest model achieved 90.3% accuracy on the Bank Marketing dataset and only 0.446 MCC on the same dataset. Those two numbers correspond to the same model, the same data, the same validation fold. The conclusion is not that the model is good or bad: it is that the answer depends on which aspect of performance you want to measure.
This also occurs between models. A Decision Tree can slightly outperform a Random Forest in accuracy (91.67% vs 90.39% on Car Evaluation) while at the same time producing a Log Loss of 3.00 compared to 0.33 from the Random Forest. The tree is right almost as often, but when it fails it does so with extreme confidence that the system penalizes heavily. Which model you would choose depends on whether your system needs frequent correct decisions or reliable probability predictions.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access
🔵 How to correctly split data into train, validation and test?
The data split is where the most costly and hardest-to-detect errors are made. A poorly designed split produces optimistic estimates that are only discovered when the model fails in production.
Why you should not use the test set to make decisions
This error is more common than it seems, and it happens in subtle ways. The obvious case is adjusting hyperparameters by evaluating directly on test. But it also happens when you select the best model among ten candidates using their test metrics: every time you use the test to decide, you introduce a selection bias. In the end, the "best model on test" is so because it got lucky with that partition, not because it generalizes better.
The paper describes this as one of the most common sources of irreproducible results: Kapoor and Narayanan document that data leakage through the test set has contaminated published results across multiple scientific disciplines that use ML. The structural solution is to use a separate validation set for all development decisions and reserve the test exclusively for the final report of the already-chosen model.
When to use cross-validation vs holdout?
- Cross-validation (k-fold): reduces the dependence of the result on a single partition. With 5 folds, each observation appears exactly once in test, and the average of the five results is a more stable estimator. It is the standard for moderately sized datasets.
- Simple holdout: valid for large datasets where the computational cost of training k models is prohibitive. The key is that the partition is representative: stratified split in classification to preserve class proportions.
- Nested cross-validation: when you need to both select hyperparameters and estimate generalization performance, the outer loop evaluates the model and the inner loop does the tuning. Avoids overestimating the performance of the selected model.
The special case of time series and geospatial data
A random shuffle in time series is a serious error: observations close in time share information, and a model can learn to "see the future" if the training data contains observations later than the test ones. The correct validation is always temporal: train on the past and evaluate on the future.
The same principle applies to geospatial data. Meyer et al. show that standard cross-validation significantly overestimates performance in spatial prediction because nearby observations in space share autocorrelation. The correct strategy is validation by geographic regions, not by individual observations.
🔵 Which classification metrics to use and when?
Choosing the wrong metric is one of the most costly errors in ML because its consequences are invisible until the model reaches production. Each metric has a different penalty structure, and that structure interacts differently with the data distribution.
The metric map by problem type
Accuracy
The proportion of correctly classified instances. Simple and intuitive, but misleading when classes are not balanced. With 95% negative class, a classifier that always predicts "negative" has 95% accuracy and zero usefulness. The paper documents this experimentally: with the Credit Default dataset (approximately 78/22 distribution), the Random Forest reaches 81.5% accuracy while its MCC is only 0.385 —four times lower in comparable terms.
Precision and Recall
Precision measures what fraction of positive predictions are correct. Recall measures what fraction of actual positives were detected. They are complementary and generally inversely related: increasing one reduces the other. The decision depends on which error is more costly in your domain:
- If the cost of a false positive is high (spam that deletes an important email, a fraud alert that blocks a legitimate purchase), prioritize Precision.
- If the cost of a false negative is high (failing to detect a tumor, failing to identify a fraudulent transaction), prioritize Recall.
In the breast cancer diagnosis experiment, the model achieved 96.48% accuracy and 95.80% Precision. But Recall dropped to 94.81% —and in a specific fold fell to 92.86%. In clinical terms: one in fourteen malignant cases was not detected. The metric that appears in the report says the model is excellent. The metric that matters says there are patients without diagnosis.
F1 Score and F2 Score
The F1 Score is the harmonic mean of Precision and Recall: it penalizes when either is low. Useful when there is no clear reason to prioritize one over the other. The F2 Score weights Recall twice as much as Precision, and is more appropriate in medical or security contexts where false negatives are more serious. In the study, the F2 Score of the breast cancer model was 0.9498, a number that better reflects clinical reality than the F1 of 0.9526.
MCC (Matthews Correlation Coefficient)
The MCC incorporates all four elements of the confusion matrix (TP, TN, FP, FN) into a single measure. Its range goes from -1 to +1, where 0 equals random prediction. It is especially informative in imbalanced datasets because it cannot be artificially inflated by the majority class. The paper identifies it as one of the most robust metrics for imbalanced binary classification: an MCC of 0.38 on a financial dataset is an honest evaluation where the 81.5% accuracy is an illusion.
ROC AUC and PR AUC
ROC AUC measures the model's ability to separate classes across all possible thresholds. It is robust but has an important weakness: it normalizes the false positive rate by the total number of negatives, which in very imbalanced datasets can result in a high ROC AUC even though the model produces many false positives in absolute terms.
PR AUC works directly with Precision and Recall, making it more informative when the positive class is the minority. In the bank marketing dataset, the Random Forest achieved a ROC AUC of 0.916 —which sounds excellent— but a PR AUC of only 0.586. The difference between those two numbers is the difference between a marketing system that appears to work and one that actually identifies customers who will convert.
🔵 The accuracy trap: imbalanced data and misleading metrics
This is the most documented error in ML evaluation and remains one of the most frequent. The paper calls it "the accuracy trap" and devotes a complete experimental scenario to documenting it with two real financial datasets.
The experiment that demonstrates it
With the Default of Credit Card Clients dataset (30,000 observations, approximately 78/22 distribution between no-default and default), the Random Forest achieved:
- Accuracy: 81.50% — suggests a competent model
- ROC AUC: 75.65% — suggests reasonable separation capacity
- PR AUC: 52.30% — reveals that half the time the model confuses positives with negatives
- MCC: 0.385 — indicates the model barely surpasses the level of a simple classifier
With Bank Marketing (45,211 observations, heavily imbalanced toward "did not subscribe"), the pattern is even more pronounced: ROC AUC of 91.58% coexisting with PR AUC of 58.55%. A marketing system built on that model and evaluated only with ROC AUC would appear excellent, but would convert roughly half of the times it expects to.
Why class imbalance creates this illusion
Accuracy and ROC AUC inflate because the volume of true negatives is so large that any improvement in detecting them dominates the metric computation. The model can almost completely ignore the minority class and still obtain apparently good numbers. MCC and PR AUC are less susceptible to this because their denominators directly include performance on the positive class.
Real strategies when classes are imbalanced
- Change the evaluation metric before adjusting the model: using F1, MCC or PR AUC from the start changes which models and which hyperparameters look favorable.
- Adjust class weights: most frameworks allow assigning higher penalties to errors in the minority class. The model optimizes its loss function with that penalty, which redistributes its attention toward the class that matters.
- Adjust the decision threshold: the default threshold of 0.5 is not optimal with imbalanced classes. Moving it downward captures more positives at the cost of more false positives; the right decision depends on the relative cost of each type of error in the domain.
- Oversampling or undersampling: SMOTE generates synthetic instances of the minority class; undersampling reduces the majority. Both strategies should be applied only on training data, never before the split.
🟡 Regression metrics: MAE, RMSE and R² are not interchangeable
In regression, the equivalent of the accuracy trap is using R² as the sole quality indicator. A high R² can coexist with systematic error patterns that are only visible in residual analysis. And MAE and RMSE can lead to opposite conclusions about which model is better, depending on whether the data has outliers.
MAE vs RMSE: when the difference matters
The mathematical difference between MAE and RMSE is that RMSE squares each error before averaging, causing large errors to contribute disproportionately. With well-behaved distributions, both metrics behave similarly. With outliers, the gap can be enormous.
The paper documents this with two real datasets. In California Housing, the Random Forest obtained an average MAE of 0.331 and an RMSE of 0.509. In Individual Household Electric Power Consumption (over two million observations), the MAE was 0.0216 and the RMSE 0.0401. In both cases, RMSE is significantly larger than MAE, reflecting that the model makes small errors frequently and large errors infrequently. The relevant question for metric selection is whether those large errors are inconsequential anomalies or costly operational failures.
- Use MAE when you want a robust report of the average error and outliers in the data are noise, not a critical signal. MAE is easier to interpret: an MAE of 0.33 in normalized prices is directly the expected error in a typical prediction.
- Use RMSE when large errors are operationally dangerous and should be penalized more strongly: electricity consumption prediction where peaks cause network failures, inventory estimation where large errors generate costly stockouts.
The illusion of high R²
In the Parkinson Telemonitoring dataset, the Random Forest achieved an average R² of 0.967 and a MAE of 0.840. At first glance it appears to be an almost perfect model. But a high R² indicates that the model explains the overall variance of the target; it does not guarantee that the error is uniform across the prediction range.
Residual analysis —plotting the error as a function of the predicted value— can reveal that the model is systematically worse for patients with severe symptoms (high values on the UPDRS scale), precisely the subgroup where accuracy matters most. That pattern does not appear in R² or global MAE. It only appears when you look at disaggregated residuals.
The practical rule: R² and MAE/RMSE are necessary but not sufficient metrics. Residual analysis is not optional in applications where the error in certain target ranges has different consequences than the error in other ranges.
The problem with MAPE for data with low values
MAPE (Mean Absolute Percentage Error) has a structural problem: when the actual value approaches zero, the denominator makes absolutely small errors produce enormous percentages. In Seoul Bike Sharing, the model obtained a MAE of 14.57 bicycles —which at peak demand hours is a marginal error— but a MAPE of 16.62%, dominated by nighttime periods of near-zero demand where any prediction produces a high percentage. Using MAPE in that context would lead to rejecting a model that works well when it matters.
🟡 How to detect overfitting when metrics seem correct?
Classic overfitting —high performance on train, low on test— is easy to detect. Modern overfitting is more subtle and is detected with other signals.
Overfitting signals that metrics do not show directly
- Very good but hard-to-reproduce results: if the model varies significantly between runs with different random seeds, the estimator variance is high even if the average is good. Stability and reproducibility are necessary conditions for a useful model.
- High variability between cross-validation folds: if the five folds give 0.92, 0.93, 0.91, 0.94 and 0.90 F1, the model is stable. If they give 0.95, 0.88, 0.93, 0.72, 0.91, there is structural instability that the average of 0.878 conceals.
- Performance that drops with minimal data perturbations: adding 1% noise to features, slightly changing the test distribution, or introducing values outside the training range should produce gradual degradations. Abrupt collapses indicate the model memorized the training set more than it learned the problem.
- Benchmark overfitting: the model continuously improves on the evaluation dataset without improving on truly new data. This occurs when the validation set is used repeatedly to make design decisions.
Learning curves as diagnostics
Learning curves plot performance on train and validation as a function of training dataset size. The diagnostic patterns are:
- Train high, validation low, constant gap when adding data → clear overfitting: the model needs regularization or simplification.
- Both low and converging → underfitting: the model is too simple for the problem.
- Both improve when adding data and the gap narrows → the model is learning; more data will help.
- Unstable performance (zigzag) on validation → possible problem with the data, not the model.
🟡 How to compare machine learning models robustly?
Comparing models with a single split produces random decisions. The paper demonstrates this directly: evaluation results depend on the split, and with a single holdout the "best model" may simply be the one that got lucky with the partition.
Why rankings change with the split
The variance of a performance estimator calculated on a single test set is high. Two models with a 0.5% difference in F1 can reverse their ranking on another split of the same dataset. This is not a theoretical problem: it is one of the reasons why ML results are hard to reproduce between teams working with the same data but with different seeds or partitions.
The solution is to cross the evaluation: 5-fold or 10-fold cross-validation produces five or ten independent estimates. The right decision is not just to look at the average, but also the spread: a model with an average of 0.88 F1 and standard deviation of 0.01 is preferable to one with an average of 0.89 and standard deviation of 0.06, because the second is less reliable.
Micro F1 vs Macro F1 in multiclass classification
The study includes four multiclass datasets and systematically documents when Micro F1 and Macro F1 diverge. The conclusion is structural: Micro F1 is mathematically identical to overall Accuracy in multiclass classification, which means it is dominated by the most frequent classes. Macro F1 gives equal weight to all classes, which exposes weak performance in minority classes.
In the Covertype dataset (markedly unequal class distribution), the Random Forest obtained a Micro F1 of 0.892 and a Macro F1 of 0.843 —a gap of 4.9 percentage points indicating significantly worse performance on the least frequent forest cover types. In Letter Recognition (nearly uniform distribution of 26 letters), the two values converged: 0.9615 vs 0.9614. The divergence between the two is, in itself, information about the problem distribution.
Statistical tests to compare models
With k cross-validation results per model, it is possible to run statistical tests to determine whether the difference between models is significant or just noise. Rainio et al. (cited in the paper) document appropriate testing procedures for ML evaluation. The recommended practice is not to declare a model the winner if the difference is not statistically significant, especially when the improvement is marginal.
🟡 How to do hyperparameter tuning without overfitting the evaluation?
Hyperparameter tuning is one of the most frequent sources of hidden overfitting. Every time you adjust a hyperparameter based on validation performance, you are using that information to make a decision. If you do it enough times with the same validation set, you end up overfitting the validation set even though you never trained with it.
Grid search, random search and bayesian optimization
- Grid search: exhaustive but exponentially costly. For a model with 4 hyperparameters and 5 values each, that is 625 evaluations. Each evaluation in cross-validation multiplies that by k. Reserved for small search spaces with lightweight models.
- Random search: randomly samples the hyperparameter space. Bergstra and Bengio showed that in practice it covers the relevant space more efficiently than grid search when many hyperparameters have little impact. It is the standard for most practical cases.
- Bayesian optimization: uses the history of previous evaluations to decide which region of the space to explore next. More efficient than random search when the space is large and each evaluation is costly.
The key insight: more tuning does not always improve generalization. At some point, the model starts to fit the particularities of the validation set. Nested cross-validation is the structural tool to avoid this: the inner loop does the tuning and the outer one evaluates the generalization of the already-tuned model.
Probability calibration
The paper dedicates a complete experimental scenario to calibration, and the results are striking. The Decision Tree marginally outperformed the Random Forest in accuracy on two datasets (91.67% vs 90.39% on Car Evaluation, similar advantage on Adult Census). But its Log Loss was 3.00 compared to 0.33 for the Random Forest. That means the tree, when it is wrong, does so with 95-99% confidence in the incorrect prediction.
In any system that uses predicted probabilities to make decisions —ranking cases by risk, generating alerts with an adjustable threshold, calculating the expected value of an action— a poorly calibrated model is operationally dangerous even if its accuracy is acceptable. Techniques such as Platt scaling and isotonic regression allow recalibrating the probabilities after training.
🔴 Data leakage: the silent error that destroys models in production
Data leakage is the only error in ML that produces better-than-real results during development. Everything else —overfitting, incorrect metrics, poorly designed splits— generates problems visible during evaluation. Leakage hides them until production.
The most common sources of leakage
- Preprocessing before splitting: normalizing, imputing or applying any transformation to the full dataset before splitting it between train and test. The learned parameters (mean, deviation, ranges) include information from the test and the model can exploit them indirectly. The solution is to encapsulate all transformations in a pipeline that is fitted only with training data.
- Features that encapsulate the target: including variables that only exist because you already know the outcome of the event you are predicting. In default prediction, including whether the customer was contacted by the collections department is leakage: that contact occurred precisely because they had already defaulted.
- Future temporal information: in time series, any feature calculated over a window that includes observations after the prediction moment. A 7-day moving average centered at t includes observations from t+1 to t+3 that would not exist in production.
- Duplicates that cross the split: if the same record appears in both train and test with small variations, the model learns to memorize those instances. Deduplication must be done before splitting.
Why it is hard to detect
Kapoor and Narayanan document cases of leakage in studies published in peer-reviewed scientific journals. The problem is not ignorance but that leakage often has no obvious signal: the pipeline runs without errors, the results are good and consistent. The only signal is that the model performs better than the state of the art in the domain would suggest is possible.
A practical heuristic: if your metrics are significantly better than the best published results for the same type of problem, investigate leakage before celebrating.
🔴 When is a machine learning model truly ready for production?
A model is ready for production when it is reproducible, stable, has metrics aligned to the operational objective and can be monitored. Maximum accuracy is irrelevant if the model does not meet those conditions.
The difference between generalizing and working in production
The paper defines generalization as consistent performance on unseen data. But production adds conditions that offline evaluation experiments do not reproduce: the volume of predictions is higher, input data changes gradually over time, there is required latency, there may be regulatory constraints on individual predictions, and the system consuming the model may have specific requirements about the output format.
Many models that pass offline validation never reach production due to operational, not technical, problems. The cost of deployment, maintenance of the data pipeline, the need to audit predictions, or the computational cost of inference can make a technically superior model unviable.
Concrete criteria to declare a model ready
- Stability across folds: the variance of the performance estimator must be low. A model that fluctuates greatly between folds has unpredictable behavior on real data.
- Alignment of metrics with the business objective: the metrics that guided development must correspond to the real cost of errors in the domain. A model optimized for macro F1 on a problem where only the minority class matters is not aligned even if the F1 is high.
- Reproducibility: the same pipeline, the same data, the same seed must produce exactly the same results. This is an audit requirement in regulated sectors and a debugging condition in any system.
- Defined monitoring plan: which metrics will be monitored in production, how frequently, and which thresholds trigger a review or retraining. A model without a monitoring plan has an unknown expiry date.
Should you retrain with the full dataset before deployment?
After selecting the model and hyperparameters with the training and validation set, it is common to retrain with all available data (train + validation) before final deployment. The idea is to maximize the information available to the model going into production. The test set must never be included in this retraining: it remains the honest estimate of expected performance in production.
🔴 How to monitor a model after deployment?
Evaluation does not end when the model reaches production. Data changes, patterns evolve and the model that was accurate six months ago may be producing systematically incorrect predictions today without any alarm flagging it.
Data drift and concept drift
- Feature drift: the distribution of input variables changes. The model receives data that is outside the range it saw during training, and its predictions in that territory are unvalidated extrapolations.
- Label drift: the relationship between features and target changes. What predicted fraud a year ago no longer predicts it because fraud patterns have evolved. The model is still technically correct with respect to what it learned; the problem is that what it learned is no longer valid.
- Concept drift: the underlying concept of the problem changes. Models trained with pre-pandemic data failed in demand, logistics and health applications during 2020–2021 because human behavior changed abruptly. There was no development error: the world stopped behaving like the training set.
What to monitor in production
- Distribution of input features: compare the current distribution with that of the training set using statistical tests (KS test, Population Stability Index). A significant drift is a signal that predictions may be in unvalidated territory.
- Distribution of predictions: if the model starts predicting almost always the same class, or if the probability distribution changes, something is changing in the data or the model.
- Performance on ground truth when available: when the actual results of predictions become available (fraud is confirmed, churn occurs or does not), recalculate the model's metrics on that sample.
- Latency and operational errors: a model that starts taking longer than expected or that produces inference errors has infrastructure problems that affect both availability and prediction quality.
When to retrain
Retraining should not be a manual process triggered by intuition. The ideal is to define thresholds: if the estimated performance drops more than X% from the baseline, or if feature drift exceeds a statistical threshold, the system triggers an automatic retraining process or alerts a responsible party to review. The specific thresholds depend on the domain and the cost of incorrect predictions versus the cost of retraining.
Frequently asked questions about ML model selection and evaluation
How to select the best machine learning model?
There is no universally better model. Selection depends on the data type, volume, interpretability constraints and the cost of errors in your domain. The critical step is to first define which metric reflects real success before comparing models.
Why can a model with good accuracy be useless?
Because accuracy ignores class distribution. With imbalanced data, a model can always predict the majority class, get 80–95% accuracy and completely fail to detect the class that matters. MCC and PR AUC reveal this; accuracy hides it.
What is the difference between MAE and RMSE?
MAE penalizes all errors linearly; RMSE penalizes large errors quadratically. With outliers, RMSE can be substantially higher than MAE for the same model. If large errors are operationally dangerous, RMSE is more appropriate. If they are inconsequential anomalies, MAE gives a more honest estimate of the typical error.
What is data leakage and how to avoid it?
Data leakage occurs when test information contaminates training, producing artificially good metrics that do not reproduce in production. The most robust way to prevent it is to use pipelines where all transformations are fitted exclusively with training data and applied to test without refitting.
When to use cross-validation?
When the dataset is of moderate size and you need a stable performance estimator. With time series or spatial datasets, using standard cross-validation is an error: the split must respect the temporal or geographic structure of the data to avoid overestimating generalization.
How to know if a model is ready for production?
When it is reproducible, stable across folds, has metrics aligned to the real operational objective and there is a monitoring plan to detect degradation after deployment. Maximum accuracy is not the criterion: reliability and operability are.
If you have a model that works well in evaluation but does not generate the expected impact when used in practice, the problem is usually in the evaluation design, not in the model. You can reach out to me directly: often reviewing the metric and the split is enough to diagnose the problem.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access