/ ML Methodology / Data Cleaning in ML
Data Cleaning in Machine Learning Complete Guide to Data Wrangling and Preprocessing in Python
Data cleaning is not a preliminary step before modeling: it's the factor that most influences whether a model works or fails in real-world scenarios. The problem is rarely technical. It's about making decisions.
At what point do you decide you've cleaned a dataset "enough"? Data cleaning in machine learning is not just a preliminary step before modeling. In practice, it's the factor that most influences whether a model works or fails in real-world scenarios.
The problem is usually not technical, but one of decision: what to clean, when to do it, and how far to go. The literature agrees: works like MLClean: A Data Cleaning System for Machine Learning show that data quality directly impacts model accuracy, robustness, and fairness. Other studies confirm that preprocessing is not a single phase, but a complete system of decisions.
This article is a pillar page: a complete map of the problem that lets you understand what decisions exist, how they connect, and when to go deeper on each one.
Table of contents:
- What data cleaning is and why it's critical
- Types of problems that appear when cleaning data
- How to decide between imputing or dropping missing values
- How to detect and handle outliers in a pipeline
- When to apply normalization or standardization
- How to handle imbalanced datasets
- How to avoid data leakage during preprocessing
- How to build reproducible data wrangling pipelines
- How to validate that cleaning improves the model
- How to avoid bias introduced by cleaning
- Advanced problems in real data beyond basic cleaning
- Real experience: how decisions are made in projects
- Frequently asked questions
🟢 What is data cleaning in machine learning and why is it critical in real models?
Data cleaning in machine learning in Python is the process of transforming raw data into usable data for models. But in real projects, this translates into something more concrete: deciding what information to keep, what to transform, and what to discard.
What does data wrangling in ML include?
In terms of ML data wrangling, the process covers four major tasks:
- Error detection: identifying inconsistencies, corrupted entries, and anomalies in raw data before any transformation.
- Handling missing values: deciding whether to impute, drop, or keep records with missing data based on the problem context.
- Outlier treatment: evaluating whether extreme values are noise or signal, and acting accordingly.
- Consistency validation: ensuring the data follows domain rules (valid ranges, correct types, relationships between variables).
According to the Machine Learning Data Preprocessing Review (MDPI), this process is critical for two fundamental reasons:
- Models don't distinguish between signal and noise: they learn from whatever you give them, regardless of whether it's correct.
- Any error in the data is amplified in the model: a small bias in training data can become a systematic bias in predictions.
What problems does data cleaning in machine learning actually solve?
In practice, the most common problems are:
- Missing values: data that wasn't recorded, was lost, or simply doesn't exist for certain records.
- Outliers (extreme values): observations that deviate significantly from the rest, whether due to error or being valid rare cases.
- Inconsistent data: entries that contradict other variables or domain rules.
- Duplicates: repeated records that can artificially inflate certain patterns.
The key point is this: the challenge is not detecting them — it's deciding what to do with them.
What is the difference between data cleaning, data wrangling, and preprocessing?
In real projects, the three terms are used differently but are related:
- Data cleaning: fixing specific errors in the dataset (nulls, duplicates, inconsistencies).
- Preprocessing: preparing data so the model can consume it (scaling, encoding, transformations).
- Data wrangling: the complete process that includes both, from raw data to model-ready data.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access
🟢 What types of problems appear when cleaning data in machine learning?
How do you identify missing values in a dataset and why do they occur?
Missing values in machine learning are not just absent data. They can indicate three very different things, and treating them without understanding which case applies can introduce serious biases:
- Capture errors: the data existed but wasn't recorded correctly (a form that failed, a sensor that didn't respond).
- Incomplete processes: the information hadn't been generated yet at the time of recording (for example, the result of an exam not yet taken).
- Real system behavior: the absence of data is itself a signal (a customer who bought nothing has a null value in "purchase amount," but that null carries meaning).
This is where judgment comes in: not all missing values should be treated the same way.
What are outliers in machine learning and how do they affect the model?
Outliers in machine learning fall into two categories with completely different implications:
- Noise (errors): values that shouldn't exist, generated by measurement errors, incorrect capture, or system failures. Removing them improves the model.
- Signal (important rare events): extreme values that are real and represent cases the model must learn to detect (fraud, critical failures, system anomalies). Removing them destroys the model.
Removing outliers without context can eliminate valuable patterns. This is widely documented in the preprocessing literature (Frontiers, ScienceDirect).
How do you detect duplicate or inconsistent data in a dataset?
This is one of the most underestimated problems. From real experience, it has consequences that aren't always obvious:
- Duplicates: can artificially inflate certain patterns, making the model learn to "repeat" answers for records it already saw, which leads to overestimating performance on validation.
- Inconsistencies: can simulate false distributions. For example, if the same entity appears with different values for the same variable at different times without logical justification, the model learns a distribution that doesn't exist in reality.
🟡 How do you decide between imputing or dropping missing values in machine learning?
This is where real value starts to be added. The decision is a trade-off between losing information or introducing bias. There is no universal rule; context defines which option is best.
When should you drop rows with missing values?
- When you have enough data volume and losing those records doesn't affect the representativeness of the dataset.
- When the percentage of nulls is low (typically less than 5%) and records with nulls don't follow a systematic pattern.
- When the nulls correspond to clear capture errors and have no informational value.
When should you impute missing values?
- When you have little data and dropping records would significantly reduce the dataset.
- When the nulls follow a pattern (for example, they're always missing for the same type of user), since dropping those records would introduce bias.
- When the variable is important for the model and losing that information affects prediction quality.
What the technique doesn't tell you
The key is not the imputation technique (mean, median, predictive model), but this: you must interpret the results accounting for the assumption you made. If you imputed with the mean, you assumed the missing values are similar to the average. If that's not true in your domain, the model will learn an incorrect distribution.
This type of decision is exactly what the literature (MLClean) highlights as critical: it's not enough to apply the technique; you have to justify it and understand its implications.
🟡 How do you detect and handle outliers in a machine learning pipeline?
Most common detection methods
In theory, statistical methods for detecting outliers are well known:
- IQR (Interquartile Range): flags as an outlier any value more than 1.5 times the interquartile range above the third quartile or below the first. Works well for approximately symmetric distributions.
- Z-score: identifies values more than 2 or 3 standard deviations from the mean. Assumes a normal distribution, which limits its applicability to real data.
Why context defines the decision
In practice, context defines whether an outlier is error or signal. Statistical methods only tell you something is unusual; they don't tell you whether to remove it. A common mistake in projects:
- Automatically removing outliers without reviewing their nature in the domain.
- Losing rare but important patterns (fraud, critical failures, extreme cases of interest).
- Not documenting which outliers were removed and why, making the pipeline impossible to audit.
The right question is not "is this value an outlier?" but "should this value be in my training data?"
🟡 When should you apply normalization or standardization in machine learning?
Which models need feature scaling?
Feature scaling in machine learning depends on the model you use — it's not a universal decision:
- Linear models (logistic regression, SVM, neural networks): sensitive to scale. Without scaling, variables with larger ranges dominate learning.
- Decision trees and Random Forest: less sensitive to scale because they make binary splits; scaling doesn't change the relative order of values.
- K-Nearest Neighbors and distance-based algorithms: highly sensitive to scaling; without normalization, higher-magnitude variables dominate the calculated distances.
The most important practical rule
Never scale before treating outliers. If you have an extreme outlier and apply min-max normalization, that extreme value compresses all other values into a very small range. The result is a transformation that distorts more than it helps.
The correct order in a pipeline is:
- First: handle missing values.
- Second: detect and treat outliers based on context.
- Third: apply scaling or normalization based on the model.
- Fourth (critical): apply all transformations using only training data, then transform the test set with the learned parameters.
🟡 How do you handle imbalanced datasets in machine learning?
Class imbalance is one of the most common and most underestimated problems in classification projects.
The most common mistake: using accuracy as the metric
With a dataset where 95% of examples are class A and 5% are class B, a model that always predicts A achieves 95% accuracy. But it fails 100% of the time at detecting class B, which is often the most important one (fraud, disease, failures).
Strategies for imbalanced datasets
- Oversampling the minority class: generating new examples of the underrepresented class (SMOTE is the most widely used technique). Useful when you have few examples of that class.
- Undersampling the majority class: reducing examples of the dominant class to balance the dataset. Useful when you have plenty of data and can afford to lose some.
- Adjusting model weights: many algorithms (Random Forest, XGBoost, logistic regression) allow assigning greater weight to the minority class during training without modifying the data.
- Changing the evaluation metric: using F1-score, AUC-ROC, or precision/recall instead of accuracy. The right metric depends on the actual cost of each type of error in your domain.
🔴 How do you avoid data leakage during data preprocessing?
This is one of the most serious mistakes in machine learning, and the hardest to detect because the model appears to work perfectly during training and validation.
What data leakage is and why it destroys a model
Data leakage occurs when information from the test set "leaks" into the model during training. The model learns to use information that wouldn't exist in production, generating artificially good evaluation metrics that don't reproduce in the real world.
The most common leakage cases in preprocessing
- Scaling with statistics from the full dataset: if you calculate the mean and standard deviation of the entire dataset (including test) to normalize, the model has seen information from the test data. Always calculate scaling parameters using only training data.
- Imputing before the split: if you fill missing values using the mean of the full dataset before splitting into train and test, you're introducing test information into the training set.
- Variables with future information: including variables that only exist because you already know the outcome (for example, including the final diagnosis as an input when the goal is to predict the diagnosis).
- Duplicates that cross the split: if the same record appears in both train and test, the model has already "seen" that data and the metrics are unreliable.
The literature (MLClean) identifies this as a critical source of error because it's silent: the pipeline runs without errors, the metrics look good, and the problem is only detected when the model fails in production.
The structural solution: use pipelines
The most robust way to avoid leakage is to build pipelines where all transformations are fitted exclusively on training data and then applied to the test set without refitting. Frameworks like scikit-learn Pipeline are designed specifically for this.
🔴 How do you build reproducible data wrangling pipelines in machine learning?
In production, data cleaning is not manual: it's a system. A poorly built preprocessing pipeline is as serious as a poorly trained model, because if you can't reproduce the exact same transformation on new data, the model fails.
What a reproducible pipeline involves
- Automated pipelines: each transformation (imputation, scaling, encoding) is encapsulated in an object that can be serialized, saved, and loaded. Scikit-learn Pipeline is the standard in Python.
- Parameters learned only from training data: the statistics used for transforming (mean, deviation, ranges) are calculated once with training data and applied consistently to any new data.
- Continuous validation: the pipeline includes checks that detect when input data deviates from the expected distribution (data drift), before the model makes a prediction with data that no longer represents what it learned.
- Documentation of every decision: what was imputed, with what strategy, which outliers were removed and why. Without this documentation, the pipeline is not auditable.
The difference between a research pipeline and a production pipeline
In a research notebook, cleaning data manually is sufficient. In production, that same process has to run identically every time new data arrives, without human intervention and with error handling. The distance between the two is greater than it seems, and it's one of the reasons why many models that work in research fail when deployed.
🔴 How do you validate that data cleaning actually improves the model?
Here's a key point from experience: you don't clean data for the sake of cleaning — you clean to improve the model. Every cleaning decision must be measurable in terms of its impact on performance.
How to measure the impact of each cleaning decision
- Establish a baseline before cleaning: train a model on the raw data (or with minimal cleaning) and record its metrics. That's your reference point.
- Apply one cleaning decision at a time: if you change multiple things simultaneously, you can't know what improved the model and what made it worse.
- Evaluate on the validation set, not on train: a cleaning step that improves training metrics but not validation metrics is probably introducing bias.
- Check if results are stable: use cross-validation. If the model consistently improves across all folds, the cleaning decision is robust.
When is the dataset "clean enough"?
A practical heuristic: the dataset is clean enough when the model produces interpretable and stable results. Not when the data is perfect, but when you can no longer improve the model by cleaning further.
This aligns with the literature: the goal of preprocessing is to improve performance, not perfect the dataset. Continuing to clean when the model no longer improves is a waste of resources and can introduce new biases.
🔴 How do you avoid bias introduced by data cleaning?
Dropping or transforming data can introduce biases that didn't exist in the original data. This is one of the most invisible effects of preprocessing.
Sources of bias in data cleaning
- Removing outliers that represent minorities: if extreme values correspond to infrequent population groups (patients with rare diseases, atypical customer transactions), removing them makes the model learn to ignore those groups.
- Imputing with the mean when the distribution is not uniform: if nulls are not random but concentrated in a specific subgroup, imputing with the overall mean introduces a systematic distortion for that subgroup.
- Dropping records with nulls that follow a pattern: if missing data is not random (MNAR: Missing Not At Random), dropping those records changes the dataset distribution in a non-representative way.
- Applying different transformations to subgroups: if data from different demographic groups is processed with different parameters, the model can learn artificial differences between groups.
This connects to work like MLClean, where fairness is studied: data cleaning is not a neutral process. Every decision has consequences for which groups are well represented — and which are not — in the final model.
🔴 What advanced problems exist in real data beyond basic cleaning?
In production, problems appear that don't exist in academic datasets or Kaggle competitions. These are the problems that separate those who work with ML in real environments from those who don't.
Data drift: when the world changes and the model doesn't
Data drift occurs when the distribution of input data changes over time, but the model keeps operating on the assumptions learned during training. A real example: training with summer data and predicting in winter generates clear biases because behavioral or consumption patterns are different.
- Feature drift: the distribution of input variables changes (users now behave differently than they did a year ago).
- Label drift: the relationship between input variables and the target variable changes (what used to predict fraud no longer does).
- Concept drift: the concept the model learned is no longer valid in the current context.
Non-representative data
One of the most frequent problems in real projects is training with data that doesn't faithfully represent the problem you want to solve. This can occur due to:
- Selection bias in data collection (only certain types of cases were recorded).
- Changes in the business process that make historical data no longer reflect current operations.
- Differences between the environment where data was collected and the environment where the model will be deployed.
🧠 Real experience: how decisions are made in projects
This is where real ML judgment is built. Cleaning decisions are not made by an algorithm: they're made by a person with context.
How do you decide which data problems to solve first?
In real projects, the process has three steps that cannot be skipped:
- Identify problems: initial exploration of the dataset (distributions, missing values, duplicates, data types, inconsistencies). This stage is descriptive, not prescriptive.
- Assess impact: for each problem found, estimate how large its potential effect on the model could be. A 0.1% rate of duplicates has a different impact than 30% missing values in the target variable.
- Prioritize: tackle the highest-impact problems first, not the easiest ones to fix. Not everything needs to be cleaned; cleaning everything can be worse than cleaning only what matters.
When should you stop cleaning data?
This breaks with intuition: not when the data is "perfect," but when the model works well. Cleaning beyond that point can introduce biases, eliminate useful signal, or simply add no measurable value.
A practical indicator: if the last three cleaning decisions you made didn't improve validation metrics, you've probably finished.
How do you detect problems that don't show up in exploratory analysis?
The short answer: by talking to the data owner. Context is not in the dataset; it's in the business or system that generates it. A field that looks numeric may have a sentinel value (like -1 or 9999) that means "data not available" — something no statistical analysis will automatically detect. A date range that looks correct may include a period when the capture system had a known bug.
What's the difference between academic and real datasets?
Contrary to what many think, both can be equally messy. The key difference is this:
- With real datasets you can ask: you have access to the person or team that generated the data and can understand the context behind each anomaly.
- With academic datasets there's nobody to ask: you have to infer context from available documentation, which is sometimes incomplete.
That difference completely changes how a real dataset is cleaned: you're not guessing what a strange value means — you can actually know.
Frequently asked questions about data cleaning in machine learning
What is data cleaning in machine learning?
It's the process of transforming raw data into usable data for models. It includes handling missing values, treating outliers, detecting duplicates and inconsistencies, and validating that the data faithfully represents the problem you want to solve.
When should you impute or drop missing values in machine learning?
It depends on data volume and context. If you have plenty of data you can drop records; if you have few, you need to impute. The most important thing is to interpret results accounting for the assumption you made when handling the missing values.
What is data leakage and why is it so serious?
Data leakage occurs when the model accesses test set information during training. The most common example is scaling data before the train/test split. The result is a model that appears to work well in evaluation but fails in production.
When should you apply normalization or standardization?
It depends on the model. Linear and distance-based models are sensitive to scale; decision trees are not. The most important rule: never scale before treating outliers, as they distort the scaling parameters.
How do you know when the dataset is clean enough?
When the model produces interpretable and stable results. Not when the data is perfect, but when cleaning further no longer improves validation metrics. The goal of preprocessing is to improve model performance, not to achieve a perfect dataset.
How do you build reproducible data wrangling pipelines?
Using tools like scikit-learn Pipeline where each transformation is fitted only on training data and applied identically to any new data. The pipeline should include continuous validation to detect data drift when input data changes over time.
Conclusion: data cleaning is a problem of decisions, not techniques
If both experience and the literature make one thing clear, it's this: there is no single way to clean data. Techniques (IQR, mean imputation, SMOTE) are tools. The decisions are yours.
Everything comes down to three factors that are always present:
- Context: where the data comes from, how it was generated, and what each anomaly means in the problem domain.
- Objective: what you want to predict, what errors are most costly, and what metrics reflect real success in your use case.
- Trade-offs: every cleaning decision has a cost. Dropping data reduces volume. Imputing introduces assumptions. Keeping outliers adds noise. No decision is without consequence.
That is exactly what differentiates applying techniques from doing real machine learning.
If you have a project underway and data cleaning is becoming a blocker, the problem is often in the pipeline design, not in the data itself. You can reach out directly and we'll review together where the critical point is.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access