Identify future information leaking into your dataset before wasting weeks training an invalid model
⚡ Use it when: You build aggregated features, work with temporal data, or your training accuracy is suspiciously high
Leakage = using information that wouldn't exist at real prediction time. It's not overfitting — it's contaminating your training set with future signals you won't have in production.
Normalizing with statistics from the entire dataset, including post-event variables as features, using random splits on temporal data, computing aggregates without a strict temporal cutoff.
Strict temporal validation, a feature pipeline without contamination, and models that actually work in production without dramatic performance drops.
📄Detectar Target Mal DefinidoPython
import pandas as pd
from sklearn.metrics import cohen_kappa_score
# Validar consistencia de etiquetasdefvalidate_target_consistency(labels_expert1, labels_expert2):
"""
Mide inter-rater agreement para detectar ambigüedad en target
"""
kappa = cohen_kappa_score(labels_expert1, labels_expert2)
if kappa <0.6:
print(f"⚠️ WARNING: Kappa = {kappa:.3f}")
print("Target tiene ambigüedad conceptual")
else:
print(f"✓ Good agreement: Kappa = {kappa:.3f}")
return kappa
# Example usage
expert1 = [1, 0, 1, 1, 0, 1]
expert2 = [1, 1, 1, 0, 0, 1]
kappa_score = validate_target_consistency(expert1, expert2)
🔧Complete Target Validation SuitePython
# Target Validationdefcheck_target_distribution(df, target_col):
"""Detecta drift en distribución del target"""
distribution = df[target_col].value_counts(normalize=True)
print("Target Distribution:")
print(distribution)
# Check for extreme imbalanceif distribution.min() <0.05:
print("⚠️ Extreme class imbalance detected")
return distribution
# Target Monitoring in Productiondefmonitor_target_drift(baseline_dist, current_dist):
"""Compara distribución actual vs baseline"""from scipy.stats import ks_2samp
statistic, p_value = ks_2samp(baseline_dist, current_dist)
if p_value <0.05:
print(f"🚨 DRIFT DETECTED: p-value = {p_value:.4f}")
print("Consider retraining or target redefinition")
else:
print(f"✓ No significant drift: p-value = {p_value:.4f}")
return p_value
# A/B Test Validationdefvalidate_ab_alignment(model_predictions, business_outcomes):
"""Valida que predicciones correlacionen con outcomes reales"""from scipy.stats import pearsonr
corr, p_value = pearsonr(model_predictions, business_outcomes)
if corr <0.7:
print(f"⚠️ Low correlation: {corr:.3f}")
print("Target may not align with business goals")
else:
print(f"✓ Good alignment: r = {corr:.3f}")
return corr
Visual Example: Temporal Split vs Random Split
Notice how random splits contaminate validation with future information,
while temporal splits maintain the real prediction scenario.
Key Formula: Temporal Validation Split
Formula 1
Train = {Xi | ti < Tcutoff}
Training set includes only observations where timestamp is strictly
before the cutoff date — no future information allowed.
Formula 2
μtrain = 1/ntrain Σ Xi
Normalization parameters (mean, std) must be computed ONLY on training data,
never on the full dataset.
🔍
Deep Dive
5 Red Flags for Data Leakage
Training accuracy > 95% — Suspiciously perfect performance often indicates future information in features
Feature importance inconsistent with domain — If random IDs rank highest, you're likely leaking identity information
Validation metrics don't match production — Classic sign that validation set was contaminated
Time-based features without lag — Using t+1 information to predict at time t
Preprocessing on full dataset — Normalizing or encoding before train/test split leaks statistics
Understanding Data Leakage in Machine Learning
What is Data Leakage and Why It Matters
Data leakage is one of the most insidious problems in machine learning.
It occurs when information from outside the training dataset is used to
create the model, leading to overly optimistic performance metrics
that don't translate to production.
Unlike overfitting, which happens during model training, leakage happens
during data preparation. The model learns patterns that won't exist when
making real predictions, resulting in dramatic performance drops in production.
Types of Data Leakage
There are two main categories of leakage you need to watch out for:
Target leakage occurs when your
features contain information about the target that wouldn't be available
at prediction time. For example, using a "refund_issued" feature to
predict customer churn when refunds only happen after churn.
Train-test contamination happens
when information from the test set influences the training process, most
commonly through preprocessing steps applied to the entire dataset.
"If your model is too good to be true, it probably is. Data leakage is
the most common reason for unrealistic performance in ML projects."
— Applied ML Best Practices
Common Scenarios Where Leakage Occurs
The following situations are red flags
that should trigger immediate investigation:
💡Pro tip: Always perform a temporal split validation
when working with time-series data. Random splits will almost always
introduce leakage.
⚠️Critical mistake: Normalizing your entire dataset
before splitting into train/test sets leaks statistical information
from the test set into training.
Key indicators include: training accuracy
exceeding 95%, perfect ROC-AUC
scores, features that shouldn't
be predictive ranking highest, and
dramatic performance drops in production.
For example, if you use df.fillna(df.mean())
before splitting, you're computing the mean using future data. Instead,
use train_mean = X_train.mean()
and then apply it to both sets.
In production, models with leakage typically see performance degradation
of 30-70% compared to validation metrics,
rendering them completely useless
for business applications.
Prevention Strategy
The key to preventing leakage is establishing a strict temporal boundary
and ensuring every preprocessing step respects that boundary.
This means computing statistics only on training data, using proper time-based
splits, and thoroughly auditing your feature engineering pipeline for any
operations that could introduce future information.