Español English
Data Quality

How to detect leakage before training?

Identify future information leaking into your dataset before wasting weeks training an invalid model

⚡ Use it when: You build aggregated features, work with temporal data, or your training accuracy is suspiciously high

Leakage = using information that wouldn't exist at real prediction time. It's not overfitting — it's contaminating your training set with future signals you won't have in production.

Normalizing with statistics from the entire dataset, including post-event variables as features, using random splits on temporal data, computing aggregates without a strict temporal cutoff.

Strict temporal validation, a feature pipeline without contamination, and models that actually work in production without dramatic performance drops.

📄 Detectar Target Mal Definido Python
import pandas as pd
from sklearn.metrics import cohen_kappa_score

# Validar consistencia de etiquetas
def validate_target_consistency(labels_expert1, labels_expert2):
    """
    Mide inter-rater agreement para detectar ambigüedad en target
    """
    kappa = cohen_kappa_score(labels_expert1, labels_expert2)
    
    if kappa < 0.6:
        print(f"⚠️ WARNING: Kappa = {kappa:.3f}")
        print("Target tiene ambigüedad conceptual")
    else:
        print(f"✓ Good agreement: Kappa = {kappa:.3f}")
    
    return kappa

# Example usage
expert1 = [1, 0, 1, 1, 0, 1]
expert2 = [1, 1, 1, 0, 0, 1]
kappa_score = validate_target_consistency(expert1, expert2)
🔧 Complete Target Validation Suite Python
# Target Validation
def check_target_distribution(df, target_col):
    """Detecta drift en distribución del target"""
    
    distribution = df[target_col].value_counts(normalize=True)
    
    print("Target Distribution:")
    print(distribution)
    
    # Check for extreme imbalance
    if distribution.min() < 0.05:
        print("⚠️ Extreme class imbalance detected")
    
    return distribution
# Target Monitoring in Production
def monitor_target_drift(baseline_dist, current_dist):
    """Compara distribución actual vs baseline"""
    
    from scipy.stats import ks_2samp
    
    statistic, p_value = ks_2samp(baseline_dist, current_dist)
    
    if p_value < 0.05:
        print(f"🚨 DRIFT DETECTED: p-value = {p_value:.4f}")
        print("Consider retraining or target redefinition")
    else:
        print(f"✓ No significant drift: p-value = {p_value:.4f}")
    
    return p_value
# A/B Test Validation
def validate_ab_alignment(model_predictions, business_outcomes):
    """Valida que predicciones correlacionen con outcomes reales"""
    
    from scipy.stats import pearsonr
    
    corr, p_value = pearsonr(model_predictions, business_outcomes)
    
    if corr < 0.7:
        print(f"⚠️ Low correlation: {corr:.3f}")
        print("Target may not align with business goals")
    else:
        print(f"✓ Good alignment: r = {corr:.3f}")
    
    return corr

Visual Example: Temporal Split vs Random Split

Description

Notice how random splits contaminate validation with future information, while temporal splits maintain the real prediction scenario.

Key Formula: Temporal Validation Split

Formula 1
Train = {Xi | ti < Tcutoff}

Training set includes only observations where timestamp is strictly before the cutoff date — no future information allowed.

Formula 2
μtrain = 1/ntrain Σ Xi

Normalization parameters (mean, std) must be computed ONLY on training data, never on the full dataset.

🔍
Deep Dive

5 Red Flags for Data Leakage

  • Training accuracy > 95% — Suspiciously perfect performance often indicates future information in features
  • Feature importance inconsistent with domain — If random IDs rank highest, you're likely leaking identity information
  • Validation metrics don't match production — Classic sign that validation set was contaminated
  • Time-based features without lag — Using t+1 information to predict at time t
  • Preprocessing on full dataset — Normalizing or encoding before train/test split leaks statistics

Understanding Data Leakage in Machine Learning

What is Data Leakage and Why It Matters

Data leakage is one of the most insidious problems in machine learning. It occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance metrics that don't translate to production.

Unlike overfitting, which happens during model training, leakage happens during data preparation. The model learns patterns that won't exist when making real predictions, resulting in dramatic performance drops in production.

Types of Data Leakage

There are two main categories of leakage you need to watch out for:

Target leakage occurs when your features contain information about the target that wouldn't be available at prediction time. For example, using a "refund_issued" feature to predict customer churn when refunds only happen after churn.

Train-test contamination happens when information from the test set influences the training process, most commonly through preprocessing steps applied to the entire dataset.

"If your model is too good to be true, it probably is. Data leakage is the most common reason for unrealistic performance in ML projects." — Applied ML Best Practices

Common Scenarios Where Leakage Occurs

The following situations are red flags that should trigger immediate investigation:

💡 Pro tip: Always perform a temporal split validation when working with time-series data. Random splits will almost always introduce leakage.

⚠️ Critical mistake: Normalizing your entire dataset before splitting into train/test sets leaks statistical information from the test set into training.

Key indicators include: training accuracy exceeding 95%, perfect ROC-AUC scores, features that shouldn't be predictive ranking highest, and dramatic performance drops in production.

For example, if you use df.fillna(df.mean()) before splitting, you're computing the mean using future data. Instead, use train_mean = X_train.mean() and then apply it to both sets.

In production, models with leakage typically see performance degradation of 30-70% compared to validation metrics, rendering them completely useless for business applications.

Prevention Strategy

The key to preventing leakage is establishing a strict temporal boundary and ensuring every preprocessing step respects that boundary.

This means computing statistics only on training data, using proper time-based splits, and thoroughly auditing your feature engineering pipeline for any operations that could introduce future information.

Stay Updated

Get the latest ML insights and best practices delivered to your inbox