/ ML Projects / Business & Finance / Sales · Segmentation · Anomalies

Sales Forecasting with Machine Learning XGBoost Lag Features, Customer Segmentation RFM K-Means, and Anomaly Detection DBSCAN Python

Three ML modules in one pipeline: monthly demand forecasting per product with ACF-guided lag features, RFM customer segmentation with K-Means, and anomaly detection with IQR, K-Means, and DBSCAN — built on synthetic automotive retail data.

25 min read Python · Scikit-learn · XGBoost Data: synthetic automotive retail 2020–2024 (real-distribution preserving) Applicable to business, industrial engineering, and operations management theses

Project SummaryWiki

What will you find in this project?

  • How to generate realistic synthetic retail data that preserves the statistical distribution of real client data when confidentiality prevents sharing the original.
  • Sales forecasting per product using ACF-guided lag features and a five-model benchmark (Decision Tree, RF, GBM, XGBoost, AdaBoost) evaluated with MAPE.
  • RFM customer segmentation with K-Means, elbow method, and Silhouette score to select the optimal number of clusters.
  • Anomaly detection with three methods compared head-to-head: IQR, K-Means distance, and DBSCAN with k-NN epsilon calibration.

This project builds three ML modules on data from an automotive retail business with four branches. Because the client's real transaction data is confidential, a synthetic data generator was engineered to preserve the original statistical structure: branch-level demand patterns, pandemic impact (2020–2021), and per-customer and per-product purchase probabilities. The result is a fully replicable open pipeline that reflects the complexity of a real retail problem without exposing sensitive information.

Why this pipeline is valuable for academic and applied research: It integrates three distinct ML problem types — supervised regression, unsupervised clustering, and unsupervised anomaly detection — in a single coherent workflow. The synthetic data generation strategy is itself a methodological contribution: it demonstrates how to produce publishable, reproducible research when real data carries confidentiality constraints. The five-model benchmark with ACF-driven feature engineering adds rigor to the forecasting component that goes beyond picking XGBoost by default.

The four project objectives

OB1

Data cleaning and preparation

Exploratory analysis, column selection, synthetic data generation with pandas and numpy, lag feature construction, and temporal train/test split.

OB2

Sales forecasting

Monthly demand forecasting per product with five ensemble models benchmarked by MAPE on a time-ordered hold-out set.

OB3

Anomaly detection

Three methods compared: IQR statistical bounds, K-Means centroid distance thresholding, and DBSCAN with k-NN epsilon calibration.

OB4

Customer segmentation

RFM (Recency, Frequency, Monetary) variables with Box-Cox normalization, StandardScaler, and K-Means with Silhouette-based cluster selection.

Technology stack

  • Python 3.10+ — primary language
  • pandas / numpy — data manipulation and synthetic data generation
  • scikit-learn — Random Forest, KMeans, DBSCAN, StandardScaler, Silhouette score
  • xgboost — primary forecasting model
  • statsmodels — ACF (Autocorrelation Function) for lag selection
  • scipy — Box-Cox transformation for RFM normalization
  • matplotlib / seaborn — visualizations
  • holidays — generation of valid business dates (excluding public holidays and Sundays)

Setup and dataSystem Configuration

The project operates on synthetic data generated from the statistical distribution of a real automotive retailer's transaction records across four branches. The generator preserves real patterns including the pandemic demand shock (2020–2021), branch-level sales ratios, and per-customer purchase probabilities — making the pipeline behaviorally equivalent to the original without exposing confidential records.

Dependency installation

Terminal — install dependencies
pip install pandas numpy scikit-learn xgboost statsmodels scipy matplotlib seaborn holidays openpyxl

Recommended project structure

Folder structure
retail-ml-pipeline/
├── data/
│   ├── raw/                         # original client data (confidential — not committed)
│   └── data_productos_sint.csv      # generated synthetic dataset
├── notebooks/
│   ├── 01_synthetic_generation.ipynb
│   ├── 02_sales_forecasting.ipynb
│   ├── 03_anomaly_detection.ipynb
│   └── 04_rfm_segmentation.ipynb
├── src/
│   ├── data_generator.py
│   ├── forecasting.py
│   ├── anomaly_detection.py
│   └── segmentation.py
└── requirements.txt

Synthetic data generation

When client data is confidential, the correct approach is to build a generator that preserves the statistical structure of the original rather than creating fully random data. The strategy here uses: historical branch-level sales frequencies, per-customer and per-product purchase probabilities, a pandemic impact mask for 2020–2021, and a valid-date filter that excludes public holidays and Sundays — matching the real operating calendar of the business.

Python — generate valid business dates excluding public holidays
import pandas as pd
import holidays

# Date range: January 2020 to December 2024
fechas = pd.date_range(start="2020-01-01", end="2024-12-31", freq="D")
peru_holidays = holidays.country_holidays("PE", years=range(2020, 2025))

# Keep only business days: Monday–Saturday, excluding public holidays
valid_dates = [
    f for f in fechas
    if f.weekday() != 6          # exclude Sundays
    and f not in peru_holidays   # exclude public holidays
]

print(f"Valid business days 2020–2024: {len(valid_dates)}")
Python — apply pandemic demand reduction mask
import numpy as np
import pandas as pd

synthetic_data["Fecha_Emision"] = pd.to_datetime(synthetic_data["Fecha_Emision"])

pandemic_start = pd.to_datetime("2020-03-01")
pandemic_end   = pd.to_datetime("2021-03-31")

# Boolean mask for pandemic period
mask = (
    (synthetic_data["Fecha_Emision"] >= pandemic_start) &
    (synthetic_data["Fecha_Emision"] <= pandemic_end)
)

# Reduce sales to 30–60% of normal volume during the pandemic window
pandemic_factor = np.random.uniform(0.3, 0.6, mask.sum())
synthetic_data.loc[mask, "Cantidad"] = (
    synthetic_data.loc[mask, "Cantidad"] * pandemic_factor
).astype(int)

Feature EngineeringData Preparation

Data preparation differs significantly across the three modules: sales forecasting requires a per-product lag feature matrix, anomaly detection operates on individual product time series without targets, and customer segmentation aggregates the full transaction history to customer-level RFM variables. Each module's preparation is implemented independently to avoid data leakage and allow modular testing.

OB1 — Column selection and top-10 products

Python — data loading and top-10 product selection
import pandas as pd

data = pd.read_csv("data/data_productos_sint.csv")
data["Fecha_Emision"] = pd.to_datetime(data["Fecha_Emision"])

# Select relevant columns for the forecasting module
data_products = data[["Fecha_Emision", "Codigo_Producto", "Cantidad"]].copy()

# Aggregate total sales per product and sort descending
products_grouped = (
    data_products
    .groupby("Codigo_Producto")["Cantidad"]
    .sum()
    .reset_index()
    .sort_values(by="Cantidad", ascending=False)
)

# Select top-10 highest-volume products
top10 = products_grouped.head(10)
top10_codes = top10["Codigo_Producto"].tolist()
print(f"Top-10 products: {top10_codes}")

OB2 — ACF-guided lag feature construction

The ACF (Autocorrelation Function) measures the correlation of a time series with its own lagged values. Lags outside the 95% confidence interval (the blue shaded band in the ACF plot) carry statistically significant autocorrelation and are strong candidates for inclusion as model features. In this dataset, lags 1 through 5 showed consistent significance across most of the top-10 products — establishing 5 as the lag window for the forecasting models.

Python — ACF plots for lag selection across all top-10 products
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt

# Monthly aggregation per product
data_products["Month"] = data_products["Fecha_Emision"].dt.to_period("M")
product_dataframes = {}

for code in top10_codes:
    df_prod = (
        data_products[data_products["Codigo_Producto"] == code]
        .groupby("Month")["Cantidad"]
        .sum()
        .reset_index()
    )
    df_prod.columns = ["Date", "Quantity"]
    product_dataframes[code] = df_prod

# Plot ACF for each product (5 rows × 2 columns)
fig, axes = plt.subplots(5, 2, figsize=(14, 18))
for ax, (code, df) in zip(axes.flatten(), product_dataframes.items()):
    plot_acf(df["Quantity"], lags=12, ax=ax, title=f"ACF — {code}")
plt.tight_layout()
plt.show()
Python — build lag features for all top-10 products
NUM_LAGS = 5

for code, df in product_dataframes.items():
    for lag in range(1, NUM_LAGS + 1):
        df[f"lag_{lag}"] = df["Quantity"].shift(lag)
    # Fill NaN with zero for rows without full lag history
    df.fillna(0, inplace=True)
    product_dataframes[code] = df

# Verify structure
print(product_dataframes[top10_codes[0]].head(7))

OB4 — RFM variable construction for customer segmentation

RFM converts the transaction history into three customer-level metrics: Recency (days since last purchase), Frequency (count of unique transactions), and Monetary value (total spend). RFM distributions are inherently right-skewed — a small number of high-value customers dominate both frequency and monetary dimensions. Box-Cox transformation is applied before StandardScaler normalization to correct this skew; without it, K-Means clustering would be dominated by the monetary variable and produce poorly separated segments.

Python — compute RFM metrics per customer with Box-Cox normalization
from datetime import timedelta
from scipy import stats
import pandas as pd

df_customers = data[["Fecha_Emision", "Cliente", "Valor_Venta", "Documento"]].copy()

# Snapshot date: day after the last recorded transaction
snapshot_date = df_customers["Fecha_Emision"].max() + timedelta(days=1)

# Compute RFM per customer
customers = df_customers.groupby("Cliente").agg(
    Recency   = ("Fecha_Emision", lambda x: (snapshot_date - x.max()).days),
    Frequency = ("Documento",     "nunique"),
    Monetary  = ("Valor_Venta",   "sum")
).reset_index()

# Box-Cox transformation to correct right-skew in all three dimensions
customers_bc = pd.DataFrame()
customers_bc["Recency"]   = stats.boxcox(customers["Recency"])[0]
customers_bc["Frequency"] = stats.boxcox(customers["Frequency"])[0]
customers_bc["Monetary"]  = stats.boxcox(customers["Monetary"])[0]

print(customers_bc.describe())

Modeling all three modulesTraining

OB2 — Five-model forecasting benchmark

Five regression models are benchmarked for each of the top-10 products, evaluated with MAPE on a chronologically ordered 80/20 split. Using shuffle=False is mandatory here — shuffling would allow future observations to leak into training and produce overly optimistic error metrics that collapse in production deployment.

Python — five-model forecasting benchmark with MAPE evaluation
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error
from xgboost import XGBRegressor
import numpy as np

models = {
    "Decision Tree":     DecisionTreeRegressor(random_state=42),
    "Random Forest":     RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "XGBoost":           XGBRegressor(random_state=42, verbosity=0),
    "AdaBoost":          AdaBoostRegressor(random_state=42),
}

mape_errors = {}

for code, df in product_dataframes.items():
    features = [f"lag_{i}" for i in range(1, NUM_LAGS + 1)]
    X = df[features]
    y = df["Quantity"]

    # shuffle=False: preserve temporal order — no future leakage into training
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=False
    )

    product_errors = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mape = mean_absolute_percentage_error(y_test, y_pred) * 100
        product_errors[name] = round(mape, 2)

    mape_errors[code] = product_errors

# Average MAPE per model across all top-10 products
import pandas as pd
df_errors = pd.DataFrame(mape_errors).T
print("Average MAPE by model (lower is better):")
print(df_errors.mean().sort_values())

OB3 — Anomaly detection: IQR, K-Means, and DBSCAN

Three anomaly detection methods are applied and compared on the sales time series of each product.

Python — anomaly detection with IQR statistical bounds
anomalies_iqr = {}

for code, df in product_dataframes.items():
    Q1  = df["Quantity"].quantile(0.25)
    Q3  = df["Quantity"].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    anomalies = df[
        (df["Quantity"] < lower_bound) |
        (df["Quantity"] > upper_bound)
    ]
    anomalies_iqr[code] = anomalies
    print(f"{code}: {len(anomalies)} anomalies detected with IQR")
Python — DBSCAN with automated epsilon via k-NN distance curve
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

def calculate_epsilon(X_scaled, k=5):
    """
    Calculate optimal epsilon using the k-nearest-neighbor distance curve.
    The point of maximum curvature in the sorted k-NN distance plot
    approximates the natural density boundary for DBSCAN.
    """
    nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
    distances, _ = nbrs.kneighbors(X_scaled)
    dist_k = np.sort(distances[:, -1])
    diff = np.diff(dist_k)
    epsilon = dist_k[np.argmax(diff)]
    return epsilon

anomalies_dbscan = {}

for code, df in product_dataframes.items():
    X = df[["Quantity"]].values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    epsilon = calculate_epsilon(X_scaled)
    db = DBSCAN(eps=epsilon, min_samples=3).fit(X_scaled)

    df_result = df.copy()
    df_result["cluster"] = db.labels_
    # Label -1 = noise = anomaly in DBSCAN
    anomalies = df_result[df_result["cluster"] == -1]
    anomalies_dbscan[code] = anomalies
    print(f"{code}: {len(anomalies)} anomalies with DBSCAN (ε={epsilon:.4f})")

OB4 — K-Means for RFM customer segmentation

The elbow method (SSE across k=1 to 10) and Silhouette score (k=2 to 5) are used jointly to select the optimal number of clusters. Agreement between both criteria at k=3 provides stronger justification than either method alone — elbow curves are often ambiguous, while Silhouette alone can favor k=2 due to its preference for clearly separated binary partitions.

Python — K-Means elbow method, Silhouette score, and final segmentation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

scaler = StandardScaler()
customers_normalized = scaler.fit_transform(customers_bc)

# Elbow method: SSE for k=1 through 10
sse = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_normalized)
    sse[k] = kmeans.inertia_

plt.figure(figsize=(8, 4))
plt.plot(list(sse.keys()), list(sse.values()), marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("SSE (inertia)")
plt.title("Elbow Method — RFM Customer Segmentation")
plt.grid(True, alpha=0.3)
plt.show()

# Silhouette score for k=2 through 5
for k in range(2, 6):
    km = KMeans(n_clusters=k, random_state=42, n_init=10).fit(customers_normalized)
    score = silhouette_score(customers_normalized, km.labels_)
    print(f"k={k}: Silhouette = {score:.4f}")

# Final model with k=3
model = KMeans(n_clusters=3, random_state=42, n_init=10)
model.fit(customers_normalized)
customers["Cluster"] = model.labels_

Metrics and comparisonEvaluation

OB2 — Forecasting model comparison

Model Avg. MAPE (%) Notes
Decision Tree ~28% High variance; prone to overfitting on small per-product series
AdaBoost ~24% Sensitive to outliers in the sales series
Random Forest ~19% Robust ensemble baseline; good generalization
Gradient Boosting ~17% Better captures local trend patterns
XGBoost ~15% Best model — selected for production forecasting

OB3 — Anomaly detection method comparison

Method Type Advantage Limitation
IQR Statistical Simple, interpretable, no hyperparameters Univariate; assumes roughly symmetric distribution; high false-positive rate on skewed series
K-Means Clustering Context-aware — flags anomalies relative to cluster structure Requires manual k and distance threshold; sensitive to initialization
DBSCAN Density No k required; robust to noise; handles non-spherical clusters; ε automated via k-NN Computationally heavier at scale; k-NN epsilon heuristic may need tuning for very short series

OB4 — RFM cluster profiles

Python — cluster profile by RFM mean values
cluster_profile = customers.groupby("Cluster").agg(
    Avg_Recency   = ("Recency",   "mean"),
    Avg_Frequency = ("Frequency", "mean"),
    Avg_Monetary  = ("Monetary",  "mean"),
    N_customers   = ("Cliente",   "count")
).round(1)

print(cluster_profile)
# Expected output interpretation:
# Cluster 0: Low recency, high frequency, high monetary  → VIP customers
# Cluster 1: High recency, low frequency, low monetary   → Churn-risk customers
# Cluster 2: Medium recency, medium frequency, medium monetary → Regular customers
Cluster Recency (days) Frequency Avg. Monetary Profile & Action
0 — VIP ~15 days High High Active high-value customers — retain and reward with loyalty programs
1 — At Risk ~180 days Low Low No recent purchases — target with win-back campaigns and personalized offers
2 — Regular ~45 days Medium Medium Stable customers — upselling and cross-selling opportunity

ConclusionsFindings

Key findings

Finding 1

XGBoost with ACF-guided lag features is the best monthly demand forecaster

Across synthetic automotive retail data for the top-10 products, XGBoost with 5 monthly lag features achieved an average MAPE of ~15%, outperforming Random Forest (~19%) and Gradient Boosting (~17%). The critical implementation detail was using shuffle=False in the train/test split to preserve temporal order — shuffled splits would have produced artificially inflated accuracy by leaking future demand observations into training.

Finding 2

DBSCAN with k-NN epsilon calibration is the most robust anomaly detector

IQR produces excess false positives on asymmetrically distributed sales series — products with a natural right-skew see many legitimate high-demand periods flagged as anomalies. K-Means requires manual tuning of both k and the distance threshold, making it impractical for a multi-product pipeline. DBSCAN with epsilon automatically calibrated via the k-NN distance curve adapts individually to each product's density structure and flags only genuinely isolated demand outliers, with no manual threshold adjustment required per product.

Finding 3

K-Means with k=3 produces the most actionable RFM customer segments

The elbow method and Silhouette score converge on k=3 as the optimal cluster count. The three resulting segments — VIP, At-Risk, and Regular — have clear, non-overlapping profiles that map directly to distinct business actions (loyalty programs, win-back campaigns, upselling). The Box-Cox transformation applied before StandardScaler normalization was essential: without it, K-Means clustering was dominated by the Monetary dimension alone, collapsing the VIP and Regular segments into a single indistinguishable cluster.

Finding 4

Statistical-distribution-preserving synthetic data is a valid and publishable methodology

When real client data carries confidentiality constraints, generating synthetic data that preserves the original's statistical structure — demand frequency ratios, purchase probabilities, and event-driven shocks like the pandemic — enables the full ML pipeline to be developed, validated, and published without exposing sensitive records. This strategy is applicable to any applied ML project under NDA: the synthetic generator becomes a reusable artifact that can be updated as new data becomes available, without modifying the downstream modeling code.

Limitations and future work

  • Models are trained independently per product — a multi-output model that learns demand patterns across products simultaneously could improve MAPE by capturing cross-product substitution and complementarity effects.
  • No exogenous variables are included — promotions, fuel prices, seasonal automotive demand cycles, and macroeconomic indicators all affect automotive parts demand and could significantly reduce forecasting error.
  • RFM segmentation ignores the branch dimension — segmenting customers per branch would produce more locally actionable insights for individual sales teams.
  • DBSCAN with k-NN epsilon is computationally expensive at scale — Isolation Forest or Local Outlier Factor (LOF) would be more scalable alternatives for production deployments with millions of records.
  • The synthetic generator uses fixed pandemic impact parameters (0.3–0.6 reduction factor) — a more sophisticated generator could sample these from a fitted distribution derived from actual pandemic-period data to better replicate demand variability during structural breaks.
Adapting this pipeline to your own thesis or project: The three-module structure works with any transactional retail dataset — e-commerce, pharmacies, supermarkets, or spare parts distributors. The minimum required columns are: date, customer ID, product ID, quantity, and transaction value. The synthetic data generator is optional — if you have real data and authorization to use it, simply replace the CSV input and the rest of the pipeline runs unchanged. For a thesis contribution, the most defensible extension is adding exogenous variables to the XGBoost forecaster and comparing MAPE against the lag-only baseline using a 12-month rolling out-of-sample backtest.