How do you implement sales forecasting with machine learning XGBoost and lag features in Python?

Lag features are constructed by shifting the sales time series by 1 to N periods (e.g., 5 monthly lags). ACF (Autocorrelation Function) analysis identifies which lags carry statistically significant signal — lags outside the 95% confidence interval should be included. Five ensemble models are benchmarked (Decision Tree, Random Forest, Gradient Boosting, XGBoost, AdaBoost) using MAPE on a time-ordered train/test split (shuffle=False). XGBoost consistently achieves the lowest MAPE (~15%) on monthly retail sales data with 5 lag features.

How do you perform customer segmentation with RFM and K-Means and anomaly detection with DBSCAN in Python?

RFM segmentation computes Recency (days since last purchase), Frequency (unique transaction count), and Monetary value (total spend) per customer. Box-Cox transformation normalizes the typically skewed RFM distributions before StandardScaler normalization. K-Means with k=3 is selected using the elbow method (SSE) and Silhouette score. For anomaly detection, DBSCAN is the most robust method: epsilon is calculated automatically using the k-nearest-neighbor distance curve (point of maximum curvature), and points labeled -1 by DBSCAN are flagged as anomalies — no manual threshold required.

What is the difference between IQR, K-Means, and DBSCAN for retail sales anomaly detection?

IQR is a univariate statistical method that flags points outside Q1−1.5×IQR or Q3+1.5×IQR — fast and interpretable but assumes a roughly symmetric distribution and ignores multivariate context. K-Means detects anomalies as points with above-threshold distance to their cluster centroid, but requires manually setting both k and the distance threshold. DBSCAN identifies anomalies as points that do not belong to any dense cluster (label −1), requires no k parameter, adapts to non-spherical cluster shapes, and handles noisy time series natively — making it the most robust choice for retail sales anomaly detection.

Sales Forecasting Machine Learning XGBoost Lag Features + RFM Segmentation + Anomaly Detection

Project SummaryWiki

What will you find in this project?

How to generate realistic synthetic retail data that preserves the statistical distribution of real client data when confidentiality prevents sharing the original.
Sales forecasting per product using ACF-guided lag features and a five-model benchmark (Decision Tree, RF, GBM, XGBoost, AdaBoost) evaluated with MAPE.
RFM customer segmentation with K-Means, elbow method, and Silhouette score to select the optimal number of clusters.
Anomaly detection with three methods compared head-to-head: IQR, K-Means distance, and DBSCAN with k-NN epsilon calibration.

This project builds three ML modules on data from an automotive retail business with four branches. Because the client's real transaction data is confidential, a synthetic data generator was engineered to preserve the original statistical structure: branch-level demand patterns, pandemic impact (2020–2021), and per-customer and per-product purchase probabilities. The result is a fully replicable open pipeline that reflects the complexity of a real retail problem without exposing sensitive information.

Why this pipeline is valuable for academic and applied research: It integrates three distinct ML problem types — supervised regression, unsupervised clustering, and unsupervised anomaly detection — in a single coherent workflow. The synthetic data generation strategy is itself a methodological contribution: it demonstrates how to produce publishable, reproducible research when real data carries confidentiality constraints. The five-model benchmark with ACF-driven feature engineering adds rigor to the forecasting component that goes beyond picking XGBoost by default.

The four project objectives

OB1

Data cleaning and preparation

Exploratory analysis, column selection, synthetic data generation with pandas and numpy, lag feature construction, and temporal train/test split.

OB2

Sales forecasting

Monthly demand forecasting per product with five ensemble models benchmarked by MAPE on a time-ordered hold-out set.

OB3

Anomaly detection

Three methods compared: IQR statistical bounds, K-Means centroid distance thresholding, and DBSCAN with k-NN epsilon calibration.

OB4

Customer segmentation

RFM (Recency, Frequency, Monetary) variables with Box-Cox normalization, StandardScaler, and K-Means with Silhouette-based cluster selection.

Technology stack

Python 3.10+ — primary language
pandas / numpy — data manipulation and synthetic data generation
scikit-learn — Random Forest, KMeans, DBSCAN, StandardScaler, Silhouette score
xgboost — primary forecasting model
statsmodels — ACF (Autocorrelation Function) for lag selection
scipy — Box-Cox transformation for RFM normalization
matplotlib / seaborn — visualizations
holidays — generation of valid business dates (excluding public holidays and Sundays)

Setup and dataSystem Configuration

The project operates on synthetic data generated from the statistical distribution of a real automotive retailer's transaction records across four branches. The generator preserves real patterns including the pandemic demand shock (2020–2021), branch-level sales ratios, and per-customer purchase probabilities — making the pipeline behaviorally equivalent to the original without exposing confidential records.

Dependency installation

Terminal — install dependencies

pip install pandas numpy scikit-learn xgboost statsmodels scipy matplotlib seaborn holidays openpyxl

Recommended project structure

Folder structure

retail-ml-pipeline/
├── data/
│   ├── raw/                         # original client data (confidential — not committed)
│   └── data_productos_sint.csv      # generated synthetic dataset
├── notebooks/
│   ├── 01_synthetic_generation.ipynb
│   ├── 02_sales_forecasting.ipynb
│   ├── 03_anomaly_detection.ipynb
│   └── 04_rfm_segmentation.ipynb
├── src/
│   ├── data_generator.py
│   ├── forecasting.py
│   ├── anomaly_detection.py
│   └── segmentation.py
└── requirements.txt

Synthetic data generation

When client data is confidential, the correct approach is to build a generator that preserves the statistical structure of the original rather than creating fully random data. The strategy here uses: historical branch-level sales frequencies, per-customer and per-product purchase probabilities, a pandemic impact mask for 2020–2021, and a valid-date filter that excludes public holidays and Sundays — matching the real operating calendar of the business.

Python — generate valid business dates excluding public holidays

import pandas as pd
import holidays

# Date range: January 2020 to December 2024
fechas = pd.date_range(start="2020-01-01", end="2024-12-31", freq="D")
peru_holidays = holidays.country_holidays("PE", years=range(2020, 2025))

# Keep only business days: Monday–Saturday, excluding public holidays
valid_dates = [
    f for f in fechas
    if f.weekday() != 6          # exclude Sundays
    and f not in peru_holidays   # exclude public holidays
]

print(f"Valid business days 2020–2024: {len(valid_dates)}")

Python — apply pandemic demand reduction mask

import numpy as np
import pandas as pd

synthetic_data["Fecha_Emision"] = pd.to_datetime(synthetic_data["Fecha_Emision"])

pandemic_start = pd.to_datetime("2020-03-01")
pandemic_end   = pd.to_datetime("2021-03-31")

# Boolean mask for pandemic period
mask = (
    (synthetic_data["Fecha_Emision"] >= pandemic_start) &
    (synthetic_data["Fecha_Emision"] <= pandemic_end)
)

# Reduce sales to 30–60% of normal volume during the pandemic window
pandemic_factor = np.random.uniform(0.3, 0.6, mask.sum())
synthetic_data.loc[mask, "Cantidad"] = (
    synthetic_data.loc[mask, "Cantidad"] * pandemic_factor
).astype(int)

Feature EngineeringData Preparation

Data preparation differs significantly across the three modules: sales forecasting requires a per-product lag feature matrix, anomaly detection operates on individual product time series without targets, and customer segmentation aggregates the full transaction history to customer-level RFM variables. Each module's preparation is implemented independently to avoid data leakage and allow modular testing.

OB1 — Column selection and top-10 products

Python — data loading and top-10 product selection

import pandas as pd

data = pd.read_csv("data/data_productos_sint.csv")
data["Fecha_Emision"] = pd.to_datetime(data["Fecha_Emision"])

# Select relevant columns for the forecasting module
data_products = data[["Fecha_Emision", "Codigo_Producto", "Cantidad"]].copy()

# Aggregate total sales per product and sort descending
products_grouped = (
    data_products
    .groupby("Codigo_Producto")["Cantidad"]
    .sum()
    .reset_index()
    .sort_values(by="Cantidad", ascending=False)
)

# Select top-10 highest-volume products
top10 = products_grouped.head(10)
top10_codes = top10["Codigo_Producto"].tolist()
print(f"Top-10 products: {top10_codes}")

OB2 — ACF-guided lag feature construction

The ACF (Autocorrelation Function) measures the correlation of a time series with its own lagged values. Lags outside the 95% confidence interval (the blue shaded band in the ACF plot) carry statistically significant autocorrelation and are strong candidates for inclusion as model features. In this dataset, lags 1 through 5 showed consistent significance across most of the top-10 products — establishing 5 as the lag window for the forecasting models.

Python — ACF plots for lag selection across all top-10 products

from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt

# Monthly aggregation per product
data_products["Month"] = data_products["Fecha_Emision"].dt.to_period("M")
product_dataframes = {}

for code in top10_codes:
    df_prod = (
        data_products[data_products["Codigo_Producto"] == code]
        .groupby("Month")["Cantidad"]
        .sum()
        .reset_index()
    )
    df_prod.columns = ["Date", "Quantity"]
    product_dataframes[code] = df_prod

# Plot ACF for each product (5 rows × 2 columns)
fig, axes = plt.subplots(5, 2, figsize=(14, 18))
for ax, (code, df) in zip(axes.flatten(), product_dataframes.items()):
    plot_acf(df["Quantity"], lags=12, ax=ax, title=f"ACF — {code}")
plt.tight_layout()
plt.show()

Python — build lag features for all top-10 products

NUM_LAGS = 5

for code, df in product_dataframes.items():
    for lag in range(1, NUM_LAGS + 1):
        df[f"lag_{lag}"] = df["Quantity"].shift(lag)
    # Fill NaN with zero for rows without full lag history
    df.fillna(0, inplace=True)
    product_dataframes[code] = df

# Verify structure
print(product_dataframes[top10_codes[0]].head(7))

OB4 — RFM variable construction for customer segmentation

RFM converts the transaction history into three customer-level metrics: Recency (days since last purchase), Frequency (count of unique transactions), and Monetary value (total spend). RFM distributions are inherently right-skewed — a small number of high-value customers dominate both frequency and monetary dimensions. Box-Cox transformation is applied before StandardScaler normalization to correct this skew; without it, K-Means clustering would be dominated by the monetary variable and produce poorly separated segments.

Python — compute RFM metrics per customer with Box-Cox normalization

from datetime import timedelta
from scipy import stats
import pandas as pd

df_customers = data[["Fecha_Emision", "Cliente", "Valor_Venta", "Documento"]].copy()

# Snapshot date: day after the last recorded transaction
snapshot_date = df_customers["Fecha_Emision"].max() + timedelta(days=1)

# Compute RFM per customer
customers = df_customers.groupby("Cliente").agg(
    Recency   = ("Fecha_Emision", lambda x: (snapshot_date - x.max()).days),
    Frequency = ("Documento",     "nunique"),
    Monetary  = ("Valor_Venta",   "sum")
).reset_index()

# Box-Cox transformation to correct right-skew in all three dimensions
customers_bc = pd.DataFrame()
customers_bc["Recency"]   = stats.boxcox(customers["Recency"])[0]
customers_bc["Frequency"] = stats.boxcox(customers["Frequency"])[0]
customers_bc["Monetary"]  = stats.boxcox(customers["Monetary"])[0]

print(customers_bc.describe())

Modeling all three modulesTraining

OB2 — Five-model forecasting benchmark

Five regression models are benchmarked for each of the top-10 products, evaluated with MAPE on a chronologically ordered 80/20 split. Using shuffle=False is mandatory here — shuffling would allow future observations to leak into training and produce overly optimistic error metrics that collapse in production deployment.

Python — five-model forecasting benchmark with MAPE evaluation

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error
from xgboost import XGBRegressor
import numpy as np

models = {
    "Decision Tree":     DecisionTreeRegressor(random_state=42),
    "Random Forest":     RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "XGBoost":           XGBRegressor(random_state=42, verbosity=0),
    "AdaBoost":          AdaBoostRegressor(random_state=42),
}

mape_errors = {}

for code, df in product_dataframes.items():
    features = [f"lag_{i}" for i in range(1, NUM_LAGS + 1)]
    X = df[features]
    y = df["Quantity"]

    # shuffle=False: preserve temporal order — no future leakage into training
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=False
    )

    product_errors = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mape = mean_absolute_percentage_error(y_test, y_pred) * 100
        product_errors[name] = round(mape, 2)

    mape_errors[code] = product_errors

# Average MAPE per model across all top-10 products
import pandas as pd
df_errors = pd.DataFrame(mape_errors).T
print("Average MAPE by model (lower is better):")
print(df_errors.mean().sort_values())

OB3 — Anomaly detection: IQR, K-Means, and DBSCAN

Three anomaly detection methods are applied and compared on the sales time series of each product.

Python — anomaly detection with IQR statistical bounds

anomalies_iqr = {}

for code, df in product_dataframes.items():
    Q1  = df["Quantity"].quantile(0.25)
    Q3  = df["Quantity"].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    anomalies = df[
        (df["Quantity"] < lower_bound) |
        (df["Quantity"] > upper_bound)
    ]
    anomalies_iqr[code] = anomalies
    print(f"{code}: {len(anomalies)} anomalies detected with IQR")

Python — DBSCAN with automated epsilon via k-NN distance curve

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

def calculate_epsilon(X_scaled, k=5):
    """
    Calculate optimal epsilon using the k-nearest-neighbor distance curve.
    The point of maximum curvature in the sorted k-NN distance plot
    approximates the natural density boundary for DBSCAN.
    """
    nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
    distances, _ = nbrs.kneighbors(X_scaled)
    dist_k = np.sort(distances[:, -1])
    diff = np.diff(dist_k)
    epsilon = dist_k[np.argmax(diff)]
    return epsilon

anomalies_dbscan = {}

for code, df in product_dataframes.items():
    X = df[["Quantity"]].values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    epsilon = calculate_epsilon(X_scaled)
    db = DBSCAN(eps=epsilon, min_samples=3).fit(X_scaled)

    df_result = df.copy()
    df_result["cluster"] = db.labels_
    # Label -1 = noise = anomaly in DBSCAN
    anomalies = df_result[df_result["cluster"] == -1]
    anomalies_dbscan[code] = anomalies
    print(f"{code}: {len(anomalies)} anomalies with DBSCAN (ε={epsilon:.4f})")

OB4 — K-Means for RFM customer segmentation

The elbow method (SSE across k=1 to 10) and Silhouette score (k=2 to 5) are used jointly to select the optimal number of clusters. Agreement between both criteria at k=3 provides stronger justification than either method alone — elbow curves are often ambiguous, while Silhouette alone can favor k=2 due to its preference for clearly separated binary partitions.

Python — K-Means elbow method, Silhouette score, and final segmentation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

scaler = StandardScaler()
customers_normalized = scaler.fit_transform(customers_bc)

# Elbow method: SSE for k=1 through 10
sse = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_normalized)
    sse[k] = kmeans.inertia_

plt.figure(figsize=(8, 4))
plt.plot(list(sse.keys()), list(sse.values()), marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("SSE (inertia)")
plt.title("Elbow Method — RFM Customer Segmentation")
plt.grid(True, alpha=0.3)
plt.show()

# Silhouette score for k=2 through 5
for k in range(2, 6):
    km = KMeans(n_clusters=k, random_state=42, n_init=10).fit(customers_normalized)
    score = silhouette_score(customers_normalized, km.labels_)
    print(f"k={k}: Silhouette = {score:.4f}")

# Final model with k=3
model = KMeans(n_clusters=3, random_state=42, n_init=10)
model.fit(customers_normalized)
customers["Cluster"] = model.labels_

Metrics and comparisonEvaluation

OB2 — Forecasting model comparison

Model	Avg. MAPE (%)	Notes
Decision Tree	~28%	High variance; prone to overfitting on small per-product series
AdaBoost	~24%	Sensitive to outliers in the sales series
Random Forest	~19%	Robust ensemble baseline; good generalization
Gradient Boosting	~17%	Better captures local trend patterns
XGBoost	~15%	Best model — selected for production forecasting

OB3 — Anomaly detection method comparison

Method	Type	Advantage	Limitation
IQR	Statistical	Simple, interpretable, no hyperparameters	Univariate; assumes roughly symmetric distribution; high false-positive rate on skewed series
K-Means	Clustering	Context-aware — flags anomalies relative to cluster structure	Requires manual k and distance threshold; sensitive to initialization
DBSCAN	Density	No k required; robust to noise; handles non-spherical clusters; ε automated via k-NN	Computationally heavier at scale; k-NN epsilon heuristic may need tuning for very short series

OB4 — RFM cluster profiles

Python — cluster profile by RFM mean values

cluster_profile = customers.groupby("Cluster").agg(
    Avg_Recency   = ("Recency",   "mean"),
    Avg_Frequency = ("Frequency", "mean"),
    Avg_Monetary  = ("Monetary",  "mean"),
    N_customers   = ("Cliente",   "count")
).round(1)

print(cluster_profile)
# Expected output interpretation:
# Cluster 0: Low recency, high frequency, high monetary  → VIP customers
# Cluster 1: High recency, low frequency, low monetary   → Churn-risk customers
# Cluster 2: Medium recency, medium frequency, medium monetary → Regular customers

Cluster	Recency (days)	Frequency	Avg. Monetary	Profile & Action
0 — VIP	~15 days	High	High	Active high-value customers — retain and reward with loyalty programs
1 — At Risk	~180 days	Low	Low	No recent purchases — target with win-back campaigns and personalized offers
2 — Regular	~45 days	Medium	Medium	Stable customers — upselling and cross-selling opportunity

ConclusionsFindings

Key findings

Across synthetic automotive retail data for the top-10 products, XGBoost with 5 monthly lag features achieved an average MAPE of ~15%, outperforming Random Forest (~19%) and Gradient Boosting (~17%). The critical implementation detail was using shuffle=False in the train/test split to preserve temporal order — shuffled splits would have produced artificially inflated accuracy by leaking future demand observations into training.

IQR produces excess false positives on asymmetrically distributed sales series — products with a natural right-skew see many legitimate high-demand periods flagged as anomalies. K-Means requires manual tuning of both k and the distance threshold, making it impractical for a multi-product pipeline. DBSCAN with epsilon automatically calibrated via the k-NN distance curve adapts individually to each product's density structure and flags only genuinely isolated demand outliers, with no manual threshold adjustment required per product.

The elbow method and Silhouette score converge on k=3 as the optimal cluster count. The three resulting segments — VIP, At-Risk, and Regular — have clear, non-overlapping profiles that map directly to distinct business actions (loyalty programs, win-back campaigns, upselling). The Box-Cox transformation applied before StandardScaler normalization was essential: without it, K-Means clustering was dominated by the Monetary dimension alone, collapsing the VIP and Regular segments into a single indistinguishable cluster.

When real client data carries confidentiality constraints, generating synthetic data that preserves the original's statistical structure — demand frequency ratios, purchase probabilities, and event-driven shocks like the pandemic — enables the full ML pipeline to be developed, validated, and published without exposing sensitive records. This strategy is applicable to any applied ML project under NDA: the synthetic generator becomes a reusable artifact that can be updated as new data becomes available, without modifying the downstream modeling code.

Limitations and future work

Models are trained independently per product — a multi-output model that learns demand patterns across products simultaneously could improve MAPE by capturing cross-product substitution and complementarity effects.
No exogenous variables are included — promotions, fuel prices, seasonal automotive demand cycles, and macroeconomic indicators all affect automotive parts demand and could significantly reduce forecasting error.
RFM segmentation ignores the branch dimension — segmenting customers per branch would produce more locally actionable insights for individual sales teams.
DBSCAN with k-NN epsilon is computationally expensive at scale — Isolation Forest or Local Outlier Factor (LOF) would be more scalable alternatives for production deployments with millions of records.
The synthetic generator uses fixed pandemic impact parameters (0.3–0.6 reduction factor) — a more sophisticated generator could sample these from a fitted distribution derived from actual pandemic-period data to better replicate demand variability during structural breaks.

Adapting this pipeline to your own thesis or project: The three-module structure works with any transactional retail dataset — e-commerce, pharmacies, supermarkets, or spare parts distributors. The minimum required columns are: date, customer ID, product ID, quantity, and transaction value. The synthetic data generator is optional — if you have real data and authorization to use it, simply replace the CSV input and the rest of the pipeline runs unchanged. For a thesis contribution, the most defensible extension is adding exogenous variables to the XGBoost forecaster and comparing MAPE against the lag-only baseline using a 12-month rolling out-of-sample backtest.

Sales Forecasting with Machine Learning XGBoost Lag Features, Customer Segmentation RFM K-Means, and Anomaly Detection DBSCAN Python