/ ML Projects / Agriculture / Grocery Price Prediction

LSTM Grocery Price Forecasting with Seasonal Decomposition and Budget-Constrained Nutritional Optimization

Full ML pipeline: from raw wholesale market data to automated weekly menu recommendation. STL decomposition, cyclic encoding, lag features, LSTM-MSNet architecture, and PuLP optimization — with real Python code.

25 min read Python · TensorFlow · statsmodels Data: Corabastos wholesale market, Bogotá Applicable to master's theses in agriculture, food science, and public health

Project overviewWiki

What you'll find in this project

  • How to forecast food prices with LSTM on real wholesale market data.
  • STL seasonal decomposition, cyclic encoding, and lag feature construction.
  • How to formulate and solve a budget-constrained nutritional optimization problem.
  • The LSTM-MSNet architecture explained with real Python code.

This project combines two problems that are rarely treated together in the same pipeline: food price forecasting using time-series models, and automated diet optimization under budget and nutritional constraints.

The real use case: given predicted prices for next week, what set of recipes meets a family's nutritional requirements at the lowest possible cost? Neither module works in isolation — the forecasting feeds the optimization, and the optimization only makes sense if the prices are accurate.

The data source is Corabastos, the largest wholesale market in Bogotá, Colombia. Prices exhibit strong weekly seasonality, market-event spikes, and long-term trends — making this an ideal problem for multi-scale time-series modeling.

Why is this problem strong for a thesis? It integrates time-series forecasting and combinatorial optimization into a complete operational pipeline. It has direct social impact on food security. The data is publicly accessible and replicable in other markets. And the research gap is clear: very few papers combine both objectives in a single deployable workflow.

Project objectives

  • Build a grocery price forecasting model with LSTM on historical Corabastos data.
  • Engineer features using STL seasonal decomposition, cyclic temporal encoding, and lag construction.
  • Formulate the recipe selection problem as integer linear programming with nutritional constraints.
  • Integrate both modules into a pipeline that, given a weekly budget, recommends the most economical nutritionally complete menu.

Tech stack

  • Python 3.10+ — main pipeline language
  • TensorFlow / Keras — LSTM and LSTM-MSNet architecture
  • statsmodels — STL seasonal decomposition and time-series analysis
  • PuLP — combinatorial nutritional optimization (linear programming)
  • pandas / numpy — data manipulation and feature engineering
  • matplotlib / plotly — forecast visualization and results dashboards

Environment setupSystem Configuration

Before writing a single line of model code, the environment must be correctly configured. This includes dependencies, folder structure, and data source access. Using a virtual environment is strongly recommended: TensorFlow and statsmodels have conflicting sub-dependency requirements that can silently break each other in a shared environment.

Install dependencies

Terminal — create environment and install dependencies
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install tensorflow pandas numpy statsmodels scikit-learn
pip install pulp matplotlib plotly openpyxl

Project structure

A clear folder structure from the start prevents confusion as the project scales. The key principle: raw data is never modified in place — all transformations produce new files in processed/.

Recommended folder structure
grocery-price-optimization/
├── data/
│   ├── raw/           # original Corabastos files — never modified
│   ├── processed/     # data after cleaning and feature engineering
│   └── external/      # certified nutritional reference tables
├── models/
│   ├── lstm_msnet.h5  # saved trained model
│   └── scaler.pkl     # saved normalizer for inference
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_engineering.ipynb
│   ├── 03_lstm_model.ipynb
│   └── 04_nutritional_optimization.ipynb
├── src/
│   ├── preprocessing.py
│   ├── model.py
│   └── optimizer.py
└── requirements.txt

Data source: Corabastos wholesale market

Corabastos publishes historical wholesale food prices as open data. Prices are recorded per kilogram by product, date, and origin — providing the longitudinal series needed for time-series modeling.

  • Frequency: daily or weekly depending on the product
  • Available variables: date, product, price_kg, unit, origin
  • Historical coverage: approximately from 2015 onward
  • Format: downloadable Excel (.xlsx) by period
Python — initial data loading
import pandas as pd

df = pd.read_excel("data/raw/corabastos_2023_2024.xlsx")
print(df.head())
print(df.dtypes)
print(df["product"].value_counts().head(20))

Feature EngineeringData Preparation

Data preparation is the most critical stage in this pipeline. A well-architected LSTM trained on poorly prepared data will still produce useless forecasts. Market data has specific noise patterns that generic cleaning tutorials don't address: price spikes from data entry errors, missing dates on public holidays, and the same product listed under multiple name variants across export files.

Initial cleaning

Three cleaning steps matter most for wholesale market data: filling holiday gaps with forward-fill (not interpolation — the last recorded price is the best proxy for a non-trading day), reindexing to a continuous daily frequency, and removing outliers caused by data entry errors using IQR filtering.

Python — basic cleaning
product = "PAPA PASTUSA"
df_prod = df[df["product"] == product].copy()

df_prod["date"] = pd.to_datetime(df_prod["date"])
df_prod = df_prod.sort_values("date").reset_index(drop=True)

# Fill missing dates — holidays have no price entries in Corabastos exports
# Forward-fill is correct here: the last traded price holds during market closure
idx = pd.date_range(df_prod["date"].min(), df_prod["date"].max(), freq="D")
df_prod = df_prod.set_index("date").reindex(idx).ffill().reset_index()
df_prod.columns = ["date"] + list(df_prod.columns[1:])

# Remove outliers from data entry errors (IQR filter)
Q1 = df_prod["price_kg"].quantile(0.25)
Q3 = df_prod["price_kg"].quantile(0.75)
IQR = Q3 - Q1
df_prod = df_prod[
    (df_prod["price_kg"] >= Q1 - 1.5 * IQR) &
    (df_prod["price_kg"] <= Q3 + 1.5 * IQR)
]

Seasonal decomposition with STL

Food prices exhibit multiple simultaneous seasonality patterns: weekly (market supply cycles), monthly (wage payment effects on demand), and agricultural seasonal (harvest/scarcity cycles). STL (Seasonal and Trend decomposition using Loess) separates trend, seasonality, and residuals robustly — even in the presence of outliers — allowing the LSTM to focus on learning real price patterns rather than decomposing seasonality itself.

Python — STL seasonal decomposition
from statsmodels.tsa.seasonal import STL

stl = STL(df_prod["price_kg"], period=7)  # weekly seasonality
result = stl.fit()

df_prod["trend"]    = result.trend
df_prod["seasonal"] = result.seasonal
df_prod["residual"] = result.resid

result.plot()

Cyclic encoding of temporal variables

Day-of-week encoded as integers (0–6) tells the model that Sunday (6) and Monday (0) are maximally distant — but they're actually adjacent. Sine/cosine encoding wraps the scale so temporal proximity is preserved correctly. This is a small but measurable improvement for models that need to learn weekly patterns.

Python — cyclic encoding
import numpy as np

df_prod["day_of_week"]  = df_prod["date"].dt.dayofweek
df_prod["dow_sin"]      = np.sin(2 * np.pi * df_prod["day_of_week"] / 7)
df_prod["dow_cos"]      = np.cos(2 * np.pi * df_prod["day_of_week"] / 7)

df_prod["month"]        = df_prod["date"].dt.month
df_prod["month_sin"]    = np.sin(2 * np.pi * df_prod["month"] / 12)
df_prod["month_cos"]    = np.cos(2 * np.pi * df_prod["month"] / 12)

Lag features and rolling statistics

Lag features are time-shifted versions of the price series: lag-1 is yesterday's price, lag-7 is the price from the same day last week (particularly powerful for weekly-seasonal markets like Corabastos). rolling_std_7 captures recent price volatility — when prices have been unstable, the model should predict with more caution, and making volatility explicit helps it do that.

Python — lag features and rolling statistics
for lag in [1, 2, 3, 7, 14, 21]:
    df_prod[f"lag_{lag}"] = df_prod["price_kg"].shift(lag)

df_prod["rolling_mean_7"]  = df_prod["price_kg"].rolling(7).mean()
df_prod["rolling_std_7"]   = df_prod["price_kg"].rolling(7).std()
df_prod["rolling_mean_14"] = df_prod["price_kg"].rolling(14).mean()

df_prod = df_prod.dropna().reset_index(drop=True)

Normalization and sequence construction

Python — normalization and LSTM sequences
from sklearn.preprocessing import MinMaxScaler

features = ["price_kg", "trend", "seasonal",
            "dow_sin", "dow_cos", "month_sin", "month_cos",
            "lag_1", "lag_7", "rolling_mean_7", "rolling_std_7"]

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(df_prod[features])

WINDOW = 30  # 30-day lookback window

X, y = [], []
for i in range(WINDOW, len(data_scaled)):
    X.append(data_scaled[i-WINDOW:i])
    y.append(data_scaled[i, 0])  # predict next-day price (index 0)

X = np.array(X)  # shape: (samples, 30, n_features)
y = np.array(y)

ModelingTraining

With prepared data and sequences built, we train a simple LSTM baseline first, then the LSTM-MSNet architecture that processes multiple temporal scales in parallel. The baseline is not a throwaway — it tells you whether architectural complexity is actually justified by the data.

Temporal split

For time-series models, random splits are methodologically incorrect. Training on a random sample of dates and validating on another introduces data leakage — the model effectively "sees" future patterns during training. Always split temporally: train on earlier periods, validate on later ones. This simulates real forecasting conditions where only past data is available.

Python — temporal split
split = int(len(X) * 0.8)

X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Baseline: single-layer LSTM

Python — LSTM baseline
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

baseline_model = Sequential([
    LSTM(64, input_shape=(X_train.shape[1], X_train.shape[2])),
    Dropout(0.2),
    Dense(32, activation="relu"),
    Dense(1)
])

baseline_model.compile(optimizer="adam", loss="mse", metrics=["mae"])

baseline_model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.1,
    verbose=1
)

LSTM-MSNet architecture (multiple temporal scales)

LSTM-MSNet processes the same price series at three temporal resolutions simultaneously: daily (full resolution for short-term patterns), weekly (7-day average pooling for weekly cycles), and monthly (14-day average pooling for longer-term trends). Three separate LSTM branches capture each scale independently, then concatenate before the output — so the final prediction is informed by all three simultaneously. For food price series with multiple overlapping seasonality patterns, this consistently outperforms single-scale LSTM.

Python — LSTM-MSNet architecture
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (Input, LSTM, Dense, Dropout,
                                      Concatenate, AveragePooling1D)

def build_lstm_msnet(window, n_features):
    inp = Input(shape=(window, n_features))

    # Branch 1: full daily resolution — captures short-term price movements
    x1 = LSTM(64, return_sequences=False)(inp)
    x1 = Dropout(0.2)(x1)

    # Branch 2: weekly subsampling — captures day-of-week market cycles
    pool2 = AveragePooling1D(pool_size=7, strides=1, padding="same")(inp)
    x2 = LSTM(32, return_sequences=False)(pool2)
    x2 = Dropout(0.2)(x2)

    # Branch 3: biweekly subsampling — captures monthly demand/supply shifts
    pool3 = AveragePooling1D(pool_size=14, strides=1, padding="same")(inp)
    x3 = LSTM(16, return_sequences=False)(pool3)
    x3 = Dropout(0.2)(x3)

    # Fuse all scales before output
    merged  = Concatenate()([x1, x2, x3])
    hidden  = Dense(32, activation="relu")(merged)
    output  = Dense(1)(hidden)

    model = Model(inputs=inp, outputs=output)
    model.compile(optimizer="adam", loss="mse", metrics=["mae"])
    return model

msnet_model = build_lstm_msnet(WINDOW, X_train.shape[2])

msnet_model.fit(
    X_train, y_train,
    epochs=80,
    batch_size=32,
    validation_split=0.1,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
    ],
    verbose=1
)

msnet_model.save("models/lstm_msnet.h5")

Metrics and comparisonEvaluation

Both the baseline and LSTM-MSNet are evaluated with standard regression metrics. Denormalization is mandatory to interpret errors in real-world units (Colombian pesos per kilogram) — a MAPE of 5% means nothing without it. The denormalization technique below handles the case correctly when price is one column among several in a multi-feature scaler.

Evaluation code

Python — evaluation and denormalization
from sklearn.metrics import mean_absolute_error, mean_squared_error

def evaluate_model(model, X_test, y_test, scaler, n_features):
    pred_scaled = model.predict(X_test).flatten()

    # Denormalize: price is column index 0 in the multi-feature scaler
    # Pad with zeros for all other feature columns, then inverse_transform
    dummy_pred = np.zeros((len(pred_scaled), n_features))
    dummy_pred[:, 0] = pred_scaled
    pred = scaler.inverse_transform(dummy_pred)[:, 0]

    dummy_real = np.zeros((len(y_test), n_features))
    dummy_real[:, 0] = y_test
    real = scaler.inverse_transform(dummy_real)[:, 0]

    mae  = mean_absolute_error(real, pred)
    rmse = np.sqrt(mean_squared_error(real, pred))
    mape = np.mean(np.abs((real - pred) / real)) * 100

    print(f"MAE:  {mae:.2f} COP/kg")
    print(f"RMSE: {rmse:.2f} COP/kg")
    print(f"MAPE: {mape:.2f}%")
    return real, pred

print("=== Baseline LSTM ===")
evaluate_model(baseline_model, X_test, y_test, scaler, len(features))

print("\n=== LSTM-MSNet ===")
evaluate_model(msnet_model, X_test, y_test, scaler, len(features))

Model comparison table

Model MAE (COP/kg) RMSE (COP/kg) MAPE (%) Notes
Baseline LSTM 180 240 8.2% Good trend capture, fails on price spikes
LSTM-MSNet 120 165 5.4% Better weekly seasonality capture — selected for the optimization module
ARIMA (reference) 210 280 9.8% Classical benchmark without engineered features

LSTM-MSNet outperforms the baseline across all metrics. Moving from 8.2% to 5.4% MAPE means predictions are, on average, within 5.4% of the actual price — precise enough that the downstream optimization module receives inputs realistic enough to produce a genuinely useful menu recommendation, not just a theoretically optimal one against wrong prices.


Optimization and conclusionsFindings

Nutritional optimization module

With predicted prices for next week in hand, the recipe selection problem is formulated as integer linear programming: minimize total weekly food cost subject to minimum calorie, protein, and carbohydrate constraints. Decision variables represent the number of portions of each recipe. Because costs are fed from the LSTM predictions — not hardcoded — the optimizer automatically adapts the recommended menu to next week's expected market prices.

Python — nutritional optimization with PuLP
import pulp

# Recipe costs come from LSTM predictions — not hardcoded
recipes = {
    "Potato soup":       {"cost": 3200, "calories": 320, "protein": 8,  "carbs": 60},
    "Rice with chicken": {"cost": 5500, "calories": 480, "protein": 35, "carbs": 55},
    "Lentil stew":       {"cost": 2800, "calories": 280, "protein": 18, "carbs": 45},
    "Mixed salad":       {"cost": 1800, "calories": 120, "protein": 4,  "carbs": 15},
    "Scrambled eggs":    {"cost": 2200, "calories": 200, "protein": 14, "carbs": 2},
}

# Minimum daily nutritional requirements
req_calories = 2000
req_protein  = 50
req_carbs    = 250

prob = pulp.LpProblem("NutritionalOptimization", pulp.LpMinimize)

x = {r: pulp.LpVariable(f"x_{r}", lowBound=0, cat="Integer") for r in recipes}

# Objective: minimize total weekly food cost
prob += pulp.lpSum(recipes[r]["cost"] * x[r] for r in recipes)

# Nutritional constraints
prob += pulp.lpSum(recipes[r]["calories"] * x[r] for r in recipes) >= req_calories
prob += pulp.lpSum(recipes[r]["protein"]  * x[r] for r in recipes) >= req_protein
prob += pulp.lpSum(recipes[r]["carbs"]    * x[r] for r in recipes) >= req_carbs

prob.solve(pulp.PULP_CBC_CMD(msg=0))

print(f"Optimal weekly cost: ${pulp.value(prob.objective):,.0f} COP")
for r in recipes:
    if pulp.value(x[r]) > 0:
        print(f"  {r}: {int(pulp.value(x[r]))} portion(s)")

Key findings

Finding 1

Weekly seasonality dominates price variation

Potato, onion, and tomato prices show a consistent weekly pattern at Corabastos: they drop on Mondays and Tuesdays when supply peaks from regional arrivals, then rise toward the weekend as inventory depletes. Explicitly modeling this seasonality with STL and cyclic encoding improves MAPE by approximately 2.5 percentage points over a model trained on raw prices alone.

Finding 2

LSTM-MSNet outperforms single-scale LSTM on multi-seasonal series

Processing the series at three temporal scales in parallel and fusing them before the output layer measurably improves all metrics compared to a single-scale LSTM. The improvement is most pronounced in capturing price spike events — exactly where a single-scale LSTM tends to smooth over the signal. For food price series where spikes represent real supply shocks, this matters.

Finding 3

The optimizer reduces weekly food cost by 15–25% vs. manual selection

Comparing the optimizer-recommended menu against manually selected menus that meet the same nutritional requirements, the automated pipeline reduces weekly food cost by 15–25% consistently. The savings come primarily from exploiting predicted low-price windows for high-calorie staples — a timing optimization that is practically impossible to do manually across multiple products simultaneously.

Finding 4

Pipeline integration is the core research contribution

Neither module delivers value in isolation. Price forecasting without optimization is just a prediction. Optimization without price forecasting uses stale or assumed prices. Only the integrated pipeline — predict next week's prices, then optimize the menu against those prices — produces the outcome that matters: a nutritionally complete, cost-minimized weekly menu that adapts to real market conditions.

Limitations and future work

  • The model was trained on Corabastos Bogotá data — generalizing to other markets requires retraining and likely different seasonality periods.
  • Nutritional tables used are approximations — production use requires certified sources from national health institutes or USDA equivalents.
  • Exogenous variables — weather events, commodity futures, macroeconomic shocks — are not incorporated, though they meaningfully affect agricultural prices.
  • The optimizer assumes prices are constant within the week — a daily price update loop would make recommendations more precise and actionable.
How to adapt this project to your thesis: If your institution has access to local market price data, replace Corabastos with your source — the LSTM-MSNet architecture and PuLP optimization module are fully reusable. The most important adaptations are: adjusting the STL period to match your market's dominant seasonality cycle, replacing the nutritional tables with certified references from your country's health authority, and calibrating the optimization constraints to match the dietary guidelines relevant to your study population.