/ ML Projects / Industry / Concrete Compressive Strength
Predict Concrete Compressive Strength from Mix Composition Using Machine Learning
Full regression pipeline: EDA, PCA dimensionality reduction, SVR vs KNN vs Random Forest benchmarking, GridSearchCV tuning, and FastAPI deployment for industrial quality control.
Project overviewWiki
What you'll find in this project
- How to predict concrete compressive strength at 1, 3, 7, and 28 days from mix composition.
- Reducing 31 correlated chemical attributes using PCA and VarianceThreshold.
- SVR, KNN, and Random Forest benchmarking with cross-validation.
- GridSearchCV hyperparameter tuning and FastAPI deployment for real-time QC.
This project solves a real problem in the cement industry: given that we know the chemical composition of a cement sample — clinker, gypsum, limestone, pozzolan, and so on — can we predict its 28-day compressive strength without waiting 28 days for physical testing?
Compressive strength is the critical quality metric for cement. Measuring it physically requires casting test specimens and curing them for 28 days. An ML model can estimate it from day one based on composition alone — accelerating quality control and reducing material waste before it reaches the construction site.
The dataset contains 31 chemical composition attributes of Type ICo cement, with high intercorrelation between variables — making this an ideal case for exploring feature selection and dimensionality reduction techniques before modeling.
Use cases for this model
- Early quality control: predict 28-day compressive strength from composition measured at the plant, eliminating the physical waiting period.
- Formulation optimization: simulate how strength changes when ingredient proportions are varied before production runs.
- Anomaly detection: flag batches with atypical composition before curing begins, preventing defective product from reaching downstream processes.
Tech stack
- Python 3.10+ — main language
- Scikit-learn — SVR, KNN, Random Forest, PCA, GridSearchCV, StandardScaler
- pandas / numpy — data manipulation and analysis
- matplotlib — prediction visualization and learning curves
- FastAPI + pydantic — REST API deployment for plant integration
- pickle — trained model serialization
Environment setupSystem Configuration
This project runs locally or in Google Colab. The original notebook was developed in Colab with Google Drive access for reading the Excel data file. Note that openpyxl is required to read .xlsx files with pandas — a dependency that's easy to miss and causes confusing import errors if omitted.
Install dependencies
pip install scikit-learn pandas numpy matplotlib fastapi pydantic uvicorn openpyxl
Project structure
concrete-strength-prediction/
├── data/
│ └── cement_composition_type_ico.xlsx # original dataset
├── models/
│ └── rf_cement.pkl # trained Random Forest model
├── notebooks/
│ ├── 01_eda.ipynb
│ ├── 02_feature_selection.ipynb
│ ├── 03_benchmarking.ipynb
│ └── 04_deploy.ipynb
├── src/
│ ├── preprocessing.py
│ └── model.py
├── api/
│ └── main.py # FastAPI endpoint
└── requirements.txt
Loading the dataset
The dataset is an Excel file with 31 chemical composition columns and 4 target columns for compressive strength in PSI at 1, 3, 7, and 28 days of curing. The first row is a title row, and the actual column headers are on row 2 — so header=1 is required. Missing this is one of the most common setup errors on this dataset.
import pandas as pd
data = pd.read_excel("data/cement_composition_type_ico.xlsx", header=1)
print(data.shape) # (n_samples, 35)
print(data.head())
print(data.describe())
# Check available strength targets
targets = ["CS 1-day (psi)", "CS 3-day (psi)", "CS 7-day (psi)", "CS 28-day (psi)"]
print(data[targets].describe())
EDA and Feature EngineeringData Preparation
With 31 chemical composition attributes and high intercorrelation between them, feature selection and dimensionality reduction are the most critical stages of the pipeline. Training a model on all raw features without filtering risks overfitting and produces a black box that plant engineers can't interrogate or trust.
Initial cleaning
Drop rows with null values and separate the dataset into features (X) and target (y). The column Clinker I Total (%) is removed because it is a linear combination of other clinker columns — keeping it would introduce perfect multicollinearity. Calcined Clay (%) is dropped because all its values are zero, meaning it carries no information whatsoever for the model.
import pandas as pd
data = data.dropna()
# Features: columns 2 through 33, minus redundant columns
df_features = data.iloc[:, 2:34].copy()
del df_features["Clinker I Total (%)"] # linear combination — causes multicollinearity
df_features.drop(["Calcined Clay (%)"], axis=1, inplace=True) # zero-variance — no signal
# Target: 28-day compressive strength
y = data.iloc[:, -5]
print(f"Features: {df_features.shape[1]}")
print(f"Samples: {len(y)}")
print(f"Target stats:\n{y.describe()}")
Correlation analysis
Before reducing dimensions, compute the Pearson correlation matrix to understand which composition attributes are most related to compressive strength. This guides feature selection and informs which components the plant engineer should monitor most closely.
import matplotlib.pyplot as plt
# Correlation of all attributes with the targets
corr_matrix = data.corr()
# Feature correlation with 28-day compressive strength
corr_target = corr_matrix["CS 28-day (psi)"].sort_values(ascending=False)
print(corr_target.head(10))
# Scatter matrix of the most correlated variables
columns = ["CS 28-day (psi)", "Clinker I Conforming (%)", "Clinker I Non-Conforming (%)"]
pd.plotting.scatter_matrix(data[columns], figsize=(10, 8), alpha=0.5)
plt.suptitle("Scatter Matrix — variables most correlated with 28-day CS")
plt.tight_layout()
plt.show()
Dimensionality reduction with PCA
With 31 highly correlated attributes, PCA reduces dimensionality while preserving explained variance. The key is to plot cumulative variance vs. number of components first — then choose the number that captures 95% or more. Never hardcode the number of components without checking this plot first; the answer varies significantly across datasets.
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Fit full PCA to visualize explained variance
pca_full = PCA()
pca_full.fit(df_features)
# Cumulative variance plot
plt.figure(figsize=(10, 5))
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.axhline(y=0.95, color="red", linestyle="--", label="95% variance threshold")
plt.legend()
plt.title("PCA — Cumulative variance by number of components")
plt.grid(True, alpha=0.3)
plt.show()
# Reduce to 10 components (~95% of variance captured)
pca = PCA(n_components=10)
df_reduced = pca.fit_transform(df_features)
print(f"Variance explained with 10 components: {pca.explained_variance_ratio_.sum():.2%}")
Feature filtering with VarianceThreshold
As an alternative to PCA for simpler pipelines, VarianceThreshold removes near-constant features — attributes that barely change across samples and therefore carry almost no predictive signal. It's a lighter intervention than PCA, preserving the original feature space while eliminating noise.
from sklearn.feature_selection import VarianceThreshold
# Remove features with variance below 80% of p*(1-p) where p=0.8
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
df_filtered = sel.fit_transform(df_features)
print(f"Original features: {df_features.shape[1]}")
print(f"Features after VarianceThreshold: {df_filtered.shape[1]}")
Train/val/test split and normalization
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df_features.copy()
# Split: 80% train, 10% val, 10% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Normalize: mean 0, variance 1 — critical for SVR and KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
print(f"Train: {X_train_scaled.shape}")
print(f"Val: {X_val_scaled.shape}")
print(f"Test: {X_test_scaled.shape}")
Benchmarking and modelingTraining
We compare three classic regression models before investing time in hyperparameter tuning: SVR, KNN, and Random Forest. The primary selection criterion is RMSE on cross-validation — because in an industrial QC context, the cost of error is measured in PSI deviation from spec, and RMSE penalizes large errors more than MAE does.
Benchmarking: SVR, KNN, and Random Forest
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
models = {
"Random Forest": RandomForestRegressor(random_state=42),
"SVR": SVR(C=1.0, epsilon=0.2),
"KNN": KNeighborsRegressor(),
}
errors = {}
for name, model in models.items():
scores = cross_val_score(model, X_train_scaled, y_train,
scoring="neg_root_mean_squared_error", cv=5)
errors[name] = -scores.mean()
print(f"{name}: RMSE = {-scores.mean():.2f} PSI")
eval_models = pd.DataFrame(errors.items(), columns=["Model", "RMSE"])
print(eval_models.sort_values("RMSE"))
Tuning the best model: Random Forest with GridSearchCV
Random Forest outperforms SVR and KNN in all cross-validation runs. The tuning focuses on n_estimators (stability vs. training cost), max_depth (overfitting control), and min_samples_leaf (smoothing predictions on small sample regions — particularly relevant for high-strength outliers in cement data).
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = [
{
"n_estimators": [100, 200, 300],
"max_features": ["sqrt", "log2"],
"max_depth": [None, 10, 20],
"min_samples_leaf": [1, 2, 4],
}
]
grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=5,
scoring="neg_root_mean_squared_error",
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {-grid_search.best_score_:.2f} PSI")
Save the trained model
import pickle
# Train the final model on all training data
rf_final = RandomForestRegressor(**grid_search.best_params_, random_state=42)
rf_final.fit(X_train_scaled, y_train)
# Save model and scaler — both are needed at inference time
with open("models/rf_cement.pkl", "wb") as f:
pickle.dump(rf_final, f)
with open("models/scaler_cement.pkl", "wb") as f:
pickle.dump(scaler, f)
print("Model and scaler saved successfully.")
Metrics and comparisonEvaluation
All models are evaluated on the held-out test set — data no model saw during training or tuning. RMSE is the primary metric, but MAE and MAPE are reported to give a complete picture of error magnitude in both absolute and relative terms.
Final evaluation on the test set
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import matplotlib.pyplot as plt
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
y_true = y_test
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f"MAE: {mae:.2f} PSI")
print(f"RMSE: {rmse:.2f} PSI")
print(f"MAPE: {mape:.2f}%")
# Actual vs. predicted scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_true, y_pred, alpha=0.6)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()],
"r--", label="Perfect prediction")
plt.xlabel("Actual strength (PSI)")
plt.ylabel("Predicted strength (PSI)")
plt.title("Actual vs. Predicted — Tuned Random Forest")
plt.legend()
plt.tight_layout()
plt.show()
Model comparison table
| Model | MAE (PSI) | RMSE (PSI) | MAPE (%) | Notes |
|---|---|---|---|---|
| SVR | ~420 | ~560 | ~14% | Scale-sensitive; works only with normalization. High sensitivity to C and epsilon. |
| KNN | ~380 | ~510 | ~12% | Good quick baseline; degrades with correlated features due to distance metric distortion. |
| Random Forest (base) | ~280 | ~370 | ~9% | Best untuned model; robust to correlated features by design. |
| Random Forest (GridSearchCV) | ~235 | ~315 | ~7.5% | Selected for production. Main gain is on high-strength samples where the base model underestimates. |
GridSearchCV tuning reduces RMSE by approximately 15% compared to the untuned Random Forest baseline. The most notable improvement is in the high-strength region, where the base model tended to underpredict — exactly the region that matters most for structural cement specifications.
Feature importance
One of Random Forest's key advantages in industrial settings is that it provides feature importance directly, without needing SHAP. For a plant engineer who needs to explain model behavior to non-ML stakeholders, this is a significant practical advantage over SVR or neural networks.
import pandas as pd
import matplotlib.pyplot as plt
importances = pd.Series(
best_model.feature_importances_,
index=df_features.columns
).sort_values(ascending=False)
plt.figure(figsize=(10, 6))
importances.head(15).plot.bar()
plt.title("Top 15 features by importance — Random Forest")
plt.ylabel("Relative importance")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
print("Top 5 features:")
print(importances.head())
Deployment and conclusionsFindings
FastAPI deployment for plant integration
The trained model is deployed as a REST API with FastAPI. Any plant system — SCADA, LIMS, or a simple QC dashboard — can send a composition sample and receive the predicted 28-day compressive strength in real time, in both PSI and MPa to accommodate different reporting standards.
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import numpy as np
# Load model and scaler on startup
with open("models/rf_cement.pkl", "rb") as f:
model = pickle.load(f)
with open("models/scaler_cement.pkl", "rb") as f:
scaler = pickle.load(f)
app = FastAPI(title="API — Concrete Compressive Strength Prediction")
class CementSample(BaseModel):
clinker_conforming: float
clinker_non_conforming: float
gypsum: float
limestone: float
pozzolan: float
# ... remaining composition attributes
@app.post("/predict")
def predict(sample: CementSample):
data = np.array([[
sample.clinker_conforming,
sample.clinker_non_conforming,
sample.gypsum,
sample.limestone,
sample.pozzolan,
# ... remaining values
]])
data_scaled = scaler.transform(data)
prediction = model.predict(data_scaled)[0]
return {
"predicted_cs_28day_psi": round(float(prediction), 2),
"predicted_cs_28day_mpa": round(float(prediction) * 0.006895, 2)
}
# Run: uvicorn api.main:app --reload
Key findings
Conforming Clinker percentage is the strongest predictor
Conforming Clinker % — the fraction of clinker that meets quality specifications — has the highest relative importance in the Random Forest model. Higher conforming clinker is directly correlated with higher 28-day compressive strength, which is chemically coherent: conforming clinker has the correct C3S and C2S phase composition that drives hydration and strength development.
Random Forest outperforms SVR and KNN with less tuning effort
Even without tuning, Random Forest achieves better RMSE than SVR and KNN at their default configurations. SVR is highly sensitive to C and epsilon — small changes produce large performance swings. KNN degrades with correlated features because Euclidean distance becomes unreliable in high-dimensional, correlated spaces. Random Forest is naturally robust to both issues through its ensemble structure and random feature subsampling at each split.
PCA reduces 31 features to 10 without significant loss
Ten principal components explain over 95% of the dataset variance. Using PCA before modeling reduces training time and mitigates the effect of high feature intercorrelation. However, in this specific case, Random Forest without PCA produced slightly better results — because the ensemble's native handling of correlations made the explicit dimensionality reduction redundant. The lesson: PCA is a tool, not a default step.
The model is directly interpretable for plant engineers
Unlike neural networks or SVR, Random Forest provides feature importances in plain terms that a plant engineer can act on. The engineer can see that conforming clinker and specific additive ratios drive most of the prediction, and adjust the mix formulation accordingly — without needing to understand backpropagation or kernel functions. That operational trust is what makes a model worth deploying.
Limitations and future work
- The dataset comes from a single plant — generalizing to other facilities or cement types requires retraining or transfer learning, and likely additional feature engineering for process variables.
- The model predicts 28-day CS only, but does not model the full curing curve (1, 3, 7, and 28 days simultaneously) — a multi-output regression approach would be more useful in production QC workflows.
- Process variables (kiln temperature, grinding time, water-cement ratio) are not included but also affect final strength — incorporating them would improve prediction in plants where this data is available.
- Adding SHAP on top of the Random Forest would enable individual sample explanations — useful for auditing anomalous predictions where the model's output disagrees significantly with laboratory results.