Is Random Forest better than SVR for concrete compressive strength regression in Python?

In this project, Random Forest outperforms SVR and KNN on RMSE on real cement composition data. SVR is highly sensitive to C and epsilon hyperparameters and requires careful tuning. KNN degrades with correlated features. Random Forest is more robust to both issues due to its ensemble nature and random feature subsampling. After GridSearchCV tuning, the optimized Random Forest reduces RMSE by approximately 15% compared to the untuned baseline.

Why use PCA for dimensionality reduction on cement composition datasets?

Cement composition datasets typically have 30+ chemical attributes with high intercorrelation. PCA reduces them to fewer components that explain 95%+ of the variance, simplifying the model and mitigating multicollinearity. In this project, 10 principal components capture over 95% of the variance from 31 original attributes — reducing training time and overfitting risk without meaningful loss in predictive power.

Predict Concrete Compressive Strength from Mix Composition Using Machine Learning

Q: How do you predict concrete compressive strength from mix composition using machine learning?

You train a regression model on cement mix composition data (clinker, gypsum, limestone, pozzolan, etc.) to predict 28-day compressive strength in PSI. The most effective models are Random Forest, SVR, and Gradient Boosting. Because cement composition features are highly correlated, PCA or VarianceThreshold are recommended before training to reduce noise and improve generalization.

Project overviewWiki

What you'll find in this project

How to predict concrete compressive strength at 1, 3, 7, and 28 days from mix composition.
Reducing 31 correlated chemical attributes using PCA and VarianceThreshold.
SVR, KNN, and Random Forest benchmarking with cross-validation.
GridSearchCV hyperparameter tuning and FastAPI deployment for real-time QC.

This project solves a real problem in the cement industry: given that we know the chemical composition of a cement sample — clinker, gypsum, limestone, pozzolan, and so on — can we predict its 28-day compressive strength without waiting 28 days for physical testing?

Compressive strength is the critical quality metric for cement. Measuring it physically requires casting test specimens and curing them for 28 days. An ML model can estimate it from day one based on composition alone — accelerating quality control and reducing material waste before it reaches the construction site.

The dataset contains 31 chemical composition attributes of Type ICo cement, with high intercorrelation between variables — making this an ideal case for exploring feature selection and dimensionality reduction techniques before modeling.

Why is this problem strong for a thesis? It combines domain knowledge (materials chemistry) with applied ML. Industry data is typically proprietary, but the framework is fully replicable with the public UCI Concrete Compressive Strength dataset. The contribution is clear: a model interpretable enough that the plant engineer understands which mix components most affect strength — and can act on that insight.

Use cases for this model

Early quality control: predict 28-day compressive strength from composition measured at the plant, eliminating the physical waiting period.
Formulation optimization: simulate how strength changes when ingredient proportions are varied before production runs.
Anomaly detection: flag batches with atypical composition before curing begins, preventing defective product from reaching downstream processes.

Tech stack

Python 3.10+ — main language
Scikit-learn — SVR, KNN, Random Forest, PCA, GridSearchCV, StandardScaler
pandas / numpy — data manipulation and analysis
matplotlib — prediction visualization and learning curves
FastAPI + pydantic — REST API deployment for plant integration
pickle — trained model serialization

Environment setupSystem Configuration

This project runs locally or in Google Colab. The original notebook was developed in Colab with Google Drive access for reading the Excel data file. Note that openpyxl is required to read .xlsx files with pandas — a dependency that's easy to miss and causes confusing import errors if omitted.

Install dependencies

Terminal — install dependencies

pip install scikit-learn pandas numpy matplotlib fastapi pydantic uvicorn openpyxl

Project structure

Recommended folder structure

concrete-strength-prediction/
├── data/
│   └── cement_composition_type_ico.xlsx  # original dataset
├── models/
│   └── rf_cement.pkl                     # trained Random Forest model
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_selection.ipynb
│   ├── 03_benchmarking.ipynb
│   └── 04_deploy.ipynb
├── src/
│   ├── preprocessing.py
│   └── model.py
├── api/
│   └── main.py                           # FastAPI endpoint
└── requirements.txt

Loading the dataset

The dataset is an Excel file with 31 chemical composition columns and 4 target columns for compressive strength in PSI at 1, 3, 7, and 28 days of curing. The first row is a title row, and the actual column headers are on row 2 — so header=1 is required. Missing this is one of the most common setup errors on this dataset.

Python — initial data loading

import pandas as pd

data = pd.read_excel("data/cement_composition_type_ico.xlsx", header=1)

print(data.shape)        # (n_samples, 35)
print(data.head())
print(data.describe())

# Check available strength targets
targets = ["CS 1-day (psi)", "CS 3-day (psi)", "CS 7-day (psi)", "CS 28-day (psi)"]
print(data[targets].describe())

EDA and Feature EngineeringData Preparation

With 31 chemical composition attributes and high intercorrelation between them, feature selection and dimensionality reduction are the most critical stages of the pipeline. Training a model on all raw features without filtering risks overfitting and produces a black box that plant engineers can't interrogate or trust.

Initial cleaning

Drop rows with null values and separate the dataset into features (X) and target (y). The column Clinker I Total (%) is removed because it is a linear combination of other clinker columns — keeping it would introduce perfect multicollinearity. Calcined Clay (%) is dropped because all its values are zero, meaning it carries no information whatsoever for the model.

Python — cleaning and X / y separation

import pandas as pd

data = data.dropna()

# Features: columns 2 through 33, minus redundant columns
df_features = data.iloc[:, 2:34].copy()
del df_features["Clinker I Total (%)"]          # linear combination — causes multicollinearity
df_features.drop(["Calcined Clay (%)"], axis=1, inplace=True)  # zero-variance — no signal

# Target: 28-day compressive strength
y = data.iloc[:, -5]

print(f"Features: {df_features.shape[1]}")
print(f"Samples: {len(y)}")
print(f"Target stats:\n{y.describe()}")

Correlation analysis

Before reducing dimensions, compute the Pearson correlation matrix to understand which composition attributes are most related to compressive strength. This guides feature selection and informs which components the plant engineer should monitor most closely.

Python — correlation matrix

import matplotlib.pyplot as plt

# Correlation of all attributes with the targets
corr_matrix = data.corr()

# Feature correlation with 28-day compressive strength
corr_target = corr_matrix["CS 28-day (psi)"].sort_values(ascending=False)
print(corr_target.head(10))

# Scatter matrix of the most correlated variables
columns = ["CS 28-day (psi)", "Clinker I Conforming (%)", "Clinker I Non-Conforming (%)"]
pd.plotting.scatter_matrix(data[columns], figsize=(10, 8), alpha=0.5)
plt.suptitle("Scatter Matrix — variables most correlated with 28-day CS")
plt.tight_layout()
plt.show()

Dimensionality reduction with PCA

With 31 highly correlated attributes, PCA reduces dimensionality while preserving explained variance. The key is to plot cumulative variance vs. number of components first — then choose the number that captures 95% or more. Never hardcode the number of components without checking this plot first; the answer varies significantly across datasets.

Python — PCA for dimensionality reduction

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Fit full PCA to visualize explained variance
pca_full = PCA()
pca_full.fit(df_features)

# Cumulative variance plot
plt.figure(figsize=(10, 5))
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.axhline(y=0.95, color="red", linestyle="--", label="95% variance threshold")
plt.legend()
plt.title("PCA — Cumulative variance by number of components")
plt.grid(True, alpha=0.3)
plt.show()

# Reduce to 10 components (~95% of variance captured)
pca = PCA(n_components=10)
df_reduced = pca.fit_transform(df_features)
print(f"Variance explained with 10 components: {pca.explained_variance_ratio_.sum():.2%}")

Feature filtering with VarianceThreshold

As an alternative to PCA for simpler pipelines, VarianceThreshold removes near-constant features — attributes that barely change across samples and therefore carry almost no predictive signal. It's a lighter intervention than PCA, preserving the original feature space while eliminating noise.

Python — VarianceThreshold

from sklearn.feature_selection import VarianceThreshold

# Remove features with variance below 80% of p*(1-p) where p=0.8
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
df_filtered = sel.fit_transform(df_features)

print(f"Original features: {df_features.shape[1]}")
print(f"Features after VarianceThreshold: {df_filtered.shape[1]}")

Train/val/test split and normalization

Python — train/val/test split and StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df_features.copy()

# Split: 80% train, 10% val, 10% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Normalize: mean 0, variance 1 — critical for SVR and KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled   = scaler.transform(X_val)
X_test_scaled  = scaler.transform(X_test)

print(f"Train: {X_train_scaled.shape}")
print(f"Val:   {X_val_scaled.shape}")
print(f"Test:  {X_test_scaled.shape}")

Benchmarking and modelingTraining

We compare three classic regression models before investing time in hyperparameter tuning: SVR, KNN, and Random Forest. The primary selection criterion is RMSE on cross-validation — because in an industrial QC context, the cost of error is measured in PSI deviation from spec, and RMSE penalizes large errors more than MAE does.

Benchmarking: SVR, KNN, and Random Forest

Python — model comparison with cross-validation

from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd

models = {
    "Random Forest": RandomForestRegressor(random_state=42),
    "SVR":           SVR(C=1.0, epsilon=0.2),
    "KNN":           KNeighborsRegressor(),
}

errors = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train,
                             scoring="neg_root_mean_squared_error", cv=5)
    errors[name] = -scores.mean()
    print(f"{name}: RMSE = {-scores.mean():.2f} PSI")

eval_models = pd.DataFrame(errors.items(), columns=["Model", "RMSE"])
print(eval_models.sort_values("RMSE"))

Tuning the best model: Random Forest with GridSearchCV

Random Forest outperforms SVR and KNN in all cross-validation runs. The tuning focuses on n_estimators (stability vs. training cost), max_depth (overfitting control), and min_samples_leaf (smoothing predictions on small sample regions — particularly relevant for high-strength outliers in cement data).

Python — GridSearchCV on Random Forest

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
    {
        "n_estimators":     [100, 200, 300],
        "max_features":     ["sqrt", "log2"],
        "max_depth":        [None, 10, 20],
        "min_samples_leaf": [1, 2, 4],
    }
]

grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {-grid_search.best_score_:.2f} PSI")

Save the trained model

Python — serialize model with pickle

import pickle

# Train the final model on all training data
rf_final = RandomForestRegressor(**grid_search.best_params_, random_state=42)
rf_final.fit(X_train_scaled, y_train)

# Save model and scaler — both are needed at inference time
with open("models/rf_cement.pkl", "wb") as f:
    pickle.dump(rf_final, f)

with open("models/scaler_cement.pkl", "wb") as f:
    pickle.dump(scaler, f)

print("Model and scaler saved successfully.")

Metrics and comparisonEvaluation

All models are evaluated on the held-out test set — data no model saw during training or tuning. RMSE is the primary metric, but MAE and MAPE are reported to give a complete picture of error magnitude in both absolute and relative terms.

Final evaluation on the test set

Python — final test set evaluation

from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import matplotlib.pyplot as plt

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test_scaled)
y_true = y_test

mae  = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f"MAE:  {mae:.2f} PSI")
print(f"RMSE: {rmse:.2f} PSI")
print(f"MAPE: {mape:.2f}%")

# Actual vs. predicted scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_true, y_pred, alpha=0.6)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()],
         "r--", label="Perfect prediction")
plt.xlabel("Actual strength (PSI)")
plt.ylabel("Predicted strength (PSI)")
plt.title("Actual vs. Predicted — Tuned Random Forest")
plt.legend()
plt.tight_layout()
plt.show()

Model comparison table

Model	MAE (PSI)	RMSE (PSI)	MAPE (%)	Notes
SVR	~420	~560	~14%	Scale-sensitive; works only with normalization. High sensitivity to C and epsilon.
KNN	~380	~510	~12%	Good quick baseline; degrades with correlated features due to distance metric distortion.
Random Forest (base)	~280	~370	~9%	Best untuned model; robust to correlated features by design.
Random Forest (GridSearchCV)	~235	~315	~7.5%	Selected for production. Main gain is on high-strength samples where the base model underestimates.

GridSearchCV tuning reduces RMSE by approximately 15% compared to the untuned Random Forest baseline. The most notable improvement is in the high-strength region, where the base model tended to underpredict — exactly the region that matters most for structural cement specifications.

Feature importance

One of Random Forest's key advantages in industrial settings is that it provides feature importance directly, without needing SHAP. For a plant engineer who needs to explain model behavior to non-ML stakeholders, this is a significant practical advantage over SVR or neural networks.

Python — feature importance

import pandas as pd
import matplotlib.pyplot as plt

importances = pd.Series(
    best_model.feature_importances_,
    index=df_features.columns
).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
importances.head(15).plot.bar()
plt.title("Top 15 features by importance — Random Forest")
plt.ylabel("Relative importance")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

print("Top 5 features:")
print(importances.head())

Deployment and conclusionsFindings

FastAPI deployment for plant integration

The trained model is deployed as a REST API with FastAPI. Any plant system — SCADA, LIMS, or a simple QC dashboard — can send a composition sample and receive the predicted 28-day compressive strength in real time, in both PSI and MPa to accommodate different reporting standards.

Python — FastAPI endpoint for strength prediction

from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import numpy as np

# Load model and scaler on startup
with open("models/rf_cement.pkl", "rb") as f:
    model = pickle.load(f)

with open("models/scaler_cement.pkl", "rb") as f:
    scaler = pickle.load(f)

app = FastAPI(title="API — Concrete Compressive Strength Prediction")

class CementSample(BaseModel):
    clinker_conforming: float
    clinker_non_conforming: float
    gypsum: float
    limestone: float
    pozzolan: float
    # ... remaining composition attributes

@app.post("/predict")
def predict(sample: CementSample):
    data = np.array([[
        sample.clinker_conforming,
        sample.clinker_non_conforming,
        sample.gypsum,
        sample.limestone,
        sample.pozzolan,
        # ... remaining values
    ]])
    data_scaled  = scaler.transform(data)
    prediction   = model.predict(data_scaled)[0]
    return {
        "predicted_cs_28day_psi": round(float(prediction), 2),
        "predicted_cs_28day_mpa": round(float(prediction) * 0.006895, 2)
    }

# Run: uvicorn api.main:app --reload

Key findings

Conforming Clinker % — the fraction of clinker that meets quality specifications — has the highest relative importance in the Random Forest model. Higher conforming clinker is directly correlated with higher 28-day compressive strength, which is chemically coherent: conforming clinker has the correct C3S and C2S phase composition that drives hydration and strength development.

Even without tuning, Random Forest achieves better RMSE than SVR and KNN at their default configurations. SVR is highly sensitive to C and epsilon — small changes produce large performance swings. KNN degrades with correlated features because Euclidean distance becomes unreliable in high-dimensional, correlated spaces. Random Forest is naturally robust to both issues through its ensemble structure and random feature subsampling at each split.

Ten principal components explain over 95% of the dataset variance. Using PCA before modeling reduces training time and mitigates the effect of high feature intercorrelation. However, in this specific case, Random Forest without PCA produced slightly better results — because the ensemble's native handling of correlations made the explicit dimensionality reduction redundant. The lesson: PCA is a tool, not a default step.

Unlike neural networks or SVR, Random Forest provides feature importances in plain terms that a plant engineer can act on. The engineer can see that conforming clinker and specific additive ratios drive most of the prediction, and adjust the mix formulation accordingly — without needing to understand backpropagation or kernel functions. That operational trust is what makes a model worth deploying.

Limitations and future work

The dataset comes from a single plant — generalizing to other facilities or cement types requires retraining or transfer learning, and likely additional feature engineering for process variables.
The model predicts 28-day CS only, but does not model the full curing curve (1, 3, 7, and 28 days simultaneously) — a multi-output regression approach would be more useful in production QC workflows.
Process variables (kiln temperature, grinding time, water-cement ratio) are not included but also affect final strength — incorporating them would improve prediction in plants where this data is available.
Adding SHAP on top of the Random Forest would enable individual sample explanations — useful for auditing anomalous predictions where the model's output disagrees significantly with laboratory results.

How to adapt this project to your thesis: The framework is fully reusable with the public UCI Concrete Compressive Strength dataset, which is widely cited in materials science and civil engineering literature. You can extend the target to multi-output regression (predicting CS at 1, 3, 7, and 28 days simultaneously), add SHAP for individual sample explanations, incorporate process variables if your plant data includes them, or apply the same pipeline to fly ash, slag, or blended cement formulations where composition-to-strength relationships differ systematically.