Model Evaluation & Testing
Model Evaluation & Testing

Model Evaluation & Testing Guide: Test Performance on Unseen Data | ML Cheat Sheet 2026

📊 Step 6: Model Evaluation & Testing

Test Performance on Unseen Data

Machine Learning Pipeline – Final Step

🎯Why Evaluate on Unseen Data?

Model evaluation on unseen data is crucial to assess how well your model generalizes to new, real-world scenarios. Training accuracy alone doesn’t guarantee good performance in production.

Key Principle: A model should perform well on data it has never seen during training. This tests its ability to generalize patterns rather than memorize training examples.

✂️Data Splitting Strategies

1. Train-Test Split

Simple division of data into training and testing sets

train_size = 70-80% test_size = 20-30%

2. Train-Validation-Test Split

Three-way split for hyperparameter tuning

train = 60-70% validation = 15-20% test = 15-20%

3. K-Fold Cross-Validation

Multiple train-test splits for robust evaluation

k = 5 or 10 folds Each fold serves as test set once

Example: Train-Test Split in Python

from sklearn.model_selection import train_test_split # Split data into 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # Maintains class distribution ) print(f”Training samples: {len(X_train)}”) print(f”Test samples: {len(X_test)}”)

Example: K-Fold Cross-Validation

from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() # Perform 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5) print(f”Accuracy scores: {scores}”) print(f”Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})”)

📈Classification Metrics

Confusion Matrix

Foundation for understanding classification performance:

Actual \ Predicted Positive Negative
Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)

Key Metrics

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

When to use: Balanced datasets

Precision

Precision = TP / (TP + FP)

When to use: Cost of false positives is high (e.g., spam detection)

Recall (Sensitivity)

Recall = TP / (TP + FN)

When to use: Cost of false negatives is high (e.g., disease detection)

F1-Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

When to use: Balance between precision and recall

Example: Classification Evaluation

from sklearn.metrics import classification_report, confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Train model and make predictions model.fit(X_train, y_train) y_pred = model.predict(X_test) # Calculate metrics accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average=‘weighted’) recall = recall_score(y_test, y_pred, average=‘weighted’) f1 = f1_score(y_test, y_pred, average=‘weighted’) print(f”Accuracy: {accuracy:.3f}”) print(f”Precision: {precision:.3f}”) print(f”Recall: {recall:.3f}”) print(f”F1-Score: {f1:.3f}”) # Detailed report print(“\nClassification Report:”) print(classification_report(y_test, y_pred)) # Confusion matrix print(“\nConfusion Matrix:”) print(confusion_matrix(y_test, y_pred))

ROC Curve & AUC

ROC-AUC Score: Measures the model’s ability to distinguish between classes
  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random classifier
  • AUC > 0.8: Good classifier
from sklearn.metrics import roc_auc_score, roc_curve import matplotlib.pyplot as plt # Get probability predictions y_proba = model.predict_proba(X_test)[:, 1] # Calculate ROC AUC auc_score = roc_auc_score(y_test, y_proba) print(f”ROC-AUC Score: {auc_score:.3f}”) # Plot ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba) plt.plot(fpr, tpr, label=f’AUC = {auc_score:.3f}’) plt.plot([0, 1], [0, 1], ‘k–‘) plt.xlabel(‘False Positive Rate’) plt.ylabel(‘True Positive Rate’) plt.legend() plt.show()

📉Regression Metrics

Mean Absolute Error (MAE)

MAE = (1/n) × Σ|y_true – y_pred|

Interpretation: Average absolute difference between predictions and actual values

Mean Squared Error (MSE)

MSE = (1/n) × Σ(y_true – y_pred)²

Interpretation: Penalizes larger errors more heavily

Root Mean Squared Error (RMSE)

RMSE = √MSE

Interpretation: Same units as target variable

R² Score (Coefficient of Determination)

R² = 1 – (SS_res / SS_tot)

Interpretation: Proportion of variance explained (0 to 1, higher is better)

Example: Regression Evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score import numpy as np # Train model and predict model.fit(X_train, y_train) y_pred = model.predict(X_test) # Calculate metrics mae = mean_absolute_error(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print(f”MAE: {mae:.3f}”) print(f”MSE: {mse:.3f}”) print(f”RMSE: {rmse:.3f}”) print(f”R² Score: {r2:.3f}”) # Visualize predictions vs actual plt.scatter(y_test, y_pred, alpha=0.5) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], ‘r–‘) plt.xlabel(‘Actual Values’) plt.ylabel(‘Predicted Values’) plt.title(f’R² = {r2:.3f}’) plt.show()

⚠️Detecting Overfitting & Underfitting

Scenario Training Performance Test Performance Issue
Good Fit High High (similar to train) ✅ Model generalizes well
Overfitting Very High Low (much worse than train) ❌ Model memorized training data
Underfitting Low Low ❌ Model too simple

Example: Learning Curves to Detect Overfitting

from sklearn.model_selection import learning_curve # Generate learning curves train_sizes, train_scores, test_scores = learning_curve( model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10) ) # Calculate mean and std train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1) # Plot learning curves plt.plot(train_sizes, train_mean, label=‘Training score’) plt.plot(train_sizes, test_mean, label=‘Cross-validation score’) plt.fill_between(train_sizes, train_mean – train_std, train_mean + train_std, alpha=0.1) plt.fill_between(train_sizes, test_mean – test_std, test_mean + test_std, alpha=0.1) plt.xlabel(‘Training Examples’) plt.ylabel(‘Score’) plt.legend() plt.show()
Warning Signs of Overfitting:
  • Training accuracy >> Test accuracy (large gap)
  • Training loss continues decreasing while validation loss increases
  • Model performs perfectly on training data but poorly on new data

Best Practices for Model Evaluation

1. Never Use Test Data During Training

Test data should remain completely unseen until final evaluation. Using it during training or hyperparameter tuning leads to overoptimistic results.

2. Use Stratified Splitting for Imbalanced Data

Ensures each split maintains the same class distribution as the original dataset, preventing bias in evaluation.

3. Choose Metrics Based on Business Goals

Don’t rely solely on accuracy. Consider precision/recall for classification, and choose regression metrics that align with your use case.

4. Perform Cross-Validation for Small Datasets

K-fold cross-validation provides more reliable estimates when you have limited data, reducing the impact of how you split the data.

5. Monitor Both Training and Test Performance

Tracking both helps identify overfitting early. Use validation data during training to make decisions about when to stop.

🔄Complete Evaluation Pipeline Example

import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score # 1. Load and split data X = df.drop(‘target’, axis=1) y = df[‘target’] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # 2. Train model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # 3. Cross-validation on training data cv_scores = cross_val_score(model, X_train, y_train, cv=5) print(f”Cross-validation scores: {cv_scores}”) print(f”CV Mean: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})”) # 4. Evaluate on test set y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] # 5. Calculate metrics print(“\nTest Set Performance:”) print(classification_report(y_test, y_pred)) print(f”\nROC-AUC Score: {roc_auc_score(y_test, y_proba):.3f}”) # 6. Check for overfitting train_score = model.score(X_train, y_train) test_score = model.score(X_test, y_test) print(f”\nTraining Accuracy: {train_score:.3f}”) print(f”Test Accuracy: {test_score:.3f}”) print(f”Difference: {abs(train_score – test_score):.3f}”) if abs(train_score – test_score) > 0.1: print(“⚠️ Potential overfitting detected!”) else: print(“✅ Model generalizes well”)

Model Evaluation Cheat Sheet

Step 6: Testing Performance on Unseen Data

© 2026 ML Learning Resources | Machine Learning Pipeline Series

Bestseller #3

Leave a Reply

Your email address will not be published. Required fields are marked *