Bestseller #1

Evaluation Theory, Models, and Applications: 50 (Research Methods…

₹12,090.00

Buy on Amazon

Bestseller #2

Empirical Model Discovery and Theory Evaluation – Automatic Selec…

₹2,601

Buy on Amazon

Bestseller #3

Evaluation Models: Viewpoints on Educational and Human Services E…

₹4,679

Buy on Amazon

Bestseller #4

Evaluation Models for Single-hop and LEACH Routing Protocols in W…

₹4,874

Buy on Amazon

Model Evaluation & Testing Guide: Test Performance on Unseen Data | ML Cheat Sheet 2026

📊 Step 6: Model Evaluation & Testing

Test Performance on Unseen Data

Machine Learning Pipeline – Final Step

🎯Why Evaluate on Unseen Data?

Model evaluation on unseen data is crucial to assess how well your model generalizes to new, real-world scenarios. Training accuracy alone doesn’t guarantee good performance in production.

                    Key Principle: A model should perform well on data it has never seen during training. This tests its ability to generalize patterns rather than memorize training examples.
                

✂️Data Splitting Strategies

1. Train-Test Split

Simple division of data into training and testing sets

train_size = 70-80% test_size = 20-30%

2. Train-Validation-Test Split

Three-way split for hyperparameter tuning

train = 60-70% validation = 15-20% test = 15-20%

3. K-Fold Cross-Validation

Multiple train-test splits for robust evaluation

k = 5 or 10 folds Each fold serves as test set once

Example: Train-Test Split in Python

from sklearn.model_selection import train_test_split

# Split data into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Maintains class distribution
)

print(f”Training samples: {len(X_train)}”)
print(f”Test samples: {len(X_test)}”)
                

Example: K-Fold Cross-Validation

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# Perform 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f”Accuracy scores: {scores}”)
print(f”Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})”)
                

📈Classification Metrics

Confusion Matrix

Foundation for understanding classification performance:

Actual \ Predicted	Positive	Negative
Positive	True Positive (TP)	False Negative (FN)
Negative	False Positive (FP)	True Negative (TN)

Key Metrics

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

When to use: Balanced datasets

Precision

Precision = TP / (TP + FP)

When to use: Cost of false positives is high (e.g., spam detection)

Recall (Sensitivity)

Recall = TP / (TP + FN)

When to use: Cost of false negatives is high (e.g., disease detection)

F1-Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

When to use: Balance between precision and recall

Example: Classification Evaluation

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Train model and make predictions
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average=‘weighted’)
recall = recall_score(y_test, y_pred, average=‘weighted’)
f1 = f1_score(y_test, y_pred, average=‘weighted’)

print(f”Accuracy: {accuracy:.3f}”)
print(f”Precision: {precision:.3f}”)
print(f”Recall: {recall:.3f}”)
print(f”F1-Score: {f1:.3f}”)

# Detailed report
print(“\nClassification Report:”)
print(classification_report(y_test, y_pred))

# Confusion matrix
print(“\nConfusion Matrix:”)
print(confusion_matrix(y_test, y_pred))
                

ROC Curve & AUC

ROC-AUC Score: Measures the model’s ability to distinguish between classes

AUC = 1.0: Perfect classifier
AUC = 0.5: Random classifier
AUC > 0.8: Good classifier

from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC AUC
auc_score = roc_auc_score(y_test, y_proba)
print(f”ROC-AUC Score: {auc_score:.3f}”)

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f’AUC = {auc_score:.3f}’)
plt.plot([0, 1], [0, 1], ‘k–‘)
plt.xlabel(‘False Positive Rate’)
plt.ylabel(‘True Positive Rate’)
plt.legend()
plt.show()
                

📉Regression Metrics

Mean Absolute Error (MAE)

MAE = (1/n) × Σ|y_true – y_pred|

Interpretation: Average absolute difference between predictions and actual values

Mean Squared Error (MSE)

MSE = (1/n) × Σ(y_true – y_pred)²

Interpretation: Penalizes larger errors more heavily

Root Mean Squared Error (RMSE)

RMSE = √MSE

Interpretation: Same units as target variable

R² Score (Coefficient of Determination)

R² = 1 – (SS_res / SS_tot)

Interpretation: Proportion of variance explained (0 to 1, higher is better)

Example: Regression Evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Train model and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f”MAE: {mae:.3f}”)
print(f”MSE: {mse:.3f}”)
print(f”RMSE: {rmse:.3f}”)
print(f”R² Score: {r2:.3f}”)

# Visualize predictions vs actual
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], ‘r–‘)
plt.xlabel(‘Actual Values’)
plt.ylabel(‘Predicted Values’)
plt.title(f’R² = {r2:.3f}’)
plt.show()
                

⚠️Detecting Overfitting & Underfitting

Scenario	Training Performance	Test Performance	Issue
Good Fit	High	High (similar to train)	✅ Model generalizes well
Overfitting	Very High	Low (much worse than train)	❌ Model memorized training data
Underfitting	Low	Low	❌ Model too simple

Example: Learning Curves to Detect Overfitting

from sklearn.model_selection import learning_curve

# Generate learning curves
train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, 
    cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10)
)

# Calculate mean and std
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot learning curves
plt.plot(train_sizes, train_mean, label=‘Training score’)
plt.plot(train_sizes, test_mean, label=‘Cross-validation score’)
plt.fill_between(train_sizes, train_mean – train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(train_sizes, test_mean – test_std, test_mean + test_std, alpha=0.1)
plt.xlabel(‘Training Examples’)
plt.ylabel(‘Score’)
plt.legend()
plt.show()
                

Warning Signs of Overfitting:

Training accuracy >> Test accuracy (large gap)
Training loss continues decreasing while validation loss increases
Model performs perfectly on training data but poorly on new data

✨Best Practices for Model Evaluation

1. Never Use Test Data During Training

Test data should remain completely unseen until final evaluation. Using it during training or hyperparameter tuning leads to overoptimistic results.

2. Use Stratified Splitting for Imbalanced Data

Ensures each split maintains the same class distribution as the original dataset, preventing bias in evaluation.

3. Choose Metrics Based on Business Goals

Don’t rely solely on accuracy. Consider precision/recall for classification, and choose regression metrics that align with your use case.

4. Perform Cross-Validation for Small Datasets

K-fold cross-validation provides more reliable estimates when you have limited data, reducing the impact of how you split the data.

5. Monitor Both Training and Test Performance

Tracking both helps identify overfitting early. Use validation data during training to make decisions about when to stop.

🔄Complete Evaluation Pipeline Example

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# 1. Load and split data
X = df.drop(‘target’, axis=1)
y = df[‘target’]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 3. Cross-validation on training data
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print(f”Cross-validation scores: {cv_scores}”)
print(f”CV Mean: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})”)

# 4. Evaluate on test set
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# 5. Calculate metrics
print(“\nTest Set Performance:”)
print(classification_report(y_test, y_pred))
print(f”\nROC-AUC Score: {roc_auc_score(y_test, y_proba):.3f}”)

# 6. Check for overfitting
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f”\nTraining Accuracy: {train_score:.3f}”)
print(f”Test Accuracy: {test_score:.3f}”)
print(f”Difference: {abs(train_score – test_score):.3f}”)

if abs(train_score – test_score) > 0.1:
    print(“⚠️ Potential overfitting detected!”)
else:
    print(“✅ Model generalizes well”)
                

Evaluation Theory, Models, and Applications: 50 (Research Methods…

Empirical Model Discovery and Theory Evaluation – Automatic Selec…

Evaluation Models: Viewpoints on Educational and Human Services E…

Evaluation Models for Single-hop and LEACH Routing Protocols in W…

🎯Why Evaluate on Unseen Data?

✂️Data Splitting Strategies

1. Train-Test Split

2. Train-Validation-Test Split

3. K-Fold Cross-Validation

Example: Train-Test Split in Python

Example: K-Fold Cross-Validation

📈Classification Metrics

Confusion Matrix

Key Metrics

Accuracy

Precision

Recall (Sensitivity)

F1-Score

Example: Classification Evaluation

ROC Curve & AUC

📉Regression Metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

R² Score (Coefficient of Determination)

Example: Regression Evaluation

⚠️Detecting Overfitting & Underfitting

Example: Learning Curves to Detect Overfitting

✨Best Practices for Model Evaluation

1. Never Use Test Data During Training

2. Use Stratified Splitting for Imbalanced Data

3. Choose Metrics Based on Business Goals

4. Perform Cross-Validation for Small Datasets

5. Monitor Both Training and Test Performance

🔄Complete Evaluation Pipeline Example

The Learning Curve – A Neurosurgeon’s Memoir

Lines And Curves ( Pattern Writing ) -1

Learning Curves

The Learning Curve

By Somish Saipar

Related Post

Leave a Reply Cancel reply

Oops, looks like this got skipped!