Data Preparation Cheat Sheet
Data Preparation Cheat Sheet
Bestseller #1
Bestseller #2
  • LRS Account Books Copy Size Combo – Hard Bound – 20 x 16 cm
  • Ledger + Cash Book + Stock Copy + Attendance Copy
  • Heavy Hard Bound to protect the pages efficiently ; Manually Stitched

Data Preparation Cheat Sheet
Step 3 of ML Pipeline

Data Preparation

Clean, Normalize, and Split Your Data

1 Data Cleaning

Remove or handle missing values, outliers, duplicates, and inconsistencies in your dataset to ensure data quality and model reliability.
🔍 Missing Values
Drop rows/columns or impute with mean, median, mode, or advanced methods
🎯 Outliers
Identify using IQR, Z-score, or domain knowledge; remove or cap extreme values
🔄 Duplicates
Remove exact or near-duplicate records that can bias your model
✅ Validation
Check data types, ranges, and format consistency across features

Example: Handling Missing Values and Outliers

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Load sample dataset
df = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 40, 200, 28],
    'salary': [50000, 60000, 55000, np.nan, 70000, 65000, 52000],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA', 'Chicago', 'NYC']
})

# 1. Check missing values
print("Missing values:\n", df.isnull().sum())

# 2. Handle missing values - Imputation
imputer = SimpleImputer(strategy='mean')
df[['age', 'salary']] = imputer.fit_transform(df[['age', 'salary']])

# 3. Remove outliers using IQR method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_clean = df[(df['age'] >= lower_bound) & (df['age'] <= upper_bound)]

# 4. Remove duplicates
df_clean = df_clean.drop_duplicates()

print("\nCleaned data shape:", df_clean.shape)
Output:
Missing values:
age       1
salary    1
city      0

Cleaned data shape: (6, 3)

2 Data Normalization

Scale features to a similar range to improve model convergence and performance. Essential for distance-based algorithms and neural networks.
S
Standardization

Z-score normalization: Mean = 0, Std = 1

z = (x - μ) / σ

Use when: Data follows normal distribution, for algorithms sensitive to feature scale (SVM, KNN, Neural Networks)

M
Min-Max Scaling

Range [0, 1]: Preserves relationships

x' = (x - min) / (max - min)

Use when: Need bounded values, for image processing, neural networks with bounded activations

R
Robust Scaling

Uses median and IQR: Resistant to outliers

x' = (x - median) / IQR

Use when: Data contains outliers that can’t be removed

Example: Applying Different Scaling Techniques

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd

# Sample dataset
df = pd.DataFrame({
    'feature1': [100, 200, 300, 400, 500],
    'feature2': [10, 20, 15, 25, 30],
    'feature3': [0.1, 0.5, 1.2, 2.3, 3.1]
})

# 1. Standardization (StandardScaler)
scaler_standard = StandardScaler()
df_standard = pd.DataFrame(
    scaler_standard.fit_transform(df),
    columns=[f'{col}_std' for col in df.columns]
)

# 2. Min-Max Scaling
scaler_minmax = MinMaxScaler()
df_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(df),
    columns=[f'{col}_mm' for col in df.columns]
)

# 3. Robust Scaling
scaler_robust = RobustScaler()
df_robust = pd.DataFrame(
    scaler_robust.fit_transform(df),
    columns=[f'{col}_rb' for col in df.columns]
)

print("Original data:")
print(df.head())
print("\nStandardized:")
print(df_standard.head(2))
print("\nMin-Max Scaled:")
print(df_minmax.head(2))
⚡ Pro Tip
Always fit your scaler on training data only, then transform both training and test sets using the same fitted scaler to prevent data leakage!

3 Train-Test Split

Divide your dataset into separate training and testing sets to evaluate model performance on unseen data and prevent overfitting.
1
Training Set
60-80% of data
Used to train the model
2
Validation Set
10-20% of data
Tune hyperparameters
3
Test Set
10-20% of data
Final evaluation

Example: Splitting Data with Stratification

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create sample dataset
np.random.seed(42)
X = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'feature3': np.random.randn(1000)
})
y = np.random.choice([0, 1], size=1000, p=[0.7, 0.3])

# 1. Simple train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

# 2. Train-Validation-Test split
# First split: separate test set (80-20)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    random_state=42
)

# Second split: separate validation from training (75-25 of temp = 60-20 overall)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp,
    test_size=0.25,
    random_state=42
)

print("\n=== Train-Val-Test Split ===")
print(f"Training: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# 3. Stratified split (maintains class distribution)
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,  # Maintains class proportions
    random_state=42
)

print("\n=== Class Distribution ===")
print(f"Original: {np.bincount(y) / len(y)}")
print(f"Train: {np.bincount(y_train_strat) / len(y_train_strat)}")
print(f"Test: {np.bincount(y_test_strat) / len(y_test_strat)}")
Output:
Training set size: (800, 3)
Test set size: (200, 3)

=== Train-Val-Test Split ===
Training: 600 samples (60.0%)
Validation: 200 samples (20.0%)
Test: 200 samples (20.0%)

=== Class Distribution ===
Original: [0.703 0.297]
Train: [0.7025 0.2975]
Test: [0.705 0.295]

4 Complete Pipeline Example

Putting it all together: a complete data preparation pipeline from raw data to model-ready splits.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Load dataset (example with Titanic-like data)
df = pd.DataFrame({
    'Age': [22, 38, 26, 35, np.nan, 54, 2, 27, 14, 4],
    'Fare': [7.25, 71.28, 7.92, 53.1, 8.05, 51.86, 21.08, np.nan, 11.13, 16.7],
    'Pclass': [3, 1, 3, 1, 3, 1, 3, 3, 2, 3],
    'Survived': [0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
})

print("=== STEP 1: Data Cleaning ===")
# Handle missing values
imputer = SimpleImputer(strategy='median')
df[['Age', 'Fare']] = imputer.fit_transform(df[['Age', 'Fare']])
print(f"Missing values after imputation: {df.isnull().sum().sum()}")

# Remove duplicates
df = df.drop_duplicates()
print(f"Dataset shape after cleaning: {df.shape}")

print("\n=== STEP 2: Feature Engineering ===")
# Separate features and target
X = df.drop('Survived', axis=1)
y = df['Survived']
print(f"Features: {X.columns.tolist()}")
print(f"Target: Survived (Classes: {y.unique()})")

print("\n=== STEP 3: Train-Test Split ===")
# Split data (80-20 with stratification)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

print("\n=== STEP 4: Feature Scaling ===")
# Scale features (fit on training data only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use fitted scaler

print(f"Training data mean: {X_train_scaled.mean(axis=0)}")
print(f"Training data std: {X_train_scaled.std(axis=0)}")

print("\n=== Data Ready for Modeling! ===")
print(f"X_train_scaled shape: {X_train_scaled.shape}")
print(f"X_test_scaled shape: {X_test_scaled.shape}")
Output:
=== STEP 1: Data Cleaning ===
Missing values after imputation: 0
Dataset shape after cleaning: (10, 4)

=== STEP 2: Feature Engineering ===
Features: ['Age', 'Fare', 'Pclass']
Target: Survived (Classes: [0 1])

=== STEP 3: Train-Test Split ===
Training set: (8, 3)
Test set: (2, 3)

=== STEP 4: Feature Scaling ===
Training data mean: [-0.  0.  0.]
Training data std: [1. 1. 1.]

=== Data Ready for Modeling! ===
X_train_scaled shape: (8, 3)
X_test_scaled shape: (2, 3)
🎯 Best Practices Checklist
  • Always explore data before cleaning (use df.info(), df.describe())
  • Fit preprocessing steps (scalers, imputers) on training data only
  • Use stratified splitting for imbalanced datasets
  • Document all preprocessing steps for reproducibility
  • Consider cross-validation for small datasets
  • Save your fitted scalers/imputers for deployment

📚 Quick Reference

Data Cleaning
df.dropna()
SimpleImputer()
df.drop_duplicates()
Scaling
StandardScaler()
MinMaxScaler()
RobustScaler()
Splitting
train_test_split()
stratify=y
random_state=42
Validation
KFold()
StratifiedKFold()
cross_val_score()

Data Preparation Cheat Sheet

Essential techniques for cleaning, normalizing, and splitting machine learning datasets

Bestseller #1

Leave a Reply

Your email address will not be published. Required fields are marked *