MLPipelineOptimizationStudy: End-to-End ML Pipeline Exploration
MLPipelineOptimizationStudy is a rigorous, reproducible investigation into the art of building high-performance machine learning pipelines. Each stage of the pipeline โ from raw data to final predictions โ is analysed independently and in combination to understand where gains are achieved.
Pipeline Stages Covered
Data Preprocessing
- Missing value imputation strategies (mean, median, KNN, iterative).
- Outlier detection and treatment.
- Class imbalance handling (SMOTE, class weights).
- Feature scaling (Standard, MinMax, Robust).
Feature Engineering
- PCA for dimensionality reduction with explained-variance analysis.
- t-SNE for high-dimensional data visualisation.
- Manual feature construction and selection (correlation, mutual information, RFE).
Model Selection
Systematic comparison of:
- Logistic Regression, Kernel SVM
- Random Forest, XGBoost, LightGBM, CatBoost
- Voting Ensembles (hard and soft)
Hyperparameter Tuning
- GridSearchCV and RandomizedSearchCV with cross-validation.
- Analysis of resource-accuracy trade-offs across search strategies.
Technology
Python, scikit-learn, XGBoost, LightGBM, CatBoost, Matplotlib, Seaborn. All experiments in reproducible Jupyter Notebooks.
