A supervised machine learning project focused on implementing and comparing multiple classification algorithms using a real-world medical dataset.
- This project applies various classification algorithms to the Breast Cancer dataset
available in the
sklearnlibrary. - The dataset consists of 569 samples with 30 numerical features describing tumor characteristics, and a binary target variable indicating whether the tumor is Benign or Malignant.
- The primary goal is to build, evaluate, and compare multiple classification models to identify the most effective algorithm for this dataset.
The project includes:
🔹 Dataset loading and preprocessing
🔹 Feature scaling and data preparation
🔹 Implementation of multiple classification algorithms
🔹 Model evaluation and comparison
🔹 Interpretation of results
🔹 Clean and reproducible Google Colab Notebook
The dataset is sourced from sklearn.datasets.load_breast_cancer.
| Component | Description |
|---|---|
| Samples | 569 |
| Features | 30 numerical features |
| Target Classes | Benign (0), Malignant (1) |
✔ Dataset loading using load_breast_cancer()
✔ Converted into Pandas DataFrame
✔ Checked missing values (none found)
✔ Checked duplicates (none found)
✔ Train–Test split
✔ Feature scaling using StandardScaler
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Support Vector Machine (SVM)
- k-Nearest Neighbors (k-NN)
- Accuracy Score
- Confusion Matrix
- Precision
- Recall
- F1-score
| Model | Accuracy |
|---|---|
| Logistic Regression | 0.982456 |
| SVM (RBF) | 0.982456 |
| k-NN | 0.956140 |
| Random Forest | 0.956140 |
| Pruned Decision Tree | 0.921053 |
| Decision Tree | 0.912281 |
Best Model: Logistic Regression & SVM (RBF) → 0.982
Worst Model: Decision Tree → 0.912
✔ Feature scaling improved SVM and k-NN performance
✔ Logistic Regression & SVM performed best
✔ Random Forest provided good generalization
✔ Decision Tree showed overfitting
✔ Pruning improved Decision Tree
| Tool | Purpose |
|---|---|
| Python | Programming |
| Pandas | Data handling |
| NumPy | Computation |
| Matplotlib | Visualization |
| Seaborn | Visualization |
| Scikit-learn | ML models |
| Google Colab | Development |
Classification-Algorithms-Model-Building/
│
├── Classification_Algorithms_Model_Building.ipynb
├── README.md
1️⃣ Open the notebook using the Colab link above
2️⃣ Run all cells sequentially
3️⃣ View model results and evaluation
This project was created as part of a Machine Learning academic assignment, demonstrating the implementation and comparison of multiple classification algorithms using a real-world medical dataset.
- Small dataset size
- Limited features
- Risk of overfitting
- No extensive hyperparameter tuning
- Not tested on external datasets
- Apply GridSearchCV tuning
- Use advanced models (XGBoost, LightGBM)
- Add cross-validation
- Build Streamlit web app
- Improve visualization dashboard
Name: Laya Mary Joy
Organization: Entri Elevate
Date: January 19, 2026
Thanks to Entri Elevate for guidance and support throughout this project.