A supervised machine learning project focused on implementing and comparing multiple regression algorithms to predict car prices in the American automobile market.
Click below to open the notebook:
This project is based on a business problem where an automobile company plans to enter the US market and aims to understand key factors influencing car prices.
The complete pipeline includes:
-
Data Cleaning
-
Exploratory Data Analysis (EDA)
-
Feature Engineering
-
Feature Scaling
-
Model Building
-
Model Evaluation
-
Hyperparameter Tuning
-
Model Comparison
The goal is to build an accurate regression model to predict car prices and support data-driven pricing strategies.
The main objectives of this project are:
πΉ Identify variables affecting car price
πΉ Analyze relationships between features and price
πΉ Build multiple regression models
πΉ Compare model performances
πΉ Optimize the best model using hyperparameter tuning
πΉ Provide business insights for pricing strategy
| Component | Description |
|---|---|
| Records | 205 cars |
| Features | 25+ independent variables |
| Target Variable | price |
| Data Type | Numerical and Categorical |
-
Engine size
-
Horsepower
-
Fuel type
-
Drive wheel type
-
Car dimensions
-
Brand
-
Mileage
-
Technical specifications
The following preprocessing steps were performed:
β Loaded dataset using Pandas
β Initial exploration (shape, info, summary statistics)
β Handled missing values
β Removed duplicate records
β Detected and handled outliers
β Feature engineering (brand extraction)
β Encoded categorical variables
β Split dataset into Training (80%) and Testing (20%)
β Scaled features using StandardScaler
Visualizations used:
-
Histogram
-
Box Plot
-
Heatmap Correlation
-
Scatter Plot
-
Identified strong relationships between engine size, horsepower, and price
-
Detected outliers affecting model performance
-
Understood feature distributions
The following models were trained and evaluated:
-
Linear Regression
-
Decision Tree Regressor
-
Random Forest Regressor
-
Gradient Boosting Regressor
-
Support Vector Regressor (SVR)
-
Pruned Decision Tree
| Model | RΒ² Score | MSE | MAE |
|---|---|---|---|
| Random Forest Regressor | 0.913 | 0.0136 | 0.0882 |
| Gradient Boosting Regressor | 0.908 | 0.0143 | 0.0838 |
| Linear Regression | 0.879 | 0.0189 | 0.1054 |
| Decision Tree Regressor | 0.860 | 0.0219 | 0.1062 |
| Support Vector Regressor | 0.831 | 0.0263 | 0.1051 |
| Pruned Decision Tree | 0.831 | 0.0263 | 0.0958 |
β Best Model: Random Forest Regressor
β Worst Model: Pruned Decision Tree
-
Applied GridSearchCV
-
Used Pipeline to prevent data leakage
-
Performed 5-Fold Cross Validation
-
n_estimators
-
max_depth
-
min_samples_split
| Model | RΒ² Score | MSE | MAE |
|---|---|---|---|
| Random Forest (Untuned) | 0.9129 | 0.0136 | 0.0882 |
| Random Forest (Tuned) | 0.9131 | 0.0136 | 0.0847 |
π Best Model: Tuned Random Forest Regressor
-
RΒ² Score: 0.9131
-
MSE: 0.0136
-
MAE: 0.0847
β Model performance improved after hyperparameter tuning
β Random Forest Regressor selected
β Optimized using hyperparameter tuning
β Provides stable and accurate predictions
Important features influencing car price:
-
Engine size
-
Curb weight
-
Horsepower
-
Car width
-
highwaympg
These features play a major role in determining car pricing strategy.
-
RΒ² Score
-
Mean Squared Error (MSE)
-
Mean Absolute Error (MAE)
-
Dataset size is relatively small (205 records)
-
Limited features may not capture all real-world pricing factors
-
Market dynamics and external economic factors are not included
-
Model may not generalize well to different regions or time periods
-
Performance may vary for unseen or highly diverse car categories
| Tool | Purpose |
|---|---|
| Python | Programming language |
| Pandas | Data handling |
| NumPy | Numerical computation |
| Matplotlib / Seaborn | Visualization |
| Scikit-learn | ML models & preprocessing |
| Google Colab | Development |
car-price-prediction/
β
βββ CarPrice_Assignment.csv
βββ Car_Price_Prediction.ipynb
βββ README.md
Click the Google Colab link above
pip install pandas numpy matplotlib seaborn scikit-learn-
Execute all cells step-by-step
-
Analyze model performance
This project helps:
-
Identify key drivers of car pricing
-
Optimize product features
-
Support strategic pricing decisions
-
Enable data-driven business planning
This project was created as part of a Machine Learning & Data Science program, showcasing end-to-end regression modeling, including data preprocessing, EDA, feature engineering, model comparison, and optimization for car price prediction.
Name: Laya Mary Joy
Organization: Entri Elevate
Date: February 14, 2026
Thanks to Entri Elevate for guidance and support.
-
Use larger and more diverse datasets
-
Include real-time market data
-
Try advanced models (XGBoost, LightGBM)
-
Deploy as a web application