Post

Implementing Polynomial Regression for Predicting Car Price (From Scratch)

Implementing Polynomial Regression for Predicting Car Price (From Scratch)

Introduction

Bellow is my notebook from Kaggle for my project on implementing Polynomial Regression from scratch for Predicting Car Price and comparing it to scikit-learn predefined one.
Enjoy!

Notebook

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
1
2
/kaggle/input/car-price-prediction/CarPrice_Assignment.csv
/kaggle/input/car-price-prediction/Data Dictionary - carprices.xlsx

Overview

In this notebook i will predict Car Price using polynomial regression. i will implement everything from scratch then compare my results to a predefined algorithm in scikit learn.

Dataset

first i will load and explore the dataset. i’m working on Car Price Prediction Data Set on Kaggle

https://www.kaggle.com/datasets/hellbuoy/car-price-prediction

1
2
df = pd.read_csv("/kaggle/input/car-price-prediction/CarPrice_Assignment.csv")
df.head()
car_IDsymbolingCarNamefueltypeaspirationdoornumbercarbodydrivewheelenginelocationwheelbase...enginesizefuelsystemboreratiostrokecompressionratiohorsepowerpeakrpmcitympghighwaympgprice
013alfa-romero giuliagasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.01115000212713495.0
123alfa-romero stelviogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.01115000212716500.0
231alfa-romero Quadrifogliogasstdtwohatchbackrwdfront94.5...152mpfi2.683.479.01545000192616500.0
342audi 100 lsgasstdfoursedanfwdfront99.8...109mpfi3.193.4010.01025500243013950.0
452audi 100lsgasstdfoursedan4wdfront99.4...136mpfi3.193.408.01155500182217450.0

5 rows × 26 columns

1
df.shape
1
(205, 26)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 17  fuelsystem        205 non-null    object 
 18  boreratio         205 non-null    float64
 19  stroke            205 non-null    float64
 20  compressionratio  205 non-null    float64
 21  horsepower        205 non-null    int64  
 22  peakrpm           205 non-null    int64  
 23  citympg           205 non-null    int64  
 24  highwaympg        205 non-null    int64  
 25  price             205 non-null    float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB
1
df.describe()
car_IDsymbolingwheelbasecarlengthcarwidthcarheightcurbweightenginesizeboreratiostrokecompressionratiohorsepowerpeakrpmcitympghighwaympgprice
count205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000
mean103.0000000.83414698.756585174.04926865.90780553.7248782555.565854126.9073173.3297563.25541510.142537104.1170735125.12195125.21951230.75122013276.710571
std59.3225651.2453076.02177612.3372892.1452042.443522520.68020441.6426930.2708440.3135973.97204039.544167476.9856436.5421426.8864437988.852332
min1.000000-2.00000086.600000141.10000060.30000047.8000001488.00000061.0000002.5400002.0700007.00000048.0000004150.00000013.00000016.0000005118.000000
25%52.0000000.00000094.500000166.30000064.10000052.0000002145.00000097.0000003.1500003.1100008.60000070.0000004800.00000019.00000025.0000007788.000000
50%103.0000001.00000097.000000173.20000065.50000054.1000002414.000000120.0000003.3100003.2900009.00000095.0000005200.00000024.00000030.00000010295.000000
75%154.0000002.000000102.400000183.10000066.90000055.5000002935.000000141.0000003.5800003.4100009.400000116.0000005500.00000030.00000034.00000016503.000000
max205.0000003.000000120.900000208.10000072.30000059.8000004066.000000326.0000003.9400004.17000023.000000288.0000006600.00000049.00000054.00000045400.000000
1
2
3
# Shuffle the data 

data = df.sample(frac=1,random_state=42).reset_index(drop=True)
1
2
3
4
5
6
7
# Calculate the number of samples for each set
train_size = int(0.8 * len(data))
test_size = int(0.2 * len(data))

# Split the dataset into training and test sets
train_data = data[:train_size] # training data (80%)
test_data = data[train_size:] # test data (20%)
1
data.head()
car_IDsymbolingCarNamefueltypeaspirationdoornumbercarbodydrivewheelenginelocationwheelbase...enginesizefuelsystemboreratiostrokecompressionratiohorsepowerpeakrpmcitympghighwaympgprice
0160bmw x4gasstdfoursedanrwdfront103.5...209mpfi3.623.398.001825400162230760.000
1100audi 5000s (diesel)gasturbotwohatchback4wdfront99.5...131mpfi3.133.407.001605500162217859.167
21010nissan nv200gasstdfoursedanfwdfront97.2...1202bbl3.333.478.5097520027349549.000
31333saab 99egasstdtwohatchbackfwdfront99.1...121mpfi3.543.079.311105250212811850.000
469-1buick century luxus (sw)dieselturbofourwagonrwdfront110.0...183idi3.583.6421.501234350222528248.000

5 rows × 26 columns

1
2
3
4
numerical_features = ['wheelbase', 'carlength', 'carwidth', 'carheight',
                      'curbweight', 'enginesize', 'boreratio', 'stroke',
                      'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']
target = 'price'
1
2
3
4
5
6
7
# Split the features and target for each set
X_train = train_data[numerical_features].values
y_train = train_data[target].values

X_test = test_data[numerical_features].values
y_test = test_data[target].values

1
2
print(f"Training set size: {len(X_train)} samples")
print(f"Test set size: {len(X_test)} samples")
1
2
Training set size: 164 samples
Test set size: 41 samples

Models

Polynomial Regression from scratch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

class PolynomialRegression:
    def __init__(self, degree=2, learning_rate=0.01, iterations=1000, scale=True):
        """
        Initialize the polynomial regression model.
        
        Parameters:
        - degree: The degree of the polynomial features.
        - learning_rate: The learning rate (alpha) for gradient descent.
        - iterations: Number of iterations for gradient descent.
        - scale: Boolean flag to apply manual feature scaling.
        """
        self.degree = degree
        self.learning_rate = learning_rate
        self.iterations = iterations
        self.scale = scale
        self.bias = 0.0
        self.weights = None
        self.cost_history = []
        # To store scaling parameters
        self.means = None
        self.stds = None

    def _scale_features(self, X):
        """
        Scale the features manually: standardize to zero mean and unit variance.
        
        Parameters:
        - X: A numpy array of shape (m, n).
        
        Returns:
        - X_scaled: The scaled features.
        """
        if self.means is None or self.stds is None:
            self.means = np.mean(X, axis=0)
            self.stds = np.std(X, axis=0)
            # Avoid division by zero
            self.stds[self.stds == 0] = 1.0
        return (X - self.means) / self.stds

    def _create_polynomial_features(self, X):
        """
        Create polynomial features for the input data X.
        
        Parameters:
        - X: A numpy array of shape (m, n) where m is the number of samples 
             and n is the number of original features.
             
        Returns:
        - X_poly: A numpy array of shape (m, n * degree) containing polynomial features.
        """
        # Optionally scale the features
        if self.scale:
            X = self._scale_features(X)
        
        m, n = X.shape
        poly_features = []
        for d in range(1, self.degree + 1):
            poly_features.append(np.power(X, d))
        X_poly = np.concatenate(poly_features, axis=1)
        return X_poly

    def _compute_cost(self, X_poly, y):
        """
        Compute the mean squared error cost.
        
        Parameters:
        - X_poly: Feature matrix.
        - y: True target values.
        
        Returns:
        - cost: The computed cost value.
        """
        m = len(y)
        predictions = self.bias + X_poly.dot(self.weights)
        error = predictions - y
        cost = (1 / (2 * m)) * np.sum(np.square(error))
        return cost

    def fit(self, X, y):
        """
        Fit the polynomial regression model using gradient descent,
        calculating bias and weights separately.
        
        Parameters:
        - X: A numpy array of shape (m, n) with the original features.
        - y: A numpy array of shape (m,) with the target values.
        """
        # Generate polynomial features (scaling is applied if self.scale=True)
        X_poly = self._create_polynomial_features(X)
        m, n_poly = X_poly.shape

        # Initialize parameters
        self.bias = 0.0
        self.weights = np.zeros(n_poly)
        self.cost_history = []

        # Gradient Descent loop
        for i in range(self.iterations):
            predictions = self.bias + X_poly.dot(self.weights)
            error = predictions - y

            # Compute gradients
            bias_gradient = (1 / m) * np.sum(error)
            weights_gradient = (1 / m) * X_poly.T.dot(error)

            # Update parameters
            self.bias -= self.learning_rate * bias_gradient
            self.weights -= self.learning_rate * weights_gradient

            cost = self._compute_cost(X_poly, y)
            self.cost_history.append(cost)

    def predict(self, X):
        """
        Make predictions using the trained model.
        
        Parameters:
        - X: A numpy array of shape (m, n) with the original features.
        
        Returns:
        - predictions: A numpy array of shape (m,) with the predicted values.
        """
        # Use the same scaling parameters stored during fit
        if self.scale:
            X = (X - self.means) / self.stds
        X_poly = self._create_polynomial_features(X)  # _create_polynomial_features scales again; avoid double scaling
        # To avoid double scaling, we can create a separate method for prediction features:
        m, n = X.shape
        poly_features = []
        for d in range(1, self.degree + 1):
            poly_features.append(np.power(X, d))
        X_poly = np.concatenate(poly_features, axis=1)
        predictions = self.bias + X_poly.dot(self.weights)
        return predictions

    def plot_cost_history(self):
        """
        Plot the cost history over iterations.
        """
        plt.figure(figsize=(8, 5))
        plt.plot(self.cost_history, label="Cost over Iterations")
        plt.xlabel("Iterations")
        plt.ylabel("Cost")
        plt.title("Gradient Descent Convergence")
        plt.legend()
        plt.show()

# Instantiate the polynomial regression model with degree 2
poly_reg = PolynomialRegression(degree=2, learning_rate=0.001, iterations=1000, scale=True)

# Fit the model using the training data
poly_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = poly_reg.predict(X_test)

# Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Test Mean Squared Error:", mse)

# Plot the cost history to observe convergence
poly_reg.plot_cost_history()
1
Test Mean Squared Error: 59698053.45753728

png

Polynomial Regression using scikit-learn

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression

# First, manually scale the data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create polynomial features using scikit-learn
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly_features.fit_transform(X_train_scaled)
X_test_poly = poly_features.transform(X_test_scaled)

# Fit a LinearRegression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_poly, y_train)
y_pred_sklearn = lin_reg.predict(X_test_poly)
mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
print("scikit-learn Polynomial Regression MSE:", mse_sklearn)
1
scikit-learn Polynomial Regression MSE: 234990527.49339545
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Compute error metrics for manual implementation
mse_manual = mean_squared_error(y_test, y_pred)
rmse_manual = np.sqrt(mse_manual)
mae_manual = mean_absolute_error(y_test, y_pred)
r2_manual = r2_score(y_test, y_pred)

# Compute error metrics for scikit-learn implementation
mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
rmse_sklearn = np.sqrt(mse_sklearn)
mae_sklearn = mean_absolute_error(y_test, y_pred_sklearn)
r2_sklearn = r2_score(y_test, y_pred_sklearn)

# Print results
print("Manual Polynomial Regression:")
print(f"   MSE  : {mse_manual:.4f}")
print(f"   RMSE : {rmse_manual:.4f}")
print(f"   MAE  : {mae_manual:.4f}")
print(f"   R²   : {r2_manual:.4f}")
print("\nScikit-Learn Polynomial Regression:")
print(f"   MSE  : {mse_sklearn:.4f}")
print(f"   RMSE : {rmse_sklearn:.4f}")
print(f"   MAE  : {mae_sklearn:.4f}")
print(f"   R²   : {r2_sklearn:.4f}")

1
2
3
4
5
6
7
8
9
10
11
Manual Polynomial Regression:
   MSE  : 59698053.4575
   RMSE : 7726.4515
   MAE  : 4168.0573
   R²   : 0.2301

Scikit-Learn Polynomial Regression:
   MSE  : 234990527.4934
   RMSE : 15329.4008
   MAE  : 7138.0699
   R²   : -2.0305
This post is licensed under CC BY 4.0 by the author.