Post

Implementing Linear Regression for Predicting House Prices (From Scratch)

Implementing Linear Regression for Predicting House Prices (From Scratch)

Introduction

Bellow is my notebook from Kaggle for my project on implementing Linear Regression from scratch for Predicting House Prices and comparing it to scikit-learn predefined one.
Enjoy!

Notebook

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
1
/kaggle/input/the-boston-houseprice-data/boston.csv

Overview

In this notebook i will predict the house prices using linear regression. i will implement everything from scratch then compare my results to a predefined algorithm in scikit learn.

Dataset

first i will load and explore the dataset. i’m working on Boston House Prices on Kaggle

https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data

1
2
df = pd.read_csv("/kaggle/input/the-boston-houseprice-data/boston.csv")
df.head()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
00.0063218.02.3100.5386.57565.24.09001296.015.3396.904.9824.0
10.027310.07.0700.4696.42178.94.96712242.017.8396.909.1421.6
20.027290.07.0700.4697.18561.14.96712242.017.8392.834.0334.7
30.032370.02.1800.4586.99845.86.06223222.018.7394.632.9433.4
40.069050.02.1800.4587.14754.26.06223222.018.7396.905.3336.2
1
df.shape
1
(506, 14)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
1
df.describe()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
1
2
# Shuffle the data
data = df.sample(frac=1, random_state=42).reset_index(drop=True)
1
2
3
4
5
6
7
8
# Calculate the number of samples for each set
train_size = int(0.7 * len(data))
val_size = int(0.15 * len(data))

# Split the dataset into training, validation, and test sets
train_data = data[:train_size] # training data (70%)
val_data = data[train_size:train_size+val_size] # validation data (15%)
test_data = data[train_size+val_size:] # test data (15%)
1
2
3
4
5
6
7
8
9
# Split the features and target for each set
X_train = train_data.drop('MEDV', axis=1)
y_train = train_data['MEDV']

X_val = val_data.drop('MEDV', axis=1)
y_val = val_data['MEDV']

X_test = test_data.drop('MEDV', axis=1)
y_test = test_data['MEDV']
1
2
3
print(f"Training set size: {len(X_train)} samples")
print(f"Validation set size: {len(X_val)} samples")
print(f"Test set size: {len(X_test)} samples")
1
2
3
Training set size: 354 samples
Validation set size: 75 samples
Test set size: 77 samples

Model

Linear Regression from scratch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class LinearRegression:
    def __init__(self, alpha=0.01, iterations=1000, scale=False):
        self.alpha = alpha  # Learning rate
        self.iterations = iterations  # Number of iterations for gradient descent
        self.scale = scale  # Whether to scale the features or not
        self.w = None  # Weights
        self.b = None  # Bias
        self.cost_history = []  # List to track cost history

    def scale_features(self, X):
        """Scale features using mean and std deviation (standardization)."""
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return (X - self.mean) / self.std

    def fit(self, X, y):
        """Fit the model to the training data using gradient descent."""
        m = len(y)  # Number of training examples
        
        # If scaling is needed, scale the features
        if self.scale:
            X = self.scale_features(X)

        # Initialize weights (w) and bias (b)
        self.w = np.zeros(X.shape[1])
        self.b = 0

        # Perform gradient descent
        for i in range(self.iterations):
            predictions = X.dot(self.w) + self.b
            error = predictions - y

            # Gradient of the cost function with respect to w and b
            dw = (1/m) * np.dot(X.T, error)
            db = (1/m) * np.sum(error)

            # Update weights and bias
            self.w -= self.alpha * dw
            self.b -= self.alpha * db

            # Calculate the cost (Mean Squared Error)
            cost = (1/(2*m)) * np.sum(error ** 2)
            self.cost_history.append(cost)

    def predict(self, X):
        """Make predictions using the trained model."""
        # If scaling was applied, scale the new data as well
        if self.scale:
            X = (X - self.mean) / self.std
        return X.dot(self.w) + self.b

    def get_cost_history(self):
        return self.cost_history
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import matplotlib.pyplot as plt

def mean_squared_error(y_true, y_pred):
    m = len(y_true)
    return (1/m) * np.sum((y_true - y_pred) ** 2)

# Initialize and train the model
model = LinearRegression(alpha=0.01, iterations=1000, scale=True)
model.fit(X_train, y_train)

cost_history = model.get_cost_history()

plt.plot(range(1, len(cost_history) + 1), cost_history, color='blue')
plt.title('Cost History During Training')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.grid(True)
plt.show()

# Make predictions on all datasets
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

# Calculate Mean Squared Error for all datasets
train_mse = mean_squared_error(y_train, y_train_pred)
val_mse = mean_squared_error(y_val, y_val_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Test MSE: {test_mse}")

png

1
2
3
Training MSE: 22.384617927405444
Validation MSE: 21.208792092413585
Test MSE: 22.102794782467647
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Plot the actual vs predicted values for all datasets
plt.figure(figsize=(15, 5))

# Training data plot
plt.subplot(1, 3, 1)
plt.scatter(y_train, y_train_pred, color='blue', label="Train Set")
plt.plot([min(y_train), max(y_train)], [min(y_train), max(y_train)], color='red', label="Ideal")
plt.title('Training Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

# Validation data plot
plt.subplot(1, 3, 2)
plt.scatter(y_val, y_val_pred, color='green', label="Validation Set")
plt.plot([min(y_val), max(y_val)], [min(y_val), max(y_val)], color='red', label="Ideal")
plt.title('Validation Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

# Test data plot
plt.subplot(1, 3, 3)
plt.scatter(y_test, y_test_pred, color='orange', label="Test Set")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label="Ideal")
plt.title('Test Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

plt.tight_layout()
plt.show()

png

Linear Regression using scikit-learn and comparison

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from sklearn.linear_model import LinearRegression as SklearnLR
from sklearn.metrics import mean_squared_error

# Scikit-learn Linear Regression Model
sklearn_model = SklearnLR()
sklearn_model.fit(X_train, y_train)
y_train_pred_sklearn = sklearn_model.predict(X_train)
y_val_pred_sklearn = sklearn_model.predict(X_val)
y_test_pred_sklearn = sklearn_model.predict(X_test)

# Calculate Mean Squared Error for both models
train_mse = mean_squared_error(y_train, y_train_pred)
val_mse = mean_squared_error(y_val, y_val_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

train_mse_sklearn = mean_squared_error(y_train, y_train_pred_sklearn)
val_mse_sklearn = mean_squared_error(y_val, y_val_pred_sklearn)
test_mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)

print("Custom Model MSE:")
print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Test MSE: {test_mse}")

print("\nScikit-learn Model MSE:")
print(f"Training MSE: {train_mse_sklearn}")
print(f"Validation MSE: {val_mse_sklearn}")
print(f"Test MSE: {test_mse_sklearn}")

# Plot the actual vs predicted values for both models

plt.figure(figsize=(15, 10))

# Custom Model - Training Data
plt.subplot(2, 2, 1)
plt.scatter(y_train, y_train_pred, color='blue', label="Custom Model - Train Set")
plt.plot([min(y_train), max(y_train)], [min(y_train), max(y_train)], color='red', label="Ideal")
plt.title('Custom Model - Training Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

# Scikit-learn Model - Training Data
plt.subplot(2, 2, 2)
plt.scatter(y_train, y_train_pred_sklearn, color='green', label="Scikit-learn Model - Train Set")
plt.plot([min(y_train), max(y_train)], [min(y_train), max(y_train)], color='red', label="Ideal")
plt.title('Scikit-learn Model - Training Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

# Custom Model - Test Data
plt.subplot(2, 2, 3)
plt.scatter(y_test, y_test_pred, color='orange', label="Custom Model - Test Set")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label="Ideal")
plt.title('Custom Model - Test Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

# Scikit-learn Model - Test Data
plt.subplot(2, 2, 4)
plt.scatter(y_test, y_test_pred_sklearn, color='purple', label="Scikit-learn Model - Test Set")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label="Ideal")
plt.title('Scikit-learn Model - Test Set: Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.legend()

plt.tight_layout()
plt.show()
1
2
3
4
5
6
7
8
9
Custom Model MSE:
Training MSE: 22.384617927405444
Validation MSE: 21.208792092413585
Test MSE: 22.102794782467647

Scikit-learn Model MSE:
Training MSE: 22.018971306970627
Validation MSE: 22.53291330113675
Test MSE: 22.14619818293375

png

This post is licensed under CC BY 4.0 by the author.