Post

Implementing Logistic Regression for Predicting if a Tumor is Malignant or Benign (From Scratch)

Implementing Logistic Regression for Predicting if a Tumor is Malignant or Benign (From Scratch)

Introduction

Bellow is my notebook from Kaggle for my project on implementing Logistic Regression from scratch for Predicting if a Tumor is Malignant or Benign and comparing it to scikit-learn predefined one.
Enjoy!

Notebook

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
1
/kaggle/input/breast-cancer-wisconsin-data/data.csv

Overview

In this notebook i will predict if a Tumor is Malignant or Benign using logistic regression. i will implement everything from scratch then compare my results to a predefined algorithm in scikit learn.

Dataset

first i will load and explore the dataset. i’m working on Breast Cancer Wisconsin (Diagnostic) Data Set on Kaggle

https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

1
2
df = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
df.head()
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...texture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worstUnnamed: 32
0842302M17.9910.38122.801001.00.118400.277600.30010.14710...17.33184.602019.00.16220.66560.71190.26540.46010.11890NaN
1842517M20.5717.77132.901326.00.084740.078640.08690.07017...23.41158.801956.00.12380.18660.24160.18600.27500.08902NaN
284300903M19.6921.25130.001203.00.109600.159900.19740.12790...25.53152.501709.00.14440.42450.45040.24300.36130.08758NaN
384348301M11.4220.3877.58386.10.142500.283900.24140.10520...26.5098.87567.70.20980.86630.68690.25750.66380.17300NaN
484358402M20.2914.34135.101297.00.100300.132800.19800.10430...16.67152.201575.00.13740.20500.40000.16250.23640.07678NaN

5 rows × 33 columns

1
df.shape
1
(569, 33)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
1
df.describe()
idradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_meansymmetry_mean...texture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worstUnnamed: 32
count5.690000e+02569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000...569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.0000000.0
mean3.037183e+0714.12729219.28964991.969033654.8891040.0963600.1043410.0887990.0489190.181162...25.677223107.261213880.5831280.1323690.2542650.2721880.1146060.2900760.083946NaN
std1.250206e+083.5240494.30103624.298981351.9141290.0140640.0528130.0797200.0388030.027414...6.14625833.602542569.3569930.0228320.1573360.2086240.0657320.0618670.018061NaN
min8.670000e+036.9810009.71000043.790000143.5000000.0526300.0193800.0000000.0000000.106000...12.02000050.410000185.2000000.0711700.0272900.0000000.0000000.1565000.055040NaN
25%8.692180e+0511.70000016.17000075.170000420.3000000.0863700.0649200.0295600.0203100.161900...21.08000084.110000515.3000000.1166000.1472000.1145000.0649300.2504000.071460NaN
50%9.060240e+0513.37000018.84000086.240000551.1000000.0958700.0926300.0615400.0335000.179200...25.41000097.660000686.5000000.1313000.2119000.2267000.0999300.2822000.080040NaN
75%8.813129e+0615.78000021.800000104.100000782.7000000.1053000.1304000.1307000.0740000.195700...29.720000125.4000001084.0000000.1460000.3391000.3829000.1614000.3179000.092080NaN
max9.113205e+0828.11000039.280000188.5000002501.0000000.1634000.3454000.4268000.2012000.304000...49.540000251.2000004254.0000000.2226001.0580001.2520000.2910000.6638000.207500NaN

8 rows × 32 columns

1
list(df.columns.values)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
['id',
 'diagnosis',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst',
 'Unnamed: 32']

Here i noticed a column named ‘Unnamed: 32’ and there values are NaN

1
2
df.drop(["Unnamed: 32"], axis = 1, inplace=True)
list(df.columns.values)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
['id',
 'diagnosis',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst']

Now we are talking :)

1
2
# Shuffle the data
data = df.sample(frac=1, random_state=42).reset_index(drop=True)
1
2
3
4
5
6
7
# Calculate the number of samples for each set
train_size = int(0.8 * len(data))
test_size = int(0.2 * len(data))

# Split the dataset into training and test sets
train_data = data[:train_size] # training data (80%)
test_data = data[train_size:] # test data (20%)
1
2
3
4
5
6
7
8
9
# Split the features and target for each set
X_train = train_data.drop('diagnosis', axis=1)
y_train = train_data['diagnosis']

X_test = test_data.drop('diagnosis', axis=1)
y_test = test_data['diagnosis']

y_train = y_train.map({'M': 1, 'B': 0}).astype(np.float64)
y_test = y_test.map({'M': 1, 'B': 0}).astype(np.float64)
1
2
print(f"Training set size: {len(X_train)} samples")
print(f"Test set size: {len(X_test)} samples")
1
2
Training set size: 455 samples
Test set size: 114 samples

Models

Logistic Regression from scratch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import matplotlib.pyplot as plt

class LogisticRegression :
    def __init__(self, alpha=0.01, iterations=1000, scale=False):
        self.alpha = alpha
        self.iterations = iterations
        self.scale = scale
        self.w = None
        self.b = None
        self.cost_history = []

    def scale_features(self,X):
        """Scale features using mean and std deviation (standardization)."""
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return (X - self.mean) / self.std


    def fit(self, X, y):
        m = len(y) # Number of training examples

        # Scale the features if needed
        if self.scale : 
            X = self.scale_features(X)
        
        # Initialize weights and bias
        self.w = np.zeros(X.shape[1])
        self.b = 0
        
        for i in range(self.iterations):
            
            z = X.dot(self.w) + self.b
            # pred = np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))
            pred = 1 / (1 + np.exp(-z))
            
            epsilon = 1e-8
            pred = np.clip(pred, epsilon, 1 - epsilon)
            
            error = pred - y

            dw = (1/m) * np.dot(X.T , error)
            db = (1/m) * np.sum(error)

            # Update the weights and bias
            self.w -= self.alpha * dw
            self.b -= self.alpha * db

            # Calculate and store the cost
            cost = (-1/m) * np.sum(y * np.log(pred) + (1 - y) * np.log(1 - pred))
            self.cost_history.append(cost)

    def predict(self,X):
        """Make predictions using the trained model."""
        if self.scale:
            X = (X - self.mean) / self.std
        logits = X.dot(self.w) + self.b
        prob = 1 / (1 + np.exp(-logits))
        return (prob >= 0.5).astype(int)

    def predict_proba(self, X):
        """Return probability estimates for the positive class."""
        if self.scale:
            X = (X - self.mean) / self.std
        logits = X.dot(self.w) + self.b
        return 1 / (1 + np.exp(-logits))

    def get_cost_history(self):
        return self.cost_history
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Initialize and train the model
model = LogisticRegression(alpha=0.1, iterations=1000, scale=True)
model.fit(X_train, y_train)

cost_history = model.get_cost_history()

plt.plot(range(1, len(cost_history) + 1), cost_history, color='blue')
plt.title('Cost History During Training')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.grid(True)
plt.show()

# Make predictions on all datasets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score, roc_curve

train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

conf_matrix = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:")
print(conf_matrix)

report = classification_report(y_test, y_test_pred)
print("Classification Report:")
print(report)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Training Accuracy: 0.9912087912087912
Test Accuracy: 0.9736842105263158
Confusion Matrix:
[[66  1]
 [ 2 45]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.97      0.99      0.98        67
         1.0       0.98      0.96      0.97        47

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Logistic Regression using scikit-learn

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the sklearn logistic regression model
clf = SklearnLogisticRegression(random_state=42, solver='lbfgs', max_iter=1000)
clf.fit(X_train_scaled, y_train)

# Make predictions on the training and test sets
y_train_pred_sklearn = clf.predict(X_train_scaled)
y_test_pred_sklearn = clf.predict(X_test_scaled)

# Evaluate the performance
train_accuracy = accuracy_score(y_train, y_train_pred_sklearn)
test_accuracy = accuracy_score(y_test, y_test_pred_sklearn)
conf_matrix = confusion_matrix(y_test, y_test_pred_sklearn)
class_report = classification_report(y_test, y_test_pred_sklearn)

print("Training Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Training Accuracy: 0.9934065934065934
Test Accuracy: 0.956140350877193
Confusion Matrix:
[[66  1]
 [ 4 43]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.94      0.99      0.96        67
         1.0       0.98      0.91      0.95        47

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114
This post is licensed under CC BY 4.0 by the author.