-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathLogistic Regression.py
More file actions
159 lines (139 loc) Β· 7.44 KB
/
Logistic Regression.py
File metadata and controls
159 lines (139 loc) Β· 7.44 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
'''
Logistic Regression with Python
In this notebook, you will learn Logistic Regression, and then, you'll create a model for a telecommunication company,
to predict when its customers will leave for a competitor, so that they can take some action to retain the customers.
What is the difference between Linear and Logistic Regression?
While Linear Regression is suited for estimating continuous values (e.g. estimating house price),
it is not the best tool for predicting the class of an observed data point.
In order to estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that
data point. For this, we use Logistic Regression.Recall linear regression:
As you know, Linear regression finds a function that relates a continuous dependent variable, y, to some predictors (independent variables π₯1 , π₯2 , etc.).
For example, Simple linear regression assumes a function of the form:
π¦=π0+π1π₯1+π2π₯2+β―
and finds the values of parameters π0,π1,π2 , etc, where the term π0 is the "intercept". It can be generally shown as:
βπ(π₯)=πππ
Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical.
It produces a formula that predicts the probability of the class label as a function of the independent variables.
Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability
with the following function, which is called sigmoid function π:
βπ(π₯)=π(πππ)=π(π0+π1π₯1+π2π₯2+...)1+π(π0+π1π₯1+π2π₯2+β―)
Or:
ππππππππππ‘π¦ππππΆπππ π 1=π(π=1|π)=π(πππ)=ππππ1+ππππ
In this equation, πππ is the regression result (the sum of the variables weighted by the coefficients),
exp is the exponential function and π(πππ) is the sigmoid or logistic function, also called logistic curve.
It is a common "S" shape (sigmoid curve).
So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:
The objective of Logistic Regression algorithm, is to find the best parameters ΞΈ, for βπ(π₯) = π(πππ) ,
in such a way that the model best predicts the class of each case.
Customer churn with Logistic Regression
A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors.
They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why.
'''
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt
!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv
churn_df = pd.read_csv("ChurnData.csv")
churn_df.head()
'''
Data pre-processing and selection
Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement by the skitlearn algorithm:
'''
churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]
churn_df['churn'] = churn_df['churn'].astype('int')
churn_df.head()
'''
the basic steps before modeling
'''
X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]
y = np.asarray(churn_df['churn'])
y [0:5]
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
'''
MODELING
Modeling (Logistic Regression with Scikit-learn)
Lets build our model using LogisticRegression from Scikit-learn package. This function implements logistic regression and can use
different numerical optimizers to find parameters, including βnewton-cgβ, βlbfgsβ, βliblinearβ, βsagβ, βsagaβ solvers.
You can find extensive information about the pros and cons of these optimizers if you search it in internet.
The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the
overfitting problem in machine learning models. C__ parameter indicates __inverse of regularization strength which must be a
positive float. Smaller values specify stronger regularization. Now lets fit our model with train set:
'''
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
yhat = LR.predict(X_test)
yhat
yhat_prob = LR.predict_proba(X_test)
yhat_prob
'''
Evaluation
jaccard index
Lets try jaccard index for accuracy evaluation.
we can define jaccard as the size of the intersection divided by the size of the union of two label sets.
If the entire set of predicted labels for a sample strictly match with the true set of labels,
then the subset accuracy is 1.0; otherwise it is 0.0.
'''
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
#confusion matrix
#Another way of looking at accuracy of classifier is to look at confusion matrix.
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')
print (classification_report(y_test, yhat))
'''
log loss
Now, lets try log loss for evaluation. In logistic regression, the output can be the probability of customer churn is yes
(or equals to 1). This probability is a value between 0 and 1. Log loss( Logarithmic loss) measures the performance of a
classifier where the predicted output is a probability value between 0 and 1.
'''
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)