This repository was archived by the owner on Jul 22, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathcancer_classification.py
More file actions
170 lines (137 loc) · 6.04 KB
/
cancer_classification.py
File metadata and controls
170 lines (137 loc) · 6.04 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# -*- coding: utf-8 -*-
"""
Created on Sat Dec 22 15:18:59 2018
@author: Koffi Moïse AGBENYA
In this notebook, you will use SVM (Support Vector Machines) to build and train
a model using human cell records, and classify cells to whether the samples are
benign or malignant.
SVM works by mapping data to a high-dimensional feature space so that data
points can be categorized, even when the data are not otherwise linearly
separable. A separator between the categories is found, then the data are
transformed in such a way that the separator could be drawn as a hyperplane.
Following this, characteristics of new data can be used to predict the group to
which a new record should belong.
About dataset:
The example is based on a dataset that is publicly available from the UCI
Machine Learning Repository (Asuncion and Newman, 2007)
[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several
hundred human cell sample records, each of which contains the values of a set
of cell characteristics.
The fields in each record are:
Field name Description
ID Clump thickness
Clump Clump thickness
UnifSize Uniformity of cell size
UnifShape Uniformity of cell shape
MargAdh Marginal adhesion
SingEpiSize Single epithelial cell size
BareNuc Bare nuclei
BlandChrom Bland chromatin
NormNucl Normal nucleoli
Mit Mitoses
Class Benign or malignant
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv"
cell_df = pd.read_csv(path)
print(cell_df.head())
#The ID field contains the patient identifiers. The characteristics of the cell
#samples from each patient are contained in fields Clump to Mit. The values are
#graded from 1 to 10, with 1 being the closest to benign.
#The Class field contains the diagnosis, as confirmed by separate medical
#procedures, as to whether the samples are benign (value = 2) or malignant
#(value = 4).
#Lets look at the distribution of the classes based on Clump thickness and
#Uniformity of cell size:
ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()
#Data pre-processing and selection
#Lets first look at columns data types:
print(cell_df.dtypes)
#It looks like the BareNuc column includes some values that are not numerical.
#We can drop those rows:
cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
print(cell_df.dtypes)
feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)
X[0:5]
#We want the model to predict the value of Class (that is, benign (=2) or
#malignant (=4)). As this field can have one of only two possible values, we
#need to change its measurement level to reflect this.
cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
y [0:5]
#Train/Test dataset
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
#Modeling SVM
#The SVM algorithm offers a choice of kernel functions for performing its
#processing. Basically, mapping data into a higher dimensional space is called
#kernelling. The mathematical function used for the transformation is known as
#the kernel function, and can be of different types, such as:
# 1.Linear
# 2.Polynomial
# 3.Radial basis function (RBF)
# 4.Sigmoid
#Each of these functions has its characteristics, its pros and cons, and its
#equation, but as there's no easy way of knowing which function performs best
#with any given dataset, we usually choose different functions in turn and
#compare the results. Let's just use the default, RBF (Radial Basis Function)
#for this set.
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
#After being fitted, the model can then be used to predict new values:
yhat = clf.predict(X_test)
yhat [0:5]
#Evaluation
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)
print (classification_report(y_test, yhat))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')
#F1 Score
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
#Jaccard index
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)