Some of you might have heard about diagnosing different health conditions with the use of artificial intelligence and machine learning. Artificial intelligence is a buzz word these days and for those who know little about programming it might actually seem real. But it’s not, at least not in 2017…
Like Kevin Kelly, I prefer to use AI as an acronym for augmented intelligence to describe learning machines.
So, what do these learning machines do and how come they are so very powerful at certain tasks? Well, let’s look at a specific example.
I’ll be using a machine learning library in Python on a cancer dataset to classify tumors as malignant or benign.
The Details of the Process
A little bit about the dataset first…
It’s called the Breast Cancer Wisconsin (Diagnostic) and it contains 569 samples (digitized images) of FNAs (fine needle aspirate) of breast cancer mass.
Each sample describes the characteristics of the cell nuclei of the image. In machine learning, these characteristics are called features.
Each sample is classified as benign or malignant. These classifications, in machine learning, are called targets.
The distribution of tumors within the dataset: 212 are malignant, 357 are benign. So, I’d say it’s a fairly balanced distribution.
The number of characteristics (the features) that describe each sample is 30. These include:
– radius (mean of distances from center to points on the perimeter)
– texture (standard deviation of gray-scale values)
– smoothness (local variation in radius lengths)
– compactness (perimeter^2 / area – 1.0)
– concavity (severity of concave portions of the contour)
– etc [*]
We are going to feed these features to a machine learning algorithm, along with the diagnosis (the targets). This is how the algorithm will learn to classify them. This is how we train it. Then, we can ask the algorithm to classify (predict for) a new sample.
How do we evaluate the accuracy of the algorithm?
We do not feed the entire dataset to the algorithm, but only a subset of samples.
So, we divide our dataset into: training samples and testing samples. The usual split is 75% for training and 25% for testing.
What you have to keep in mind is that we have diagnosis for 100% of the dataset. So, we use 75% in training the algorithm:
image 1 (30 features) => benign
image 2 (30 features) => benign
image 3 (30 features) => malign
This is how the algorithm learns to classify the tumor. Then, we evaluate it on the testing subset (25% of samples).
Here, we only ‘show’ it an image (30 features) and we let it classify it (put the diagnosis). It applies the ‘knowledge’ learned while training. Then we look at the actual diagnosis of the image (remember, we know it) to see if the algorithm got it right. This is how we evaluate its accuracy.
If we are not satisfied with the accuracy, we can tweak the parameters of the algorithm or we can use different algorithms. Being good at this (tweaking) requires solid knowledge of statistics and mathematics, especially linear algebra.
If we are satisfied with the performance of the algorithm we can further test it on new data; in this case, new samples (in the exact same format).
Ok, Now to Programming…
Today, there are many ways to do machine learning.
‘Exceptional’ is probably an underrated description of what ‘open-source’ (shared collaboration, if I may) has allowed for technological development and the progression of knowledge…
The Python machine learning library we’re using here is scikit-learn. So, let’s jump right into it.
To understand this part (and the entire post – for that matter) you need to be literate in Python and Machine Learning and, by extension, in maths and stats. I’m going to use full ‘jargon’.
This is inspired from Andreas C. Muller and Sarah Guido’s book Introduction to Machine Learning.
First, I’m going to make the necessary imports: the cancer dataset, the module for splitting the data, the classifier, and a module for plotting.
The classifier that I’m going to use is KNeighborsClassifier, which classifies based on value proximity (see the image at the top). I’m using Python 3.6 and Jupyter Notebooks when I do machine learning.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
Then, loading the dataset and splitting it into training and testing:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)
Creating the classifier (algorithm) and training it (fitting it):
clf = KNeighborsClassifier(n_neighbors=6)
And that’s it! Machine learning in a few lines of code; but, with a few caveats…
Notice, I’m passing a parameter to the classifier: I’m telling it to use 6 neighboring values (neighbors) to make the prediction. I evaluated the algorithm using different numbers of neighbors and I found that this yields an appropriate accuracy. As you can see below, the classifier is about 94% accurate, which is quite good in my opinion.
With the classifier trained, I could now use it on new values. But the dirty reality is that most often than not, things are not so simple…
The advantage in this specific case is that scikit-learn comes ‘pre-packaged’ with this and a few other datasets. This means that they have been cleaned and nicely pre-processed. They have been readied for the pipeline, so to speak…
In reality, data is very messy and you have to spend an insurmountable amount of time cleaning and pre-processing it (data wrangling). You have to make sure you avoid Garbage in, garbage out.
What I look forward to do is to test different classifiers (algorithms) against this dataset to see how I can improve its accuracy. If you have questions pertinent to this or if you need help should you decide to work on similar stuff, let me know in the comments below.