Machine Learning For Complete Beginners

The problem: Many machine learning tutorials out there expect you have a PhD in Statistics or Mathematics. This tutorial is written for beginners, assuming no previous knowledge of machine learning.

Format: We will start off with an introduction to machine learning, followed by a machine learning script that tries to predict which people survived the Titanic. This is followed by two practice sessions for you: I will guide you on how to proceed, but you have to write the code yourself.

Prerequisite knowledge: A knowledge of Python is assumed. Also, basic knowledge of Pandas is expected. If you are new to Pandas, follow the basic lessons here.

The development is done using Ipython. I recommend you use Anaconda with Python 3.

Note: There are 5 videos + transcript in this series. The videos are mixed with the transcripts, so scroll down if you are only interested in the videos. Make sure you turn on HD.

Introduction to Machine Learning

To start off, here is an introduction to machine learning, a short presentation that goes over the basics.

here is no transcript, but the presentation is available on Github. Though, if you are completely new to machine learning, I strongly recommend you watch the video, as I talk over several points that may not be obvious by just looking at the presentation.

You have a task in the presentation. Look at titanic_train.csv (can be opened in Excel or OpenOffice), and guess which fields would be useful for our machine learning algorithm. Eg, does age matter when predicting who would survive the Titanic? What about the port of boarding? Select 2-3 columns you feel are the most important.

Machine Learning with Python

The next video starts the actual coding. The code is available on Github. You don’t need a Github account, as you can download the repo as a zip file. Titanic_Machine_Learning.ipynb is the file we will be working on.

Transcript:

import numpy as np
import pandas as pd

So we start by importing everything we need. Pandas and Numpy are obviously needed.

#The Machine learning alogorithm
from sklearn.ensemble import RandomForestClassifier

For machine learning, we are using the Random Forest algorithm. You don’t need to know how it works internally (for this example), but you do need to know how to use it.

# Test train split
from sklearn.cross_validation import train_test_split

If you watched the presentation (and you really should have, or you won’t follow half the code), you know we need to use test / train split to avoid overfitting. So we import the train_test_split() function.

# Just to switch off pandas warning
pd.options.mode.chained_assignment = None

This is just switch off a Pandas warning.

# Used to write our model to a file
from sklearn.externals import joblib

Finally, we import the joblib function. This will be used to write our model to a file for reuse.

We will now open our csv file in Pandas.

data = pd.read_csv("titanic_train.csv")
data.head()
Unnamed: 0 row.names pclass survived name age embarked home.dest room ticket boat sex
0 998 999 3rd 1 McCarthy, Miss Katie NaN NaN NaN NaN NaN NaN female
1 179 180 1st 0 Millet, Mr Francis Davis 65 Southampton East Bridgewater, MA NaN NaN (249) male
2 556 557 2nd 0 Sjostedt, Mr Ernst Adolf 59 Southampton Sault St Marie, ON NaN NaN NaN male
3 174 175 1st 0 McCaffry, Mr Thomas Francis 46 Cherbourg Vancouver, BC NaN NaN (292) male
4 1232 1233 3rd 0 Strilic, Mr Ivan NaN NaN NaN NaN NaN NaN male

Look at the age. The first and last values are NaN, which means null, or empty. There are a lot of other NaNs in our code.

In the first presentation, I gave you a task. To find out which columns in the table above would be suitable inputs for our machine learning algorithm. Remember, we need both inputs and expected output (if you don’t know what that is, look at the presentation video again).

The expected output is the survived field. What about the input?

If you remember the movie Titanic, you will know that the rich were more likely to survive. Also, the first preference was given to women and children.

Based on this, we can say 3 things mattered the most to surviving the Titanic: How rich you were, your age, and your sex.

Rich older women and children were the most likely to survive.

Poor middle aged men were the least likely to survive. We know this just from the movie.

Age and sex are directly visible in our table. What about wealth? While there are multiple columns (like ticket price), the most direct field is the passenger class. First class passengers were the most likely to survive, no matter what price they paid for their ticket.

So these are the 3 inputs to our machine learning algorithm: Passenger class, age and sex

The expected output is the survived field.

Before we can extract these values, look at the csv file in Excel/Openoffice. Specifically, the age field. The age is missing for large parts of the data. It seems there is no age data for any of the 3rd class passengers, while it is there for first class.

We can’t just throw away the empty fields, as we will be getting rid of most of the 3rd class passenger data.

Our solution? Replace the empty fields with the median age.

Before we go ahead, are you clear of the difference between mean, median and mode?

The mean is what we call the average.

Median is the middle most value. For example, in

1, 2 , 5, 10, 21, 33, 57

10 is the median, because it is the middle value.

Mode is the most common or repeated value.

For this example, we will use the median for the age, though you can experiment with the others.

median_age = data['age'].median()
print("Median age is {}".format(median_age))

Median age is 29.0

We can calculate the median using the Pandas median() function. In the example above, we see it is 29.

We now replace the empty values for age with the median, using the Panas fillna() function.

data['age'].fillna(median_age, inplace = True)
data['age'].head()

0    29
1    65
2    59
3    46
4    29

If you remember, the 1st and 5th values were NaNs. They have been replaced with 29 now.

Let’s now extract the 3 fields we need: Class, age and sex.

Why do we need to extract these 3 into new Pandas dataframe object?

So as not to confuse our machine learning algorithm. If we pass in everything, we will have a lot of noise, with the result the algorithm will give a very poor prediction.

data_inputs = data[["pclass", "age", "sex"]]
data_inputs.head()
pclass age sex
0 3rd 29 female
1 1st 65 male
2 2nd 59 male
3 1st 46 male
4 3rd 29 male

And the expected output is:

expected_output = data[["survived"]]
expected_output.head()
survived
0 1
1 0
2 0
3 0
4 0

Now, we have a problem. The algorithms in Scikit, the library we are using, only work with numbers. That means we can’t pass in the sex as male or female, or the class as 1st or 3rd.

Let’s fix the class first, as it’s easy. We’ll simply replace 1st by 1, 2nd by 2 and 3rd by 3:

data_inputs["pclass"].replace("3rd", 3, inplace = True)
data_inputs["pclass"].replace("2nd", 2, inplace = True)
data_inputs["pclass"].replace("1st", 1, inplace = True)
data_inputs.head()
pclass age sex
0 3 29 female
1 1 65 male
2 2 59 male
3 1 46 male
4 3 29 male

There, we have fixed the class. The age is correct, just the sex is left now (don’t say the last sentence out loud, people will stare at you like you are a creep!).

We will be using the np.where() function, which is not intuitive. I forget how to use it everytime, and have to Google for it.

data_inputs["sex"] = np.where(data_inputs["sex"] == "female", 0, 1)
data_inputs.head()
pclass age sex
0 3 29 0
1 1 65 1
2 2 59 1
3 1 46 1
4 3 29 1

The way the function works is, if the input sex is female, it is replaced by 0, otherwise 1. Like I said, the function is non-intuitive.

Test / Train Split to prevent overfitting

If you remember from the presentation, we split our data into a train set and test set. The training set is used to train the machine learning algorithm, while the test set is used to find the accuracy (since we still have the expected output for the test set, we can compare the actual output with the predicted output, and calculate our error).

Time to create the test / train split. While we can do it manually, it’s better to use the inbuilt function, as it will do other things like shuffle the data for us.

inputs_train, inputs_test, expected_output_train, expected_output_test   =
train_test_split (data_inputs, expected_output, test_size = 0.33, random_state = 42)

The function returns the training input and output, as well as the output set.

test_size=0.33 means 33% of the sample is to be used for testing, the other for training. random_state is used to initialise the inbuilt randomiser, so we get the same result from the randomiser each time.

Let’s print a few values:

print(inputs_train.head())
print(expected_output_train.head())

     pclass  age  sex
618       3   19    1
169       3   29    1
830       1   54    1
140       3   29    1
173       2   28    1
     survived
618         0
169         0
830         1
140         0
173         0

Time to start machine learning.

rf = RandomForestClassifier (n_estimators=100)

We create our Random Forest machine learning algorithm instance.

rf.fit(inputs_train, expected_output_train)

The fit() function is used to train our algorithm. It takes our input dataframe and tries to fit it to the expected output. That’s why we narrowed down the fields we pass in, so that the algorithm is not confused by noise.

If this works, the instance will now have “learnt” how to predict Titanic survivors. Let’s see how accurate our algorithm is:

accuracy = rf.score(inputs_test, expected_output_test)
print("Accuracy = {}%".format(accuracy * 100))

Accuracy = 79.60526315789474%

The score() function takes the test input, and finds out how accurate the prediction is based on the known test outputs.

In the example above, we get an accuracy of 79%. Is that good? We won’t know until we compare it to something (which we’ll do in the practice sessions).

There is one final thing to do. We spent all this time training our algorithm. We don’t want to repeat this process everytime. This example is fairly fast, as the dataset is small, but for large datasets, it can take tens of minutes, if not hours.

To save time, we can write our machine learning model to a file, so we can reuse it in the future.

joblib.dump(rf, "titanic_model1", compress=9)

Pickle was the library originally used for this, but joblib.dump is a much more simpler function, so I recommend you use it. compress = 9 is needed, otherwise it will create dozens of files.

If you look in your code folder, you will see a file titanic_model1, that will contain our model.

Why Programming Practice is Needed

Okay, before we go ahead, here is a video on why practice is necessary:

Titanic Practice Sessions

If you were convinced, here is the first of the practice videos. The worksheet is Titanic Practice 1.ipynb in the repo.

The first practice session is to repeat what we did in the previous example, except this time we will only extract 2 fields: Class and sex (ignoring age). Just follow the instructions in the Notebook.

The video contains hints, but the main hint is: If you get stuck, look at the previous example. Everything in the practice session builds on that.

And here is practice video 2 (Titanic Practice 2.ipynb is the file) :

In this practice session, we will load the machine learning algorithm you created and run it on a new file.

For this session, we will be working with a new file we have not touched till now, titanic_test.csv. I created this file by taking the original data and breaking off 30% of it.

Since this is new data, we can use it to measure the accuracy of our algorithm.

Extract the class and sex data from this file, as you did for the first practice session.

I already give you the code to load your saved model (again, from 1st practice session).

rf = joblib.load("titanic_model2")

You need to take your input dataframe and pass it to the predict function:

pred = rf.predict()# ADD YOUR DATA VARIABLE HERE)

The above example has an empty predict(). You need to do something like predict(data).

At the end, I have written a small function to find the accuracy of your algorithm vs the actual result. You don’t need to write anything, just run this code. I am getting an accuracy of 82%. Can you do better?

And there you go. That was your first machine learning example using Python.