joe, In yesterday’s post, we asked the basic question of what is machine learning. I hoped to illustrate the similarities and differences between artificial intelligence and machine learning. Lately, on this site, we have been spending a bit of time using Python and I wanted to take a moment today to look at a great library for machine learning in Python.
Scikit-learn is the go-to library for machine learning with an amazing ecosystem of plugins. It is open-source and supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. After you
python3 -m venv EnvironmentName
andsource EnvironmentName/bin/activate
, you can install it by runningpip install scikit-learn
. At that point, you can reference it in your code assklearn
.The way that scikit-learn works is that you start with some data, you give it to a model, the model learns from it, and then you will be able to make predictions. The common notation is splitting up the data into a part called X (everything you are using to make a prediction) and another part called Y (the prediction you are interested in making). The X could be information about a house (square feet, number of bathrooms, etc) where Y is the house price, or X could be a patient’s health statistics where Y is whether or not they develop diabetes. The model then uses X to try to predict Y.
sklearn.datasets
Let’s take a look at the sklearn.datasets module, first. You can use https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing to get test data directly out of the library about the California housing market.
In the above code, we load the 20,640 records and 9 columns into the
data
variable and then we set the things that we are using to make a prediction toX
and the prediction that we are interested in making toy
. So, what are the feature (column) names for the data? If youprint(data.feature_names)
, it will print them.sklearn.model_selection
Once you have data, you can start working on creating a model. The model itself is nothing more than a Python object but the goal after you create it is to train it. You will want to split your data into a training set and a test set. Using
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split">train_test_split</a>
in sklearn.model_selection, you can split it into 70% of the data for training the model and 30% of the data for testing the model (or whatever split you want).Let’s see what that looks like.
sklearn.impute
A dataset is rarely pristine. There are often missing data points or data points that are set to a value like 0. Imputing is the process of replacing missing or incomplete data with substituted values. https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer in sklearn.impute lets you replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column.
Let’s see what that looks like.
In the above example, we are taking any X values except num_preg (the number of pregnancies) that have the value 0 and setting it to the mean. That makes it so that missing values don’t scew things when you go to train the model.
Creating and training a model
Like I said above, the model itself is nothing more than a Python object. You can use sklearn to both create and train it, though. Let’s see what it looks like to create a model using sklearn.neighbors (for a regression based on k-nearest neighbors) and then https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.fit to train the model.
The neat thing about
.fit()
is that if you want to swap out the KNeighborsRegressor model with a new one,.fit()
still works just the same. Let’s look at what it would look like using a linear regression model.That’s pretty easy.
How do you check the accuracy of the trained model?
Sklearn has a method for predicting using your chosen model and a library for performance metrics. Let’s take a look at what those look like.
In the above code, we are predicting the value for y and then comparing it against the actual value of y. Using just the training data, it is predicting the values with a 75.23% level of accuracy.
So, what is next?
In a future post, I want to step through the whole process of picking a statement to test, adjusting the data, building and training a model, testing, adjusting the model, and making predictions. Let’s save that for another day, though.