In yesterday’s post, we asked the basic question of [what is machine... - Machine Learning

joe, 1 month ago

In yesterday’s post, we asked the basic question of what is machine learning. I hoped to illustrate the similarities and differences between artificial intelligence and machine learning. Lately, on this site, we have been spending a bit of time using Python and I wanted to take a moment today to look at a great library for machine learning in Python.

Scikit-learn is the go-to library for machine learning with an amazing ecosystem of plugins. It is open-source and supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. After you python3 -m venv EnvironmentName and source EnvironmentName/bin/activate, you can install it by running pip install scikit-learn. At that point, you can reference it in your code as sklearn.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-26-at-2.37.12%E2%80%AFPM.png?resize=1024%2C374&ssl=1

The way that scikit-learn works is that you start with some data, you give it to a model, the model learns from it, and then you will be able to make predictions. The common notation is splitting up the data into a part called X (everything you are using to make a prediction) and another part called Y (the prediction you are interested in making). The X could be information about a house (square feet, number of bathrooms, etc) where Y is the house price, or X could be a patient’s health statistics where Y is whether or not they develop diabetes. The model then uses X to try to predict Y.

sklearn.datasets

Let’s take a look at the sklearn.datasets module, first. You can use https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing to get test data directly out of the library about the California housing market.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-27-at-6.37.15%E2%80%AFPM.png?resize=1024%2C650&ssl=1

In the above code, we load the 20,640 records and 9 columns into the data variable and then we set the things that we are using to make a prediction to X and the prediction that we are interested in making to y. So, what are the feature (column) names for the data? If you print(data.feature_names), it will print them.

sklearn.model_selection

Once you have data, you can start working on creating a model. The model itself is nothing more than a Python object but the goal after you create it is to train it. You will want to split your data into a training set and a test set. Using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split">train_test_split</a> in sklearn.model_selection, you can split it into 70% of the data for training the model and 30% of the data for testing the model (or whatever split you want).

Let’s see what that looks like.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-28-at-8.32.31%E2%80%AFPM.png?resize=1024%2C336&ssl=1

sklearn.impute

A dataset is rarely pristine. There are often missing data points or data points that are set to a value like 0. Imputing is the process of replacing missing or incomplete data with substituted values. https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer in sklearn.impute lets you replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column.

Let’s see what that looks like.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-29-at-1.53.33%E2%80%AFPM.png?resize=1024%2C302&ssl=1

In the above example, we are taking any X values except num_preg (the number of pregnancies) that have the value 0 and setting it to the mean. That makes it so that missing values don’t scew things when you go to train the model.

Creating and training a model

Like I said above, the model itself is nothing more than a Python object. You can use sklearn to both create and train it, though. Let’s see what it looks like to create a model using sklearn.neighbors (for a regression based on k-nearest neighbors) and then https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.fit to train the model.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-29-at-3.46.17%E2%80%AFPM.png?resize=1024%2C246&ssl=1

The neat thing about .fit() is that if you want to swap out the KNeighborsRegressor model with a new one, .fit() still works just the same. Let’s look at what it would look like using a linear regression model.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-29-at-3.48.42%E2%80%AFPM.png?resize=1024%2C250&ssl=1

That’s pretty easy.

How do you check the accuracy of the trained model?

Sklearn has a method for predicting using your chosen model and a library for performance metrics. Let’s take a look at what those look like.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-29-at-4.02.57%E2%80%AFPM.png?resize=1024%2C228&ssl=1

In the above code, we are predicting the value for y and then comparing it against the actual value of y. Using just the training data, it is predicting the values with a 75.23% level of accuracy.

So, what is next?

In a future post, I want to step through the whole process of picking a statement to test, adjusting the data, building and training a model, testing, adjusting the model, and making predictions. Let’s save that for another day, though.

https://jws.news/2024/what-is-scikit-learn/

#MachineLearning #Python #scikitLearn

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...