Today, we are going to start our first step in Machine Learning: Data Preprocessing. Here, we are going to learn how we can enter and process the data before giving it to our Machine Learning Model. The given steps are required as per your need. Not all steps are required in all Models. But it depends on your data, e.g. if you have all numerical data in the same range then you don’t require to apply Encoding.
Here are the requirements:
- Basic python knowledge
- Google colab
- and the passion to learn something new
We are going to use Google Colaboratory which uses the jupyter notebook. And also it comes with many libraries installed in it so we don’t have to install anything in it. if you don’t know what the jupyter notebook is then I would like to inform you that it is a web-based development environment that is widely used in data science and Machine Learning. you can learn more about it here: https://jupyter.org
So Let’s first prepare our colab notebook. Got to https://colab.research.google.com and log in using your Google account. Now create a new notebook by going to File Menu -> New Notebook.
Now rename the Notebook file by clicking on the title and the change it to “Data Preprocessing”. you can name it as you want. And it is now available to use.
If you want to add a code cell or text box you can add it by clicking on the “+Code” and “+Text” button. To run the code cell just press “Ctrl+Enter”. I recommend exploring the all features on google colab, so that you can get the idea on how what is actually happening.
Let’s Understand our dataset. I have got the dataset from the Kaggle open datasets. So we are going to work on it in this tutorial.
Here is the link to our dataset:
Download it because we will need to upload it to colab.
You can see that we have 9 columns and 215 rows. The data is about the placement of students in the university, and we have to predict the salary and if the placement is done or not. But for this preprocessing tutorial, we are going to predict the only salary to understand the core concept clearly. Here is the screenshot of the dataset:
Now here we can see that it contains the following columns in the given order: SSC percentage, HSC percentage, HSC stream, Degree percentage, has work experience or not, online test percentage, Employ-ability test percentage, MBA percentage, Status of placement- Placed/Not placed if placed then salary.
So Now let’s dive into the step-by-step tutorial. Go to Notebook and then write the following code in the code cell described in the below steps.
1. Import the libraries
Here we going to import the required libraries. We are going to use pandas, NumPy, matplotlib, scipy, and sci-kit-learn mainly. But at the start of any notebook, I recommend importing these three libraries. Here is the code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Here the import keyword imports the libraries and as keyword is used to alias the libraries name to any short name so that we don’t have to type whole long library name every time we call it.
2. Import the dataset
Now as we have imported libraries, its time to import the dataset. So just click on the file icon in the left-hand side menu and then wait for the kernel to connect and then upload the dataset file.
Here is the code:
data = pd.read_csv(‘Placement_Data.csv’)
x = data.iloc[:, :-2].values
y = data.iloc[:, -1].values
Here, we are creating a dataframe called data from our dataset Placement_Data.csv. Dataframe is a kind of datatype that you can imagine as any dataset containing rows and columns. It always ignores the first-row because it contains the name of the columns. We are calling read_csv() function from pandas (aliased as pd) to read data from CSV file. CSV file means “comma-separated value”. If you open CSV using a text editor then you will find that it is just a text file that contains comma after each value. Let’s go further, we are now defining independent variable x and dependent variable y. The independent variable is also sometimes called as features, that are the variable on which we are going to predict our result, e.g. for predicting housing price we consider no. of bedrooms, no. of bathrooms, age of the house, size of the house, etc. are the features. And on the basis of features, we predict the price which is called the dependent variable.
Here we calling the iloc function from our dataframe which takes indexes of the data we want to define, so here to predict the salary, we are choosing all the rows because we need all data. For x we need all columns except the last two because we want to predict the salary, and the last two columns are the prediction result, for y we will take only the last column which we want to predict. I hope you get this point. iloc takes rows as first arguments, in python ‘:’ gives the range and so we are using it to give the range of indexes to assign the columns and rows to variables x and y. The values at the end tell the function that it has to only take the values of the data.
I recommend using the print function to print the variable so that you can get the basic idea about what are you doing.
3. Handle missing data
If you see in the y variable then you can find that it contains some values which are Nan, which means it has some missing values. so we have to handle it. Always remember, if we have any missing numerical data we should take an average of the data in that column and then replace all missing values by it, so we can reduce the error in our model and if we have any categorical missing data then we should replace missing value by most frequent value in the columns.
Here is the code:
from sklearn.impute import SimpleImputer
imputer_y = SimpleImputer(missing_values=np.nan, strategy=’mean’)
y = imputer_y.transform(y.reshape(-1,1))
Here we will start using the sci-kit-learn library which we will use most. Here we are importing SimpleImputer class from sklearn.impute library. We will use this class whenever we need to handle the missing values. Next, we are creating an object of the SimpleImputer class which is imputer_y. the class takes many arguments but not all are necessary. so here we are two arguments for the class. first is the identification of missing values, so we are telling the class that the missing values have ‘nan’ value by giving argument np.nan from NumPy. The second argument is the strategy that we want to use to get the value to replace missing values, so we will use the ‘mean’ strategy to calculate the average. Now we have to fit the data to calculate the average value for missing data and we have only one column so we entered all rows and one column of y in fit() method. Here the reshape function is used to change the dimension to 2-d because y has 1 column and is 1-d. Now we have to replace the missing values of the y and to do that we have to use transform method and assign the value of that transformation to the y to do the complete process.
Note: If you are confusing about whether you are importing class or function then always remember that class always start with Capital latter and functions always start with small latter.
4. Encoding categorical data that doesn’t have binary result
We can identify which category there is while analyzing the data, but the machine doesn’t understand it, so we generally use onehotencoder to convert it into columns and if it has the value then the category will have 1 else it will have 0 in its value. But if we have yes or no type category or true or false type category then we have to use label encoding which will be in next section right after this section.
Here is the code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), )], remainder=’passthrough’)
x = np.array(ct.fit_transform(x))
first we have to import the ColumnTransformer class from sklearn.compose to transform the column and then we will import the OneHotEncoder class from sklearn.preprocessing to get the encoder. we create an object of the class ColumnTransformer and then we specify the required arguments which are transformer and remainder. transformer gives the information that we want to use ‘encoder’ to transfer the column, we want to use ‘OneHotEncoder()’ and then the index of the column to be transferred. here we wan to transfer the column ‘hsc_s’ so we specified index as 2. The reminder’s value ‘passthrough’ let the other data as it is, so it will not spoil all the data. In the end we are fitting and transforming the value of x by combined in-built method fit_transform(). Here we have to convert the value of x back to the array using np.array(). In the result, the category column will be removed and three new columns of a different category will be added at the start, so the number of the column will be 11.
5. Encoding categorical data that have binary result
Now we are going to encode the column having binary result. Binary result means have only two choices of kinds like true-false, yes-no, 0-1, etc. If a column has already 0-1 type value then we don’t need to perform the below steps, but if we have other binary results the we have to convert it to 0-1 using Label encoder. It just changes the positive value to 1 and negative to 0. So the computer can understand the data clearly.
Here is the code:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
x[:, 6] = le.fit_transform(x[:, 6])
Here we just have to import the LabelEncoder class from sklearn.preprocessing to encode the binary result to 0-1. We then create the object of the LabelEncoder class and the good news is that it doesn’t require any arguments. then we fit and transform the data using the fit_transform method and assign the value to it so all things just happened in a few lines of code. Here we are going to encode only ‘workex’ column as it has yes and no typed value. Here we have taken the column index as 6 because the above categorical encoding removes the category column and add a new column at the start named different category. here we have 3 columns so we have chosen 6 and you can also check it by printing the x.
You can also check if the value of the column has changed or not by printing the x. And now we have only numerical data in our x.
6. Feature Scaling
We know that we have all numerical data in our dataset and no missing data is there but still the value of the data is not in the same and also in low range so we can properly fit the data into the machine learning model, so will use the feature scaling to scale all the data in the same range so that they can relate to our future prediction.
Here is the code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
Here first we import StandardScaler class from sklearn.preprocessing which will help us in scaling the dataset, then we create an object of the class and it also doesn’t require any arguments. And then we fit and transform our data using fit_transform method and assign its value to the x to complete the process.
Now you can see scaled data by printing the x.
7. Splitting the data into the Training set and Test set
After successfully processing the data, we still have the last step which is very important in machine learning. we have to split the data in the training set and testing set. You can imagine training set as a school study and test set as the exam. We first train our model on the training set and then we test the model on the test set. It is very helpful in deciding the model performance.
Here is the code:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
Here we are importing the train_test_split function from sklearn.model_selection which will split the data into training and testing set. This function returns a training set of x, a test set of x, a training set of y, and a test set of y in the given order. We are giving the four arguments to the function which are x, y, the test_size, and the random_state. The test_size tells that how much test data we want. Here we want 20 % of the data for testing, and the random_state helps to choose us the state on the basis of which the data is being randomly chosen from the dataset. The common value of random_state is 0 or 42. So we are assigning four values from train_test_split function.
Here we can check the data by printing its value.
Here is the link to the notebook file:
I hope you like this tutorial. If you have any queries, please write in the comment section.
Check out our other blogs here: