Let’s create our first Machine Learning model under the Regression Section. There are many regression model available: Simple Linear Regression, Multiple Linear Regression, Polynomial Regression, Support Vector Regression (SVR), Decision Tree Regression, Random Forest Regression, etc.

Today we are going to create two Linear regression Model, simple and multiple. You will require data preprocessing section for this tutorial. The Data preprocessing will always require in all ML models.

If you have not read that section earlier then, please refer to this link:

https://erainnovator.com/data-preprocessing-with-python

Let’s first create two new notebooks on Google Colaboratory. Name it as you want, but I am going to name it as “Simple Linear regression” for Simple Linear Regression and “Multiple Linear Regression” for Multiple Linear Regression. Then Download the dataset from this GitHub link. We will need to upload dataset on colab.

Now let’s understand what the Linear Regression is. The linear regression is the linear relation between the dependent variable and independent variable. If we draw a graph of linear regression then we will get a straight line. In simple linear regression, there is only one independent variable in the equation. In multiple linear regression, there are more than one independent variable.

The equation for Simple Linear Regression is :

y = m * x + c

The equation for Multiple Linear Regression is :

y = m_{1}*x_{1} + m_{2}*x_{2} + m_{3}*x_{3} + m_{4}*x_{4} + … + m_{n}*x_{n} + c

Here, from the equation, we can see that both the linear regression has slightly difference of no. of independent variables. In machine learning there are many inbuilt libraries, classes and function available that makes our task easier by creating our model. In the both equation, the constant c represents where the line is crossing on the y axis and x represents the independent variables and m represents the slope for the given relation of x and y. We can only create a graph of simple linear regression, because it will create a 2-dimensional graph, which we can graph properly, but in multiple linear regression we cannot create the graph having many dimensions. For more information on Linear Regression, I recommend to go through this link.

I hope intuition of linear regression has helped you to get basic idea on what we are going to do.

Now, let’s understand our dataset. In our dataset, we have total 9 columns and 215 rows. So we can see that there are many columns like, SSC percentage, HSC percentage, HSC stream, Degree percentage, has work experience or not, online test percentage, Employ-ability test percentage, MBA percentage, Status of placement- Placed/Not placed if placed then salary.

- Simple Linear Regression

In simple linear regression, just for the teaching purpose, we are going to take only one variable Employability test percentage as independent variable (x) and salary as dependent variable (y). In reality, there are many factors that actually affect the resultant value of y. But for the learning purpose, assume that in simple linear regression, only one affects. After creating the model, we are going to predict the value on test set and then we will create a graph to check how our model is accurate. Here is the code for simple linear regression.

#Import the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

#Import the dataset

data = pd.read_csv(‘Placement_Data.csv’)

x = data.iloc[:, 5].values

y = data.iloc[:, -1].values

#Handle missing data

from sklearn.impute import SimpleImputer

imputer_y = SimpleImputer(missing_values=np.nan, strategy=’mean’)

imputer_y.fit(y.reshape(-1,1))

y = imputer_y.transform(y.reshape(-1,1))

#Splitting the data into the Training set and Test set

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Creating Simple Linear Regression model

#creating and training model on training set

from sklearn.linear_model import LinearRegression

simple_regressor = LinearRegression()

simple_regressor.fit(x_train.reshape(-1,1), y_train)

#predicting result on test set

y_pred = simple_regressor.predict(x_test.reshape(-1,1))

print(y_pred)

#visualizing the graph of training result

plt.scatter(x_train, y_train, color=’red’)

plt.plot(x_train.reshape(-1,1), simple_regressor.predict(x_train.reshape(-1,1)), color=’green’)

plt.title(‘Salary vs Experience (Training Set)’)

plt.xlabel(‘Years of Experience’)

plt.ylabel(‘Salary’)

plt.show()

#visualizing the graph of test result

plt.scatter(x_test, y_test, color=’red’)

plt.plot(x_test.reshape(-1,1), simple_regressor.predict(x_test.reshape(-1,1)), color=’green’)

plt.title(‘Salary vs Experience (Test Set)’)

plt.xlabel(‘Years of Experience’)

plt.ylabel(‘Salary’)

plt.show()

#Evaluating performance of model

from sklearn.metrics import r2_score

print(r2_score(y_test, y_pred))

Let’s understand the code. First, as a data preprocessing phase, we are importing libraries and dataset and then assign the variables x and y. Then we have replaced missing values of y by average of all values. Then we splitted the data into training set and test set in the ratio of 80% training set and 20% test set. You might have noticed that we haven’t applied all data preprocessing steps here. The reason is that not all the dataset require all the steps from data preprocessing, but it depends upon our dataset and our need.

And now the interesting part comes. We are going to create a ML model. First, we import LinearRegression() class from sklearn.linear_model library and then we create an object of that class as normal without giving extra argumenets. Then we fit model on our traing set x_train and y_train to train our model on training set. Here we have used reshape method many times as per the requirments of the particular class that it requires array or vector. array is more than 2-dimensional and vector is 1-dimensional entity. So don’t get confused when you find dimension error.

As we have created our first ML model, now its time to predict the value. So we will use the predict method from the object of the LinearRegression() class and then we will print its value to see the result.

Then the fun part comes. We are going to visualize the result on the graph. First, we are going to visualize the result of how our training set has trained our model and then we will vsualize that how our test set fit on our data. So for visualizing we will use matplotlib.pyplot aliased as plt. so we will use scatter method for getting the original data and the plot method to create the line and then compare it how they fit on each other. So next we enter x_train and y_train in scatter to get original data graph. Then we enter x_train and predicted x_train in the plot method to create a predicted line to see the result. to differentiate the we are using color options. Then we use title method to give the title to graph, x_label and y_label to give name to x and y axis. Then we use show method to show the graph. It all comes same in the test prediction graph. We just change x_test and y_test in scatter and x_test and predicted x_test in plot method. then we use the same concept to see the result and colorize the graph.

In the graph we can see that there is much variation between our actual and predicted values. It doesn’t mean that we are wrong, it all depends on the accuracy of model. To measure the accuracy we will use the R-squared. The more your r-squared is near to 1, the more accurate your model is. So we will import r2_score function from sklearn.metrics and the we will enter to arguments y_test and y_pred to the funcion, we will print it to screen. Our score is not too good but, we will try to make it more better. To learn more about R-squared value, refer to this link.

- Multiple Linear Regression

In multiple linear regression, we will need all independent variables and salary. Here we are not including placement status column, because it is the example of classification, which we will cover in next tutorial. For now, we will assign all column except status and salary to the x and salary to y. Here we are going to predict the result of test set and then we will print the test result and prediction result to compare it ourself as we can not create a graph of more then 2 dimensions. Here is the code of the Multiple Linear Regression.

#Import the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

#Import the dataset

data = pd.read_csv(‘Placement_Data.csv’)

x = data.iloc[:, :-2].values

y = data.iloc[:, -1].values

#Handle missing data

from sklearn.impute import SimpleImputer

imputer_y = SimpleImputer(missing_values=np.nan, strategy=’mean’)

imputer_y.fit(y.reshape(-1,1))

y = imputer_y.transform(y.reshape(-1,1))

#Encoding categorical data that doesn’t have binary result

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [2])], remainder=’passthrough’)

x = np.array(ct.fit_transform(x))

#Encoding categorical data that have binary result

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

x[:, 6] = le.fit_transform(x[:, 6])

#Splitting the data into the Training set and Test set

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Creating Multiple Linear Regression model

#creating and training model on training set

from sklearn.linear_model import LinearRegression

multi_regressor = LinearRegression()

multi_regressor.fit(x_train, y_train)

#predicting result on test set

y_pred = multi_regressor.predict(x_test)

np.set_printoptions(precision=2)

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

#Evaluating performance of model

from sklearn.metrics import r2_score

print(r2_score(y_test, y_pred))

In multiple linear regression we have used many data preprocessing steps but not all and also as required. We imported libraries and dataset, then we replaced the missing values by average of all values, then we encoded the categorical data and then we splitted the data into training set and test set.

Now we are going to create the ML model. The steps to create the model is same as the simple linear regression. We import the LinearRegression() class from sklearn.linear_model and then create an object of that class then, we fit the model on training set.

As we have created our multiple linear regression model, it’s time to predict the value and see the result. So as we did in simple linear regression, we use the predict method to predict the result and stored it into y_pred variable. But we cannot visualize it because it has more than 2-dimensions, but we can see the result directly and compare it. So we will create a vetical array using numpy, then to print it vertically with 2 columns we use set_printoptions method and set the precision to 2. Then we will concatenate (join) two array y_test and y_pred to see them vertically, and also here we have reshaped y_test and y_pred to vertical so we can get our desired output. And at last we print it. If you find it hard to do, then you can simply print y_test and y_pred as normal and directly compare them. It’s upon you.

Now we evaluate the performance of the model by using r-squared. It’s also somehow near to 1. But we can see that still we have to improve our model. So we can use other regression model too, for the better performance.

Here is the ink to the Notebook Files that we have created in this tutorial: