Supervised Learning in Machine Learning

10 min readNov 2, 2020

Machine Learning as many of you know being the most popular knowledge domain that’s at a hype these days . The reason is its essentiality in real world scenarios , helping enterprises to deal with data effectively and increase productivity as well as profit. Learning has three broad techniques ,which are Supervised, Unsupervised and Reinforcement Learning . In this article we will discuss these in brief and deal mainly with Supervised methods of learning.

Your ML model is simply an algorithm written most commonly in python language, since it is the most popular because of simplicity .

Overview of Supervised, Unsupervised and Reinforcement Learning

In layman terms , supervised learning is about gaining insights ( learning — the training process ) from a data where both inputs and known outputs are provided to the model and the model makes future predictions on an unknown data or sample .

This is unlike the unsupervised techniques where you provide data to the model which doesn’t have known outputs , and the model learns to predict values for future data or inputs . Clustering of data into different categories based on similarity factors, neural networks, dimensionality reduction all falls under unsupervised methods .Unsupervised learning brings order to a data .Grouping the customers of supermarkets based on their items purchase list is an example of unsupervised learning.

Take an example of a simple data , say a person is joining a new company and says his previous salary for a position in the old company . Now the employer needs to figure out if he is speaking the truth , so he can use the salary prediction ML model for an employee but using data of previous positions and corresponding salaries and check or predict a value for a position . Employers can look if this matches with the employee’s saying .If yes ,we can say that the employee has spoken the truth . This is a kind of supervised learning . Predicting a numerical value (here salary) was kind of regression, we will come to that later .

Reinforcement learning is something different and really interesting .Here there is an agent in an environment, who takes an action in a state so that at the end he gets maximum rewards. Say you are playing an Atari game like Super Mario, here your Mario is the agent ,if the agent(Mario) touches a coin ,her gets a reward, when he hits evil, he dies(or get negative reward) the display consisting of your agent, reward coins ,evils together constitute the environment .Mario can take actions(left, right, up, down) and move to a different condition, this is called state. When Mario finishes a stage we call it an episode.

Basically comparing to a RL model here ,

Agent -Mario

Set of Actions -left ,right ,up ,down

Set of State -position after taking any of above action

Reward - coins

Environment - contains rewards ,agent and state

There are many other concepts in RL(Reinforcement Learning) like policies, value functions, policies, Q-learning etc which computes a solution to its objectives that we will discuss later.

Supervised Machine Learning

Supervised learning is a method to process data and classify them .Here we are teaching the machine by providing labelled data to figure out the correlation between the input and output data. We are basically splitting these data to training and test sets . This training set is for teaching or training the machine and the test set acts as an unseen data for the machine which will be useful for the machine to analyze accuracy of the created model. There are a set of independent variables and dependent variable, the independent variables are the features that decide the value of the dependent variable(our output).

Supervised learning algorithms are of 2 types, primarily regression and classification .

Regression

Regression Algorithms are supervised learning models that are trained to prejudice real numbers outputs like temperature, stock price etc. In this case we are figuring out the correlation between input and continuous numerical output values, like predicting a persons’ salary using the features like the work experience of the person, age etc..

Most commonly used regression algorithms are -

Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Decision Tree Regression
Random Forest Regression

Simple Linear Regression

Simple linear regression has a concept of figuring out the best linear relation between an independent and dependent variable. Graphically , its aim is to find a best find line that can predict best and accurate output given a single feature. It is suitable for relatively small datasets with less complexity.

The equation connecting input and output in linear regression is

y = m*x + c

m is the slope of the line and c is the y-intercept

Graphically it’s a linear line with an input feature on the X- axis and the dependent variable on the Y-axis. Using this linear we can find the y value that is the output value corresponding to the input value.

Multiple Linear Regression

For the prediction of a continuous numerical value with several input features, we can use multiple linear regression.

y = b0 + b1*1 + b2*2 + … + bk-1*k-1 + bk*k

Predicting the output with all the available features will lead to an inefficient model, therefore feature selection is an important step in this type of regression algorithm. There are certain methods for finding out most significant features, among which one is backward elimination- the stepwise selection of features by removing the statistically least significant features one by one, considering the p-value ,which is the probability that the null hypothesis -the phenomenon where there exist no correlation between variables is true.

Different steps in Backward Elimination:-

Select the significant level (we are selecting this as 0.05 )
Fit model with all possible predictors
Consider the predictor with high p-value. if P-value > Significant level go to step 4 else finish the process
Eliminate the predictor
Fit the model without predictor (continue process until step 3 satisfied)

After eliminating all the unwanted features from the dataset, then we can create an efficient model.

Polynomial Regression

It’s a regression method in which the input and output variables are related as an nth degree polynomial of x, that is for creating a nonlinear relation between input and the output variables. In some cases a straight line cannot be a best fit line for the prediction of the values, only a nonlinear line will be best for prediction, such cases polynomial regression can be used. The equation for polynomial regression is as follows

Y=b0 + b1x + b2x² + … + bmxᵐ

It is also called polynomial linear regression. Linearity is considered with respect to the coefficient of x.

Classification

Classification is a kind of supervised learning technique in which the data is classified into predefined classes using algorithms. They work on the principle of pattern recognition and target is to accurately classify the data. Classification models include linear models and nonlinear ones like Logistic Regression, SVM ( Linear ) , K-NN, Kernel SVM, Decision tree and Random Forests classification (Non-Linear).

Categorizing emails into “spam” or “ham”, handwriting recognition, speech recognition, biometric identification, are all applications of classification.

Logistic Regression

This is a binary classification algorithm that means that your output belongs to either one of 2 classes (like yes or no, cat or dog etc).Although the name regression follows this it is in fact a classification algorithm. The algorithm is named logistic as it uses logistic function(Sigmoid function — takes real value and returns a value between 0 and 1 ) .The input is one or more independent variables and the output is either 0 or 1. If the predicted output value of sigmoid function is >0.5 => 1 and <0.5 => 0 .

Sigmoid function — y = 1/(1+e^-x)

Support Vector Machines(SVM)

In SVMs comes the concept of 3D Hyperplane, Euclidean distance and max margin. Your given data is classified simply by a line if data is linearly separable, method — Linear SVM. This algorithm mainly comes into action where data is not linearly separable; and we will have to project the data points to higher dimensions. In higher dimensions the data points form different shapes and hence become linearly separable, project to 3D and separate them using hyperplane, then project back to 2D.This is simply called Kernel SVM.

Linear SVM is a parametric model and as the training size increases its complexity also increases.

But you must note that in Kernel SVM, there is a tedious process of projecting the data to a higher dimension and predicting. Gaussian kernel is commonly used.

Naive Bayes Algorithm

It’s a classification algorithm that works based on Bayes algorithm. First of all we have to understand Bayes theorem.

Bayes theorem finds a value for calculating probability based on the prior probabilities and with the assumption that each of the input variables is dependent on all other provided variables, which is the main cause of its complexity. This can be resolved by changing the model from dependent model to independent model and thus simplify calculations.

When this simplification is applied to predictive modelling problems it is called Naive Bayes algorithm.

Let’s understand the concept of Naive Bayes Theorem through an example. We are taking a dataset of employees in a company, our aim is to create a model to find whether a person is going to the office by driving or walking using the salary and age of the person.

In the above we can see 30 data points in which red points belong to those who are walking and green belong to those who are driving. Now let’s add a new data point into it . Our aim is to find the category that the new point belongs to.

Note that we are taken age in the X axis and Salary in the Y axis. We are using Naive Bayes algorithm to find the category of new datapoint. For this we have to find the posterior probability of walking and driving for this datapoint. After comparing, the point belongs to the category having higher probability.

Posterior probability of walking for the new datapoint is :

also for the driving is :

Steps involved in Naive Bayes algorithm

Step 1 : We have to find all the probabilities required for Bayes theorem for the calculation of posterior probability

P(Walks) is simply the probability of those who walks among all

In order to find the marginal likelihood, P(X) , we have to consider a circle around the new data point of any radii including some red and green points.

P(X|Walks) can be find by :

Now we can find the posterior probability using Bayes theorem,

Step 2 : Similarly we can find the posterior probability of Driving, and it is 0.25

Step 3 : Compare both posterior probabilities. When comparing the posterior probability, we can find that P(walks|X) has greater values and the new point belongs to the walking category.

Decision Tree Regression and Classification

The concept of decision trees is similar for regression trees and classification trees. Only difference is that in regression we predict values and in classification we classify data points into different groups. Decision trees is about splitting data points into smaller subsets. How the splits are conducted is determined by algorithms and is stopped when the certain number of information to be added is reached. The point where split occurs is termed node and terminal node is called leaf node.

Pruning (opposite to splitting) is a method in tree algorithms performed to remove anomaly in training data caused due to noise by removing nodes.

Random Forest Regression and Classification

This is an ensemble learning technique where you build stronger models with many decision trees to get better prediction values.

It includes the following steps -

Pick some K data points from training set
Build the decision tree for these k data points
Choose the number of trees you need and then repeat the above steps again
For each new data-point make your trees predict values or classify them(based on average or any other parameter)

K-Nearest Neighbors classification

It’s an important classification algorithm in which new data points are classified based on similarity in the specific group of neighboring data points. This gives a competitive result.

Steps for classifying a new data point

Select the value of K neighbors(say k=5)
Find the K (5) nearest data point for our new data point based on Euclidean distance.
Among these K data points count the data points in each category
Assign the new data-point to the category that has the most neighbors of the new data-point

CONCLUSION

Each of the algorithms are imported from the sklearn module, they are instantiated, fitted to the model and finally predictions are made taking into account only specific features that are relevant for prediction using Exploratory data analysis.

That was all brief description on supervised algorithms. Thankyou for reading and Happy Learning !!

Supervised Learning in Machine Learning

Written by Surabhi S