Common ML Algorithms used by Data Scientists: Part-1

Abhishek Kathuria
Sep 4, 2020
11 min read

Machine learning is an important Artificial Intelligence technique that can perform a task effectively by learning through experience. According to Forbes, machine learning will replace 25% of the jobs within the next 10 years.

One of the most popular real-world applications of Machine Learning is classification. It corresponds to a task that occurs commonly in everyday life. For example, a hospital may want to classify medical patients into those who are at high, medium or low risk of acquiring a certain illness, an opinion polling company may wish to classify people interviewed into those who are likely to vote for each of several political parties or are undecided, or we may wish to classify a student project as distinction, merit, pass or fail. Other applications include clustering, language translation, recommendation, speech and image recognition, etc.

Selecting the relevant machine learning technique is one of the main tasks as there are various algorithms available which are used in different use-cases, and all of them have their benefits and utility.

In this article, we will discuss the top 5 machine learning algorithms which are most commonly used by data scientists.

As this is the first part of the blog, hence this consists of two commonly used machine learning algorithm which are Logistic Regression and Clustering.

1. LOGISTIC REGRESSION

Logistic Regression (also known as Logit Regression) is a regression technique which is used for classification ( binary and multiclass classification). It is a probabilistic statistical model where the dependent variable is a categorical value. (To know more about dependent variables, click this link where I have briefly explained the difference between the dependent and independent variables)

Why not just use Linear Regression for classification problems?

As you might be wondering that since Logistic Regression is a regression algorithm, but still it is used for classification instead of linear regression. To understand this perplexity, let us consider the following example:

Consider a case where we have to predict if a person is ‘obese’ or ‘not obese’ based on his/her current weight. The following is a graph where specific sample points are already plotted based on data. The y-axis denotes the categorical target values where 1 denotes that a person is obese and 0 denotes that the person is not obese.

Let f(x) be a linear regression line (or the best fit line) for the plotted data points. Now, if we are using linear regression, we need to set up a threshold value on the basis of which we can perform the classification. If the estimated probability (P) lies in the internal 0.5<P<1, the model will predict the value as 1 which means the person is obese and if the probability (P) is given as 0<P<0.5, then the model will predict the person to be not obese ( target value is 0).

Figure 1: Graph depicting best fit lines

In the above figure 1, the regression line f(x) which is given by the formula y= mx+c, where y is f(x), m is the slope, x is the dependent variable and c is a constant. But certain problems arise with this formulation.

Suppose we add a number of ‘very positive’ points to our training dataset. The regression line will tilt towards these examples (given by f’ (x), putting the correct classification of more marginal cases at risk. This might result in the estimated probability (P) to be greater than 1. This is unfortunate because we would have already correctly classified these very positive points, anyway. We want the line to cut between the classes and serve as a border, instead of through these classes as a scorer. The right way to think about classification is as carving feature space into regions so that all the points within any given region are destined to be assigned the same label. Regions are defined by their boundaries, so we want regression to find separating lines instead of a fit.

Our hopes for accurate classification rest on regional coherence among the points. This means that nearby points tend to have similar labels and that boundaries between regions tend to be sharp instead of fuzzy. Ideally, our two classes will be well-separated in feature space, so a line can easily partition them. But more generally, there will be outliers. We need to judge our classifier by the “purity” of the resulting separation, penalising the misclassification of points which lie on the wrong side of the line. Any set of points can be perfectly partitioned if we design a complicated enough boundary that swerves in and out to capture all instances with a given label. But the main problem is that these custom-designed boundaries might lead to overfitting of data, hence the linear separators should be constructed ( such as Logistic Regression) which will offer the virtue of simplicity and robustness.

1.1 Logit Function

Similar to a regression model, the Logistic Regression model calculates the weighted sum of the dependent features with the addition of a bias term, but the estimated probability is given by the following equation:

Here, y is the predefined class, Wᵗx is the prediction and σ is the sigmoid function. The logit function or the sigmoid function is given as :

This function takes as input a real value −∞ ≤ x ≤ ∞, and produces a value ranging over [0,1], That is, a probability. The advantage of using the logit function is that it helps in handling the outliers and maximises the cost function.

1.2 Practical Usecase: Logistic Regression using Python

For the practical implementation using Python, we will use the HR Analytics dataset which is available on kaggle. In this dataset, 3 categorical values are given for prediction which are 'low', 'medium', 'high'. We will only use 'low' and 'high' category to demonstrate binary classification using Logistic Regression.

import pandas as pd
import numpy as np

#import the dataset
df=pd.read_csv('HR_comma_sep.csv')

#remove the 'medium' target category
df=df.loc[(df['salary'] == 'low') | (df['salary']=='high')]
df=df.drop(columns="Department")

Figure 2: Dataframe

As seen in figure 2, we have 8 independent variables and 1 dependent variable which is represented by column 'salary'. Now, before any ML algorithm is applied, we need to convert the target variables into numerical values. To achieve this, we will use label encoding.

from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder() 
df['salary']= le.fit_transform(df['salary']) 
X=df.iloc[: ,0:-1]
y=df.iloc[: ,-1]

Figure 3: Dataframe after label encoding

It can be clearly seen in the 'salary' column that the variables have been converted to numerical values. We will use the 'train_test_split' module from sklearn to split our data in the training and testing datasets and then use logistic regression for classification.

from sklearn.model_selection import train_test_split
X_train_Logistic, X_test_Logistic, y_train_Logistic, y_test_Logistic = train_test_split(X,y,train_size=0.25)
#logistic regression model
from sklearn.linear_model import LogisticRegression
model_Logistic = LogisticRegression()
model_Logistic.fit(X_train_Logistic, y_train_Logistic)

The ratio of training to the testing data is 75:25. The 'model_Logistic' variable consists of the instance created for the logistic regression. After training the data, we will make predictions on the test data.

model_Logistic.predict(X_test_Logistic)

The following output is generated which demonstrates the predictions for the output variable.

The output is a binary array with predictions. Now further, we can calculate the accuracy and confusion matrix based on this.

At this point, an anayst might do some model selection; find a subset of the variables that are sufficient for explaining their joint effect on the target variable. One way to proceed by is to drop the least significant coefficient, and refit the model. This is done repeatedly until no further terms can be dropped from the model.

A better but more time-consuming strategy is to refit each of the models with one variable removed, and then perform an analysis of deviance to decide which variable to exclude. The residual deviance of a fitted model is minus twice its log-likelihood, and the deviance between two models is the difference of their individual residual deviances (in analogy to sums-of-squares).

2. CLUSTERING

The typical business applications of machine learning, such as predictive modelling and clustering, are relying less than ever on the production of original code - David Amoux

Clustering is an unsupervised machine learning technique based on the grouping of similar objects together. There can be various use-cases of clustering, some of which are given below:

In a financial application, to find clusters of companies that have similar financial performance.
In a marketing application, to find clusters of customers with similar buying behaviour.
In an economics application, to find countries whose economies are similar.
In a medical application, to find clusters of patients with similar symptoms.
In a crime analysis application, we might look for clusters of high volume crimes such as burglaries or try to cluster together much rarer (but possibly related) crimes such as murders.

How to measure the performance of the clustering model?

Since clustering is an unsupervised machine learning techniques, there is no measure which would enable us to measure the performance of the model such as accuracy, precision,etc. Hence, the question arises that how do we measure the performance of our clustering model?

There are mainly two techniques available to enable us in measuring the performance, one is that allows us to compare between different clustering methods, and the other is that which check on specific properties of the clustering, such as compactness, etc.

For the comparison between the different clustering algorithms, Rand measure is used. Through Rand Measure, we compare the coincidence of different clusterings obtained by different methods. In layman terms, this measure checks the similarity between the results of data clustering, On the other hand, for the checking the specific properties such as compactness, Silhouette Analysis is used. The Silhouette Analysis is based on the Silhouette score which indicates how well a data point belongs to a particular cluster. For example, if we have three clusters C1, C2, C3, and we take a random point x from cluster C1, Silhouette score will tell us how well the point x belongs to the cluster C1. The Silhouette Analysis is discussed in section 2.1.1 (b).

2.1 Types of Clustering Algorithms

There are many types of clustering algorithms available which are applied for different use cases and data. Some of the clustering algorithms include k-means clustering, hierarchical clustering, DBSCAN, fuzzy c-means clustering, etc. In this article, we will discuss the most commonly used clustering algorithm (k-means clustering) with the Python implementation.

2.1.1 Practical Usecase: K-Means clustering with Python Code

K-means algorithm is a hard partition algorithm with the goal of assigning each data point to a single cluster. K-means algorithm divides a set of n samples into k disjoint clusters ci, i = 1,..., k, each described by the mean µi of the samples in the cluster. The means are commonly called cluster centroids. The K-means algorithm assumes that all k groups have equal variance.

It is a fast, simple-to-understand, and generally effective approach to clustering. It starts by making a guess as to where the cluster centres might be, evaluates the quality of these centres, and then refines them to make better centre estimates. The algorithm for the k-means clustering algorithm is given as follows:

1. Initialize the k cluster centers
2. Using the loop, assign all the datapoints to their nearest centroids (cluster centres).
3. Recalculate the centroids for the k clusters formed 
4. Repeat the above two steps till the centroids no longer move

For the practical implementation, let us consider the Enron email Dataset.

emails = pd.read_csv('enron.csv')

After loading the data, the following dataframe is obtained:

Figure 4: Dataframe for the Enron Email Dataset

We then preprocess the dataset and obtained the Tf-Idf features (you can go to my Github repository for reference). Before applying the k-means clustering, it is necessary to find the number of optimal clusters. This can be obtained by two most common methods: Elbow Method and Silhouette Score.

a) Elbow method

For finding the number of centroids, the elbow method is used. In this, the sum of squared error (SSE) value is calculated for different values of k (that is, number of clusters) by clustering the dataset following each value of k. The elbow method works on the principle of minimization of the within cluster sum of squares (WCSS) which is given by the formula:

In this equation, Si gives the mean of the points, x contains the observations in a d-dimentional vector and k is the number of cluster centres.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
wcss = []
for i in range(1, 6):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X.todense())
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 6), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Here, we have selected the range of our loop to be between 1 and 6 which represents the number of potential optimal clusters. As X was a sparse matrx, we have used the dense() method to convert the sparse matrix into a dense one.

The point on the graph where a 'hinge' occurs is considered to be the optimal value of k. As we can clearly see from the figure that there are two x-coordinates, 2 and 3, where the graph gives the hinge. Hence the optimal number of clusters can be either 2 or 3. To further give the precise number of clusters, Silhouette Score is used.

b) Silhouette Score

A Silhouette value always lies between -1 and 1 where -1 shows that the datapoint that we a considering is closer to its neighbouring cluster and far away from the assigned cluster centre, and +1 depicts that the datapoint is close to the assigned cluster center and far away from the neighbouring cluster. The silhouette score is given by the following formula:

In the above equation, M(x) is the mean distance of the point x with all the points within the assigned cluster and N(x) is the mean distance of point x from all the points of the neighbouring cluster.

from sklearn.metrics import silhouette_score
range_n_clusters = list (range(2,6))
for n_clusters in range_n_clusters:
    clusterer = KMeans (n_clusters=n_clusters)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_
    score = silhouette_score (X, preds, metric='euclidean')

We take the same range of the centroids for calculating the silhouette score as well. The metric is chosen as 'euclidean' which signifies that for calculating the mean distances, we have used the euclidean distance. This code generated the following output:

Figure 5: Average Silhouette Scores

It is clear from the figure 5 that the optimal number of clusters is 3 as it obtained the highest score. Hence, using this, we will perform our k-means clusterig

n_clusters = 3
clf = KMeans(n_clusters=n_clusters, 
            max_iter=100, 
            init='k-means++', 
            n_init=1)
labels = clf.fit_predict(X)

The variable 'n_clusters' contains the optimal number of clusters. This code produces the following output:

Figure 6: Clusters

The figure 6 shows that we have performed the k-means clustering successfully. It is clearly visible from the figure that there are three cluster centres.

2.2 Applications of Clustering

Hypothesis development: Learning that there appear to be (say) four distinct populations represented in your data set should spark the question as to why they are there. If these clusters are compact and well-separated enough, there has to be a reason and it is your business to find it. Once you have assigned each element a cluster label, you can study multiple representatives of the same cluster to figure out what they have in common, or look at pairs of items from different clusters and identify why they are different.
Modeling over smaller subsets of data: Data sets often contain a very large number of rows (n) relative to the number of feature columns (m): think the taxi cab data of 80 million trips with ten recorded fields per trip. Clustering provides a logical way to partition a large single set of records in a hundred distinct subsets each ordered by similarity. Each of these clusters still contains more than enough records to fit a forecasting model on, and the resulting model may be more accurate on this restricted class of items then a general model trained over all items.
Data reduction: Dealing with millions or billions of records can be overwhelming, for processing or visualization. Consider the computational cost of identifying the nearest neighbor to a given query point, or trying to understand a dot plot with a million points. One technique is to cluster the points by similarity, and then appoint the centroid of each cluster to represent the entire cluster. Such nearest neighbor models can be quite robust because you are reporting the consensus label of the cluster, and it comes with a natural measure of confidence: the accuracy of this consensus over the full cluster.
Outlier detection: Certain items resulting from any data collection procedure will be unlike all the others. Perhaps they reflect data entry errors or bad measurements. Perhaps they signal lies or other misconduct. Or maybe they result from the unexpected mixture of populations, a few strange apples potentially spoiling the entire basket.