A Beginner’s Guide To Logistic Regression
You may think that the Logistic Regression algorithm is used for regression or that it’s related to Linear regression. Well, It can be used for regression but actually, it is widely used for classification tasks. It is used to predict categorical variables with the help of dependent variables.
After reading this post you should know the following:
- What Is Logistic Regression?
- What Are the Types of Logistic Regression?
- How does Logistic Regression work?
- What are the advantages and disadvantages of using logistic regression?
- Logistic Regression vs Linear Regression
- Python implementation for Logistic Regression
- Resources & References
What Is Logistic Regression?
Logistic regression is defined as a supervised machine learning algorithm that solves binary classification tasks by predicting the probability of an outcome or class. It is used when the data is linearly separable and the outcome is binary.
Some of the examples of classification problems are Normal vs spam Email, Fraud or Normal transactions. And this means we are dealing with a binary outcome.
A binary outcome is one where there are only two possible scenarios, either the event happens (1) or it does not happen (0). Independent variables are those variables or factors which may influence the outcome (or dependent variable).
What Are the Types of Logistic Regression?
- Binary logistic regression
- Multinomial logistic regression
- Ordinal logistic regression
Binary logistic regression
It is the statistical technique used to predict the relationship between the dependent variable (Y) and the independent variable (X), where the dependent variable is binary value (1 or 0). This is the type of logistic regression that we’ve been focusing on in this post. for example:
- Spam filter : Normal / Spam
- Medical diagnosis : Normal / Abnormal
Multinomial logistic regression
A categorical dependent variable has two or more discrete possible outcomes. It is very similar to logistic regression except that here you can have more than two possible outcomes. for example:
- Classifying texts into what language they come from.
- Predicting whether a student will go to college by bus or train or car
Ordinal logistic regression
Ordinal logistic regression applies when the dependent variable is in an ordered state and it has more than two possible outcomes. for example:
- Formal shirt size : XS, S, M, L, XL
- Rating of a movie : Stars 0 to 5
How does Logistic Regression work?
In order to perform Logistic Regression, you have to make sure of some assumptions that apply.
- The dependent variable is in discrete binary
- Linear relationship between the independent variables
- No extreme outliers
- No multicollinearity between the independent variables ( They should be independent on each other i.e. low correlation)
- Preferably large sample size
The Logistic Regression equation is quite similar to the Linear Regression model. but the Logistic Regression uses a more complex cost function, this cost function can be defined as the Sigmoid function.
As you can see, the sigmoid function returns only values between 0 and 1 ( Can also be called probabilities) for the dependent variable, irrespective of the values of the independent variable. This also works if you have multiple independent variables. The Logistic Regression also model equations between them and one dependent variable.
Unlike Linear regression the best fit curve can’t be determined using Least squares. Instead, methods like maximum likelihood can be used. Likelihood can be said to be the probability of an event based on previous information.
Here y represents the actual class and log(h(x)) is the probability of that class.
When y = 1, the second term becomes 0 and we will check the probability value log(h(x)).
if h(x) is 1, then log(h(x)) is 0 and our cost will be 0. This means our prediction is correct.
if h(x) is 0, then log(h(x)) is ∞ and our cost will be ∞. This means our prediction is incorrect.
But when y = 0, the first term becomes 0 and we will check the probability value of log(h(x)).
if h(x) is 1, then log(1-h(x)) is ∞ and our cost will be ∞. This means our prediction is incorrect.
if h(x) is 0, then log(1-h(x)) is 0 and our cost will be 0. This means our prediction is correct.
What are the advantages and disadvantages of using logistic regression?
Advantage
- The main advantage of logistic regression is that it is too easy to set up and train.
- Logistic regression performs better when the data is linearly separable.
- It does not require too many computational resources.
- It does not require tuning.
Disadvantage
- Logistic regression fails to predict a continuous outcome
- Logistic regression may not be accurate if the sample size is too small or not linear.
Logistic Regression vs Linear Regression
Linear Regression | Logistic Regression |
---|---|
Used to solve regression problems | Used to solve classification problems |
Can deal with continuous output | Deals with discrete/categorical output only |
It is a straight line | It is a curved line (S shaped) |
Python implementation for Logistic Regression
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from matplotlib import rcParams
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score, roc_curve, auc, roc_auc_score
from sklearn.preprocessing import label_binarize
Importing the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
for i in range(3):
plt.subplot(3, 1, i+1)
plt.imshow(X_train[i])
plt.show()
Data reshaping
print(f'Before reshaping:\nX_train:{X_train.shape}\nX_test:{X_test.shape}')
X_train = X_train.reshape(X_train.shape[0],-1)
X_test = X_test.reshape(X_test.shape[0],-1)
print(f'\nAfter reshaping:\nX_train:{X_train.shape}\nX_test:{X_test.shape}')
Data normalization
print(f'Before normalization:\nmax value:{X_train.max()}\nmin value:{X_train.min()}')
X_train = X_train / 255.0
X_test = X_test / 255.0
print(f'After normalization:\nmax value:{X_train.max()}\nmin value:{X_train.min()}')
Training the classification model on the Training set
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
Predicting the Test set results
y_pred = classifier.predict(X_test)
Making the Confusion Matrix
classes_num = 10
cm = confusion_matrix(y_test, y_pred)
print(cm)
rcParams['figure.figsize'] = 12,8
sns.set(font_scale = 1.2)
sns.heatmap(cm, fmt='d', annot=True, xticklabels=[0,1,2,3,4,5,6,7,8,9], yticklabels=np.arange(classes_num))
Calculate Precision, Recall, F1_score and Accuracy
average_methods = ['micro','macro','weighted']
for method in average_methods:
Precision,Recall,F1_score,_ = precision_recall_fscore_support(y_test, y_pred, average=method)
Accuracy = accuracy_score(y_test, y_pred)
print(f'{method}:')
print(f'Precision: {round(Precision*100,2)}%\nRecall: {round(Recall*100,2)}%\nF1_score: {round(F1_score*100,2)}%\nAccuracy: {round(Accuracy*100,2)}%\n')
Resources & References
- Read more about machine learning algorithms
- Read more about Linear Regression
- Full code implementation can be found On Github