Linear Regression for Continuous Value Prediction
Linear Regression for Continuous Value Prediction is usually the first machine learning algorithm that every data scientist comes across. It is one of the most important Supervised Machine Learning Algorithms. In brief, It is a very simple model that tries to mimic the behavior of a the dataset using a straight-line.
After reading this post you should know the following:
- What is regression?
- When and why we use regression?
- Definition of Linear regression
- Advantages of using Linear regression
- Model training and evaluation
- Python implementation for Linear regression
- Resources & References
What is Linear Regression?
Before we explain what is linear regression, let us explain first what is the meaning of regression.
Regression is a statically way of modelling an input to an output, we call the input another name by the “independent variables” and we call the output by the name “dependent variable”.
This method is mostly used for forecasting and finding out the type of relationship between the independent and dependent variables.
When and Why we Use Linear Regression?
Regression is performed when the dependent variable (Usually denoted by Y) is a continuous value.
The regression method tries to find the best fit line, which shows the relationship between the dependent variable and independent variables (Denoted by X or also called predictors). This line could be linear or non linear, if the best fit line used is a linear line (straight line) this means it is called a Linear regression model.
This is how the regression formula looks like. Since we are using a linear component (X^1) then it can be called Linear regression formula too. Also i here denotes to the i-th feature.
Definition of Linear regression
Linear Regression for Continuous Value Prediction analysis is the most widely used of all statistical techniques. It assumes that there is a linear relationship between the dependent variable and the predictor (independent variable)
You can think of linear regression as the answer to the question “How can I use X to predict Y?”, where X is some information that you have and Y is some information that you want to know. And the answer is by finding the best fit line that models the input X to output Y.
Advantages of using Linear regression
Linear regression can be known for being:
- Ease of use. It is very easy to use and to be understood (It is the equation of a line segment that intercepts a point on the Y-axis)
- Scalability. You can add more data easily and retrain your model.
- Analyzing the impact of Price Changes by predicting the value.
Model training and evaluation
In order for linear regression to work, it has some assumptions about the dataset
- Linearity: The relationship between independent variables and the dependent variable is linearly related.
- Homoscedasticity: The variance of the features should be equal.
- Independence: Features are independent on each other. i.e no correlation should be there between the independent variables.
- Normality: The X and Y variables should be normally distributed.
You can easily check for the above assumptions by plotting the dataset, plot X verses Y and you will notice if it is applicable to use a linear model for this dataset. Also you can check for independence by drawing a correlation matrix between the independent variables.
Loss functions
After you check for the validity of using linear regression, you will now need to ask yourself how to train this model. So, first we have to define an error function or an evaluation metric to check the performance of our model, which can be one of the following:
- Mean Absolute Error (MAE): we calculate the average absolute difference between the actual values and the predicted values.
- Mean Absolute Percentage Error (MAPE): MAPE is defined as the average of the absolute deviation of the predicted value from the actual value.
- Root Mean Square Error (RMSE): RMSE calculates the square root average of the sum of the squared difference between the actual and the predicted values.
- R-squared values: R-squared is a measure of how much variance in the dependent variable that our linear function accounts for. R-squared value that is closer to 1 is better, because it accounts for more variance.
Where,
- SSR : It is the measure of the difference between the expected and the actual observed output. Also known as the sum of squares of the residual
- SST : It is the sum of difference between the data points and their the mean. Also known as sum of squares of the total.
The error function mostly used is the R² function, and we always adjust the best fit line and then compute the R² value and check if it is improved and then adjust the line then compute the R² and so on…
Python implementation for Linear regression
Importing the libraries
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
Loading and checking the dataset
dataset = pd.read_csv('50_Startups.csv')
dataset.head()
Finding the NAN values
nan_num = dataset.isna().sum().sum()
print(f'Number of nan: {nan_num}')
rows_with_nan = list()
for index, row in dataset.iterrows():
is_nan_series = row.isnull()
if is_nan_series.any():
rows_with_nan.append(index)
print(f'NaN_indices: {rows_with_nan}')
Dropping NAN and filling the other NAN values with mean value
X_inital = dataset.iloc[:, :-1].values
print(f'Input with NaN:\n{X_inital[rows_with_nan[0]-1:rows_with_nan[0]+2]}')
dataset.dropna(inplace=True)
X_inital = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(f'Input without NaN:\n{X_inital[rows_with_nan[0]-1:rows_with_nan[0]+1]}')
Label encoding the dataset
print(f'Input with strings:\n{X_inital[:,3]}')
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2,3,4])], remainder='passthrough')
X_categorical = np.array(ct.fit_transform(X_inital))
print(f'Categorical input:\n{X_categorical[:,0:3]}')
le = LabelEncoder()
le.fit(dataset['State'].unique().tolist())
dataset['State'] = le.transform(dataset['State'].values)
print(f'String inputs:\n{X_inital[:,3]}')
X = dataset.iloc[:, :-1].values
print(f'Numerical inputs:\n{X[:,3]}')
y = dataset.iloc[:, -1].values
Splitting the dataset and normalizing the values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(f'Before normalizing:\nMax value: {X_train.max()}\nMin value: {X_train.min()}')
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_norm = scaler.transform(X_train)
X_test_norm = scaler.transform(X_test)
print(f'After normalizing:\nMax value: {X_train_norm.max()}\nMin value: {X_train_norm.min()}')
Building the model and evaluating it
regressor = LinearRegression()
regressor.fit(X_train_norm, y_train)
y_pred = regressor.predict(X_test_norm)
Calculate the metrics
R_squared = r2_score(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)
RMSE = mean_squared_error(y_test, y_pred, squared=False)
MAE = mean_absolute_error(y_test, y_pred)
MAPE = mean_absolute_percentage_error(y_test, y_pred)
print(f'R-squared: {round(R_squared*100,2)}%')
print(f'MAE: {round(MAE,2)}')
print(f'MSE: {round(MSE,2)}')
print(f'RMSE: {round(RMSE,2)}')
print(f'MAPE: {round(MAPE*100,2)}%')