Linear Regression Model using Math!

If you are not familiar with basic Machine Learning terms like Features, Label, Training, Inferencing, please check out the following blog as a prerequisite.

Machine Learning 101

Dev J. Shah 🥑 ・ May 21

#machinelearning #ai

Introduction

Regression Models Supervised Machine Learning models that are trained to predict the label values based on the training data. In this blog, we will discuss Linear Regression.

Linear Regression

Let me first set up the context. Throughout the blog, we will consider a practical example to understand the concept. Considering that we have an existing data of an ice cream shop. The data has 2 columns, one is the average temperature of each day and the second one has the number of ice cream sold.

We are creating an algorithm using this existing data. The algorithm can further be used to predict the number of ice creams that can be sold given the average temperature.

For this use case, we will try using linear regression. Hence, consider the algorithm to be the following,

y = \beta_0 + \beta_1x + \varepsilon

In this algorithm, $y$ is the label (number of ice creams), $x$ is the feature (average temperature), including $\beta_0$ and $\beta_1$ as parameters.

Lets understand how this algorithm is derived.

As the value of $x$ (features) increases the value of $y$ (prediction) will increase/decrease because we are considering a linear regression.
Hence, $y$ is either directly or inversely proportional to $x$ .
In our case, the value of number of ice creams will increase with the increase in temperature.
Therefore, $y\propto x$
Therefore, $y = \beta_1 . x$ where $\beta_1$ = constant of proportionality. Also called the slope of the line describing the relationship between $y$ and $x$ .
Further, consider a point at which the value of $y$ starts. In our case, we can call it the base value or the number of ice creams that are sold regardless of the temperature. This value can also be 0. Lets represent this value by $\beta_0$ . It is the y-intercept of the line.
Now the equation becomes $y = \beta_0 + \beta_1x + \varepsilon$ where $\varepsilon$ = The error or the difference between the predicted label and the actual label of the feature.

Training

Following are the steps taken place while training a linear regression model.

Step 1

The available data, that is, both features and labels are randomly splited into multiple groups.
This creates various groups of data which can be used to train the model.
One group is hold back, which can be further use to validate the trained model.

Step 2

Take one dataset from the multiple groups that we created in the previous step.
Use a regression algorithm such as linear regression to fit the training data into a model. In other words, create a formula, based on the known data, by assuming the values of $\beta_0$ and $\beta_1$ such that it predicts the right label for given feature.

Step 3

Use the group of data that we held, to validate the model by letting it predict the labels for the features.

Step 4

Compare the known actual labels in the group of data, with the labels that model predicted.
Then aggregate the differences between the predicted and actual label, to calculate a metric that indicates how accurately the model predicted for the validation data.

After each train, validate and evaluate iteration. You can repeat the process with different algorithms and parameters, until an acceptable evaluation metric is achieved.

Regression evaluation metrics

Based on the predicted and actual values, you can calculate some common metrics that are used to evaluate a regression model.
For understanding each metrics, consider the following observations for the ice cream sales.

Temperature ( $\mathcal{X}$ )	Actual sales ( $\mathcal{Y}$ )	Predicted sales ( $\hat{\mathcal{Y}}$ )	Absolute Difference ( $\lvert\hat{\mathcal{Y}} - \mathcal{Y}\lvert$ )
52	0	2	2
67	14	17	3
70	23	20	3
73	22	23	1
78	26	28	2
83	36	33	3

Mean Absolute Error (MAE)

The value of MAE is the average of all the absolute differences. Hence, the name Mean Absolute Error.
In the ice cream example, the mean (average) of the absolute errors (2, 3, 3, 1, 2, and 3) is 2.33.

Mean Squared Error (MSE)

The Mean Absolute Error takes into account, all the discrepancies between the predicted and actual labels equally. However, it is more desirable to have a model that consistently makes small errors vs a model that makes fewer but large errors.
One way of getting that metrics that amplifies the large errors is by squaring the individual errors and calculating the mean of the squared values. This metric is known as Mean Squared Error.
In our ice cream example, the mean of the squared absolute values (which are 4, 9, 9, 1, 4, and 9) is 6.

Root Mean Squared Error (RMSE)

The Mean Squared Error helps take the magnitude of errors into account, but because it squares the error values, the resulting metric no longer represents the quantity measured by the label.
To get the error in terms of the unit of label, we need to calculate the square root of MSE. It produces a metric called Root Mean Squared Error.
In this case $\sqrt{6}$ , which is 2.45 (ice creams).

Coefficient of determination ( $R^2$ )

All the metrics so far, compare the discrepancy between the predicted and the actual value in order to evaluate the model. However, in reality, there is some natural random variance in the daily data that model takes into account.
To find the natural variation existing in each data, we need to have a reference point. This reference point can be the average of all the data. Using this reference point, we can calculate the variation that exist in the data.
In this case, the average of the actual sales is $\approx20.167$ . Now the absolute variation in each data can be calculated as 20.167, 6.167, 3.167, 2.167, 6.167 and 16.167.
Now we will find the RMS value of these data due to the same reasons as mentioned in Mean Square Error description. The RMS value of the data is $\approx11.25$ . This is the variation that already exists in the data.
Now, we need to find the variation in the predicted data and the actual data. For this one we do not need a reference value because we already have 2 entities.
The absolute variation in the data predicted by the model is 2, 3, 3, 1, 2, 3. The RMS value of this variation is 2.45.
Now, the actual (or ideal) variation in the data is 11.25 and the total variation by the model is 2.45. If we remove the total variation by the model from the actual variation in the data, we get 11.25 - 2.45 = 8.8. This is the proportion of the variation from the actual variation that we can get from the model.
Hence, to calculate how well the model explains the data, we divide the variation the model is able to capture (which is 11.25−2.45=8.8) by the total variation in the data (which is 11.25). The value that we get, indicates how accurate the model is. This value is call the coefficient of determination, which ranges between 0 to 1.
1 indicates that the model is efficiently able to get the variation that already exists in the data; while 0 indicates that the model is inefficient and it is only able to guess the mean.

All the metrics explained above are used to evaluate a regression model. A data scientist uses an iterative approach to repeatedly train and evaluate a model, varying:

Feature Selection and Preparation: Choosing which features to include in the model, and calculations applied to them to help ensure a better fit.
Algorithm selection: There are many regression algorithms
Algorithm parameters: In case of linear regression algorithm, the parameters were $\beta_0, \beta_1$ etc. However, in general parameters means the coefficients that represents the relationship between the features and the predicted value of labels.

Final Words

Thank you for reading the blog. I understand that these concepts are hard to understand, especially with limited math and statistics knowledge. Hence, if you have any questions or thoughts, feel free to discuss them in the comments or try to contact me via any of my social profiles.

Citation

This blog post is inspired by a Microsoft Learn course's module. While the foundational concepts are based on the course material, I have expanded on them with additional explanations, examples, and insights to better simplify and contextualize the information for readers.

Dev J. Shah 🥑 @busycaesar

Linear Regression Model using Math!

Machine Learning 101

Dev J. Shah 🥑 ・ May 21

Introduction

Linear Regression

Training

Step 1

Step 2

Step 3

Step 4

Regression evaluation metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Coefficient of determination ( $R^2$ )

Final Words

Citation

Comments 0 total

Dev J. Shah 🥑 @busycaesar

Linear Regression Model using Math!

Machine Learning 101

Dev J. Shah 🥑 ・ May 21

Introduction

Linear Regression

Training

Step 1

Step 2

Step 3

Step 4

Regression evaluation metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Coefficient of determination ( R2R^2R2 )

Final Words

Citation

Comments 0 total

Coefficient of determination ( $R^2$ )