Linear Regression

Laxmikant R
7 min readFeb 24, 2023

--

Linear regression is a type of computer program that helps predict the future by looking at past data. Specifically, it looks at how one or more things (like temperature or sales) are related to another thing (like time or advertising budget). This helps us understand how changes in one thing affect the other.

Linear regression works by drawing a straight line through the data points that best fits the pattern of the data. This line represents the relationship between the things we are measuring. For example, if we are measuring temperature and time, the line might show us that temperature increases over time.

Once we have this line, we can use it to make predictions about the future. For example, if we know the temperature and time for the last 10 days, we can use linear regression to predict what the temperature will be on the 11th day.

Linear regression is used in many different fields, like economics, finance, and marketing, to make predictions and understand patterns in data. It is a helpful tool for people who want to use data to make informed decisions.

To shorten the example I will take the tempreatures for a week

Day       Temperature    
Monday 23
Tuesday 24
Wednesday 26
Thursday 25
Friday 27
import numpy as np
import matplotlib.pyplot as plt

# Create a list of days of the week starting from Monday to Friday
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

# Create a list of temperatures for each day
temperatures = [23, 24, 26, 25, 27]

# Fit a linear regression to the data
slope, intercept = np.polyfit(range(len(days)), temperatures, 1)

# Create a linear temperature range based on the regression
temp_range = slope * np.array(range(len(days))) + intercept

# Plot the days of the week and temperatures
plt.plot(days, temperatures, 'o', label='Data')
plt.plot(days, temp_range, label='Linear Regression')

# Set the axis labels and legend
plt.xlabel('Day of the Week')
plt.ylabel('Temperature (°C)')
plt.legend()

# Show the plot
plt.show()

In the above example, we find slope of the line by linear equation i.e

y = mx + c

Hope you have already studied linear regression in your school, however to recall on it, Here is the definition of Linear equation in short —

The equation y = mx + c is the slope-intercept form of a linear equation, where:

  • y is the dependent variable, which represents the output or the value we are solving for.
  • x is the independent variable, which represents the input or the variable we are manipulating.
  • m is the slope of the line, which represents the rate at which the dependent variable changes in response to a change in the independent variable.
  • c is the y-intercept of the line, which represents the point where the line intersects the y-axis when x is equal to zero.

In other words, the equation y = mx + c represents a straight line on a Cartesian plane, where m is the steepness of the line and c is the point where the line intersects the y-axis.

Here are some basic terms associated with linear regression:

  1. Dependent variable: The variable that is being predicted or explained by the model. It is also referred to as the response variable or the target variable.
  2. Independent variable: The variable(s) used to predict the dependent variable. It is also referred to as the predictor variable or the explanatory variable.
  3. Simple linear regression: A linear regression model that uses only one independent variable to predict the dependent variable.
  4. Multiple linear regression: A linear regression model that uses more than one independent variable to predict the dependent variable.
  5. Slope: The coefficient of the independent variable in the linear regression equation. It represents the change in the dependent variable for a unit change in the independent variable.
  6. Intercept: The point where the regression line crosses the y-axis.
  7. Residuals: The difference between the actual values of the dependent variable and the predicted values by the model. The goal of linear regression is to minimize the sum of the squared residuals.

Now, lets take a look onhow to predict the temperature of Sunday ie. 7th day of the Week, but here range starts from 0 so Sunday is 6th day:

import numpy as np
import matplotlib.pyplot as plt

# Create a list of days of the week starting from Monday to Friday
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

# Create a list of temperatures for each day
temperatures = [23, 24, 26, 25, 27]

# Fit a linear regression to the data
slope, intercept = np.polyfit(range(len(days)), temperatures, 1)

# Predict the temperature for Sunday
sunday_temperature = slope * 6 + intercept

# Create a scatter plot of the data points
plt.scatter(range(len(days)), temperatures)

# Create a line plot of the regression line
plt.plot(range(len(days)), slope * np.array(range(len(days))) + intercept, color='r')

# Add labels and title to the plot
plt.xlabel('Day of the week')
plt.ylabel('Temperature (°C)')
plt.title('Linear Regression for Temperature')

# Add a marker for the predicted temperature of Sunday
plt.scatter(6, sunday_temperature, color='g')

# Show the plot
plt.show()

In this code, we first create a list of days of the week starting from Monday to Friday and a corresponding list of temperatures for each day. We then fit a linear regression to the data using numpy.polyfit and predict the temperature for Sunday using the equation of the regression line.

To plot the data, we create a scatter plot of the data points using plt.scatter, and a line plot of the regression line using plt.plot. We also add labels and a title to the plot using plt.xlabel, plt.ylabel, and plt.title.

To show the predicted temperature of Sunday on the plot, we add a marker using plt.scatter with the x-coordinate of 6 (since Sunday is the 7th day of the week with index 6) and the predicted temperature for Sunday as the y-coordinate. We color the marker green to distinguish it from the data points and the regression line.

Finally, we show the plot using plt.show()

Let’s dive little deeper in to it to understand how machine learning technology plays a role to predict the value. We will take a generic example

Let’s say we want to predict the price of a house based on its size (in square feet). We can use a linear regression algorithm to build a model that predicts the price of a house given its size.

We will use a dataset that contains information about the size and price of houses in a certain area. The dataset has two columns: “Size” (in square feet) and “Price” (in dollars).

Here is an example of what the dataset might look like:


Size (sqft) Price ($)
1500 200000
2000 250000
2500 300000
3000 350000
3500 400000

We can use this dataset to train a linear regression model. The goal of the model is to find a linear relationship between the size of the house and its price, so that we can predict the price of a house given its size.

We can visualize the relationship between the size of the house and its price using a scatter plot:

import matplotlib.pyplot as plt

size = [1500, 2000, 2500, 3000, 3500]
price = [200000, 250000, 300000, 350000, 400000]

plt.scatter(size, price)
plt.xlabel('Size (sqft)')
plt.ylabel('Price ($)')
plt.show()

This will produce a scatter plot of the data, showing the relationship between the size of the house and its price:

We can see that there is a positive linear relationship between the size of the house and its price. This means that as the size of the house increases, the price of the house also increases.

We can use a linear regression algorithm to find the equation of the line that best fits the data. The equation of the line will have the form y = mx + b, where y is the predicted price of the house, x is the size of the house, m is the slope of the line, and b is the y-intercept.

from sklearn.linear_model import LinearRegression

X = [[1500], [2000], [2500], [3000], [3500]]
y = [200000, 250000, 300000, 350000, 400000]

model = LinearRegression()
model.fit(X, y)

print("Slope: ", model.coef_[0])
print("Intercept: ", model.intercept_)
Slope:  100.0
Intercept: 50000.0

This will output the slope and intercept of the line that best fits the data. The slope represents the increase in price per unit increase in size, and the intercept represents the price of a house with zero square feet (which is not meaningful in this context).

Using this information, we can predict the price of a house given its size using the equation y = mx + b:

size = 4000
price = model.predict([[size]])[0]

print("Size: ", size)
print("Predicted price: ", price)
Size:  4000
Predicted price: 450000.0

This means that we predict the price of a house with size 4000 square feet to be $450,000.

In conclusion, linear regression is a powerful and widely used machine learning algorithm for predicting numerical values. It is particularly useful for modeling the relationship between a dependent variable and one or more independent variables.

Linear regression can be used for both simple and multiple regression problems, and is particularly useful when the relationship between the independent and dependent variables is linear. By minimizing the sum of the squared residuals, linear regression is able to find the best-fit line that represents the relationship between the variables.

The choice of appropriate independent variables and the preprocessing of the dataset are crucial for obtaining an accurate and meaningful linear regression model. Additionally, it is important to evaluate the performance of the model using appropriate metrics and validation techniques.

Overall, linear regression is a powerful tool that can be used in a variety of contexts to make accurate predictions and gain insights into the relationship between variables.

--

--