Introduction to Linear Regression in Python

In this short guide you can learn the basics of linear regression in Python. Linear regression is a model which represents the relation of variables - they can be independent or dependent. Linear regression can help to answer the question of dependency or numeric variables.

So let's explain it in a simple example. Suppose you are a data scientist which is asked:

Is there a relation between money and happiness?

To prove it you will need to do it in several steps:

define the question (it's already done)
find data
analyse it
summarize it

Step 1: Find and prepare data for Linear Regression

Once the question is defined we can search for data. Then you may need to do some preprocessing of the data like:

clean - bad records, incomplete data
format issues

Suppose we have data like:

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# import and preview dataset
df = pd.read_csv("https://raw.githubusercontent.com/softhints/dataplotplus/master/data/happyscore_income.csv")

country	adjusted_satisfaction	avg_income	happyScore
Armenia	37.0	2096.76	4.350
Angola	26.0	1448.88	4.033
Argentina	60.0	7101.12	6.574
Austria	59.0	19457.04	7.200
Australia	65.0	19917.00	7.284

Data is taken from Kaggle and there's no need to clean or process it.

Step 2: Scatterplot and Linear Regression

In this step we are going to plot simple Scatterplot in Python. We need two numeric variables like:

# basic plot scatter
plt.xlabel('Happy Score')
plt.ylabel('Income')
plt.scatter('happyScore', 'avg_income', data=df, s=20, color='green')

result:

introduction-linear-regression-python-scatterplot

Next step is to perform the linear regression. We are going to search for the relation between happyScore and avg_income. We are using class: LinearRegression from sklearn.linear_model

X = df['happyScore'].values.reshape(-1, 1)
Y = df['avg_income'].values.reshape(-1, 1)

linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)

The result is array of predicted values like:

array([[ 1842.30538429],
       [  481.79795859],
       [11387.31647189],
       [14073.99675104],
       [14434.50975975],
       [ 5541.85554504],
       [ 3318.69199137],

Step 3: Plot Linear Regression in Python

Finally we are going to** plot the regression line on the scatterplot** from the previous step:

plt.xlabel('Happy Score')
plt.ylabel('Income')

plt.scatter(X, Y, color='green', alpha=0.5)
plt.plot(X, Y_pred, color='red')
plt.show()

the result shows that there might be a relation between those two:

plot-linear-regression-python

Step 4: Polynomial Linear Regression in Python

Finally let's check how to extend linear regression with polynomial features in Python. This approach is more precise in some cases:

from sklearn.preprocessing import PolynomialFeatures

# prepare Polynomial Features
pr = PolynomialFeatures(degree = 5)
X_poly = pr.fit_transform(X)
pr.fit(X_poly, Y)

# Polynomial Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, Y)

# plot
plt.scatter(X, Y, color='green', alpha=0.5)
plt.scatter(X, lin_reg.predict(pr.fit_transform(X)))
plt.show()

The result:

We are fitting the model with degree = 5 which:

degree equal to 1 will be the same as the linear regression
higher degree will take more time to compute
higher degree will match relation better

Example for degree 15:

polynomial-linear-regression-python-degree

> Axis

> Title

> Legend

> Subplot

> Annotation

> Font

> Theme

> Distribution charts

> Correlation plots

> Ranking

> Part Of A Whole

> Evolution

> Map charts

> Flow

> Dates

> Basic plots

> Advanced Tutorials

> Data visualization

> Visualization Projects

> Line Chart

> Bar Chart

> Linear Regression

> Scatter plot

> Heatmap

> Histogram

> Box plot

> Area plot

> Pie plot

> Hexagonal bin plot

> Correlation

> Distribution

> Change

> Groups

> Matplotlib

> Seaborn

> Bokeh

> Altair

> Plotly

> Folium

Step 1: Find and prepare data for Linear Regression

Step 2: Scatterplot and Linear Regression

Step 3: Plot Linear Regression in Python

Step 4: Polynomial Linear Regression in Python

Resources