In this short guide you can learn the **basics of linear regression in Python.** Linear regression is a model which represents the relation of variables - they can be independent or dependent. Linear regression can help to answer the question of dependency or numeric variables.

So let's explain it in a simple example. Suppose you are a data scientist which is asked:

Is there a relation between money and happiness?

To prove it you will need to do it in several steps:

- define the question (it's already done)
- find data
- analyse it
- summarize it

## Step 1: Find and prepare data for Linear Regression

Once the question is defined we can search for data. Then you may need to do some preprocessing of the data like:

- clean - bad records, incomplete data
- format issues

Suppose we have data like:

```
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# import and preview dataset
df = pd.read_csv("https://raw.githubusercontent.com/softhints/dataplotplus/master/data/happyscore_income.csv")
```

country | adjusted_satisfaction | avg_income | happyScore |
---|---|---|---|

Armenia | 37.0 | 2096.76 | 4.350 |

Angola | 26.0 | 1448.88 | 4.033 |

Argentina | 60.0 | 7101.12 | 6.574 |

Austria | 59.0 | 19457.04 | 7.200 |

Australia | 65.0 | 19917.00 | 7.284 |

Data is taken from Kaggle and there's no need to clean or process it.

## Step 2: Scatterplot and Linear Regression

In this step we are going to plot simple Scatterplot in Python. We need two numeric variables like:

```
# basic plot scatter
plt.xlabel('Happy Score')
plt.ylabel('Income')
plt.scatter('happyScore', 'avg_income', data=df, s=20, color='green')
```

result:

Next step is to perform the linear regression. We are going to search for the relation between `happyScore`

and `avg_income`

. We are using class: `LinearRegression`

from `sklearn.linear_model`

```
X = df['happyScore'].values.reshape(-1, 1)
Y = df['avg_income'].values.reshape(-1, 1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
```

The result is array of predicted values like:

```
array([[ 1842.30538429],
[ 481.79795859],
[11387.31647189],
[14073.99675104],
[14434.50975975],
[ 5541.85554504],
[ 3318.69199137],
```

## Step 3: Plot Linear Regression in Python

Finally we are going to** plot the regression line on the scatterplot** from the previous step:

```
plt.xlabel('Happy Score')
plt.ylabel('Income')
plt.scatter(X, Y, color='green', alpha=0.5)
plt.plot(X, Y_pred, color='red')
plt.show()
```

the result shows that there might be a relation between those two:

## Step 4: Polynomial Linear Regression in Python

Finally let's check how to **extend linear regression with polynomial features** in Python. This approach is more precise in some cases:

```
from sklearn.preprocessing import PolynomialFeatures
# prepare Polynomial Features
pr = PolynomialFeatures(degree = 5)
X_poly = pr.fit_transform(X)
pr.fit(X_poly, Y)
# Polynomial Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, Y)
# plot
plt.scatter(X, Y, color='green', alpha=0.5)
plt.scatter(X, lin_reg.predict(pr.fit_transform(X)))
plt.show()
```

The result:

We are fitting the model with `degree = 5`

which:

- degree equal to 1 will be the same as the linear regression
- higher degree will take more time to compute
- higher degree will match relation better

Example for degree 15: