In this guide you can find how to use Scatterplot and Kmeans in Python. We can see several examples on Scatterplot and Kmeans with matplotlib.

Scatterplot and Kmeans basic example

First we will start with imports of all libraries. Then we will read the data and visualize it by:

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# import dataset
df = pd.read_csv("https://raw.githubusercontent.com/softhints/dataplotplus/master/data/happyscore_income.csv")

df[['country', 'GDP', 'happyScore', 'region', 'avg_income']].head()

data:

country GDP happyScore region avg_income
Armenia 0.76821 4.350 'Central and Eastern Europe' 2096.76
Angola 0.75778 4.033 'Sub-Saharan Africa' 1448.88
Argentina 1.05351 6.574 'Latin America and Caribbean' 7101.12
Austria 1.33723 7.200 'Western Europe' 19457.04
Australia 1.33358 7.284 'Australia and New Zealand' 19917.00

Next we are going to define variables for the Kmeans analysis and the scatterplot. In this case we are searching for answer of:

  • Is there a relation between happiness and GDP of a given country
# set the x and y variables
col_x = 'happyScore'
col_y = 'GDP'

Calculate the standard deviation and the mean

The standard deviation and the mean might be helpful when you work with Kmeans. We are going to find them by using Pandas methods:

  • .mean()
  • .std()

We need the mean and the standard deviation for both variables:

# calculate the standard deviation and the mean
x_mean = df[col_x].mean()
x_stdev = df[col_x].std()

y_mean = df[col_y].mean()
y_stdev = df[col_y].std()

print(x_mean, x_stdev)
print(y_mean, y_stdev)

result:

5.421909909909912 1.180800549315818
0.8419090990990992 0.3875905043131557

Calculate the normalized values

Next we are going to find the normalized values for both variables by Python and simple formula:

# calculate the normalised values on x and y
x_min = min(df[col_x])
x_max = max(df[col_x])

norm_x = ((df[col_x] - x_min)/(x_max - x_min))
df['norm_x'] = norm_x

y_min = min(df[col_y])
y_max = max(df[col_y])

norm_y = ((df[col_y] - y_min)/(y_max - y_min))
df['norm_y'] = norm_y

result of norm_y:

0      0.486184
1      0.479449
2      0.670414

Create Kmeans Clusters centers

For Kmeans we are going to use the library sklearn and it's class KMeans. In this example we will have 2 clusters which are set by n_clusters=2.

# create Kmeans clusters
from sklearn.cluster import KMeans

x_y = np.column_stack((df['norm_x'], df['norm_y']))

km_res = KMeans(n_clusters=2).fit(x_y)
clusters = km_res.cluster_centers_
clusters

The cluster information:

array([[0.74214874, 0.73933614],
       [0.35627928, 0.33903262]])

Plot Scatterplot and Kmeans in Python

Finally we can plot the scatterplot and the Kmeans by method plt.scatter. Where:

  • df.norm_x, df.norm_y - are the numeric variables for our Kmeans
  • alpha = 0.25 - is the transparency of the points. Which is useful when number of points grow
  • s = 100 - size of the data points
  • color='red' - color of the data points
figure(num=None, figsize=(10, 10), dpi=100, facecolor='w', edgecolor='k')

plt.xlabel(col_x)
plt.ylabel(col_y)

plt.scatter(df.norm_x, df.norm_y, alpha = 0.25, s = 100, color='red')
plt.title("Kmeans Scatterplot of World Happiness vs GDP", fontsize=18)
plt.scatter(clusters[:,0], clusters[:,1], s = 100000, alpha=0.30)

result is:

plot-scatterplot-and-kmeans-python

Plot data points in different colors - Optional

If you like to plot the cluster points in different colors with Scatterplot we can mark the DataFrame rows with the corresponding cluster labels by:

df['cluster'] = km_res.labels_
df1 = df[df['cluster'] == 0]
df2 = df[df['cluster'] == 1]

Then we will create two temporary DataFrames with the data for each cluster.

Finally we can plot data for each cluster in a different color:

figure(num=None, figsize=(10, 10), dpi=100, facecolor='w', edgecolor='k')

plt.xlabel(col_x)
plt.ylabel(col_y)

plt.scatter(df1.norm_x, df1.norm_y, alpha = 0.25, s = 100, color='red')
plt.scatter(df2.norm_x, df2.norm_y, alpha = 0.25, s = 100, color='green')
plt.title("Kmeans Scatterplot of World Happiness vs GDP", fontsize=18)
plt.scatter(clusters[:,0], clusters[:,1], s = 100000, alpha=0.30)

result: