Hari Prashanth: December 2016

K Means Clustering in PythonSpark

Business Problem:

An Insurance Company wants to understand their Customer by creating segments by considering the variables like Customer Income ,Insurance Coverage, Deductibles.

Introduction:

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

Why to go for Clustering ?

Organizing data into Clusters such that there is an:

High Intra Cluster Similarity

Low Intra Cluster Similarity

Informally natural Groups among objects

Step 1:

Choose the number of clusters.

Step 2:

Set the initial partition, and the initial mean vectors for each cluster.

Step 3:

For each remaining individual...

Step 4:

Get averages for comparison to the Cluster 1:

Add individual's A value to the sum of A values of the individuals in Cluster 1, then divide by the total number of scores that were summed.

Add individual's B value to the sum of B values of the individuals in Cluster 1, then divide by the total number of scores that were summed.

Step 5:

Get averages for comparison to the Cluster 2:

Add individual's A value to the sum of A values of the individuals in Cluster 2, then divide by the total number of scores that were summed.

Add individual's B value to the sum of B values of the individuals in Cluster 2, then divide by the total number of scores that were summed.

Step 6:

If the averages found in Step 4 are closer to the mean values of Cluster 1, then this individual belongs to Cluster 1, and the averages found now become the new mean vectors for Cluster 1.

If closer to Cluster 2, then it goes to Cluster 2, along with the averages as new mean vectors.

Step 7:

If there are more individual's to process, continue again with Step 4. Otherwise go to Step 8.

Step 8:

Now compare each individual’s distance to its own cluster's mean vector, and to that of the opposite cluster. The distance to its cluster's mean vector should be smaller than it distance to the other vector. If not, relocate the individual to the opposite cluster.

Step 9:

If any relocations occurred in Step 8, the algorithm must continue again with Step 3, using all individuals and the new mean vectors.

If no relocations occurred, stop. Clustering is complete.

Again, in case the algorithm never settles on a final solution, it may be a good idea to implement a maximum number of iterations check.

Reason for Choosing K means from data set Perspective

The reason behind choosing K means is because we can form natural groups or clusters to segment customers.Since the formation of clusters was good enough to identify segment patterns,I did not proceed with any other technique

Lets have a look on Data in Excel Format

Variables of this data as given below

a)Claim Amount
b)Insurance Coverage
c)Income
d)deductibles

Lets start doing Clustering in Python

The Optimal number of Clusters is 4 for the above dataset.The final output obtained in the above screen shot shows the Cluster Centre Data point which is good representative of the entire population.

My Inference about the Clustering Output :

a)People who fall into the first cluster having an income close to 148 and who have done an Insurance Coverage amount close to 1161 can claim an amount of 305 .

b)People falling into the 2nd cluster who have an income of 53 and who have done Insurance Coverage 890 can claim an amount of 186 .

c)People who fall into the 3 rd cluster having an income close to 75.13 and who have an Insurance Coverage of 336 can claim an amount of 77.09

d)People who fall into the 4 th Cluster having an income close to 63.89 and who have done an Insurance Coverage of 282 can claim an amount of 61.64.

Hierarchial Clustering in python

Wholesale dataset
The dataset consists of Products consumed by people in certain regions.The main objective is to cluster the products consumed by people.

Dataset :

The dataset can be accessed by the following from the website UCI Machine Learning "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"

Screen Shots of codes are as follows:

Thus the above screen shots provides the sequential steps of the coding performed in python on the wholesale dataset and the Output thereby obtained is dendogram.

Inference:

All the datapoints which are close together are formed as cluster.The cluster thus formed provides us an inference that People who buy a similar kind of products are grouped together in one cluster.
In the dendogram usually if we need to select the clusters we would take a horizontal and the maximum distance it can go or move up and down would be the best place to split which will also give us the number of clusters but again it depends on the business problem we have and how many clusters we want in our case the desired clustered are 11.

Hari Prashanth

Thursday, 8 December 2016

K Means Clustering in Python Spark

Hierarchial Clustering with Python