K Means Clustering in PythonSpark
An Insurance Company wants to understand their
Customer by creating segments by considering the variables like Customer Income
,Insurance Coverage, Deductibles.
Introduction:
Lets start doing Clustering in Python
The Optimal number of Clusters is 4 for the above dataset.The final output obtained in the above screen shot shows the Cluster Centre Data point which is good representative of the entire population.
Business Problem:
Introduction:
K-Means
is one of the most popular "clustering" algorithms. K-means stores centroids
that it uses to define clusters. A point is considered to be in a particular
cluster if it is closer to that cluster's centroid than any other centroid.
K-Means
finds the best centroids by alternating between (1) assigning data points to
clusters based on the current centroids (2) chosing centroids (points which are
the center of a cluster) based on the current assignment of data points to
clusters.
Why to go for Clustering ?
Organizing data into Clusters such that there is an:
High Intra Cluster Similarity
Low Intra Cluster Similarity
Informally natural Groups among objects
Step 1:
Choose the number of clusters.
Step 2:
Set the initial partition, and the initial mean vectors for each
cluster.
Step 3:
For each remaining individual...
Step 4:
Get averages for comparison to the Cluster 1:
Add individual's A value to the sum of A values of the individuals
in Cluster 1, then divide by the total number of scores that were summed.
Add individual's B value to the sum of B values of the individuals
in Cluster 1, then divide by the total number of scores that were summed.
Step 5:
Get averages for comparison to the Cluster 2:
Add individual's A value to the sum of A values of the individuals
in Cluster 2, then divide by the total number of scores that were summed.
Add individual's B value to the sum of B values of the individuals
in Cluster 2, then divide by the total number of scores that were summed.
Step 6:
If the averages found in Step 4 are closer to the mean values of
Cluster 1, then this individual belongs to Cluster 1, and the averages found
now become the new mean vectors for Cluster 1.
If closer to Cluster 2, then it goes to Cluster 2, along with the
averages as new mean vectors.
Step 7:
If there are more individual's to process, continue again with
Step 4. Otherwise go to Step 8.
Step 8:
Now compare each individual’s distance to its own cluster's mean
vector, and to that of the opposite cluster. The distance to its
cluster's mean vector should be smaller than it distance to the other
vector. If not, relocate the individual to the opposite cluster.
Step 9:
If any relocations occurred in Step 8, the algorithm must continue
again with Step 3, using all individuals and the new mean vectors.
If no relocations occurred, stop. Clustering is complete.
Again, in case the algorithm never settles on a final solution, it
may be a good idea to implement a maximum number of iterations check.
Reason for Choosing K means from data set Perspective
The reason behind choosing K means is because we can form natural
groups or clusters to segment customers.Since the formation of clusters was
good enough to identify segment
patterns,I did not proceed with any
other technique
Lets have a look on Data in Excel Format
Variables of this data as given below
a)Claim Amount
b)Insurance Coverage
c)Income
d)deductibles
Lets have a look on Data in Excel Format
Variables of this data as given below
a)Claim Amount
b)Insurance Coverage
c)Income
d)deductibles
My Inference about the Clustering Output :
a)People who fall into the first cluster having an income close to
148 and who have done an Insurance Coverage amount close to 1161 can claim an amount of 305 .
b)People falling into the 2nd cluster who have an income of 53 and who have done Insurance Coverage 890 can claim an amount of 186 .
c)People who fall into the 3 rd cluster having an income close to
75.13 and who have an Insurance Coverage of 336 can claim an amount of 77.09
d)People who fall into the 4 th Cluster having an income close to 63.89 and who have done an Insurance
Coverage of 282 can claim an amount of 61.64.



