Hari Prashanth: 2016

Thursday, 8 December 2016

K Means Clustering in Python Spark

K Means Clustering in PythonSpark

Business Problem:

An Insurance Company wants to understand their Customer by creating segments by considering the variables like Customer Income ,Insurance Coverage, Deductibles.

Introduction:

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

Why to go for Clustering ?

Organizing data into Clusters such that there is an:

High Intra Cluster Similarity

Low Intra Cluster Similarity

Informally natural Groups among objects

Step 1:

Choose the number of clusters.

Step 2:

Set the initial partition, and the initial mean vectors for each cluster.

Step 3:

For each remaining individual...

Step 4:

Get averages for comparison to the Cluster 1:

Add individual's A value to the sum of A values of the individuals in Cluster 1, then divide by the total number of scores that were summed.

Add individual's B value to the sum of B values of the individuals in Cluster 1, then divide by the total number of scores that were summed.

Step 5:

Get averages for comparison to the Cluster 2:

Add individual's A value to the sum of A values of the individuals in Cluster 2, then divide by the total number of scores that were summed.

Add individual's B value to the sum of B values of the individuals in Cluster 2, then divide by the total number of scores that were summed.

Step 6:

If the averages found in Step 4 are closer to the mean values of Cluster 1, then this individual belongs to Cluster 1, and the averages found now become the new mean vectors for Cluster 1.

If closer to Cluster 2, then it goes to Cluster 2, along with the averages as new mean vectors.

Step 7:

If there are more individual's to process, continue again with Step 4. Otherwise go to Step 8.

Step 8:

Now compare each individual’s distance to its own cluster's mean vector, and to that of the opposite cluster. The distance to its cluster's mean vector should be smaller than it distance to the other vector. If not, relocate the individual to the opposite cluster.

Step 9:

If any relocations occurred in Step 8, the algorithm must continue again with Step 3, using all individuals and the new mean vectors.

If no relocations occurred, stop. Clustering is complete.

Again, in case the algorithm never settles on a final solution, it may be a good idea to implement a maximum number of iterations check.

Reason for Choosing K means from data set Perspective

The reason behind choosing K means is because we can form natural groups or clusters to segment customers.Since the formation of clusters was good enough to identify segment patterns,I did not proceed with any other technique

Lets have a look on Data in Excel Format

Variables of this data as given below

a)Claim Amount
b)Insurance Coverage
c)Income
d)deductibles

Lets start doing Clustering in Python

The Optimal number of Clusters is 4 for the above dataset.The final output obtained in the above screen shot shows the Cluster Centre Data point which is good representative of the entire population.

My Inference about the Clustering Output :

a)People who fall into the first cluster having an income close to 148 and who have done an Insurance Coverage amount close to 1161 can claim an amount of 305 .

b)People falling into the 2nd cluster who have an income of 53 and who have done Insurance Coverage 890 can claim an amount of 186 .

c)People who fall into the 3 rd cluster having an income close to 75.13 and who have an Insurance Coverage of 336 can claim an amount of 77.09

d)People who fall into the 4 th Cluster having an income close to 63.89 and who have done an Insurance Coverage of 282 can claim an amount of 61.64.

Hierarchial Clustering with Python

Hierarchial Clustering in python

Wholesale dataset
The dataset consists of Products consumed by people in certain regions.The main objective is to cluster the products consumed by people.

Dataset :

The dataset can be accessed by the following from the website UCI Machine Learning "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"

Screen Shots of codes are as follows:

Thus the above screen shots provides the sequential steps of the coding performed in python on the wholesale dataset and the Output thereby obtained is dendogram.

Inference:

All the datapoints which are close together are formed as cluster.The cluster thus formed provides us an inference that People who buy a similar kind of products are grouped together in one cluster.
In the dendogram usually if we need to select the clusters we would take a horizontal and the maximum distance it can go or move up and down would be the best place to split which will also give us the number of clusters but again it depends on the business problem we have and how many clusters we want in our case the desired clustered are 11.

Tuesday, 15 November 2016

Installation of Spark and Python step by step

Helpguide for getting an glimpse on Installing Spark,Python in their Machine

The blog is for People who want to run spark on top of python with Linux (Ubuntu desktop) installed in Oracle VM Box in a Windows Machine

Introduction on Python and Spark

Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

Spark is admired for many reasons by developers and analysts to quickly query, analyze and transform data at scale. In simple words, you can call Spark a competent alternative to Hadoop, with its characteristics, strengths and limitations. Spark runs in-memory to process data with speed and sophistication than the other complement approaches like Hadoop Map Reduce.

1.Installation of Oracle Virtual Machine Box Manager

In Case of Windows 1)First download the Oracle VM Virtual Box Manager 3)Open the VM VirtualBox Manager

Begin the Installation process as represented above you will find the screen like this after installing

Step 2: Downloading Linux (Ubuntu Desktop) and installing it in Oracle Virtual Box Manager

In windows > Go to Google Chrome > Download the Ubuntu Software following the link > download Ubuntu Desktop by clicking the download button

once it is downloaded open the Oracle Virtual Box Manager by double clicking the Virtual Box Manager icon 1) click on Create a new virtual machine 2) select Ubuntu (64 bit)> 3) click open then click next 4)give memory of size 20GB for this machine to run smoothly 5) click next 6)which will install Ubuntu in Oracle Virtual Box Manager, it will take more than 15 minutes to install.

Once Ubuntu is installed it will look like this inside Virtual Box Manager

3 :Downloading Anaconda 4.2.0 using Ubuntu

1)Inside Ubuntu open internet browser 2) download Anaconda 4.0.2 for windows 3)once download is finished 4) open the terminal 5) type following command after $ symbol 6) press enter

4.PYTHON SPARK

1)Download the latest Version of Spark

After Download is finished Install the Spark

As soon as you install Python Spark ,you must execute the following codes for running a Python Spark in your machine Subsequently you can work on this Platform for further Execution of Programs

export SPARK_HOME=/home/rajat/spark16
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
export SPARK_LOCAL_IP=LOCALHOST
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

4a..To start spark in the command line mode enter the "pyspark" command and you will find the Spark screen

4b.To start spark in ipython notebook

You can see the Jupiter Notebook below once the Python spark codes are Run.

4c. The Spark running on our Ubuntu machine, check out the status at http;//localhost//:

5.Running Simple Programs:

from pyspark import SparkContext

sc = SparkContext( 'local', 'pyspark')

text = sc.textFile("Twilight.txt")

print (text)

from operator import add

def tokenize(text):

return text.split()

words = text.flatMap(tokenize)

print (words)

PythonRDD[2] at RDD at PythonRDD.scala:48

wc = words.map(lambda x: (x,1))

print (wc.toDebugString())

Copy the Codes and Paste it in notebook Line by Line to get the output as done above

Thus spark and python starts working in our Machine.Lets start working for Real Life Datasets.

Hope the explanation reached well.