Hari Prashanth: November 2016

Helpguide for getting an glimpse on Installing Spark,Python in their Machine

The blog is for People who want to run spark on top of python with Linux (Ubuntu desktop) installed in Oracle VM Box in a Windows Machine

Introduction on Python and Spark

Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

Spark is admired for many reasons by developers and analysts to quickly query, analyze and transform data at scale. In simple words, you can call Spark a competent alternative to Hadoop, with its characteristics, strengths and limitations. Spark runs in-memory to process data with speed and sophistication than the other complement approaches like Hadoop Map Reduce.

1.Installation of Oracle Virtual Machine Box Manager

In Case of Windows 1)First download the Oracle VM Virtual Box Manager 3)Open the VM VirtualBox Manager

Begin the Installation process as represented above you will find the screen like this after installing

Step 2: Downloading Linux (Ubuntu Desktop) and installing it in Oracle Virtual Box Manager

In windows > Go to Google Chrome > Download the Ubuntu Software following the link > download Ubuntu Desktop by clicking the download button

once it is downloaded open the Oracle Virtual Box Manager by double clicking the Virtual Box Manager icon 1) click on Create a new virtual machine 2) select Ubuntu (64 bit)> 3) click open then click next 4)give memory of size 20GB for this machine to run smoothly 5) click next 6)which will install Ubuntu in Oracle Virtual Box Manager, it will take more than 15 minutes to install.

Once Ubuntu is installed it will look like this inside Virtual Box Manager

3 :Downloading Anaconda 4.2.0 using Ubuntu

1)Inside Ubuntu open internet browser 2) download Anaconda 4.0.2 for windows 3)once download is finished 4) open the terminal 5) type following command after $ symbol 6) press enter

4.PYTHON SPARK

1)Download the latest Version of Spark

After Download is finished Install the Spark

As soon as you install Python Spark ,you must execute the following codes for running a Python Spark in your machine Subsequently you can work on this Platform for further Execution of Programs

export SPARK_HOME=/home/rajat/spark16
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
export SPARK_LOCAL_IP=LOCALHOST
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

4a..To start spark in the command line mode enter the "pyspark" command and you will find the Spark screen

4b.To start spark in ipython notebook

You can see the Jupiter Notebook below once the Python spark codes are Run.

4c. The Spark running on our Ubuntu machine, check out the status at http;//localhost//:

5.Running Simple Programs:

from pyspark import SparkContext

sc = SparkContext( 'local', 'pyspark')

text = sc.textFile("Twilight.txt")

print (text)

from operator import add

def tokenize(text):

return text.split()

words = text.flatMap(tokenize)

print (words)

PythonRDD[2] at RDD at PythonRDD.scala:48

wc = words.map(lambda x: (x,1))

print (wc.toDebugString())

Copy the Codes and Paste it in notebook Line by Line to get the output as done above

Thus spark and python starts working in our Machine.Lets start working for Real Life Datasets.

Hope the explanation reached well.

Hari Prashanth

Tuesday, 15 November 2016

Installation of Spark and Python step by step