The blog is for People who want to run spark on top of python
with Linux (Ubuntu desktop) installed in Oracle VM Box in a Windows Machine
Python is an easy to learn, powerful programming language. It
has efficient high-level data structures and a simple but effective approach to
object-oriented programming. Python’s elegant syntax and dynamic typing, together
with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.
Spark
is admired for many reasons by developers and analysts to quickly query,
analyze and transform data at scale. In simple words, you can call Spark a
competent alternative to Hadoop, with its characteristics, strengths and
limitations. Spark runs in-memory to process data with speed and sophistication
than the other complement approaches like Hadoop Map Reduce.
1.Installation of
Oracle Virtual Machine Box Manager
In Case of Windows 1)First
download the Oracle VM Virtual Box Manager 3)Open the VM VirtualBox Manager
Begin the Installation process as represented above you will find
the screen like this after installing
Step 2: Downloading Linux (Ubuntu Desktop) and
installing it in Oracle Virtual Box Manager
In windows > Go to
Google Chrome > Download the Ubuntu Software
following the link > download
Ubuntu Desktop by clicking the download button
once it is downloaded open the Oracle Virtual
Box Manager by double clicking the Virtual Box Manager icon 1) click on Create
a new virtual machine 2) select Ubuntu
(64 bit)> 3) click open
then click next 4)give memory of
size 20GB for this machine to run smoothly 5) click next 6)which will install
Ubuntu in Oracle Virtual Box Manager, it will take more than 15 minutes to
install.
Once Ubuntu is installed
it will look like this inside Virtual Box Manager
3 :Downloading Anaconda 4.2.0 using Ubuntu
1)Inside
Ubuntu open internet browser 2) download Anaconda 4.0.2 for windows 3)once
download is finished 4) open the terminal 5) type following command after $
symbol 6) press enter
4.PYTHON SPARK
1)Download the latest Version of Spark
After Download is finished Install the Spark
As soon as you install
Python Spark ,you must execute the following codes for running a Python Spark
in your machine Subsequently you can work on this Platform for further
Execution of Programs
export
SPARK_HOME=/home/rajat/spark16
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
export SPARK_LOCAL_IP=LOCALHOST
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
export SPARK_LOCAL_IP=LOCALHOST
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
4b.To start spark in ipython notebook
You can see the Jupiter Notebook below once
the Python spark codes are Run.
4c. The Spark running on our Ubuntu machine, check out the status at http;//localhost//:
5.Running
Simple Programs:
from pyspark
import SparkContext
sc = SparkContext( 'local', 'pyspark')
text = sc.textFile("Twilight.txt")
print (text)
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print (words)
PythonRDD[2] at RDD at PythonRDD.scala:48
wc = words.map(lambda x: (x,1))
print (wc.toDebugString())
Copy the Codes and Paste it in notebook Line by Line to get the output as done above
Thus spark and python starts working in our Machine.Lets
start working for Real Life Datasets.
Hope the explanation reached well.




