A Shortcut to Big Data - Lesson Plan 2, Installing Spark stand alone on windows 10: Smart Programmer Series

satya - 9/8/2019, 3:09:13 PM

Introduction

If you want to learn Bigdata, Spark is a *very* good entry point to Bigdata. That means you don't have to start your journey of Bigdata with HDFS, MapReduce, Hive, Pig, Yarn, and numerous other tools. That means you don't have to learn the BigTheory behind Bigdata.

Recent shifts in Bigdata infrastructure makes "spark" a good starting point. This significantly reduces learning time. it is a better ramp. You can learn the previously said other Bigdata things at leisure while you are taking the benefits of Bigdata processing using spark.

So, the problem of learning Bigdata has reduced to learning spark. Learning spark involves 2 efforts.

The first is the effort and learning needed to setup a spark cluster. You need a spark cluster to run your spark code. Even a simple helloworld. Your best bet is to set this cluster up using Databricks in an Azure pay-as-you-go account. I have documented this in a previous article/lesson. I will post a link to that below along with other links.

The second effort is, once you have a cluster, although you can use notebooks to write python code to run on them, you really need an IDE to write that code with intellisense, and test it to an extent on your local machine and check that into github and have Databricks cluster pick those files from github to run on a real cluster in azure.

This document has the steps to install spark to run standalone on windows 10. This is a manual multi step installation process. It is prone to inconsistencies and errors. This document is a deliberate effort to save you time on how to do install spark on windows with out spending days on the effort.

Important thing to say at the onset is this: Take courage for IT IS POSSIBLE to install spark on windows

satya - 9/8/2019, 3:22:38 PM

What I am going to cover

1. Essential Installs for Spark (a quick list of tools). This will also include what the key environment variables look like at the end of the installation

2. A More detailed list of installs

3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.

4. Links and other references you need as you go to install these tools. These references start with a link to a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article

5. Detailed analysis for each of the installations

6. How to test the installation: Running PySpark interactive shell

7. Submitting pyspark python files using submit command to run on the local spark stand alone cluster.

8. Working with VSCode on pyspark code

9. How to exercise the PysSpark API

10. Now that you have a house, how to begin the practice of living in PySpark

satya - 9/9/2019, 10:49:19 AM

1. Essential Installs

1. Essential Installs

satya - 9/9/2019, 10:53:35 AM

The Essential List


JDK 1.8 (not latest)

Python 3.7.4 (latest)

Spark 2.4.3 (latest)

Hadoop binaries (also referred to as winutils: 2.7.3)

satya - 9/9/2019, 11:04:17 AM

Essential list elaborated

JDK 1.8 (not latest):

Don't install the latest. It didn't work for me with JDK12. There are some method discrepancies between the JDKs. One would think they ought would have been compatible. So stick to JDK 1.8. Also make sure you set the JAVA_HOME to the parent of the "/bin" directory in that distribution.

Python 3.7.4 (latest).

I installed the latest from python.org. Do not install it from windows 10 shop. Anaconda distribution is another suggested option. I have not tried that one. will likely work.

Spark 2.4.3 (latest):

I just got the latest from Apache Spark site.

Hadoop binaries (also referred to as winutils: 2.7.3).

Although one can get these as part of Hadoop download from Apache, many sources are suggesting to download these from a Github repo called "Winutils". These were managed by individuals. This repo is now superceded by another repo which is listed at 'winutils'.

satya - 9/9/2019, 11:34:54 AM

2. A More detailed list of installs

Once the spark is installed, you need an IDE to code in python and pyspark. I am using Vscode already. So I am going to utilize the python extensions for vscode allow me to code. This section will show the additional installs that I worked with in this effort.

satya - 9/9/2019, 11:39:19 AM

The full list is

//Essential

JDK 1.8 (not latest)

Python 3.7.4 (latest)

Spark 2.4.3 (latest)

Hadoop binaries (also referred to as winutils: 2.7.3)

//Additional

PySpark: as a python package. It is not clear yet to me its main purpose either for intellisense or some other reason.

VScode (if you don't have it)

7-zip (to unzip sppark and other downloads)

Git (To download hadoop binaries from Github)

Gitdesktop (Just find out what the fuss is about)

//Vscode extensions

MS python extension

Gitlens (integrated with vscode)

other python extensions

satya - 9/9/2019, 11:40:20 AM

Here is a pictorial representation

satya - 9/9/2019, 11:51:20 AM

Order of installs

I don't think the order matters too much. It is only during running spark the dependencies need to be all available. Nevertheless things are done in order.

First, Read through all of this document before you install any of this

7 Zip (as you need it to unzip a number of things)

JDK 1.8 (usually an installation program sets this up.) Make sure to install it into good short directory name.

Python from python.org

Winutils from Github (Hadoop binaries)

Spark

satya - 9/9/2019, 11:51:53 AM

3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.

3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.

satya - 9/9/2019, 12:00:04 PM

For spark to run, once you install here what your system env variables look like


System level Environment variables:

#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk8\jre"
HADOOP_HOME C:\satya\i\hadoop312-common-bin

#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk8\jre\bin

#python-path
%install-dir%\python374
%install-dir%\python374\scripts


#hadoop-path
%install-dir%\hadoop312-common-bin\bin

satya - 9/9/2019, 12:02:45 PM

A Book reference: Apache Spark in 24 hours

Finally this is the book that helped me navigate the installation of spark on windows. Do buy it. This is a good book.

There is an online reference to a webpage where the installation section is made available.

satya - 9/9/2019, 12:05:34 PM

Read this: Installing Spark, By Jeffrey Aven, Feb 16, 2017

Installing Spark, By Jeffrey Aven, Feb 16, 2017

Read this to see what environment varialbes are set and other essential installation tasks. This page has both mac, linux and windows instructions. So just focus on the windows section. Give a quick read. 10 minutes. For I have explained the rest here including what is there in that article.

satya - 9/9/2019, 12:06:32 PM

JAVA_OPTIONS env variable

This is to disable ipv6 internet option. I don't know why. I am just following that books recommendation. I have not tested without.

satya - 9/9/2019, 12:08:50 PM

Why do I have to disable IPv6 for Java applications when installing Spark

As the answer is not that important, but if you are curious to follow up, see if anyone may have explained this better

Why do I have to disable IPv6 for Java applications when installing Spark

Search for: Why do I have to disable IPv6 for Java applications when installing Spark

satya - 9/9/2019, 12:55:17 PM

3. Key Lessons/Gotchas

1. Take courage that it is possible to install spark on windows and start the learning process

2. You have to use JDK 8. Don't use any higher versions. Don't install the latest. It didn't work for me with JDK12. There are some method discrepancies between the JDKs. One would think they ought would have been compatible. So stick to JDK 1.8. Also make sure you set the JAVA_HOME to the parent of the "/bin" directory in that distribution

3. Turn off the IPV6 as indicated above.

4. Spark says it does not depend on Hadoop. But it appears it needs the hadoop binaries that are already built for windows. Some individuals have prepared them and made them available on Github for each release of Hadoop. This repo is called "winutils". Either download the entire repo and pick the version you want (This is described later in the document). Bottom line is you don't need to install the Hadoop wholesale from Apache Hadoop.

5. Python has a (pip)module called "pyspark". when you download Spark, Spark comes with its own python sub directories that allow you to submit/run pyspark programs. This pip module, I am not sure yet what its role is. So you can skip installing it with pip. If you were to install it, it failed installation horribly. Read later how to fix it.

6. When you install python make sure it is 64 bit. The default one that shows up at Python.org seem to be a 32bit version with no explicit directions to download 64 bit (But if you look hard for x64 version you will find it). Some times it is also called _AMD64 as well.

7. You can have multiple python versions at the same time. Your IDEs can choose which python interpreter to use.

8. Install python from python.org and not from windows store. Anaconda distribution will likely work but I haven't tried. I have read briefly that anaconda has lot of packages dedicated to data science, I am not there yet. When you install python make sure to add the following both to path: \python374\ and \python374\scripts\

9. The intellisense of python in VSCode is still confusing to me. More research is needed, especially to discover "pyspark" API in code.

10. There is a URL where pyspark API is documented just like JDK API. This is a useful resource esp to understand example source code

11. Spark install comes with examples of pyspark code under examples\..\python subdirectory. Load them into vscode editor and use the PySpark API reference to walk through the code

12. You will use spark-submit command to submit python code to run. (The command structure is detailed later)

13. Things I have done that you can avoid 1) installed wrong JDK 2) Installed python from windows store 3) Installed 32 bit python 4) Set JAVA_HOME incorrectly for jdk 8

14. Vscode when loading a pyspark .py file shows an error on "from" import of pyspark code. (I had to install pip install pyspark to fix this). I am not sure yet why this is.

satya - 9/9/2019, 2:50:50 PM

4. Key links and other references

Links and other references you need as you go to install these tools. These references start with a link to a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article

satya - 9/9/2019, 2:52:05 PM

PySpark on Windows 10: Installation Journal

PySpark on Windows 10: Installation Journal

a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article

satya - 9/9/2019, 2:54:03 PM

Here are notes from Sams 24 hours spark book on installing spark

Here are notes from Sams 24 hours spark book on installing spark

You can also use this as a primary guide once you get to read all of the material here if you like, as by then you would know what can go wrong and what are the nuances.

satya - 9/9/2019, 2:56:30 PM

Spark home page: spark.apache.org. You will see spark download links here

Spark home page: spark.apache.org. You will see spark download links here

satya - 9/9/2019, 3:01:18 PM

Key things around spark, while we are at this URL

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

satya - 9/9/2019, 5:23:02 PM

Spark Python API docs

Spark Python API docs

Just like the API docs for Java and JDK, these are the python docs for Spark. You will these references lot more

satya - 9/9/2019, 5:23:41 PM

A good entry point into those APIs is the pyspark package

A good entry point into those APIs is the pyspark package

satya - 9/9/2019, 5:26:15 PM

As you start learning here is Spark Overview from Apache.org

As you start learning here is Spark Overview from Apache.org

You will find starting examples, quick starts, other docs from Apache. How to run a spark submit job, downloads etc.

satya - 9/9/2019, 5:29:30 PM

Download page for Apache spark

Download page for Apache spark

Use this as the starting page to download spark. You can download the spark-2.4.4-bin-hadoop2.7.tgz from here. Although it says it is hadoop version it does not have any hadoop binaries in it.

To unzip that file you need 7 zip. Download that program from here for windows

You will notice that it is a secure site (https) which means its URL is validated and you can trust what is downloaded from that site as long as the site is not compromised. Download the 64 bit version of executable.

Common hadoop binaries from github repo: steveloughran/winutils

Windows binaries for Hadoop versions till Hadoop 2.6.0. For later releases you need to use the repo from the link below

Link to winutils repo for later releases: cdarlint/winutils

winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows

Basic, innocuous, goal of syntax checking for PySpark in VSCode on a windows box has now led to: 8 hours so far, a new laptop, spark, python, hadoop, 7zip, git, and still on the move....

So why?

Apparently spark stand alone clusters on windows needs some Hadoop binaries.

Downloading them from Apache hadoop is a 300MB affair and not even sue if I got the required bin directory

More search leads to some git site called "winutils" that have binaries built for each version of hadoop, just the bin directory

Well here I am downloading the whole repo, although there seem to be a way to download a partial repo using git! Well for some other time.


HADOOP_HOME=<your local hadoop-ver folder>
PATH=%PATH%;%HADOOP_HOME%\bin

So download the winutils from this cdarlint path

Use the clone or download option from github. That will give you all the versions in that repo. If you want to be narrow, which you can be, and want the binaries just for the latest release of hadoop, you need to install Git and use Git to download a specific sub directory. See my install journal on how to do this. If you have bandwidth I wouldn't recommend that. Just download the whole repo and use what you need.

Once you download it you will see things like


C:\satya\i\hadoop-winutils\winutils\hadoop-2.6.5\bin
C:\satya\i\hadoop-winutils\winutils\hadoop-2.7.3\bin

Then you copy of your version into


C:\satya\i\hadoop312-common-bin\bin

where the hadoop_home is

C:\satya\i\hadoop312-common-bin

and the path is 

C:\satya\i\hadoop312-common-bin\bin

After that you can leave the directory C:\satya\i\hadoop-winutils as is or delete it

Download Python from here for windows

You will find various python releases here. Make sure you find an install executable that is 64bit. It reads like this: Windows x86-64 executable installer.

Download it and install it into a directory like \python374. This will need to be added to be added to the system path. also add \python374\scripts into the path. You will need these paths for a) spark b) vscode c) to install additional python modules through pip (A python package installer)

MS Document: Getting started with python in vscode is here

Caution: Do not follow the instructions here to install python through windows 10 apps. Instead use the python.org as above to install python. Just be careful about "pip". It may be better to try "pip" from an outside command line and not the one from inside vscode.

However this link may be useful as you get more into vscode and python for other aspects.

Using multiple python environments in vscode

It is a bit strange how the IDEs seem to treat python. It seems you can identify a different python interpreter for each folder that is opened in vscode. It is too early for me to comment in detail on this.

Here is a link from python.org on working with python on windows

There is a lot of windows specific info on python. I need to explore this more. For now leave this here as a key reference to come back to.

PyPi.org

This is where public python packages are kept such as PySpark

Here is a link in PyPi to PySpark package

I need more research on the relationship between python, Spark, and PySpark to speak more lucidly. For now I seem to need to install pyspark to get rid of errors in Vscode intellisense. But that happens through "pip" install which uses the pypi.org behind the scenes to pull it from.

Java SE Development Kit 8 Downloads: from Oracle

There are so many versions of JDK and types. Should I use Oracle version or will openjdk will do? There are so many URLs and sites to download it from. Some downloads prompted for login to oracle. Look around and validate the right URL and get your jdk SE version.

This link I have navigated down from /oracle/java. So hopefully correct one. In here pick the windows 64bit version. I think it is called jdk-8u221-windows-x64.exe. It is under title "windows 64". This is the base jdk se version 8. it only installed JRE which seem to be sufficient. When you install choose a "change directory" for a target directory such as \jdk8\jre. Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\. The JAVA_HOME should be c:\satya\i\jdk8\jre. The parent of "bin" directory. Even though this directory is empty other than the "\bin" in it.

MIT Archive of Shakespeare sonnets text

when you start using pyspark code you can use some text files to process. This URL has some text that you can copy as files and then analyze the text for words using pyspark sample code.

5. Detailed analysis for each of the installations

Use the link and instructions from the previous section to install JDK 8.

It has to be JDK8 and not a later version.

Install it in a good path: ex: c:\i\jdk18\jre

it only installs JRE

Choose a "change directory" for a target directory such as \jdk8\jre

Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\

and JAVA_HOME to \jdk8\jre


JAVA_HOME to \jdk8\jre
Path: \jdk8\jre\bin

Use the links above to install from python.org

Make sure to pick 64 bit python. the default links seem to have the 32bit version as the default

Install it to good directory name like c:\i\python374

The path should point to c:\i\python374. Notice there is no \bin here

Set also the path to c:\i\python374\scripts (this is useful to invoke pip to install other python packages


\python374
\python374\scripts

//Notice there is no python_home
//as far as I know for now

Before installing spark you need a program to unzip a .tgz file (gzipped tar file).

So download and install 7zip from the links above

Download spark .tgz file from apache link above

Use 7zip to unzip (extract) the tar file first

Use 7zip again to extract from the tar file into a suitable directory

I used c:\i\spark234

I have set spark_home to that directory

and the path to c:\i\spark234\bin

However I don't think those paths are necessary.


spark_home=c:\i\spark
path to c:\i\spark\bin

See the link above to download the hadoop binaries from github

Get the whole repo, unless you know how to get a particular directory

For the release of Hadoop that you need copy those files to c:\i\hadoop312-common-bin.

The 312 refers to the version of hadoop. Look at Spark release notes and see what version of hadoop is used. Hopefully anything after is ok.

You need to set 2 environment variables

hadoop_home=c:\i\hadoop312-common-bin

and the path to c:\i\hadoop312-common-bin\bin


_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"

This is to disable ipv6. Not sure why.

Many of the online links and books suggest to use 'setx' command line option.

Because there are many installs and many system variables, I found using the wndows UI a better option to set the system variables.

In the status bar of windows (at the bottom) search for "Environment" you will see "Edit Environment Variables"

Use that to open up the "System Properties" window

At the bottom you will see an option to "Edit Environment Variables"

In the subsequent window you can see what system variables are set already and this visibility is useful to you as you need to make sure all variables are set, including the path

The path editor is also very nice as it shows you what paths are already set. To add a new path is also easy and intuitive.

This method avoids a number of errors


#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk8\jre"
HADOOP_HOME C:\satya\i\hadoop312-common-bin

#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk8\jre\bin

#python-path
%install-dir%\python374
%install-dir%\python374\scripts


#hadoop-path
%install-dir%\hadoop312-common-bin\bin

#Likely optional spark home
spark_home=%install-dir%\spark234 (what ever version u have)

#Likely optional spark path
%install-dir%\spark234\bin

when you install software on windows 10, unless you are just unzipping it and setting env variables yourself, it is better to run the programs as an administrator

To do that right click on the executable and run as admin

Sometimes you may have to do that for the command line as well.

open a command window (I dont think you need to be an admin for this)

Run: c:\satya\i\spark\bin\pyspark

This should open an interactive pyspark shell that has a command prompt that looks like

>>>

That verifies that you have

1. Right jdk

2. Python is available

3. Hadoop binaries are located

4. spark is installed


mkdir C:\tmp\hive
C:\Hadoop\bin\winutils.exe chmod 777 /tmp/hive

It may be safer to run the command line as an admin for this command. I don't remember now how I did it. I also don't see any files in there. So I don't even know if this is needed.

Although you can test pyspark code in the interpreter, I say instead, use the submit command to run a python file that has code in it. This section covers that process.

These are located here (they came as part of spark installation): C:\satya\i\spark\examples\src\main\python

One of the files we are going to use from here is called "wordcount.py". This prints the list of words in a text file and how many times each word is repeated.

This file "wordcount.py" takes a text file as an input to read the sentences from.

I suggest you visit the MIT link for sonnets and download a sonnet, preferably the second one, and make a text file out of it. Drop this in the same directory as examples for now to keep the syntax short. call this file "sonnet2.txt"


c:\satya\i\spark\bin\spark-submit --master local[4] wordcount.py sonnet2.txt

You really need an IDE to work with code.

You can choose your favorite one. Many folks are using PyCharm, a netbeans IDE. I am a bit of a generalist. So I decided to use VSCode which i have started learning for javascript.

The extension I have enabled in vscode is called "Microsoft Python: ms-python.python"

The extension says: IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!


###############################################
# To test flatmap
###############################################

from __future__ import print_function

import sys
from operator import add

from pyspark.sql import SparkSession

#***********************************
#Function: printCollected
#***********************************


def printBeginMsg(msg):
    print ("*****************************")
    print ("*" + msg)
    print ("*****************************")

def printEndMsg(msg):
    print ("*****************************")
    print ("*" + msg)
    print ("*****************************")

def printCollected(msg, collectedRDD):
    printBeginMsg(msg)
    for item in collectedRDD:
        print (item)
    printEndMsg(msg)

#***********************************
#End of function
#***********************************

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        sys.exit(-1)

    spark = SparkSession        .builder        .appName("PythonWordCount")        .getOrCreate()


    #
    # spark.read is a property
    # it returns a dataframereader
    #
    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    printCollected("Raw lines", lines.collect())

    lineAsListOfWords = lines.map(lambda x: x.split(' ')) 
    printCollected("Raw lines split into words. Each line is a list of words",         lineAsListOfWords.collect())

    justAListOfWords = lineAsListOfWords.flatMap(lambda x: x.lower())
    printCollected("Just A List of Words, from flatmap", justAListOfWords.collect())

    #make each word a list which is (word, length-of-thatword)
    # WordObject: (word, length, howmany)
    listOfWordObjects = justAListOfWords.map(lambda x: (x, len(x), 1))
    printCollected("List of Word Objects", listOfWordObjects.collect())

    spark.stop()

from pyspark.sql import SparkSession

I ran into terrible installation issues.

It looks like this "pip install pyspark" requires a prerequisite module


//the pip script is in python\bin directory
//As long as that is on the path you can run pip

//Run the following from any command line
//Safe to run the command line as an admin

//Install this first
pip install pypandoc

//Then install this
pip install pyspark

Now you are ready to explore the Pyspark API

Put a book mark to the PySpark API from spark (this is listed in the links section above)


#Refer to the following package in the APIs
from pyspark.sql import SparkSession

//All starts with a SparkSession object
//Look at the API for SparkSession
spark = SparkSession        .builder        .appName("PythonWordCount")        .getOrCreate()

//This creates a new SparkSession with a name
//uses the standard build pattern

//Now you can use the SparkSession to
//read files as data frames and RDDs

    # spark.read is a property
    # it returns a dataframereader
    # .text returns a dataframe
    # dataframe has an RDD
    # Get a new RDD that is just a collection of lines
    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

    #Bring it back to the driver and print it
    #Because RDDs are on multiple machines up until "collect"
    printCollected("Raw lines", lines.collect())

10. Now that you have a house, how to begin the practice of living in PySpar

Go through spark python samples

Keep a book mark on the Spark APIs for python

There are likely over 80 verbs in spark to achieve mastery

Understand how to process through native spark and also through Spark SQL

Understand how to keep data in Parquet tables

Understand the IDE for better intellisense

Understand the IDE for better code snippets

Implement the samples yourself in other languages like Java and Scala and see what they offer

Look at Spark/R library

See how to submit the files to databricks clusters

What are user defined functions in python for spark?

How do you package python code for deploys? (like libraries)

How do you use Spark UI?

How is SparkContext and SparkSession different?

Meaning of Collect

Difference between transformations and actions in Spark performance

How does VScode or python resolve python imports? is that a path?

rdd.filter

map

take

count

flatmap

distinct

sample

join

intersection

glom

collect

takeSample

etc.

etc.

A Shortcut to Big Data - Lesson Plan 1, Azure and Spark: Smart Programmer Series

This article goes over how to setup a Databricks spark cluster in Azure. Much like this second lesson, this first lesson at this URL goes into lot of details on how to do that