A Shortcut to Big Data - Lesson Plan 2, Installing Spark stand alone on windows 10: Smart Programmer Series

If you want to learn Bigdata, Spark is a *very* good entry point to Bigdata. That means you don't have to start your journey of Bigdata with HDFS, MapReduce, Hive, Pig, Yarn, and numerous other tools. That means you don't have to learn the BigTheory behind Bigdata.

Recent shifts in Bigdata infrastructure makes "spark" a good starting point. This significantly reduces learning time. it is a better ramp. You can learn the previously said other Bigdata things at leisure while you are taking the benefits of Bigdata processing using spark.

So, the problem of learning Bigdata has reduced to learning spark. Learning spark involves 2 efforts.

The first is the effort and learning needed to setup a spark cluster. You need a spark cluster to run your spark code. Even a simple helloworld. Your best bet is to set this cluster up using Databricks in an Azure pay-as-you-go account. I have documented this in a previous article/lesson. I will post a link to that below along with other links.

The second effort is, once you have a cluster, although you can use notebooks to write python code to run on them, you really need an IDE to write that code with intellisense, and test it to an extent on your local machine and check that into github and have Databricks cluster pick those files from github to run on a real cluster in azure.

This document has the steps to install spark to run standalone on windows 10. This is a manual multi step installation process. It is prone to inconsistencies and errors. This document is a deliberate effort to save you time on how to do install spark on windows with out spending days on the effort.

Important thing to say at the onset is this: Take courage for IT IS POSSIBLE to install spark on windows

1. Essential Installs for Spark (a quick list of tools). This will also include what the key environment variables look like at the end of the installation

2. A More detailed list of installs

3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.

4. Links and other references you need as you go to install these tools. These references start with a link to a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article

5. Detailed analysis for each of the installations

6. How to test the installation: Running PySpark interactive shell

7. Submitting pyspark python files using submit command to run on the local spark stand alone cluster.

8. Working with VSCode on pyspark code

9. How to exercise the PysSpark API

10. Now that you have a house, how to begin the practice of living in PySpark

1. Essential Installs


JDK 1.8 (not latest)

Python 3.7.4 (latest)

Spark 2.4.3 (latest)

Hadoop binaries (also referred to as winutils: 2.7.3)

JDK 1.8 (not latest):

Don't install the latest. It didn't work for me with JDK12. There are some method discrepancies between the JDKs. One would think they ought would have been compatible. So stick to JDK 1.8. Also make sure you set the JAVA_HOME to the parent of the "/bin" directory in that distribution.

Python 3.7.4 (latest).

I installed the latest from python.org. Do not install it from windows 10 shop. Anaconda distribution is another suggested option. I have not tried that one. will likely work.

Spark 2.4.3 (latest):

I just got the latest from Apache Spark site.

Hadoop binaries (also referred to as winutils: 2.7.3).

Although one can get these as part of Hadoop download from Apache, many sources are suggesting to download these from a Github repo called "Winutils". These were managed by individuals. This repo is now superceded by another repo which is listed at 'winutils'.

Once the spark is installed, you need an IDE to code in python and pyspark. I am using Vscode already. So I am going to utilize the python extensions for vscode allow me to code. This section will show the additional installs that I worked with in this effort.

//Essential

JDK 1.8 (not latest)

Python 3.7.4 (latest)

Spark 2.4.3 (latest)

Hadoop binaries (also referred to as winutils: 2.7.3)

//Additional

PySpark: as a python package. It is not clear yet to me its main purpose either for intellisense or some other reason.

VScode (if you don't have it)

7-zip (to unzip sppark and other downloads)

Git (To download hadoop binaries from Github)

Gitdesktop (Just find out what the fuss is about)

//Vscode extensions

MS python extension

Gitlens (integrated with vscode)

other python extensions

I don't think the order matters too much. It is only during running spark the dependencies need to be all available. Nevertheless things are done in order.

First, Read through all of this document before you install any of this

7 Zip (as you need it to unzip a number of things)

JDK 1.8 (usually an installation program sets this up.) Make sure to install it into good short directory name.

Python from python.org

Winutils from Github (Hadoop binaries)

Spark

3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.


System level Environment variables:

#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk8\jre"
HADOOP_HOME C:\satya\i\hadoop312-common-bin

#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk8\jre\bin

#python-path
%install-dir%\python374
%install-dir%\python374\scripts


#hadoop-path
%install-dir%\hadoop312-common-bin\bin

Finally this is the book that helped me navigate the installation of spark on windows. Do buy it. This is a good book.

There is an online reference to a webpage where the installation section is made available.

Installing Spark, By Jeffrey Aven, Feb 16, 2017

Read this to see what environment varialbes are set and other essential installation tasks. This page has both mac, linux and windows instructions. So just focus on the windows section. Give a quick read. 10 minutes. For I have explained the rest here including what is there in that article.

This is to disable ipv6 internet option. I don't know why. I am just following that books recommendation. I have not tested without.

As the answer is not that important, but if you are curious to follow up, see if anyone may have explained this better

Why do I have to disable IPv6 for Java applications when installing Spark

Search for: Why do I have to disable IPv6 for Java applications when installing Spark

1. Take courage that it is possible to install spark on windows and start the learning process

2. You have to use JDK 8. Don't use any higher versions. Don't install the latest. It didn't work for me with JDK12. There are some method discrepancies between the JDKs. One would think they ought would have been compatible. So stick to JDK 1.8. Also make sure you set the JAVA_HOME to the parent of the "/bin" directory in that distribution

3. Turn off the IPV6 as indicated above.

4. Spark says it does not depend on Hadoop. But it appears it needs the hadoop binaries that are already built for windows. Some individuals have prepared them and made them available on Github for each release of Hadoop. This repo is called "winutils". Either download the entire repo and pick the version you want (This is described later in the document). Bottom line is you don't need to install the Hadoop wholesale from Apache Hadoop.

5. Python has a (pip)module called "pyspark". when you download Spark, Spark comes with its own python sub directories that allow you to submit/run pyspark programs. This pip module, I am not sure yet what its role is. So you can skip installing it with pip. If you were to install it, it failed installation horribly. Read later how to fix it.

6. When you install python make sure it is 64 bit. The default one that shows up at Python.org seem to be a 32bit version with no explicit directions to download 64 bit (But if you look hard for x64 version you will find it). Some times it is also called _AMD64 as well.

7. You can have multiple python versions at the same time. Your IDEs can choose which python interpreter to use.

8. Install python from python.org and not from windows store. Anaconda distribution will likely work but I haven't tried. I have read briefly that anaconda has lot of packages dedicated to data science, I am not there yet. When you install python make sure to add the following both to path: \python374\ and \python374\scripts\

9. The intellisense of python in VSCode is still confusing to me. More research is needed, especially to discover "pyspark" API in code.

10. There is a URL where pyspark API is documented just like JDK API. This is a useful resource esp to understand example source code

11. Spark install comes with examples of pyspark code under examples\..\python subdirectory. Load them into vscode editor and use the PySpark API reference to walk through the code

12. You will use spark-submit command to submit python code to run. (The command structure is detailed later)

13. Things I have done that you can avoid 1) installed wrong JDK 2) Installed python from windows store 3) Installed 32 bit python 4) Set JAVA_HOME incorrectly for jdk 8

14. Vscode when loading a pyspark .py file shows an error on "from" import of pyspark code. (I had to install pip install pyspark to fix this). I am not sure yet why this is.

Links and other references you need as you go to install these tools. These references start with a link to a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article

PySpark on Windows 10: Installation Journal

a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article

Here are notes from Sams 24 hours spark book on installing spark

You can also use this as a primary guide once you get to read all of the material here if you like, as by then you would know what can go wrong and what are the nuances.

Spark home page: spark.apache.org. You will see spark download links here

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Spark Python API docs

Just like the API docs for Java and JDK, these are the python docs for Spark. You will these references lot more

A good entry point into those APIs is the pyspark package

As you start learning here is Spark Overview from Apache.org

You will find starting examples, quick starts, other docs from Apache. How to run a spark submit job, downloads etc.

Download page for Apache spark

Use this as the starting page to download spark. You can download the spark-2.4.4-bin-hadoop2.7.tgz from here. Although it says it is hadoop version it does not have any hadoop binaries in it.