A Shortcut to Big Data - Lesson Plan 2, Installing Spark stand alone on windows 10: Smart Programmer Series
satya - 9/8/2019, 3:09:13 PM
Introduction
If you want to learn Bigdata, Spark is a *very* good entry point to Bigdata. That means you don't have to start your journey of Bigdata with HDFS, MapReduce, Hive, Pig, Yarn, and numerous other tools. That means you don't have to learn the BigTheory behind Bigdata.
Recent shifts in Bigdata infrastructure makes "spark" a good starting point. This significantly reduces learning time. it is a better ramp. You can learn the previously said other Bigdata things at leisure while you are taking the benefits of Bigdata processing using spark.
So, the problem of learning Bigdata has reduced to learning spark. Learning spark involves 2 efforts.
The first is the effort and learning needed to setup a spark cluster. You need a spark cluster to run your spark code. Even a simple helloworld. Your best bet is to set this cluster up using Databricks in an Azure pay-as-you-go account. I have documented this in a previous article/lesson. I will post a link to that below along with other links.
The second effort is, once you have a cluster, although you can use notebooks to write python code to run on them, you really need an IDE to write that code with intellisense, and test it to an extent on your local machine and check that into github and have Databricks cluster pick those files from github to run on a real cluster in azure.
This document has the steps to install spark to run standalone on windows 10. This is a manual multi step installation process. It is prone to inconsistencies and errors. This document is a deliberate effort to save you time on how to do install spark on windows with out spending days on the effort.
Important thing to say at the onset is this: Take courage for IT IS POSSIBLE to install spark on windows
satya - 9/8/2019, 3:22:38 PM
What I am going to cover
1. Essential Installs for Spark (a quick list of tools). This will also include what the key environment variables look like at the end of the installation
2. A More detailed list of installs
3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.
4. Links and other references you need as you go to install these tools. These references start with a link to a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article
5. Detailed analysis for each of the installations
6. How to test the installation: Running PySpark interactive shell
7. Submitting pyspark python files using submit command to run on the local spark stand alone cluster.
8. Working with VSCode on pyspark code
9. How to exercise the PysSpark API
10. Now that you have a house, how to begin the practice of living in PySpark
satya - 9/9/2019, 10:49:19 AM
1. Essential Installs
1. Essential Installs
satya - 9/9/2019, 10:53:35 AM
The Essential List
JDK 1.8 (not latest)
Python 3.7.4 (latest)
Spark 2.4.3 (latest)
Hadoop binaries (also referred to as winutils: 2.7.3)
satya - 9/9/2019, 11:04:17 AM
Essential list elaborated
JDK 1.8 (not latest):
Don't install the latest. It didn't work for me with JDK12. There are some method discrepancies between the JDKs. One would think they ought would have been compatible. So stick to JDK 1.8. Also make sure you set the JAVA_HOME to the parent of the "/bin" directory in that distribution.
Python 3.7.4 (latest).
I installed the latest from python.org. Do not install it from windows 10 shop. Anaconda distribution is another suggested option. I have not tried that one. will likely work.
Spark 2.4.3 (latest):
I just got the latest from Apache Spark site.
Hadoop binaries (also referred to as winutils: 2.7.3).
Although one can get these as part of Hadoop download from Apache, many sources are suggesting to download these from a Github repo called "Winutils". These were managed by individuals. This repo is now superceded by another repo which is listed at 'winutils'.
satya - 9/9/2019, 11:34:54 AM
2. A More detailed list of installs
Once the spark is installed, you need an IDE to code in python and pyspark. I am using Vscode already. So I am going to utilize the python extensions for vscode allow me to code. This section will show the additional installs that I worked with in this effort.
satya - 9/9/2019, 11:39:19 AM
The full list is
//Essential
JDK 1.8 (not latest)
Python 3.7.4 (latest)
Spark 2.4.3 (latest)
Hadoop binaries (also referred to as winutils: 2.7.3)
//Additional
PySpark: as a python package. It is not clear yet to me its main purpose either for intellisense or some other reason.
VScode (if you don't have it)
7-zip (to unzip sppark and other downloads)
Git (To download hadoop binaries from Github)
Gitdesktop (Just find out what the fuss is about)
//Vscode extensions
MS python extension
Gitlens (integrated with vscode)
other python extensions
satya - 9/9/2019, 11:40:20 AM
Here is a pictorial representation
satya - 9/9/2019, 11:51:20 AM
Order of installs
I don't think the order matters too much. It is only during running spark the dependencies need to be all available. Nevertheless things are done in order.
First, Read through all of this document before you install any of this
7 Zip (as you need it to unzip a number of things)
JDK 1.8 (usually an installation program sets this up.) Make sure to install it into good short directory name.
Python from python.org
Winutils from Github (Hadoop binaries)
Spark
satya - 9/9/2019, 11:51:53 AM
3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.
3. Warnings before you start installations. I ran into lot of gotchas. Let me state what these are in brief.
satya - 9/9/2019, 12:00:04 PM
For spark to run, once you install here what your system env variables look like
System level Environment variables:
#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk8\jre"
HADOOP_HOME C:\satya\i\hadoop312-common-bin
#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk8\jre\bin
#python-path
%install-dir%\python374
%install-dir%\python374\scripts
#hadoop-path
%install-dir%\hadoop312-common-bin\bin
satya - 9/9/2019, 12:02:45 PM
A Book reference: Apache Spark in 24 hours
Finally this is the book that helped me navigate the installation of spark on windows. Do buy it. This is a good book.
There is an online reference to a webpage where the installation section is made available.
satya - 9/9/2019, 12:05:34 PM
Read this: Installing Spark, By Jeffrey Aven, Feb 16, 2017
Installing Spark, By Jeffrey Aven, Feb 16, 2017
Read this to see what environment varialbes are set and other essential installation tasks. This page has both mac, linux and windows instructions. So just focus on the windows section. Give a quick read. 10 minutes. For I have explained the rest here including what is there in that article.
satya - 9/9/2019, 12:06:32 PM
JAVA_OPTIONS env variable
This is to disable ipv6 internet option. I don't know why. I am just following that books recommendation. I have not tested without.
satya - 9/9/2019, 12:08:50 PM
Why do I have to disable IPv6 for Java applications when installing Spark
As the answer is not that important, but if you are curious to follow up, see if anyone may have explained this better
Why do I have to disable IPv6 for Java applications when installing Spark
Search for: Why do I have to disable IPv6 for Java applications when installing Spark
satya - 9/9/2019, 12:55:17 PM
3. Key Lessons/Gotchas
1. Take courage that it is possible to install spark on windows and start the learning process
2. You have to use JDK 8. Don't use any higher versions. Don't install the latest. It didn't work for me with JDK12. There are some method discrepancies between the JDKs. One would think they ought would have been compatible. So stick to JDK 1.8. Also make sure you set the JAVA_HOME to the parent of the "/bin" directory in that distribution
3. Turn off the IPV6 as indicated above.
4. Spark says it does not depend on Hadoop. But it appears it needs the hadoop binaries that are already built for windows. Some individuals have prepared them and made them available on Github for each release of Hadoop. This repo is called "winutils". Either download the entire repo and pick the version you want (This is described later in the document). Bottom line is you don't need to install the Hadoop wholesale from Apache Hadoop.
5. Python has a (pip)module called "pyspark". when you download Spark, Spark comes with its own python sub directories that allow you to submit/run pyspark programs. This pip module, I am not sure yet what its role is. So you can skip installing it with pip. If you were to install it, it failed installation horribly. Read later how to fix it.
6. When you install python make sure it is 64 bit. The default one that shows up at Python.org seem to be a 32bit version with no explicit directions to download 64 bit (But if you look hard for x64 version you will find it). Some times it is also called _AMD64 as well.
7. You can have multiple python versions at the same time. Your IDEs can choose which python interpreter to use.
8. Install python from python.org and not from windows store. Anaconda distribution will likely work but I haven't tried. I have read briefly that anaconda has lot of packages dedicated to data science, I am not there yet. When you install python make sure to add the following both to path: \python374\ and \python374\scripts\
9. The intellisense of python in VSCode is still confusing to me. More research is needed, especially to discover "pyspark" API in code.
10. There is a URL where pyspark API is documented just like JDK API. This is a useful resource esp to understand example source code
11. Spark install comes with examples of pyspark code under examples\..\python subdirectory. Load them into vscode editor and use the PySpark API reference to walk through the code
12. You will use spark-submit command to submit python code to run. (The command structure is detailed later)
13. Things I have done that you can avoid 1) installed wrong JDK 2) Installed python from windows store 3) Installed 32 bit python 4) Set JAVA_HOME incorrectly for jdk 8
14. Vscode when loading a pyspark .py file shows an error on "from" import of pyspark code. (I had to install pip install pyspark to fix this). I am not sure yet why this is.
satya - 9/9/2019, 2:50:50 PM
4. Key links and other references
Links and other references you need as you go to install these tools. These references start with a link to a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article
satya - 9/9/2019, 2:52:05 PM
PySpark on Windows 10: Installation Journal
PySpark on Windows 10: Installation Journal
a journal I have kept during the entire installation process. You can use this to browse to see if I had missed anything here in the summary article
satya - 9/9/2019, 2:54:03 PM
Here are notes from Sams 24 hours spark book on installing spark
Here are notes from Sams 24 hours spark book on installing spark
You can also use this as a primary guide once you get to read all of the material here if you like, as by then you would know what can go wrong and what are the nuances.
satya - 9/9/2019, 2:56:30 PM
Spark home page: spark.apache.org. You will see spark download links here
Spark home page: spark.apache.org. You will see spark download links here
satya - 9/9/2019, 3:01:18 PM
Key things around spark, while we are at this URL
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
satya - 9/9/2019, 5:23:02 PM
Spark Python API docs
Just like the API docs for Java and JDK, these are the python docs for Spark. You will these references lot more
satya - 9/9/2019, 5:23:41 PM
A good entry point into those APIs is the pyspark package
satya - 9/9/2019, 5:26:15 PM
As you start learning here is Spark Overview from Apache.org
As you start learning here is Spark Overview from Apache.org
You will find starting examples, quick starts, other docs from Apache. How to run a spark submit job, downloads etc.
satya - 9/9/2019, 5:29:30 PM
Download page for Apache spark
Download page for Apache spark
Use this as the starting page to download spark. You can download the spark-2.4.4-bin-hadoop2.7.tgz from here. Although it says it is hadoop version it does not have any hadoop binaries in it.
satya - 9/9/2019, 5:33:49 PM
To unzip that file you need 7 zip. Download that program from here for windows
To unzip that file you need 7 zip. Download that program from here for windows
You will notice that it is a secure site (https) which means its URL is validated and you can trust what is downloaded from that site as long as the site is not compromised. Download the 64 bit version of executable.
satya - 9/9/2019, 5:36:18 PM
Common hadoop binaries from github repo: steveloughran/winutils
Common hadoop binaries from github repo: steveloughran/winutils
Windows binaries for Hadoop versions till Hadoop 2.6.0. For later releases you need to use the repo from the link below
satya - 9/9/2019, 5:37:39 PM
Link to winutils repo for later releases: cdarlint/winutils
Link to winutils repo for later releases: cdarlint/winutils
winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows
satya - 9/9/2019, 5:38:46 PM
What are these winutils?
Basic, innocuous, goal of syntax checking for PySpark in VSCode on a windows box has now led to: 8 hours so far, a new laptop, spark, python, hadoop, 7zip, git, and still on the move....
So why?
Apparently spark stand alone clusters on windows needs some Hadoop binaries.
Downloading them from Apache hadoop is a 300MB affair and not even sue if I got the required bin directory
More search leads to some git site called "winutils" that have binaries built for each version of hadoop, just the bin directory
Well here I am downloading the whole repo, although there seem to be a way to download a partial repo using git! Well for some other time.
satya - 9/9/2019, 5:39:24 PM
Setting up hadoop home and path once you have these utils
HADOOP_HOME=<your local hadoop-ver folder>
PATH=%PATH%;%HADOOP_HOME%\bin
satya - 9/9/2019, 5:47:22 PM
So download the winutils from this cdarlint path
So download the winutils from this cdarlint path
Use the clone or download option from github. That will give you all the versions in that repo. If you want to be narrow, which you can be, and want the binaries just for the latest release of hadoop, you need to install Git and use Git to download a specific sub directory. See my install journal on how to do this. If you have bandwidth I wouldn't recommend that. Just download the whole repo and use what you need.
Once you download it you will see things like
C:\satya\i\hadoop-winutils\winutils\hadoop-2.6.5\bin
C:\satya\i\hadoop-winutils\winutils\hadoop-2.7.3\bin
Then you copy of your version into
C:\satya\i\hadoop312-common-bin\bin
where the hadoop_home is
C:\satya\i\hadoop312-common-bin
and the path is
C:\satya\i\hadoop312-common-bin\bin
After that you can leave the directory C:\satya\i\hadoop-winutils as is or delete it
satya - 9/9/2019, 5:56:17 PM
Download Python from here for windows 10
Download Python from here for windows
You will find various python releases here. Make sure you find an install executable that is 64bit. It reads like this: Windows x86-64 executable installer.
Download it and install it into a directory like \python374. This will need to be added to be added to the system path. also add \python374\scripts into the path. You will need these paths for a) spark b) vscode c) to install additional python modules through pip (A python package installer)
satya - 9/9/2019, 6:09:30 PM
MS Document: Getting started with python in vscode is here
MS Document: Getting started with python in vscode is here
Caution: Do not follow the instructions here to install python through windows 10 apps. Instead use the python.org as above to install python. Just be careful about "pip". It may be better to try "pip" from an outside command line and not the one from inside vscode.
However this link may be useful as you get more into vscode and python for other aspects.
satya - 9/9/2019, 6:11:15 PM
Using multiple python environments in vscode
Using multiple python environments in vscode
It is a bit strange how the IDEs seem to treat python. It seems you can identify a different python interpreter for each folder that is opened in vscode. It is too early for me to comment in detail on this.
satya - 9/9/2019, 6:14:42 PM
Here is a link from python.org on working with python on windows
Here is a link from python.org on working with python on windows
There is a lot of windows specific info on python. I need to explore this more. For now leave this here as a key reference to come back to.
satya - 9/9/2019, 6:20:36 PM
PyPi.org
This is where public python packages are kept such as PySpark
satya - 9/9/2019, 6:23:19 PM
Here is a link in PyPi to PySpark package
Here is a link in PyPi to PySpark package
I need more research on the relationship between python, Spark, and PySpark to speak more lucidly. For now I seem to need to install pyspark to get rid of errors in Vscode intellisense. But that happens through "pip" install which uses the pypi.org behind the scenes to pull it from.
satya - 9/10/2019, 11:06:27 AM
Java SE Development Kit 8 Downloads: from Oracle
Java SE Development Kit 8 Downloads: from Oracle
There are so many versions of JDK and types. Should I use Oracle version or will openjdk will do? There are so many URLs and sites to download it from. Some downloads prompted for login to oracle. Look around and validate the right URL and get your jdk SE version.
This link I have navigated down from /oracle/java. So hopefully correct one. In here pick the windows 64bit version. I think it is called jdk-8u221-windows-x64.exe. It is under title "windows 64". This is the base jdk se version 8. it only installed JRE which seem to be sufficient. When you install choose a "change directory" for a target directory such as \jdk8\jre. Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\. The JAVA_HOME should be c:\satya\i\jdk8\jre. The parent of "bin" directory. Even though this directory is empty other than the "\bin" in it.
satya - 9/10/2019, 11:14:24 AM
MIT Archive of Shakespeare sonnets text
MIT Archive of Shakespeare sonnets text
when you start using pyspark code you can use some text files to process. This URL has some text that you can copy as files and then analyze the text for words using pyspark sample code.
satya - 9/10/2019, 11:19:21 AM
5. Detailed analysis for each of the installations
5. Detailed analysis for each of the installations
satya - 9/10/2019, 11:38:01 AM
Start with JDK8
Use the link and instructions from the previous section to install JDK 8.
It has to be JDK8 and not a later version.
Install it in a good path: ex: c:\i\jdk18\jre
it only installs JRE
Choose a "change directory" for a target directory such as \jdk8\jre
Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\
and JAVA_HOME to \jdk8\jre
satya - 9/10/2019, 11:39:06 AM
So the paths are
JAVA_HOME to \jdk8\jre
Path: \jdk8\jre\bin
satya - 9/10/2019, 11:42:46 AM
Install Python next
Use the links above to install from python.org
Make sure to pick 64 bit python. the default links seem to have the 32bit version as the default
Install it to good directory name like c:\i\python374
The path should point to c:\i\python374. Notice there is no \bin here
Set also the path to c:\i\python374\scripts (this is useful to invoke pip to install other python packages
satya - 9/10/2019, 11:43:58 AM
Paths for python are
\python374
\python374\scripts
//Notice there is no python_home
//as far as I know for now
satya - 9/10/2019, 11:46:30 AM
Install 7zip if you don't have it
Before installing spark you need a program to unzip a .tgz file (gzipped tar file).
So download and install 7zip from the links above
satya - 9/10/2019, 11:52:57 AM
Download and unzip Spark
Download spark .tgz file from apache link above
Use 7zip to unzip (extract) the tar file first
Use 7zip again to extract from the tar file into a suitable directory
I used c:\i\spark234
I have set spark_home to that directory
and the path to c:\i\spark234\bin
However I don't think those paths are necessary.
satya - 9/10/2019, 11:53:25 AM
If need be
spark_home=c:\i\spark
path to c:\i\spark\bin
satya - 9/10/2019, 12:59:21 PM
Next install Hadoop binaries
See the link above to download the hadoop binaries from github
Get the whole repo, unless you know how to get a particular directory
For the release of Hadoop that you need copy those files to c:\i\hadoop312-common-bin.
The 312 refers to the version of hadoop. Look at Spark release notes and see what version of hadoop is used. Hopefully anything after is ok.
You need to set 2 environment variables
hadoop_home=c:\i\hadoop312-common-bin
and the path to c:\i\hadoop312-common-bin\bin
satya - 9/10/2019, 1:02:33 PM
Set the following system variable
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
This is to disable ipv6. Not sure why.
satya - 9/10/2019, 1:06:55 PM
How best to set system variables in Windows 10
Many of the online links and books suggest to use 'setx' command line option.
Because there are many installs and many system variables, I found using the wndows UI a better option to set the system variables.
In the status bar of windows (at the bottom) search for "Environment" you will see "Edit Environment Variables"
Use that to open up the "System Properties" window
At the bottom you will see an option to "Edit Environment Variables"
In the subsequent window you can see what system variables are set already and this visibility is useful to you as you need to make sure all variables are set, including the path
The path editor is also very nice as it shows you what paths are already set. To add a new path is also easy and intuitive.
This method avoids a number of errors
satya - 9/10/2019, 1:10:02 PM
Let me summarize the system environment variables one more time
#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk8\jre"
HADOOP_HOME C:\satya\i\hadoop312-common-bin
#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk8\jre\bin
#python-path
%install-dir%\python374
%install-dir%\python374\scripts
#hadoop-path
%install-dir%\hadoop312-common-bin\bin
#Likely optional spark home
spark_home=%install-dir%\spark234 (what ever version u have)
#Likely optional spark path
%install-dir%\spark234\bin
satya - 9/10/2019, 1:12:37 PM
Running windows command line as an administrator
when you install software on windows 10, unless you are just unzipping it and setting env variables yourself, it is better to run the programs as an administrator
To do that right click on the executable and run as admin
Sometimes you may have to do that for the command line as well.
satya - 9/10/2019, 1:15:35 PM
It is time to test the installation
open a command window (I dont think you need to be an admin for this)
Run: c:\satya\i\spark\bin\pyspark
This should open an interactive pyspark shell that has a command prompt that looks like
>>>
That verifies that you have
1. Right jdk
2. Python is available
3. Hadoop binaries are located
4. spark is installed
satya - 9/10/2019, 1:17:53 PM
There was an additional precaution in the 24 hours book to do the following
mkdir C:\tmp\hive
C:\Hadoop\bin\winutils.exe chmod 777 /tmp/hive
It may be safer to run the command line as an admin for this command. I don't remember now how I did it. I also don't see any files in there. So I don't even know if this is needed.
satya - 9/10/2019, 1:30:52 PM
7. Submitting pyspark python files using submit command to run on the local spark stand alone cluster.
Although you can test pyspark code in the interpreter, I say instead, use the submit command to run a python file that has code in it. This section covers that process.
satya - 9/10/2019, 1:37:25 PM
Where can I get sample pyspark files?
These are located here (they came as part of spark installation): C:\satya\i\spark\examples\src\main\python
One of the files we are going to use from here is called "wordcount.py". This prints the list of words in a text file and how many times each word is repeated.
This file "wordcount.py" takes a text file as an input to read the sentences from.
I suggest you visit the MIT link for sonnets and download a sonnet, preferably the second one, and make a text file out of it. Drop this in the same directory as examples for now to keep the syntax short. call this file "sonnet2.txt"
satya - 9/10/2019, 1:38:08 PM
Now you can use this command to submit wordcount.py to execute on spark
c:\satya\i\spark\bin\spark-submit --master local[4] wordcount.py sonnet2.txt
satya - 9/10/2019, 1:50:40 PM
8. Working with VSCode on pyspark code
You really need an IDE to work with code.
You can choose your favorite one. Many folks are using PyCharm, a netbeans IDE. I am a bit of a generalist. So I decided to use VSCode which i have started learning for javascript.
The extension I have enabled in vscode is called "Microsoft Python: ms-python.python"
The extension says: IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!
satya - 9/10/2019, 1:51:57 PM
Here is an example wordcount that I had altered slightly
###############################################
# To test flatmap
###############################################
from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession
#***********************************
#Function: printCollected
#***********************************
def printBeginMsg(msg):
print ("*****************************")
print ("*" + msg)
print ("*****************************")
def printEndMsg(msg):
print ("*****************************")
print ("*" + msg)
print ("*****************************")
def printCollected(msg, collectedRDD):
printBeginMsg(msg)
for item in collectedRDD:
print (item)
printEndMsg(msg)
#***********************************
#End of function
#***********************************
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
sys.exit(-1)
spark = SparkSession .builder .appName("PythonWordCount") .getOrCreate()
#
# spark.read is a property
# it returns a dataframereader
#
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
printCollected("Raw lines", lines.collect())
lineAsListOfWords = lines.map(lambda x: x.split(' '))
printCollected("Raw lines split into words. Each line is a list of words", lineAsListOfWords.collect())
justAListOfWords = lineAsListOfWords.flatMap(lambda x: x.lower())
printCollected("Just A List of Words, from flatmap", justAListOfWords.collect())
#make each word a list which is (word, length-of-thatword)
# WordObject: (word, length, howmany)
listOfWordObjects = justAListOfWords.map(lambda x: (x, len(x), 1))
printCollected("List of Word Objects", listOfWordObjects.collect())
spark.stop()
satya - 9/10/2019, 1:52:18 PM
I was getting an error at the following line
from pyspark.sql import SparkSession
satya - 9/10/2019, 1:53:12 PM
So I have decided to install pyspark through pip install
I ran into terrible installation issues.
It looks like this "pip install pyspark" requires a prerequisite module
satya - 9/10/2019, 1:54:48 PM
How to install "pip install pyspark"
//the pip script is in python\bin directory
//As long as that is on the path you can run pip
//Run the following from any command line
//Safe to run the command line as an admin
//Install this first
pip install pypandoc
//Then install this
pip install pyspark
satya - 9/10/2019, 1:56:15 PM
9. How to exercise the PysSpark API
Now you are ready to explore the Pyspark API
Put a book mark to the PySpark API from spark (this is listed in the links section above)
satya - 9/10/2019, 1:58:27 PM
All submits seem to start with this code
#Refer to the following package in the APIs
from pyspark.sql import SparkSession
//All starts with a SparkSession object
//Look at the API for SparkSession
spark = SparkSession .builder .appName("PythonWordCount") .getOrCreate()
//This creates a new SparkSession with a name
//uses the standard build pattern
//Now you can use the SparkSession to
//read files as data frames and RDDs
satya - 9/10/2019, 2:00:14 PM
Take a look at this code
# spark.read is a property
# it returns a dataframereader
# .text returns a dataframe
# dataframe has an RDD
# Get a new RDD that is just a collection of lines
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
#Bring it back to the driver and print it
#Because RDDs are on multiple machines up until "collect"
printCollected("Raw lines", lines.collect())
satya - 9/10/2019, 2:00:37 PM
10. Now that you have a house, how to begin the practice of living in PySpar
10. Now that you have a house, how to begin the practice of living in PySpar
satya - 9/10/2019, 3:51:08 PM
Next steps
Go through spark python samples
Keep a book mark on the Spark APIs for python
There are likely over 80 verbs in spark to achieve mastery
Understand how to process through native spark and also through Spark SQL
Understand how to keep data in Parquet tables
Understand the IDE for better intellisense
Understand the IDE for better code snippets
Implement the samples yourself in other languages like Java and Scala and see what they offer
Look at Spark/R library
See how to submit the files to databricks clusters
satya - 9/10/2019, 3:55:21 PM
More....
What are user defined functions in python for spark?
How do you package python code for deploys? (like libraries)
How do you use Spark UI?
How is SparkContext and SparkSession different?
Meaning of Collect
Difference between transformations and actions in Spark performance
How does VScode or python resolve python imports? is that a path?
satya - 9/10/2019, 3:56:24 PM
Understand the following APIs ...and more
rdd.filter
map
take
count
flatmap
distinct
sample
join
intersection
glom
collect
takeSample
etc.
etc.
satya - 9/10/2019, 4:22:41 PM
A Shortcut to Big Data - Lesson Plan 1, Azure and Spark: Smart Programmer Series
A Shortcut to Big Data - Lesson Plan 1, Azure and Spark: Smart Programmer Series
This article goes over how to setup a Databricks spark cluster in Azure. Much like this second lesson, this first lesson at this URL goes into lot of details on how to do that