PySpark: PySpark on Windows 10: Installation Journal. A formal article based on this material will soon be posted to the bigdata folder.

what is vscode extension for working with pyspark library?

Search for: what is vscode extension for working with pyspark library?

VS code official python extension docs

Something about pip and pyspark

How to install pyspark?

Search for: How to install pyspark?

where does pip install the packages?

Search for: where does pip install the packages?

There is some local info on installing pyspark

where does pip install the packages? on SOF


c:\any-dir>pip install pyspark

where is the official pyspark distribution?

Search for: where is the official pyspark distribution?

Here is the PySpark API: This is at Spark Apache site

Spark homepage

Lets see whats here

A bit disappointed as I am looking for direct instructions to get started with installation of stand alone (spark if needed) and the pyspark library for python, an IDE, in VScode etc.

It is really sad, the links in the book don't seem to work any more :(

Here is the official pyspark site

Spark overview from Apache docs

How to install spark standalone on windows

Search for: How to install spark standalone on windows

Here are notes from Sams 24 hours spark book on installing spark

This is a good book overall

Yes you need 7zip: http://7-zip.org/download.html


setx /M _JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"

//Not sure what this is, but will investigate

Here is a link where everyone is using the common binaries for Hadoop

Basic, innocuous, goal of syntax checking for PySpark in VSCode on a windows box has now led to: 8 hours so far, a new laptop, spark, python, hadoop, 7zip, git, and still on the move....

So why?

Apparently spark stand alone clusters on windows needs some Hadoop binaries.

Downloading them from Apache hadoop is a 300MB affair and not even sue if I got the required bin directory

More search leads to some git site called "winutils" that have binaries built for each version of hadoop, just the bin directory

Well here I am downloading the whole repo, although there seem to be a way to download a partial repo using git! Well for some other time.


setx /M path "%path%;

//python: Notice no /bin
C:\satya\i\python374

//java
C:\satya\i\jdk12\bin;

//Hadoop path
C:\satya\i\hadoop312-common-bin\bin

Path=C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;

C:\Windows\System32\WindowsPowerShell\v1.0\;

C:\Windows\System32\OpenSSH\;

C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;

C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;

C:\Program Files\Intel\WiFi\bin\;

C:\Program Files\Common Files\Intel\WirelessCommon\;

C:\satya\i\git\Git\cmd;

C:\Users\satya\AppData\Local\Microsoft\WindowsApps;

C:\satya\i\vscode\bin;

C:\Users\satya\AppData\Local\GitHubDesktop\bin

PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC

I thought I installed java!!!! what happened to it? Let me check

environment variable java_home is not there

java is not in the path


Search for any of these 
  Environment
  Edit Environment variables

Do not search for
  Settings
  System settings
  System properties

Choose "Edit Environment variables"

That takes you to a dialog "System Properties"
Choose at the bottom: Environment variables

It is best you edit these environment variables this way
and not through command line.

System level Environment variables:

#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk12"
HADOOP_HOME C:\satya\i\hadoop312-common-bin

#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk12\bin

#python-path
%install-dir%\python374

#hadoop-path
%install-dir%\hadoop312-common-bin\bin

WARNING: An illegal reflective access operation has occurred

Search for: WARNING: An illegal reflective access operation has occurred

Here the problem appears to be with a java version. Preferred one appears to be Java 8


mkdir C:\tmp\hive
C:\Hadoop\bin\winutils.exe chmod 777 /tmp/hive

How do I tell vscode where python is installed?

Search for: How do I tell vscode where python is installed?

there are 2 pythons

the first one I have installed from microsoft app store

the second one manually from python.org

Neither is anaconda distribution

It looks like, at least on the command line inside vscode, it is picking up from path for python.exe which i have explicitly setup.

Here is an explicit document on setting up python environments

By default, the Python extension looks for and uses the first Python interpreter it finds in the system path. If it doesn't find an interpreter, it issues a warning. On macOS, the extension also issues a warning if you're using the OS-installed Python interpreter, because you typically want to use an interpreter you install directly.

first Python interpreter it finds in the system path

use Ctrl-shift-p for command interpreter

type "python:sel...."

Choose Python: Select Interpreter

It shows what is current and what is available. To my surprise I found what I installed manually is 32bit while the app version is at 64 bit.

More notes on vscode are here

Take an example pyspark program in python

look for "spark\examples\src\main\python\wordcount.py"

Looks like you can't use the spark interactive shell to do this

You have to submit the job instead

Or install pyspark of python and then you can run it as a python program

I am going to try the submit option first

You can read about this in the next link below

Apache spark docs on running python programs

Pyspark: Unsupported class file major version 56

Search for: Pyspark: Unsupported class file major version 56

Looks like after all one needs jdk8!! to run this version of Spark


//Of course remove the new lines
//when you try this out

..\bin\spark-submit   --master local[4]   ..\examples\src\main\python\wordcount.py

Primary dependencies are inside the boundary

How do I tell python where other python packages are?

Search for: How do I tell python where other python packages are?

pip install pyspark failed with error code 1 in windows

Search for: pip install pyspark failed with error code 1 in windows

Could not import pypandoc - required to package PySpark

To fix it do the following

pip install pypandoc

pip install pyspark

The above is documented here, in equally brief terms however

pip install pyspark failing on windows 10

Search for: pip install pyspark failing on windows 10

I also had uninstalled python374 that is 32bit, which I had incorrectly installed.

VSCode can't find python although available as a windows 64 bit app

I see it in: C:\Users\satya\AppData\Local\Microsoft\WindowsApps

Not sure why it is not in its own directory? if not what is the python_home should be then set to??

what is python_home for windows 10 python app?

Search for: what is python_home for windows 10 python app?

Using python on windows: A python.org doc

Getting started with python code in VS Code

This is the original instructions i have followed to install python. I had misread it and went for the easiest route which is the windows app. the other two options are python from python.org or the anaconda distribution.

I am going to go with python.org although anaconda distribution may be more desirable. will see.

As I had installed jdk12 before, I had to install jdk8 now

There are so many versions of it and types

Should I use Oracle version or will openjdk will do

there are so many URLs!!

Some asked for login to oracle

Look around and validate the right URL and get your jdk SE version (Not entirely sure)

I did pick the base jdk se version 8

it only installed JRE

Choose a "change directory" for a target directory such as \jdk8\jre

Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\

and JAVA_HOME to \jdk8

I don't even if this works. Will keep you posted

Uninstall and reboot with a no python

Then install

Ask it to add environment variables otherwise you will not find pip.ex as it is in /scripts

That will add path to

\python374\

\python374\scripts\

It will only add to local path

Add it to system path as well


pip install pypandoc

pip install pyspark

Pip install seem to be smart enough to cache the previous download somehow.

ctrl-shift-p

then looks like you have to the folders inside the work space.

I have two. So I had to choose twice

Not sure if there is a way to do this globally


from pyspark.sql import SparkSession

spark-submit error The system cannot find the path specified.

Search for: spark-submit error The system cannot find the path specified.

It never ends....

On jdk12 I have set it to

java_home=/jdk12

and path as /jdk12/bin

However in Oracle JDK8 installation the director structure was

/jdk8/jre/bin

and there are no sub directories under /jdk8 other than ./jre

In my haste i have set the java_home as

java_home=/jdk8

And that was the problem, it should be

java_home=/jdk8/jre

in cases where the jdk is really a "jre" distribution

sometimes it is /jdk-x and sometime /jdk-x/jre

JAVA_HOME=/jdk8/jre

paht=%path%;/jdk8/jre/bin

Of course take care of the slashes and root directories

Shakespear sonnets

Search for: Shakespear sonnets

MIT Archive


FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light'st flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, makest waste in niggarding.
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

the: 6
thy: 4
to: 3
might: 2
But: 2
by: 2
tender: 2
thine: 2
own: 2
world's: 2
FROM: 1
fairest: 1
creatures: 1
we: 1
desire: 1
increase,: 1
That: 1
thereby: 1
beauty's: 1
rose: 1
never: 1
die,: 1
as: 1
riper: 1
should: 1
time: 1
decease,: 1
His: 1
heir: 1
bear: 1
his: 1
memory:: 1
thou,: 1
contracted: 1
bright: 1
eyes,: 1
Feed'st: 1
light'st: 1
flame: 1
with: 1
self-substantial: 1
fuel,: 1
Making: 1
a: 1
famine: 1
where: 1
abundance: 1
lies,: 1
Thyself: 1
foe,: 1
sweet: 1
self: 1
too: 1
cruel.: 1
Thou: 1
that: 1
art: 1
now: 1
fresh: 1
ornament: 1
And: 1
only: 1
herald: 1
gaudy: 1
spring,: 1
Within: 1
bud: 1
buriest: 1
content: 1
And,: 1
churl,: 1
makest: 1
waste: 1
in: 1
Pity: 1
world,: 1
or: 1
else: 1
this: 1
glutton: 1
be,: 1
To: 1
eat: 1
due,: 1
grave: 1
and: 1
thee.: 1

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' '))                   .map(lambda x: (x, 1))                   .reduceByKey(add) 
                  
    counts2 = counts.sortBy(lambda x: x[1], False)
    output = counts2.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

So the installation finally seem to work.

I will document the details in a linear fashion soon. will post a link


When forty winters shall beseige thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery, so gazed on now,
Will be a tatter'd weed, of small worth held:
Then being ask'd where all thy beauty lies,
Where all the treasure of thy lusty days,
To say, within thine own deep-sunken eyes,
Were an all-eating shame and thriftless praise.
How much more praise deserved thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.