PySpark: PySpark on Windows 10: Installation Journal. A formal article based on this material will soon be posted to the bigdata folder.

satya - 8/31/2019, 2:01:09 PM

what is vscode extension for working with pyspark library?

what is vscode extension for working with pyspark library?

Search for: what is vscode extension for working with pyspark library?

satya - 8/31/2019, 2:43:51 PM

VS code official python extension docs

VS code official python extension docs

satya - 8/31/2019, 2:47:24 PM

Something about pip and pyspark

Something about pip and pyspark

satya - 8/31/2019, 2:48:58 PM

How to install pyspark?

How to install pyspark?

Search for: How to install pyspark?

satya - 8/31/2019, 3:02:12 PM

where does pip install the packages?

where does pip install the packages?

Search for: where does pip install the packages?

satya - 8/31/2019, 3:02:39 PM

There is some local info on installing pyspark

There is some local info on installing pyspark

satya - 8/31/2019, 3:06:42 PM

where does pip install the packages? on SOF

where does pip install the packages? on SOF

satya - 8/31/2019, 3:19:02 PM

Gave up and going to try


c:\any-dir>pip install pyspark

satya - 8/31/2019, 3:19:24 PM

where is the official pyspark distribution?

where is the official pyspark distribution?

Search for: where is the official pyspark distribution?

satya - 8/31/2019, 3:21:09 PM

Here is the PySpark API: This is at Spark Apache site

Here is the PySpark API: This is at Spark Apache site

satya - 8/31/2019, 3:22:25 PM

Spark homepage

Spark homepage

satya - 8/31/2019, 3:31:06 PM

Lets see whats here: Book reference Learning PySpark

Lets see whats here

A bit disappointed as I am looking for direct instructions to get started with installation of stand alone (spark if needed) and the pyspark library for python, an IDE, in VScode etc.

satya - 8/31/2019, 3:35:42 PM

It is really sad, the links in the book don't seem to work any more :(

It is really sad, the links in the book don't seem to work any more :(

satya - 8/31/2019, 3:39:47 PM

Here is the official pyspark site

Here is the official pyspark site

satya - 8/31/2019, 4:17:27 PM

Spark overview from Apache docs

Spark overview from Apache docs

satya - 9/1/2019, 11:20:07 AM

How to install spark standalone on windows

How to install spark standalone on windows

Search for: How to install spark standalone on windows

satya - 9/1/2019, 11:22:51 AM

Here are notes from Sams 24 hours spark book on installing spark

Here are notes from Sams 24 hours spark book on installing spark

satya - 9/1/2019, 11:24:10 AM

This is a good book overall

This is a good book overall

satya - 9/1/2019, 11:24:31 AM

Yes you need 7zip: http://7-zip.org/download.html

Yes you need 7zip: http://7-zip.org/download.html

satya - 9/1/2019, 11:28:46 AM

we are advised to set the following


setx /M _JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"

//Not sure what this is, but will investigate

satya - 9/1/2019, 2:44:07 PM

Here is a link where everyone is using the common binaries for Hadoop

Here is a link where everyone is using the common binaries for Hadoop

satya - 9/1/2019, 3:40:29 PM

What are these about?

Basic, innocuous, goal of syntax checking for PySpark in VSCode on a windows box has now led to: 8 hours so far, a new laptop, spark, python, hadoop, 7zip, git, and still on the move....

So why?

Apparently spark stand alone clusters on windows needs some Hadoop binaries.

Downloading them from Apache hadoop is a 300MB affair and not even sue if I got the required bin directory

More search leads to some git site called "winutils" that have binaries built for each version of hadoop, just the bin directory

Well here I am downloading the whole repo, although there seem to be a way to download a partial repo using git! Well for some other time.

satya - 9/1/2019, 5:00:25 PM

Paths


setx /M path "%path%;

//python: Notice no /bin
C:\satya\i\python374

//java
C:\satya\i\jdk12\bin;

//Hadoop path
C:\satya\i\hadoop312-common-bin\bin

satya - 9/1/2019, 5:03:36 PM

Nature of paths on my box prior to fixing anything


Path=C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;

C:\Windows\System32\WindowsPowerShell\v1.0\;

C:\Windows\System32\OpenSSH\;

C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;

C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;

C:\Program Files\Intel\WiFi\bin\;

C:\Program Files\Common Files\Intel\WirelessCommon\;

C:\satya\i\git\Git\cmd;

C:\Users\satya\AppData\Local\Microsoft\WindowsApps;

C:\satya\i\vscode\bin;

C:\Users\satya\AppData\Local\GitHubDesktop\bin

PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC

satya - 9/1/2019, 5:04:09 PM

I thought I installed java!!!! what happened to it? Let me check

I thought I installed java!!!! what happened to it? Let me check

satya - 9/1/2019, 5:05:22 PM

it says java se 12 is installed, how come then

environment variable java_home is not there

java is not in the path

satya - 9/2/2019, 12:46:39 PM

How best to setup system environment variables in windows 10


Search for any of these 
  Environment
  Edit Environment variables

Do not search for
  Settings
  System settings
  System properties

Choose "Edit Environment variables"

That takes you to a dialog "System Properties"
Choose at the bottom: Environment variables

It is best you edit these environment variables this way
and not through command line.

satya - 9/2/2019, 12:48:42 PM

What i have before I have tested


System level Environment variables:

#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk12"
HADOOP_HOME C:\satya\i\hadoop312-common-bin

#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk12\bin

#python-path
%install-dir%\python374

#hadoop-path
%install-dir%\hadoop312-common-bin\bin

satya - 9/2/2019, 12:58:15 PM

WARNING: An illegal reflective access operation has occurred

WARNING: An illegal reflective access operation has occurred

Search for: WARNING: An illegal reflective access operation has occurred

satya - 9/2/2019, 1:02:11 PM

Here the problem appears to be with a java version. Preferred one appears to be Java 8

Here the problem appears to be with a java version. Preferred one appears to be Java 8

satya - 9/2/2019, 1:03:28 PM

As the 24 hrs book suggested I had also done


mkdir C:\tmp\hive
C:\Hadoop\bin\winutils.exe chmod 777 /tmp/hive

satya - 9/2/2019, 1:26:48 PM

How do I tell vscode where python is installed?

How do I tell vscode where python is installed?

Search for: How do I tell vscode where python is installed?

satya - 9/2/2019, 2:06:19 PM

Few notes on this

there are 2 pythons

the first one I have installed from microsoft app store

the second one manually from python.org

Neither is anaconda distribution

It looks like, at least on the command line inside vscode, it is picking up from path for python.exe which i have explicitly setup.

satya - 9/2/2019, 2:06:54 PM

Here is an explicit document on setting up python environments

Here is an explicit document on setting up python environments

satya - 9/2/2019, 2:08:10 PM

So the fact is

By default, the Python extension looks for and uses the first Python interpreter it finds in the system path. If it doesn't find an interpreter, it issues a warning. On macOS, the extension also issues a warning if you're using the OS-installed Python interpreter, because you typically want to use an interpreter you install directly.

satya - 9/2/2019, 2:08:25 PM

first Python interpreter it finds in the system path

first Python interpreter it finds in the system path

satya - 9/2/2019, 2:14:28 PM

You can also use the command

use Ctrl-shift-p for command interpreter

type "python:sel...."

Choose Python: Select Interpreter

It shows what is current and what is available. To my surprise I found what I installed manually is 32bit while the app version is at 64 bit.

satya - 9/2/2019, 2:25:13 PM

More notes on vscode are here

More notes on vscode are here

satya - 9/2/2019, 3:23:19 PM

Next goal

Take an example pyspark program in python

look for "spark\examples\src\main\python\wordcount.py"

Looks like you can't use the spark interactive shell to do this

You have to submit the job instead

Or install pyspark of python and then you can run it as a python program

I am going to try the submit option first

You can read about this in the next link below

satya - 9/2/2019, 3:24:05 PM

Apache spark docs on running python programs

Apache spark docs on running python programs

satya - 9/2/2019, 3:35:00 PM

Pyspark: Unsupported class file major version 56

Pyspark: Unsupported class file major version 56

Search for: Pyspark: Unsupported class file major version 56

satya - 9/2/2019, 3:35:24 PM

Looks like after all one needs jdk8!! to run this version of Spark

Looks like after all one needs jdk8!! to run this version of Spark

satya - 9/2/2019, 3:36:39 PM

Anyway here is the example using submit


//Of course remove the new lines
//when you try this out

..\bin\spark-submit   --master local[4]   ..\examples\src\main\python\wordcount.py

satya - 9/3/2019, 2:15:52 PM

So what did I install

satya - 9/3/2019, 2:16:46 PM

Here is the dependency break down

satya - 9/3/2019, 2:16:54 PM

Primary dependencies are inside the boundary

Primary dependencies are inside the boundary

satya - 9/3/2019, 2:18:00 PM

what did I test

satya - 9/3/2019, 2:18:22 PM

Key takeaways

satya - 9/3/2019, 2:18:40 PM

Next tasks

satya - 9/6/2019, 9:45:10 AM

How do I tell python where other python packages are?

How do I tell python where other python packages are?

Search for: How do I tell python where other python packages are?

satya - 9/6/2019, 9:51:35 AM

pip install pyspark failed with error code 1 in windows

pip install pyspark failed with error code 1 in windows

Search for: pip install pyspark failed with error code 1 in windows

satya - 9/6/2019, 10:48:13 AM

This is psuedo problem. The real reason and solution is

Could not import pypandoc - required to package PySpark

To fix it do the following

pip install pypandoc

pip install pyspark

satya - 9/6/2019, 10:48:45 AM

The above is documented here, in equally brief terms however

The above is documented here, in equally brief terms however

satya - 9/6/2019, 10:49:23 AM

Better search for future: pip install pyspark failing on windows 10

pip install pyspark failing on windows 10

Search for: pip install pyspark failing on windows 10

satya - 9/6/2019, 10:50:03 AM

I also had uninstalled python374 that is 32bit, which I had incorrectly installed.

I also had uninstalled python374 that is 32bit, which I had incorrectly installed.

satya - 9/6/2019, 10:58:05 AM

Now that seem to cause problems of its own

VSCode can't find python although available as a windows 64 bit app

I see it in: C:\Users\satya\AppData\Local\Microsoft\WindowsApps

Not sure why it is not in its own directory? if not what is the python_home should be then set to??

satya - 9/6/2019, 10:58:22 AM

what is python_home for windows 10 python app?

what is python_home for windows 10 python app?

Search for: what is python_home for windows 10 python app?

satya - 9/6/2019, 11:01:19 AM

Using python on windows: A python.org doc

Using python on windows: A python.org doc

satya - 9/6/2019, 11:09:16 AM

Getting started with python code in VS Code

Getting started with python code in VS Code

This is the original instructions i have followed to install python. I had misread it and went for the easiest route which is the windows app. the other two options are python from python.org or the anaconda distribution.

I am going to go with python.org although anaconda distribution may be more desirable. will see.

satya - 9/6/2019, 2:40:05 PM

Back to JDK installation

As I had installed jdk12 before, I had to install jdk8 now

There are so many versions of it and types

Should I use Oracle version or will openjdk will do

there are so many URLs!!

Some asked for login to oracle

Look around and validate the right URL and get your jdk SE version (Not entirely sure)

I did pick the base jdk se version 8

it only installed JRE

Choose a "change directory" for a target directory such as \jdk8\jre

Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\

and JAVA_HOME to \jdk8

I don't even if this works. Will keep you posted

satya - 9/6/2019, 2:42:19 PM

Reinstalled python374 from python.org

Uninstall and reboot with a no python

Then install

Ask it to add environment variables otherwise you will not find pip.ex as it is in /scripts

That will add path to

\python374\

\python374\scripts\

It will only add to local path

Add it to system path as well

satya - 9/6/2019, 2:43:33 PM

Rerun pip installs


pip install pypandoc

pip install pyspark

satya - 9/6/2019, 2:43:53 PM

Pip install seem to be smart enough to cache the previous download somehow.

Pip install seem to be smart enough to cache the previous download somehow.

satya - 9/6/2019, 2:45:17 PM

Now in vscode you have to tell where the interpreter is

ctrl-shift-p

then looks like you have to the folders inside the work space.

I have two. So I had to choose twice

Not sure if there is a way to do this globally

satya - 9/6/2019, 2:45:42 PM

Finally the following "from" will not show the error in the .py file


from pyspark.sql import SparkSession

satya - 9/6/2019, 5:55:32 PM

spark-submit error The system cannot find the path specified.

spark-submit error The system cannot find the path specified.

Search for: spark-submit error The system cannot find the path specified.

satya - 9/6/2019, 6:27:33 PM

It never ends....

It never ends....

satya - 9/6/2019, 6:30:26 PM

So this seem to be coming from wrong setting of JAVA_HOME

On jdk12 I have set it to

java_home=/jdk12

and path as /jdk12/bin

However in Oracle JDK8 installation the director structure was

/jdk8/jre/bin

and there are no sub directories under /jdk8 other than ./jre

In my haste i have set the java_home as

java_home=/jdk8

And that was the problem, it should be

java_home=/jdk8/jre

in cases where the jdk is really a "jre" distribution

satya - 9/6/2019, 6:31:46 PM

So JAVA_HOME should be the parent directory of /bin

sometimes it is /jdk-x and sometime /jdk-x/jre

satya - 9/6/2019, 6:33:05 PM

So for Java SE JDK 8 the settings are

JAVA_HOME=/jdk8/jre

paht=%path%;/jdk8/jre/bin

Of course take care of the slashes and root directories

satya - 9/6/2019, 7:10:04 PM

Shakespear sonnets

Shakespear sonnets

Search for: Shakespear sonnets

satya - 9/6/2019, 7:10:35 PM

MIT Archive

MIT Archive

satya - 9/6/2019, 7:11:04 PM

Sonnet I


FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light'st flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, makest waste in niggarding.
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

satya - 9/6/2019, 7:48:33 PM

Here are words at their frequency


the: 6
thy: 4
to: 3
might: 2
But: 2
by: 2
tender: 2
thine: 2
own: 2
world's: 2
FROM: 1
fairest: 1
creatures: 1
we: 1
desire: 1
increase,: 1
That: 1
thereby: 1
beauty's: 1
rose: 1
never: 1
die,: 1
as: 1
riper: 1
should: 1
time: 1
decease,: 1
His: 1
heir: 1
bear: 1
his: 1
memory:: 1
thou,: 1
contracted: 1
bright: 1
eyes,: 1
Feed'st: 1
light'st: 1
flame: 1
with: 1
self-substantial: 1
fuel,: 1
Making: 1
a: 1
famine: 1
where: 1
abundance: 1
lies,: 1
Thyself: 1
foe,: 1
sweet: 1
self: 1
too: 1
cruel.: 1
Thou: 1
that: 1
art: 1
now: 1
fresh: 1
ornament: 1
And: 1
only: 1
herald: 1
gaudy: 1
spring,: 1
Within: 1
bud: 1
buriest: 1
content: 1
And,: 1
churl,: 1
makest: 1
waste: 1
in: 1
Pity: 1
world,: 1
or: 1
else: 1
this: 1
glutton: 1
be,: 1
To: 1
eat: 1
due,: 1
grave: 1
and: 1
thee.: 1

satya - 9/6/2019, 7:49:08 PM

Code for it


lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' '))                   .map(lambda x: (x, 1))                   .reduceByKey(add) 
                  
    counts2 = counts.sortBy(lambda x: x[1], False)
    output = counts2.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

satya - 9/6/2019, 7:49:18 PM

So the installation finally seem to work.

So the installation finally seem to work.

satya - 9/6/2019, 7:49:32 PM

I will document the details in a linear fashion soon. will post a link

I will document the details in a linear fashion soon. will post a link

satya - 9/7/2019, 11:44:56 AM

Sonnet 2 may be better


When forty winters shall beseige thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery, so gazed on now,
Will be a tatter'd weed, of small worth held:
Then being ask'd where all thy beauty lies,
Where all the treasure of thy lusty days,
To say, within thine own deep-sunken eyes,
Were an all-eating shame and thriftless praise.
How much more praise deserved thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.