PySpark: PySpark on Windows 10: Installation Journal. A formal article based on this material will soon be posted to the bigdata folder.
satya - 8/31/2019, 2:01:09 PM
what is vscode extension for working with pyspark library?
what is vscode extension for working with pyspark library?
Search for: what is vscode extension for working with pyspark library?
satya - 8/31/2019, 2:43:51 PM
VS code official python extension docs
satya - 8/31/2019, 2:48:58 PM
How to install pyspark?
How to install pyspark?
satya - 8/31/2019, 3:02:12 PM
where does pip install the packages?
where does pip install the packages?
satya - 8/31/2019, 3:02:39 PM
There is some local info on installing pyspark
satya - 8/31/2019, 3:06:42 PM
where does pip install the packages? on SOF
satya - 8/31/2019, 3:19:02 PM
Gave up and going to try
c:\any-dir>pip install pyspark
satya - 8/31/2019, 3:19:24 PM
where is the official pyspark distribution?
where is the official pyspark distribution?
satya - 8/31/2019, 3:21:09 PM
Here is the PySpark API: This is at Spark Apache site
satya - 8/31/2019, 3:31:06 PM
Lets see whats here: Book reference Learning PySpark
A bit disappointed as I am looking for direct instructions to get started with installation of stand alone (spark if needed) and the pyspark library for python, an IDE, in VScode etc.
satya - 8/31/2019, 3:35:42 PM
It is really sad, the links in the book don't seem to work any more :(
It is really sad, the links in the book don't seem to work any more :(
satya - 9/1/2019, 11:20:07 AM
How to install spark standalone on windows
How to install spark standalone on windows
satya - 9/1/2019, 11:22:51 AM
Here are notes from Sams 24 hours spark book on installing spark
Here are notes from Sams 24 hours spark book on installing spark
satya - 9/1/2019, 11:24:10 AM
This is a good book overall
This is a good book overall
satya - 9/1/2019, 11:24:31 AM
Yes you need 7zip: http://7-zip.org/download.html
satya - 9/1/2019, 11:28:46 AM
we are advised to set the following
setx /M _JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
//Not sure what this is, but will investigate
satya - 9/1/2019, 2:44:07 PM
Here is a link where everyone is using the common binaries for Hadoop
Here is a link where everyone is using the common binaries for Hadoop
satya - 9/1/2019, 3:40:29 PM
What are these about?
Basic, innocuous, goal of syntax checking for PySpark in VSCode on a windows box has now led to: 8 hours so far, a new laptop, spark, python, hadoop, 7zip, git, and still on the move....
So why?
Apparently spark stand alone clusters on windows needs some Hadoop binaries.
Downloading them from Apache hadoop is a 300MB affair and not even sue if I got the required bin directory
More search leads to some git site called "winutils" that have binaries built for each version of hadoop, just the bin directory
Well here I am downloading the whole repo, although there seem to be a way to download a partial repo using git! Well for some other time.
satya - 9/1/2019, 5:00:25 PM
Paths
setx /M path "%path%;
//python: Notice no /bin
C:\satya\i\python374
//java
C:\satya\i\jdk12\bin;
//Hadoop path
C:\satya\i\hadoop312-common-bin\bin
satya - 9/1/2019, 5:03:36 PM
Nature of paths on my box prior to fixing anything
Path=C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;
C:\Windows\System32\WindowsPowerShell\v1.0\;
C:\Windows\System32\OpenSSH\;
C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;
C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;
C:\Program Files\Intel\WiFi\bin\;
C:\Program Files\Common Files\Intel\WirelessCommon\;
C:\satya\i\git\Git\cmd;
C:\Users\satya\AppData\Local\Microsoft\WindowsApps;
C:\satya\i\vscode\bin;
C:\Users\satya\AppData\Local\GitHubDesktop\bin
PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
satya - 9/1/2019, 5:04:09 PM
I thought I installed java!!!! what happened to it? Let me check
I thought I installed java!!!! what happened to it? Let me check
satya - 9/1/2019, 5:05:22 PM
it says java se 12 is installed, how come then
environment variable java_home is not there
java is not in the path
satya - 9/2/2019, 12:46:39 PM
How best to setup system environment variables in windows 10
Search for any of these
Environment
Edit Environment variables
Do not search for
Settings
System settings
System properties
Choose "Edit Environment variables"
That takes you to a dialog "System Properties"
Choose at the bottom: Environment variables
It is best you edit these environment variables this way
and not through command line.
satya - 9/2/2019, 12:48:42 PM
What i have before I have tested
System level Environment variables:
#***************************************
#Java options to disable ipv6
#***************************************
_JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
JAVA_HOME "%install-dir%\jdk12"
HADOOP_HOME C:\satya\i\hadoop312-common-bin
#***************************************
#Add the following paths to the system path variable
#***************************************
#Java path
%install-dir%\jdk12\bin
#python-path
%install-dir%\python374
#hadoop-path
%install-dir%\hadoop312-common-bin\bin
satya - 9/2/2019, 12:58:15 PM
WARNING: An illegal reflective access operation has occurred
WARNING: An illegal reflective access operation has occurred
Search for: WARNING: An illegal reflective access operation has occurred
satya - 9/2/2019, 1:02:11 PM
Here the problem appears to be with a java version. Preferred one appears to be Java 8
Here the problem appears to be with a java version. Preferred one appears to be Java 8
satya - 9/2/2019, 1:03:28 PM
As the 24 hrs book suggested I had also done
mkdir C:\tmp\hive
C:\Hadoop\bin\winutils.exe chmod 777 /tmp/hive
satya - 9/2/2019, 1:26:48 PM
How do I tell vscode where python is installed?
How do I tell vscode where python is installed?
satya - 9/2/2019, 2:06:19 PM
Few notes on this
there are 2 pythons
the first one I have installed from microsoft app store
the second one manually from python.org
Neither is anaconda distribution
It looks like, at least on the command line inside vscode, it is picking up from path for python.exe which i have explicitly setup.
satya - 9/2/2019, 2:06:54 PM
Here is an explicit document on setting up python environments
Here is an explicit document on setting up python environments
satya - 9/2/2019, 2:08:10 PM
So the fact is
By default, the Python extension looks for and uses the first Python interpreter it finds in the system path. If it doesn't find an interpreter, it issues a warning. On macOS, the extension also issues a warning if you're using the OS-installed Python interpreter, because you typically want to use an interpreter you install directly.
satya - 9/2/2019, 2:08:25 PM
first Python interpreter it finds in the system path
first Python interpreter it finds in the system path
satya - 9/2/2019, 2:14:28 PM
You can also use the command
use Ctrl-shift-p for command interpreter
type "python:sel...."
Choose Python: Select Interpreter
It shows what is current and what is available. To my surprise I found what I installed manually is 32bit while the app version is at 64 bit.
satya - 9/2/2019, 3:23:19 PM
Next goal
Take an example pyspark program in python
look for "spark\examples\src\main\python\wordcount.py"
Looks like you can't use the spark interactive shell to do this
You have to submit the job instead
Or install pyspark of python and then you can run it as a python program
I am going to try the submit option first
You can read about this in the next link below
satya - 9/2/2019, 3:24:05 PM
Apache spark docs on running python programs
satya - 9/2/2019, 3:35:00 PM
Pyspark: Unsupported class file major version 56
Pyspark: Unsupported class file major version 56
Search for: Pyspark: Unsupported class file major version 56
satya - 9/2/2019, 3:35:24 PM
Looks like after all one needs jdk8!! to run this version of Spark
Looks like after all one needs jdk8!! to run this version of Spark
satya - 9/2/2019, 3:36:39 PM
Anyway here is the example using submit
//Of course remove the new lines
//when you try this out
..\bin\spark-submit --master local[4] ..\examples\src\main\python\wordcount.py
satya - 9/3/2019, 2:15:52 PM
So what did I install
satya - 9/3/2019, 2:16:46 PM
Here is the dependency break down
satya - 9/3/2019, 2:16:54 PM
Primary dependencies are inside the boundary
Primary dependencies are inside the boundary
satya - 9/3/2019, 2:18:00 PM
what did I test
satya - 9/3/2019, 2:18:22 PM
Key takeaways
satya - 9/3/2019, 2:18:40 PM
Next tasks
satya - 9/6/2019, 9:45:10 AM
How do I tell python where other python packages are?
How do I tell python where other python packages are?
Search for: How do I tell python where other python packages are?
satya - 9/6/2019, 9:51:35 AM
pip install pyspark failed with error code 1 in windows
pip install pyspark failed with error code 1 in windows
Search for: pip install pyspark failed with error code 1 in windows
satya - 9/6/2019, 10:48:13 AM
This is psuedo problem. The real reason and solution is
Could not import pypandoc - required to package PySpark
To fix it do the following
pip install pypandoc
pip install pyspark
satya - 9/6/2019, 10:48:45 AM
The above is documented here, in equally brief terms however
The above is documented here, in equally brief terms however
satya - 9/6/2019, 10:49:23 AM
Better search for future: pip install pyspark failing on windows 10
pip install pyspark failing on windows 10
satya - 9/6/2019, 10:50:03 AM
I also had uninstalled python374 that is 32bit, which I had incorrectly installed.
I also had uninstalled python374 that is 32bit, which I had incorrectly installed.
satya - 9/6/2019, 10:58:05 AM
Now that seem to cause problems of its own
VSCode can't find python although available as a windows 64 bit app
I see it in: C:\Users\satya\AppData\Local\Microsoft\WindowsApps
Not sure why it is not in its own directory? if not what is the python_home should be then set to??
satya - 9/6/2019, 10:58:22 AM
what is python_home for windows 10 python app?
what is python_home for windows 10 python app?
satya - 9/6/2019, 11:01:19 AM
Using python on windows: A python.org doc
satya - 9/6/2019, 11:09:16 AM
Getting started with python code in VS Code
Getting started with python code in VS Code
This is the original instructions i have followed to install python. I had misread it and went for the easiest route which is the windows app. the other two options are python from python.org or the anaconda distribution.
I am going to go with python.org although anaconda distribution may be more desirable. will see.
satya - 9/6/2019, 2:40:05 PM
Back to JDK installation
As I had installed jdk12 before, I had to install jdk8 now
There are so many versions of it and types
Should I use Oracle version or will openjdk will do
there are so many URLs!!
Some asked for login to oracle
Look around and validate the right URL and get your jdk SE version (Not entirely sure)
I did pick the base jdk se version 8
it only installed JRE
Choose a "change directory" for a target directory such as \jdk8\jre
Then add that path to the system path as: c:\satya\i\jdk8\jre\bin\
and JAVA_HOME to \jdk8
I don't even if this works. Will keep you posted
satya - 9/6/2019, 2:42:19 PM
Reinstalled python374 from python.org
Uninstall and reboot with a no python
Then install
Ask it to add environment variables otherwise you will not find pip.ex as it is in /scripts
That will add path to
\python374\
\python374\scripts\
It will only add to local path
Add it to system path as well
satya - 9/6/2019, 2:43:33 PM
Rerun pip installs
pip install pypandoc
pip install pyspark
satya - 9/6/2019, 2:43:53 PM
Pip install seem to be smart enough to cache the previous download somehow.
Pip install seem to be smart enough to cache the previous download somehow.
satya - 9/6/2019, 2:45:17 PM
Now in vscode you have to tell where the interpreter is
ctrl-shift-p
then looks like you have to the folders inside the work space.
I have two. So I had to choose twice
Not sure if there is a way to do this globally
satya - 9/6/2019, 2:45:42 PM
Finally the following "from" will not show the error in the .py file
from pyspark.sql import SparkSession
satya - 9/6/2019, 5:55:32 PM
spark-submit error The system cannot find the path specified.
spark-submit error The system cannot find the path specified.
Search for: spark-submit error The system cannot find the path specified.
satya - 9/6/2019, 6:27:33 PM
It never ends....
It never ends....
satya - 9/6/2019, 6:30:26 PM
So this seem to be coming from wrong setting of JAVA_HOME
On jdk12 I have set it to
java_home=/jdk12
and path as /jdk12/bin
However in Oracle JDK8 installation the director structure was
/jdk8/jre/bin
and there are no sub directories under /jdk8 other than ./jre
In my haste i have set the java_home as
java_home=/jdk8
And that was the problem, it should be
java_home=/jdk8/jre
in cases where the jdk is really a "jre" distribution
satya - 9/6/2019, 6:31:46 PM
So JAVA_HOME should be the parent directory of /bin
sometimes it is /jdk-x and sometime /jdk-x/jre
satya - 9/6/2019, 6:33:05 PM
So for Java SE JDK 8 the settings are
JAVA_HOME=/jdk8/jre
paht=%path%;/jdk8/jre/bin
Of course take care of the slashes and root directories
satya - 9/6/2019, 7:11:04 PM
Sonnet I
FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light'st flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, makest waste in niggarding.
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
satya - 9/6/2019, 7:48:33 PM
Here are words at their frequency
the: 6
thy: 4
to: 3
might: 2
But: 2
by: 2
tender: 2
thine: 2
own: 2
world's: 2
FROM: 1
fairest: 1
creatures: 1
we: 1
desire: 1
increase,: 1
That: 1
thereby: 1
beauty's: 1
rose: 1
never: 1
die,: 1
as: 1
riper: 1
should: 1
time: 1
decease,: 1
His: 1
heir: 1
bear: 1
his: 1
memory:: 1
thou,: 1
contracted: 1
bright: 1
eyes,: 1
Feed'st: 1
light'st: 1
flame: 1
with: 1
self-substantial: 1
fuel,: 1
Making: 1
a: 1
famine: 1
where: 1
abundance: 1
lies,: 1
Thyself: 1
foe,: 1
sweet: 1
self: 1
too: 1
cruel.: 1
Thou: 1
that: 1
art: 1
now: 1
fresh: 1
ornament: 1
And: 1
only: 1
herald: 1
gaudy: 1
spring,: 1
Within: 1
bud: 1
buriest: 1
content: 1
And,: 1
churl,: 1
makest: 1
waste: 1
in: 1
Pity: 1
world,: 1
or: 1
else: 1
this: 1
glutton: 1
be,: 1
To: 1
eat: 1
due,: 1
grave: 1
and: 1
thee.: 1
satya - 9/6/2019, 7:49:08 PM
Code for it
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add)
counts2 = counts.sortBy(lambda x: x[1], False)
output = counts2.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
satya - 9/6/2019, 7:49:18 PM
So the installation finally seem to work.
So the installation finally seem to work.
satya - 9/6/2019, 7:49:32 PM
I will document the details in a linear fashion soon. will post a link
I will document the details in a linear fashion soon. will post a link
satya - 9/7/2019, 11:44:56 AM
Sonnet 2 may be better
When forty winters shall beseige thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery, so gazed on now,
Will be a tatter'd weed, of small worth held:
Then being ask'd where all thy beauty lies,
Where all the treasure of thy lusty days,
To say, within thine own deep-sunken eyes,
Were an all-eating shame and thriftless praise.
How much more praise deserved thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.