PySpark Learn Journal 2
satya - 7/10/2020, 8:12:17 PM
Key ideas to reintroduce
1. Map Reduce
2. RDDs
3. Transformations
4. Actions
5. Data Frames
satya - 7/10/2020, 8:16:54 PM
What are the URLs for Spark
SQL
RDDs
APIs
Data frames
satya - 7/11/2020, 12:13:48 PM
Write a python program that demo the following
Data types
Numbers
Strings
Printing
Lists
Dictionaries
Booleans
Tuples
Sets
Comparison Operators
if,elif, else Statements
for Loops
while Loops
range()
list comprehension
functions
lambda expressions
map and filter
methods
satya - 7/11/2020, 12:14:22 PM
Add a few more of my chosing on top to make it as comprehensive as possible
Add a few more of my chosing on top to make it as comprehensive as possible
satya - 7/12/2020, 11:27:33 AM
Next steps
1. Diff between a spark session and a spark context
2. Run the sample of reading a file by typing defs
satya - 7/12/2020, 11:28:20 AM
Run the csv file read on
1. windows pyspark
2. local notebook
3. aws notebook
satya - 7/12/2020, 11:37:16 AM
Topics to understand DF: Lession 1
1. what is a sparksession
2. how do you get one
3. what methods are on a session
4. how to read a file (csc, json) using session into a data frame
5. how to show a data frame
6. how to see the schema of the data frame
7. How to explicitly set the schema for a file
satya - 7/12/2020, 11:44:03 AM
Local Jupyter on my windows seem to work!
Local Jupyter on my windows seem to work!
satya - 7/12/2020, 11:56:18 AM
How to get a session
#Python stuff for pyspark package
#pyspark is a package/module
from pyspark.sql import SparkSession
#Create a new session or get one that is named
spark = SparkSession.builder.appName("Basics").getOrCreate()
#Examine its type
type(spark)
satya - 7/12/2020, 6:56:01 PM
Submitting pyspark program to run on windows: a batch file rs1.bat
@echo off
@rem the spark examples are at
@rem c:\satya\i\spark\examples\src\main\python
@rem notice the spark bin directory in its installation path
@rem *****************************************************
@rem this is how to submit a spark job using .py file
@rem example: rs1.cmd wordcount.py sonnets2.txt
@rem
@rem rs1.cmd : This batch file
@rem wordcount.py : pyspark program
@rem sonnet2.txt: input argument
@rem
@rem pwd: C:\satya\data\code\pyspark
@rem \wordcount.py
@rem \sonnet2.txt
@rem
@rem *****************************************************