PySpark Learn Journal 2

satya - 7/10/2020, 8:12:17 PM

Key ideas to reintroduce

1. Map Reduce

2. RDDs

3. Transformations

4. Actions

5. Data Frames

satya - 7/10/2020, 8:16:54 PM

What are the URLs for Spark

SQL

RDDs

APIs

Data frames

satya - 7/11/2020, 12:13:48 PM

Write a python program that demo the following


Data types
Numbers
Strings
Printing
Lists
Dictionaries
Booleans
Tuples
Sets
Comparison Operators
if,elif, else Statements
for Loops
while Loops
range()
list comprehension
functions
lambda expressions
map and filter
methods

satya - 7/11/2020, 12:14:22 PM

Add a few more of my chosing on top to make it as comprehensive as possible

satya - 7/12/2020, 11:27:33 AM

Next steps

1. Diff between a spark session and a spark context

2. Run the sample of reading a file by typing defs

satya - 7/12/2020, 11:28:20 AM

Run the csv file read on

1. windows pyspark

2. local notebook

3. aws notebook

satya - 7/12/2020, 11:37:16 AM

Topics to understand DF: Lession 1

1. what is a sparksession

2. how do you get one

3. what methods are on a session

4. how to read a file (csc, json) using session into a data frame

5. how to show a data frame

6. how to see the schema of the data frame

7. How to explicitly set the schema for a file

satya - 7/12/2020, 11:44:03 AM

Local Jupyter on my windows seem to work!

satya - 7/12/2020, 11:56:18 AM

How to get a session



#Python stuff for pyspark package
#pyspark is a package/module
from pyspark.sql import SparkSession

#Create a new session or get one that is named
spark = SparkSession.builder.appName("Basics").getOrCreate()

#Examine its type
type(spark)

satya - 7/12/2020, 6:56:01 PM

Submitting pyspark program to run on windows: a batch file rs1.bat


@echo off

@rem the spark examples are at
@rem c:\satya\i\spark\examples\src\main\python
@rem notice the spark bin directory in its installation path

@rem *****************************************************
@rem this is how to submit a spark job using .py file
@rem example: rs1.cmd wordcount.py sonnets2.txt
@rem 
@rem rs1.cmd : This batch file
@rem wordcount.py : pyspark program
@rem sonnet2.txt: input argument
@rem
@rem pwd: C:\satya\data\code\pyspark 
@rem     \wordcount.py
@rem     \sonnet2.txt
@rem
@rem *****************************************************