PySpark Learn Journal 2

1. Map Reduce

2. RDDs

3. Transformations

4. Actions

5. Data Frames

SQL

RDDs

APIs

Data frames


Data types
Numbers
Strings
Printing
Lists
Dictionaries
Booleans
Tuples
Sets
Comparison Operators
if,elif, else Statements
for Loops
while Loops
range()
list comprehension
functions
lambda expressions
map and filter
methods

Add a few more of my chosing on top to make it as comprehensive as possible

1. Diff between a spark session and a spark context

2. Run the sample of reading a file by typing defs

1. windows pyspark

2. local notebook

3. aws notebook

1. what is a sparksession

2. how do you get one

3. what methods are on a session

4. how to read a file (csc, json) using session into a data frame

5. how to show a data frame

6. how to see the schema of the data frame

7. How to explicitly set the schema for a file

Local Jupyter on my windows seem to work!



#Python stuff for pyspark package
#pyspark is a package/module
from pyspark.sql import SparkSession

#Create a new session or get one that is named
spark = SparkSession.builder.appName("Basics").getOrCreate()

#Examine its type
type(spark)

@echo off

@rem the spark examples are at
@rem c:\satya\i\spark\examples\src\main\python
@rem notice the spark bin directory in its installation path

@rem *****************************************************
@rem this is how to submit a spark job using .py file
@rem example: rs1.cmd wordcount.py sonnets2.txt
@rem 
@rem rs1.cmd : This batch file
@rem wordcount.py : pyspark program
@rem sonnet2.txt: input argument
@rem
@rem pwd: C:\satya\data\code\pyspark 
@rem     \wordcount.py
@rem     \sonnet2.txt
@rem
@rem *****************************************************