AWS Data bricks setup
satya - 7/11/2020, 1:35:24 PM
Key ideas
1. Free community edition with X Gig
2. DBFS - Data bricks file system
3. Define tables to see data abstraction
satya - 7/11/2020, 1:36:25 PM
Try databricks url: https://databricks.com/try-databricks
satya - 7/11/2020, 1:44:36 PM
First steps
signup with community edition
email confirmation
login
satya - 7/11/2020, 3:08:06 PM
Managing clusters is documented here
satya - 7/11/2020, 3:08:33 PM
To create a notebook
1. Create a cluster
2. Associate that cluster with the notebook
satya - 7/11/2020, 3:20:30 PM
Choose the language as python
Choose the language as python
satya - 7/11/2020, 3:20:39 PM
Shift-enter to run the cell and go to next cell
Shift-enter to run the cell and go to next cell
satya - 7/11/2020, 3:49:31 PM
Use tab to autofill
Use tab to autofill
satya - 7/11/2020, 3:50:50 PM
How do you create a table or read a file
1. data
2. Add data
3. Browse to chose a csv file
4. preview file to select columns and their types
5. give a name
6. Save the table
7. So you can read the table in the noteboook
satya - 7/11/2020, 3:51:36 PM
How to read the table in the notebook
import pyspark
df = sqlContext.sql("select * from mytable-above")
df.show()
satya - 7/11/2020, 3:51:53 PM
everything is case sensitive
everything is case sensitive
satya - 7/11/2020, 3:52:54 PM
Mistakes I made as I created my first notebook
1. I have terminated the cluster thinking I am stopping it and I will come back to it later. The option to restart is not available.
2. I had to clone that cluster (or create a new one I suppose) and wait for it to start running and attach it to the notebook when it is created
3. If the cluster is terminated as in (1) then the notebook creation will not have a drop down for a cluster. fyi
4. I misspelled "sqlcontext". It should be sqlContext
5. Shift-enter will run the cell and open a new cell
6. tab looks like for auto completion
7. I skipped ".sql" after "sqlContext". I got strange errors when like the HiveObject not found etc. So make sure it is "sqlContext.sql("")". I know. it is all there. But eagerness ...
satya - 7/11/2020, 3:54:20 PM
Now...
It is to be seen when I come back after a few hours if the notebook, cluster and all will still be there or not.
I am worried to explicitly terminate the cluster. Instead will see if the notebook will auto restart it when invoked. to be seen