AWS Data bricks setup

satya - 7/11/2020, 1:35:24 PM

Key ideas

1. Free community edition with X Gig

2. DBFS - Data bricks file system

3. Define tables to see data abstraction

satya - 7/11/2020, 1:36:25 PM

Try databricks url: https://databricks.com/try-databricks

Try databricks url: https://databricks.com/try-databricks

satya - 7/11/2020, 1:44:36 PM

First steps

signup with community edition

email confirmation

login

satya - 7/11/2020, 1:47:26 PM

Databricks documentation

Databricks documentation

satya - 7/11/2020, 3:08:06 PM

Managing clusters is documented here

Managing clusters is documented here

satya - 7/11/2020, 3:08:33 PM

To create a notebook

1. Create a cluster

2. Associate that cluster with the notebook

satya - 7/11/2020, 3:20:30 PM

Choose the language as python

Choose the language as python

satya - 7/11/2020, 3:20:39 PM

Shift-enter to run the cell and go to next cell

Shift-enter to run the cell and go to next cell

satya - 7/11/2020, 3:49:31 PM

Use tab to autofill

Use tab to autofill

satya - 7/11/2020, 3:50:50 PM

How do you create a table or read a file

1. data

2. Add data

3. Browse to chose a csv file

4. preview file to select columns and their types

5. give a name

6. Save the table

7. So you can read the table in the noteboook

satya - 7/11/2020, 3:51:36 PM

How to read the table in the notebook


import pyspark

df = sqlContext.sql("select * from mytable-above")

df.show()

satya - 7/11/2020, 3:51:53 PM

everything is case sensitive

everything is case sensitive

satya - 7/11/2020, 3:52:54 PM

Mistakes I made as I created my first notebook

1. I have terminated the cluster thinking I am stopping it and I will come back to it later. The option to restart is not available.

2. I had to clone that cluster (or create a new one I suppose) and wait for it to start running and attach it to the notebook when it is created

3. If the cluster is terminated as in (1) then the notebook creation will not have a drop down for a cluster. fyi

4. I misspelled "sqlcontext". It should be sqlContext

5. Shift-enter will run the cell and open a new cell

6. tab looks like for auto completion

7. I skipped ".sql" after "sqlContext". I got strange errors when like the HiveObject not found etc. So make sure it is "sqlContext.sql("")". I know. it is all there. But eagerness ...

satya - 7/11/2020, 3:54:20 PM

Now...

It is to be seen when I come back after a few hours if the notebook, cluster and all will still be there or not.

I am worried to explicitly terminate the cluster. Instead will see if the notebook will auto restart it when invoked. to be seen