AWS Data bricks setup

1. Free community edition with X Gig

2. DBFS - Data bricks file system

3. Define tables to see data abstraction

Try databricks url: https://databricks.com/try-databricks

signup with community edition

email confirmation

login

Databricks documentation

Managing clusters is documented here

1. Create a cluster

2. Associate that cluster with the notebook

Choose the language as python

Shift-enter to run the cell and go to next cell

Use tab to autofill

1. data

2. Add data

3. Browse to chose a csv file

4. preview file to select columns and their types

5. give a name

6. Save the table

7. So you can read the table in the noteboook


import pyspark

df = sqlContext.sql("select * from mytable-above")

df.show()

everything is case sensitive

1. I have terminated the cluster thinking I am stopping it and I will come back to it later. The option to restart is not available.

2. I had to clone that cluster (or create a new one I suppose) and wait for it to start running and attach it to the notebook when it is created

3. If the cluster is terminated as in (1) then the notebook creation will not have a drop down for a cluster. fyi

4. I misspelled "sqlcontext". It should be sqlContext

5. Shift-enter will run the cell and open a new cell

6. tab looks like for auto completion

7. I skipped ".sql" after "sqlContext". I got strange errors when like the HiveObject not found etc. So make sure it is "sqlContext.sql("")". I know. it is all there. But eagerness ...

It is to be seen when I come back after a few hours if the notebook, cluster and all will still be there or not.

I am worried to explicitly terminate the cluster. Instead will see if the notebook will auto restart it when invoked. to be seen