Spark on Azure

How to run spark on Azure?

Search for: How to run spark on Azure?

Difference between HDInsight and Azure Data bricks

Search for: Difference between HDInsight and Azure Data bricks

Here is how these two are compared

Azure data bricks docs

Databricks pricing is here

Databricks quick start: get a cluster, run a spark job

The azure databricks service points to its own documentation here

Getting started

User guide

Admin guide

APIs

Release Notes

Delta lake guide

Spark Ref

ML

Meaning of: Deploying Azure Databricks in your Azure Virtual Network

Deploying Azure Databricks in your Azure Virtual Network

Search for: Deploying Azure Databricks in your Azure Virtual Network

For now take the default option of "No" for this

The workspace creation takes a few minutes

what is the difference between a resource group and a managed resource group?

Search for: what is the difference between a resource group and a managed resource group?

what are managed resource groups in Azure?

Search for: what are managed resource groups in Azure?

I am asking this because the creation of a databricks workspace is creating automatically a managed resource group

Databricks managed resource group

Search for: Databricks managed resource group

A deep dive Databricks PPT from MS Ignire

This is almost like a portal inside a portal. there appears to be an additional login into this portal. This seem to have managed with SSO based on the logged in subscription credentials.

It keeps cluster handy so the startup time for cluster is reduced.

Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster

Search for: Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster

what are drivers and workers in spark clusters

Search for: what are drivers and workers in spark clusters

Clusters are documented here

However detail seem to be sparse

Here is a public data set called Boston Safety Data set


This data set is in parquet format
updated daily
100,000 rows
10MB
Records from 2011 to today

open data initiative from MS is here

Notice that the Azure Databricks workspace is a separate portal from the Azure portal

Notice the menu left to navigate the databricks experience.

Search for "subscription" in the global search

Locate subscription

Click on it to see subscription blade

Azure forums on MSDN

Info on quota increase

what DS series should be used for databricks in azure?

Search for: what DS series should be used for databricks in azure?

Service limits are described here

How to check current CORE quotas in Azure?

Search for: How to check current CORE quotas in Azure?

So far there is no good answer. Will look into it later

....Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.

Notebooks are documented here

It took a while to start the cluster. About 10 minutes

You see your usual stop, start icons on the right hand side of each named cluster line

1. Start the cluster first (even if it is on demand/auto-scale cluster)

2. Create the notebook by choosing that running cluster

3. Do work on the notebook

4. when done close the cluster

5. notebook seemed to be stored somewhere as it is available even if the cluster is shutdown

Notice the ability to get plots at each field level in the query itself

Next steps: Using Spark for ETL through Datalake storage