Spark on Azure

satya - 8/9/2019, 4:31:38 PM

How to run spark on Azure?

How to run spark on Azure?

Search for: How to run spark on Azure?

satya - 8/9/2019, 4:50:07 PM

Difference between HDInsight and Azure Data bricks

Difference between HDInsight and Azure Data bricks

Search for: Difference between HDInsight and Azure Data bricks

satya - 8/9/2019, 4:52:56 PM

Here is how these two are compared

Here is how these two are compared

satya - 8/9/2019, 5:02:12 PM

Azure data bricks docs

Azure data bricks docs

satya - 8/9/2019, 5:04:50 PM

Databricks pricing is here

Databricks pricing is here

satya - 8/11/2019, 4:57:32 PM

Databricks quick start: get a cluster, run a spark job

Databricks quick start: get a cluster, run a spark job

satya - 8/11/2019, 5:05:37 PM

The azure databricks service points to its own documentation here

The azure databricks service points to its own documentation here

satya - 8/11/2019, 5:07:35 PM

For Azure Databricks You will find here

Getting started

User guide

Admin guide

APIs

Release Notes

Delta lake guide

Spark Ref

ML

satya - 8/11/2019, 5:11:07 PM

Meaning of: Deploying Azure Databricks in your Azure Virtual Network

Meaning of: Deploying Azure Databricks in your Azure Virtual Network

satya - 8/11/2019, 5:11:28 PM

Deploying Azure Databricks in your Azure Virtual Network

Deploying Azure Databricks in your Azure Virtual Network

Search for: Deploying Azure Databricks in your Azure Virtual Network

satya - 8/11/2019, 5:12:53 PM

For now take the default option of "No" for this

For now take the default option of "No" for this

satya - 8/11/2019, 5:14:20 PM

The workspace creation takes a few minutes

The workspace creation takes a few minutes

satya - 8/11/2019, 5:17:09 PM

what is the difference between a resource group and a managed resource group?

what is the difference between a resource group and a managed resource group?

Search for: what is the difference between a resource group and a managed resource group?

satya - 8/11/2019, 5:17:48 PM

what are managed resource groups in Azure?

what are managed resource groups in Azure?

Search for: what are managed resource groups in Azure?

satya - 8/11/2019, 5:18:14 PM

I am asking this because the creation of a databricks workspace is creating automatically a managed resource group

I am asking this because the creation of a databricks workspace is creating automatically a managed resource group

satya - 8/11/2019, 5:18:23 PM

Databricks managed resource group

Databricks managed resource group

Search for: Databricks managed resource group

satya - 8/11/2019, 5:24:21 PM

A deep dive Databricks PPT from MS Ignire

A deep dive Databricks PPT from MS Ignire

satya - 8/11/2019, 5:28:24 PM

The workspace allows you a starting point to explore databricks. So you have to launch it to see the following portal

This is almost like a portal inside a portal. there appears to be an additional login into this portal. This seem to have managed with SSO based on the logged in subscription credentials.

satya - 8/13/2019, 5:10:10 PM

What is a Pool in a spark cluster?

It keeps cluster handy so the startup time for cluster is reduced.

satya - 8/13/2019, 5:12:12 PM

Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster

Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster

Search for: Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster

satya - 8/13/2019, 5:13:06 PM

what are drivers and workers in spark clusters

what are drivers and workers in spark clusters

Search for: what are drivers and workers in spark clusters

satya - 8/13/2019, 5:16:46 PM

Clusters are documented here

Clusters are documented here

However detail seem to be sparse

satya - 8/13/2019, 5:24:30 PM

Here is a public data set called Boston Safety Data set

Here is a public data set called Boston Safety Data set

satya - 8/13/2019, 5:26:16 PM

About boston safety data set


This data set is in parquet format
updated daily
100,000 rows
10MB
Records from 2011 to today

satya - 8/13/2019, 5:26:48 PM

open data initiative from MS is here

open data initiative from MS is here

satya - 8/14/2019, 11:26:33 AM

Notice that the Azure Databricks workspace is a separate portal from the Azure portal

Notice that the Azure Databricks workspace is a separate portal from the Azure portal

satya - 8/14/2019, 11:28:58 AM

Here is how to see if you have any existing clusters

Notice the menu left to navigate the databricks experience.

satya - 8/14/2019, 11:43:30 AM

How do you find more about your subscription?

Search for "subscription" in the global search

Locate subscription

Click on it to see subscription blade

satya - 8/14/2019, 1:48:31 PM

Azure forums on MSDN

Azure forums on MSDN

satya - 8/14/2019, 1:56:53 PM

Info on quota increase

Info on quota increase

satya - 8/14/2019, 1:57:14 PM

what DS series should be used for databricks in azure?

what DS series should be used for databricks in azure?

Search for: what DS series should be used for databricks in azure?

satya - 8/14/2019, 2:01:06 PM

Service limits are described here

Service limits are described here

satya - 8/15/2019, 10:38:19 AM

How to check current CORE quotas in Azure?

How to check current CORE quotas in Azure?

Search for: How to check current CORE quotas in Azure?

satya - 8/17/2019, 11:24:26 AM

So far there is no good answer. Will look into it later

So far there is no good answer. Will look into it later

satya - 8/17/2019, 11:42:13 AM

To create a notebook the cluster must be running

....Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.

satya - 8/17/2019, 11:45:24 AM

Notebooks are documented here

Notebooks are documented here

satya - 8/17/2019, 11:48:40 AM

It took a while to start the cluster. About 10 minutes

It took a while to start the cluster. About 10 minutes

satya - 8/17/2019, 12:26:04 PM

You can start and stop a cluster once you locate the cluster using the cluster menu on the left

You see your usual stop, start icons on the right hand side of each named cluster line

satya - 8/17/2019, 12:26:27 PM

Icons look like this

satya - 8/17/2019, 12:37:47 PM

So the process of creating a notebood

1. Start the cluster first (even if it is on demand/auto-scale cluster)

2. Create the notebook by choosing that running cluster

3. Do work on the notebook

4. when done close the cluster

5. notebook seemed to be stored somewhere as it is available even if the cluster is shutdown

satya - 8/17/2019, 12:38:06 PM

Here is what a notebook may look like

satya - 8/17/2019, 12:38:46 PM

and this...

satya - 8/17/2019, 12:39:08 PM

Notice the ability to get plots at each field level in the query itself

Notice the ability to get plots at each field level in the query itself

satya - 8/17/2019, 12:40:00 PM

the plot configuration will look like this

satya - 8/17/2019, 12:42:03 PM

Next steps: Using Spark for ETL through Datalake storage

Next steps: Using Spark for ETL through Datalake storage