Spark on Azure
satya - 8/9/2019, 4:31:38 PM
How to run spark on Azure?
How to run spark on Azure?
satya - 8/9/2019, 4:50:07 PM
Difference between HDInsight and Azure Data bricks
Difference between HDInsight and Azure Data bricks
Search for: Difference between HDInsight and Azure Data bricks
satya - 8/11/2019, 4:57:32 PM
Databricks quick start: get a cluster, run a spark job
satya - 8/11/2019, 5:05:37 PM
The azure databricks service points to its own documentation here
The azure databricks service points to its own documentation here
satya - 8/11/2019, 5:07:35 PM
For Azure Databricks You will find here
Getting started
User guide
Admin guide
APIs
Release Notes
Delta lake guide
Spark Ref
ML
satya - 8/11/2019, 5:11:07 PM
Meaning of: Deploying Azure Databricks in your Azure Virtual Network
Meaning of: Deploying Azure Databricks in your Azure Virtual Network
satya - 8/11/2019, 5:11:28 PM
Deploying Azure Databricks in your Azure Virtual Network
Deploying Azure Databricks in your Azure Virtual Network
Search for: Deploying Azure Databricks in your Azure Virtual Network
satya - 8/11/2019, 5:12:53 PM
For now take the default option of "No" for this
For now take the default option of "No" for this
satya - 8/11/2019, 5:14:20 PM
The workspace creation takes a few minutes
The workspace creation takes a few minutes
satya - 8/11/2019, 5:17:09 PM
what is the difference between a resource group and a managed resource group?
what is the difference between a resource group and a managed resource group?
Search for: what is the difference between a resource group and a managed resource group?
satya - 8/11/2019, 5:17:48 PM
what are managed resource groups in Azure?
what are managed resource groups in Azure?
satya - 8/11/2019, 5:18:14 PM
I am asking this because the creation of a databricks workspace is creating automatically a managed resource group
I am asking this because the creation of a databricks workspace is creating automatically a managed resource group
satya - 8/11/2019, 5:18:23 PM
Databricks managed resource group
Databricks managed resource group
satya - 8/11/2019, 5:24:21 PM
A deep dive Databricks PPT from MS Ignire
satya - 8/11/2019, 5:28:24 PM
The workspace allows you a starting point to explore databricks. So you have to launch it to see the following portal
This is almost like a portal inside a portal. there appears to be an additional login into this portal. This seem to have managed with SSO based on the logged in subscription credentials.
satya - 8/13/2019, 5:10:10 PM
What is a Pool in a spark cluster?
It keeps cluster handy so the startup time for cluster is reduced.
satya - 8/13/2019, 5:12:12 PM
Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster
Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster
Search for: Waringin/Error: This account may not have enough CPU cores, new Azure Spark Cluster
satya - 8/13/2019, 5:13:06 PM
what are drivers and workers in spark clusters
what are drivers and workers in spark clusters
satya - 8/13/2019, 5:16:46 PM
Clusters are documented here
However detail seem to be sparse
satya - 8/13/2019, 5:24:30 PM
Here is a public data set called Boston Safety Data set
satya - 8/13/2019, 5:26:16 PM
About boston safety data set
This data set is in parquet format
updated daily
100,000 rows
10MB
Records from 2011 to today
satya - 8/13/2019, 5:26:48 PM
open data initiative from MS is here
satya - 8/14/2019, 11:26:33 AM
Notice that the Azure Databricks workspace is a separate portal from the Azure portal
Notice that the Azure Databricks workspace is a separate portal from the Azure portal
satya - 8/14/2019, 11:28:58 AM
Here is how to see if you have any existing clusters
Notice the menu left to navigate the databricks experience.
satya - 8/14/2019, 11:43:30 AM
How do you find more about your subscription?
Search for "subscription" in the global search
Locate subscription
Click on it to see subscription blade
satya - 8/14/2019, 1:57:14 PM
what DS series should be used for databricks in azure?
what DS series should be used for databricks in azure?
Search for: what DS series should be used for databricks in azure?
satya - 8/15/2019, 10:38:19 AM
How to check current CORE quotas in Azure?
How to check current CORE quotas in Azure?
satya - 8/17/2019, 11:24:26 AM
So far there is no good answer. Will look into it later
satya - 8/17/2019, 11:42:13 AM
To create a notebook the cluster must be running
....Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.
satya - 8/17/2019, 11:48:40 AM
It took a while to start the cluster. About 10 minutes
It took a while to start the cluster. About 10 minutes
satya - 8/17/2019, 12:26:04 PM
You can start and stop a cluster once you locate the cluster using the cluster menu on the left
You see your usual stop, start icons on the right hand side of each named cluster line
satya - 8/17/2019, 12:26:27 PM
Icons look like this
satya - 8/17/2019, 12:37:47 PM
So the process of creating a notebood
1. Start the cluster first (even if it is on demand/auto-scale cluster)
2. Create the notebook by choosing that running cluster
3. Do work on the notebook
4. when done close the cluster
5. notebook seemed to be stored somewhere as it is available even if the cluster is shutdown
satya - 8/17/2019, 12:38:06 PM
Here is what a notebook may look like
satya - 8/17/2019, 12:38:46 PM
and this...
satya - 8/17/2019, 12:39:08 PM
Notice the ability to get plots at each field level in the query itself
Notice the ability to get plots at each field level in the query itself
satya - 8/17/2019, 12:40:00 PM
the plot configuration will look like this
satya - 8/17/2019, 12:42:03 PM
Next steps: Using Spark for ETL through Datalake storage