A Shortcut to Big Data - Lesson Plan 1, Azure and Spark: Smart Programmer Series

Much of the complexity of Bigdata lies in its infrastructure. Storage clusters. Compute clusters. HDFS, YARN, A multitude of tools and technologies that make the distributed storage and computing possible.

What is the quickest way to become productive without loosing the generality of your learning and does not impede your capacity to innovate and the best you can be in the data space?

Let me assume further that you are a smart programmer and you really gain no benefit in learning volumes about how these clusters are put together and managed. When was the last time you have asked what is inside my computer, let alone CPU?

Most cloud offerings, Azure, AWS, Google, now offer that level of abstraction where you can focus on distributed programming through Spark and Python and not worry about the clusters.

What follows is a week long exercise to get you (smart programmer) up to speed with writing a hello world program in Spark on a spark cluster in Azure. Sorry to disappoint you that it is not a 24 hour exercise!!

I will first cover the high level topics in this journey followed by links, comments, and references for each of the topics. This week long exercise will set you up the environment you need to go further from where you will need to learn a) Notebooks b) python and c) Spark in greater detail. That is what you want: Spend more time on learning programming and less to no time on infrastructure.

This is not as simple as you might think. This involved in eliminating a number of choices like building your own VMs, or even clusters like HDInsight. The preliminary research lead me to "Databricks" in Azure that makes this pretty straight forward. Still even that is a week long exercise which you will need.

One may ask, why not AWS or Google? Perhaps you can do that there as well. If your linux skills are rusty the managed clusters of Azure makes it lot easier. Perhaps EMR in AWS is close enough. Perhaps when I get there as well I will post what the experience is like.

1. You do not need any prerequisites. Any windows laptop that can access azure will do. As all things you need are in the cloud. Even linux experience is optional at this early stage.

2. Create an account in azure. Don't use Free account. Use a Pay-as-you-go option with a credit card on file. Do this before even venturing into learning any thing.

3. Go through Azure Fundamentals online training track that Azure offers. This will take anywhere from 8 to 16 hours. I highly recommend. I have links below where this is available. Do not skip this step. Make notes as you go through these training material.

4. Create a Resource group in your account and subscription (you will know what these are from your learning). A resource group is like a folder to put related purchases together.

5. Follow the quick lesson for spark and data bricks to create a Data bricks work space.

6. Use the data bricks work space to create a spark cluster. This can be involved for no fault of yours. You will need to increase CPU core limits through tickets to azure. I will post details in the details section. You will also see a journal I have kept while doing this.

7. Understand how to stop and start the spark cluster

8. Create a notebook where you can enter python spark code and execute the code against the cluster

9. Finally plan the next steps to ACTUALLY learn the 3 main things you need to learn, now that you have the "rails" are laid. (Notebooks, Spark, Python in depth)

You do not need any prerequisites. Any windows laptop that can access azure will do. As all things you need are in the cloud. Even linux experience is optional at this early stage.

Hopefully you have some understanding of Big data

That it is all about reading, processing, and writing data in parallel

It is similar (not exact) to how traditional tools processed large ETL jobs

You have some idea of HDFS, MapReduce, YARN, Pig, Hive, Hive Metadata, Formats like AVRO, Parquet, Databases like Big table, HBase, Sqoop, Presto, Spark, Oozie etc. Because you may have read a summary book on Hadoop. if not spend a week or so on reading "Sams Teach Yourself Hadoop in 24 Hours". It is a good right sized introduction. Just don't go in deep in most of these technologies as you would like to focus on Spark mainly. But this level of introduction is freeing as you know then the problem space and "what to ignore".

Book: Sams Teach Yourself Hadoop in 24 hours

Search for: Book: Sams Teach Yourself Hadoop in 24 hours

I have some zig-zagged notes on these technologies here

You will find here some hand drawn figures on a number of these technologies

Here are some of my inquiry notes into mapreduce

Don't use Free account. Use a Pay-as-you-go option with a credit card on file. Do this before even venturing into learning any thing.

This is because, as you go through the Azure fundamentals tutorial you will be asked to look at a number of details in an Azure account. Having a legitimate account makes this easy.

As resources are billed based on usage, the bill is minimal. This pay account also sets you up nicely for Databricks clusters that you will create.

I don't have any links to help you here in creating this account as I had done that very long time ago for something else so I happened to have one.

However as I went through the training materials for Azure, I have kept some notes here

Go through Azure Fundamentals online training track that Azure offers. This will take anywhere from 8 to 16 hours. I highly recommend. I have links below where this is available. Do not skip this step. Make notes as you go through these training material chapters.

You will learn here how Azure is organized and what is its vocabulary: Regions, subscriptions, Azure Active Directory, Security principals, Resources, Resource managers, ip addresses, networks, virtual machines (compute), storage, marketplace etc.

The immediate link above are the notes I have kept as I went through all the tracks in this training program. You will get a birds eye view there as to what you will learn.

At the end of that track I was hoping to take the certification. But that turned out to be too much of a hassle with exams and such!!! :)

Here is that learning path

Create a Resource group in your account and subscription (you will know what these are from your learning). A resource group is like a folder to put related purchases together.

You see "Resource Groups" mentioned often and often in Azure. Under an account, there are one or more subscriptions. Under each subscription there are one or more resource groups one for each type of need.

Goal for this resource group is to put all related resources (clusters, storage, compute etc) under this resource group so that common policies (like security or delete) can be applied uniformly.

Follow the quick lesson for spark and data bricks to create a Data bricks work space.

Azure has a quick start guide that takes you step by step for walking through the rest of the steps here. Follow that lesson and use the account and resource group above to create the necessary assets (resources)

The quick start guide is here

My experience with following that guide and doing the work is here

Use the data bricks work space to create a spark cluster. This can be involved for no fault of yours. You will need to increase CPU core limits through tickets to azure.

You will see the difficulties with creating this cluster as the instructions are not entirely sufficient.

The journal above has the details of what the failure is.

Create cluster fails with an error saying there are not enough cores to create the clusters.

This is because by default the number of cores you can create is limited. Not only that there is a limit for the total cores but the number of cores of a given "machine type" (VM type or VM Series). By default this is 20 cores for a given "series or type" of virtual machines.

It is not easy to see these limits nor is it obvious how to increase the limits or one would be charged to increase the limits etc. You will some of these findings at that link.

In the creation process of the cluster "carefully" note the listed virtual machine type "names". Write a few of them down. Obviously you want to choose as low powered ones (based on their non-intuitive names). Keep this paper handy.

Now go to your help/support section and open a ticket to increase the "limits". It is one of the options as you navigate the help/support questionnaire to create a support ticket.

The options will present a list of virtual machine types to increase the cores for. These don't exactly match the names you have noted before. Do your best to guess based on their names. In this process you will know what the current core limits are for that machine type and how much you want. I would say 100. obviously you are not billed for increasing the limits. only when used.

the case will be submitted and hopefully your new quota will be approved in minutes to hours.

Now go back to the cluster creation and hopefully the "specified" options in the training guide will work and you will not see the error message of insufficient cores. I wish Microsoft is more helpful in this particular case. It is hard to track this down. Now if it fails repeat the exercise knowing that you may not have guessed the type of cores or number of cores right.

Good luck in this step.

the guide suggest the default on demand cluster

it has a auto inactivity shutdown of 2 hours.

I changed it to 15 minutes. (I don't know my billing yet. I don't think it is of worry too much) as long as you shut down right after. Even 2 hours may not be too expensive.

Also I doubt if it matters if it is an on demand cluster or auto scale cluster.

Understand how to stop and start the spark cluster. There are icons to start and stop the cluster.

The cluster during the initial creation may start and leave it in the start status until the inactivity timer kicks in to shut it down.

Do stop it if you noticed.

It will take a good bit of time, 10 minutes or more, to start the cluster.

Further, the cluster must be in a started state to be able to create the notebook that is used to run spark/python code.

Create a notebook where you can enter python spark code and execute the code against the cluster

Problem I ran into here is that, it complains I have not running clusters. So I had to go and manually start the cluster that was created earlier and choose that as the target for the notebook.

The notebook is what you use to type in spark/python code.

the notebooks seem to be available even when the cluster is down, once they are created.

This concludes all the steps that you follow in the quick start guide. By now you would have created cluster, created a notebook, used spark/python to query boston data set and convert the results even in to a chart.

As noted earlier, my experience of following the guide is at this link

Finally plan the next steps to ACTUALLY learn the 3 main things you need to learn, now that you have the "rails" are laid. (Notebooks, Spark, Python in depth)

Expect a few months mastering each of these ideas and hopefully in 6 months to a year time you will migrate into Machine learning and AI.

I will post those as I make progress.

Good luck!!