Azure storage

Azure storage is documented here

Here is a quick start on how to create Blob storage to keep files

Here are some sample CSV files from FSU: JBurkardt

sample csv files to download

Search for: sample csv files to download

All access to Azure Storage takes place through a storage account.

Storage accounts, containers, and blobs

A storage account can contain zero or more containers. A container contains properties, metadata, and zero or more blobs. A blob is any single entity comprised of binary data, properties, and metadata.

The URI to reference a container or a blob must be unique. Because every storage account name is unique, two accounts can have containers with the same name. However, within a given storage account, every container must have a unique name. Every blob within a given container must also have a unique name within that container.


https://myaccount.blob.core.windows.net/mycontainer/myblob

1. Storage account URI
2. container name
3. blob name

Here is how to create a storage account


Deployment model	Resource Manager
Performance	Standard
Account kind	StorageV2 (general-purpose v2)
Replication	Read-access geo-redundant storage (RA-GRS)
Access tier	Hot

The default deployment model is Resource Manager, which supports the latest Azure features.

Azure access keys

Search for: Azure access keys

How do I read an azure blob from databricks spark

Search for: How do I read an azure blob from databricks spark

Here is a specific way to read using databricks

Here are some examples

Manage anonymous read access to containers and blobs

Overview of Azure storage security

Created a storage account to keep files for processing by Spark in data bricks. This is done in its own resource group.

Created a container.

Located a csv file and uploaded it as a blob in that container

There are some example of granting security at the "storage account" level using generated access keys and use that key for the spark cluster.

Securing the storage accounts, containers, files seem to have too many options. It is confusing

It is not clear how to provide fine grained security at container or file level. There are hints that one may have to use Active Directory. Then that leads identifying the clients (spark programs and clusters) to AD in some way. Not clear as well.

Databricks documentation seem to suggest things like DBFS, Shred keys, Pubic and private accounts etc. Too much stuff going on!

Wonder why is this more difficult then an "ftp" or a "database"! May be it is. Is that what the storage account is??

I think I am going to use the access keys and then come back to these security concerns.

Use access keys in spark programs to read that CSV file and select a few columns and print them.

Then create a new CSV file that is a subset of the original CSV file using Spark

Go to storage account

Then look below the detail of the account called "Services"

Locate Blobs

Click on it

It will take you to show a list of containers

Here you can view, delete, or create containers

Click on the container of your choice to see the blobs/files in that container

How to create folders and sub folders in azure blob container

Search for: How to create folders and sub folders in azure blob container

As always SOF to the rescue: How to create sub directory in a blob container

Naming standards for azure containers

Search for: Naming standards for azure containers

How do I rename a container in azure?

Search for: How do I rename a container in azure?

Apparently you cannot rename a container. See the SOF here

Create a new container

Copy the blobs over

delete the old container

However it may be doing behind the scenes the above procedure automatically

Naming conventions for azure resources

Looks like you can have policies to enforce naming