Azure Big Data: Attach Azure Blob Storage to CentOS 7 VM

In a this blog I’ve covered how I set up a standalone Spark 2.3 on an Azure provisioned CentOS 7.4 VM. This is the build I’m using to experiment with and learn Spark data applications and architectures. A benefit of using an Azure VM is that I can rip it down, rebuild it or clone it. When I do this I don’t want to lose my data every time, recover it and then put it back in place. Having my data in a datalake in an Azure blog storage container is ideal since I can kill and recycle my compute VMs and my data just stays persisted in the cloud. This blog covers how I can mount my blob storage container to my CentOS 7.4 VM.

Note this is for standalone only and for the convenience of learning and experimentation. A multi-node Spark cluster would need further consideration and configuration to achieve distributed compute over Azure blog storage.

A final note; I’m learning linux and spark myself and a lot of this stuff is already on the webz albeit in several different places sometimes poorly explained. Hopefully this provides a relatively layman’s end to end write-up with the missing bits filled in that I found myself asking.

Install Blobfuse

What is blobfuse? Well repeating the github opener…

blobfuse is an open source project developed to provide a virtual filesystem backed by the Azure Blob storage.

We need to download and install this; note the URL (….rhel/7…) is correct because we’re CentOS 7.4! Not (….rhel/6…) like I tried to do!

sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm
sudo yum install blobfuse

Temporary Path

Blobfuse requires a temporary path. This is where it caches files locally aiming to provide the performance of local native storage. This place obviously has to be big enough to accommodate the data that we want to use on our standalone spark build. What better drive to use for this than the local temporary SSD storage that you get with a Azure Linux VM. Running the following we can see a summary of our attached physical storage:

df -lh

centos vm storage

Here we can see that /dev/sbd1 has 63GB available which is plenty for me right now. It’s mounted on /mnt/resource so we’ll create a temp directory here. Obviously substitute your own username when assigning permissions.

sudo mkdir /mnt/resource/blobfusetmp 
sudo chown shaunryan /mnt/resource/blobfusetmp

When the machine is rebooted everything here can (assume it will) be lost. But that’s fine because it’s all held on our cloud storage container. It’s just the cache.

In fact if you navigate to mounted folder and list the files:

cd /mnt/resource
ls -L

we can can see a file called DATALOSS_WARNING_README.txt. If we nano that we can see the following:

datalosswarning

Create an Azure Blob Storage Account

I’m not going to cover creating an Azure storage account since it’s pretty straight forward – see here.

After creating the storage account we need to create a container; Click on blobs and create a container.

storage1

storage2

Once the container is created, click on it and upload some data. I’m using the companion data files for the book called Definitive Guide to Spark, they can be found here.

storage3

Now the storage, container and data is up we need to note down the following details so that we can configure the connection details for blobfuse:

  • Storage Account Name
  • Access Key 1 or 2 (doesn’t matter)
  • Container Name – we already created I called it datalake

These can be obtained by clicking on the storage account Access Keys.

storage1

storage5

Configure Blob Storage Access Credentials

Blobfuse takes a parameter which is a path to a file that holds the Azure storage credentials. To that end we need to create this file. I created it in my home user directory (i.e.  home/shaunryan or ~) for convenience. Because of it’s content it should be adequately secured on a shared machine so store it where you want to but note the path.

cd ~
sudo touch ~/fuse_connection.cfg
chmod 700 fuse_connection.cfg

We need the following Azure storage details for the storage container that we want to mount using blobfuse:

  • account name
  • access key
  • container name

Create an Azure Blob Storage Account above will show where these details can be found.

Edit the new file using nano:

sudo nano ~/fuse_connection.cfg

Enter the account details as follows

accountName myaccountname
accountKey mykeyaccount
containerName mycontainername

Should look something like this. When done hit ctrl+x and y to save.

fuse_connection

Mount the Drive

So now all that’s left to do is mount the drive. We need somewhere to mount it to so create a directory of your liking. I’m using a sub-dir in a folder called data at my home directory since I might mount more than 1 storage container and it’s just for me (~/data/datalake).

sudo mkdir ~/data/datalake

We also need the path to our temp location (/mnt/resource/blobfusetmp) and the path to our fuse_connection.cfg file that holds the connection details (just fuse_connection.cfg because I created this at ~).

cd ~
blobfuse ~/data/datalake --tmp-path=/mnt/resource/blobfusetmp --config-file=fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120

So now when we list files in this directory I should see all the files that are in my storage account and I can load them into my spark console. See below where I have all the data files available to work through the definitive guide to spark book. I copied them from GitHub into my Azure storage account which is now attached to my VM.

datalake mounted

Automate in Bash Profile

So it’s all up and working until we reboot the machine, the drive is unmounted and our temp location is potentially (we should assume it will be) deleted.

To remedy this we can automate the temporary file creation and blobfuse storage mount in the bash profile.

That way I can totally forget all this stuff and just be happy that it works; and when it doesn’t I’ll be back here reading what I wrote.

Nano the bash profile to edit it.

sudo nano ~/.bash_profile

Add the following to the end of the profile and ctrl+x to exit and y to save.

# Mount Azure Storage account

if [ ! -d /mnt/resource/blobfusetmp ]
then
 echo "creating Azure Storage temporary folder /mnt/resource/blobfusetmp"
 sudo mkdir /mnt/resource/blobfusetmp
 sudo chown shaunryan /mnt/resource/blobfusetmp
 echo "created Azure Storage temporary folder /mnt/resource/blobfusetmp"
else
 echo "Azure Storage temprorary folder already exists /mnt/resource/blobfusetmp"
fi

echo "Mounting Azure storage at ~/data/datalake/ using ~/fuse_connection.cfg and temporary drive /mnt/resource/blobfusetmp"
blobfuse ~/data/datalake --tmp-path=/mnt/resource/blobfusetmp --config-file=/home/shaunryan/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 -o nonempty

Now when we ssh in it should mount automatically. Below shows a login after a reboot and login after an exit. The mount after the exit will fail because it’s already mounted which is fine. Note the temporary storage already existed but it may not do. I issued a reboot so likely hood it wasn’t down long enough to be recycled, however it was destroyed when I shut down the VM last night and power it up this morning.

login-mount1

 

 

Advertisements