As of late I’ve been investigating all the options of running a Spark Big Data platform on Azure using blob and datalake for data storage. So far I’ve poked around with the following – which I may blog about if I get time:
- IaaS Linux VM Build (standalone and clustered)
- Spark on Azure Container Cluster (AKS preview) i.e. Kubernetes
This is basic how to install Spark 2.3 on a standalone Centos VM in Azure. Basically the latest and greatest build of Spark 2.3, Centos 7.4 (Linux), Scala 2.11.12 and Java 8+. There are later versions of Scala but Spark 2.3 requires Scala 2.11 max as covered here:
Preparing Your Client Machine
- Install bash client
- Create ssh rsa key – we need this before creating the Azure VM
We’re setting up a linux server to run Spark on a Centos VM in Azure. I’m not going to bother with a Linux Desktop or remote desktop but we’ll need a client bash terminal to connect to the machine in order to:
- Administrate the Centos & install software
- Use the Spark terminal
I run mac OS and windows 10; mostly Windows 10. If you’re running with a mac you don’t need a bash client terminal since you have one already. Windows however does need a bash client.
There is a new Microsoft Linux Subsystem available in the Windows 10 Fall Creators update but I hit some issues with it so wouldn’t advise it yet. It’s not just a Bash client; it emulates a local Linux subsystem which provides some irritating complications. The best experience by far I’ve had is with Git Bash so go ahead and install this if you’re using Windows.
Once we have Bash we need to create public private key so that we can use the Secure Shell (SSH) command to securely connect to our Linux VM.
Open Git Bash and execute the following:
ssh-keygen -t rsa -b 2048
This creates a an 2048 bit rsa private public key pair that we can use to connect to our Azure VM. You’ll be prompted for a filename and passphrase. See here:
As we can see it says that it created the key in:
In my case it didn’t however and demokey.pub and demokey can be found here, which is my bash home directory:
C:\users\shaun\demopub.key (this is the public key) C:\users\shaun\demopub (no extension this is the private key)
Review these files using notepad. Copy the private key to the respective .ssh folder and rename it to id_rsa:
Keep a note of the public key which looks something like below because this lives on the Linux server and we need it when creating the Linux VM in Azure. Also don’t forget the passphrase you entered because we need that to login using the ssh command.
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBkb5GTWTIAtGhZeHNKXwbVF6WoQqb0u23D3opQc0TId9NdlWj8WnYmFu1/l4CuqdD/uzE7/JJTP2pW9mrb3/sZyygF560XGQzTmEUAGRlAexTr509Q0wB/Spekp9qGLVqkys3wQdbxjWsWI2lEhwJIvPlyzgzIAJrmeUU/NGS6rQN+tzoqntg4V2fI714W7f0YRerUICb9rveVwbDU0ieihs1B+n+ljNoJ+J3yFAKqYVcYyQIL4WYmpYgi/M1EMOyrRZK0hVySIbhGh4eI1FBOfplxEOhI8SgedK1KaemhBWs4f+zs1bntqkSCgFHJzV/eLUHDsYxTrgEK3Tn9s5X shaun@DESKTOP-CKJA9OR
Right time to create our Linux VM.
Create Centos VM in Azure
Login into the Azure Portal, click Create a Resource and search for Centos. Choose the CentOS-based 7.4 and hit create.
Fill in the necessaries in order to create your VM choosing the most affordable and appropriate machine. For a demo learning standalone I tend to go for about 4 cpu’s and 32GB (remember spark is an in-memory optimised big data platform). The important bit is to copy and paste our public rsa key into the SSH Public Key input box so it can be placed on the VM when provisioned. When Azure has provisioned your VM it leaves it up and running.
Connect to CentOS VM
So hopefully that all went well and we’re now ready to connect. You can give your VM a DNS name (see docs) however I tend to just connect using the IP. Navigate to the VM in the portal and click the connect button. This will show you the SSH command with the server address that we can enter into a bash client in order to connect.
Enter the SSH command, enter the passphrase and we’re good to go:
Patch the OS
Ensure the OS is patched, the reboot will kick you out of your ssh session. So you’ll need to sign back in.
sudo yum update -y sudo reboot
Install Java 8
Install open JDK 1.8 and validate the install
sudo yum install java-1.8.0-openjdk.x86_64 java -version
Set the following home paths in your .bash_profile so that everytime we login our paths are set accordingly. To do this we’ll use the nano text editor.
sudo nano ~/.bash_profile
Add the following path statements, since they’re required by the scala config:
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk export JRE_HOME=/usr/lib/jvm/jre
should look something like this.
To exit press ctrl+x, you’ll be prompted to save. Now reload the bash profile.
Check the java version.
check the java paths:
echo $JAVA_HOME echo $JAVA_JRE
Java is all done. Onto scala.
Install Scala 2.11.12
Spark 2.3.0 requires Scala 2.11.x version. Note that the current scala is version 2.12 so we’ll go for the last 2.11 scala version which is 2.11.12; we want the rpm package:
wget http://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.rpm sudo yum install scala-2.11.12.rpm
We should now validate the install
Note the output as follows:
cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: No such file or directory Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
This is because there is no release directory in the $JAVA_HOME path directory which the scala script looks for; see a more thorough explanation here. It’s not vitally necessary but I got around this by just creating a release directory at $JAVA_HOME.
cd $JAVA_HOME sudo mkdir release cd ~ scala -version
cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: Is a directory Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Install Spark 2.3.0
Final step! Install spark. Download the 2.3.0 rmp package. We’ll use wget again and download the package from a mirror url listed from on this page. I’m using the 1st listed mirror url but adjust as you see fit.
unzip and move the files into a more appropriate directory:
tar xf spark-2.3.0-bin-hadoop2.7.tgz mkdir /usr/local/spark cp -r spark-2.3.0-bin-hadoop2.7/* /usr/local/spark
We need need add some paths to out bash profile for convenience so that we don’t have to map them every time we create a session. Again we’ll use nano for this.
cd ~ sudo nano ~/.bash_profile
Add the following lines along with the java paths that added earlier.
Also put the spark binary folder into the $PATH variable:
The file should look something like this.
Now we can exit and save the file using crtl+x, reload the bash profile and check the paths.
source ~/.bash_profile echo $PATH echo $SPARK_EXAMPLES_JAR
Now we should be good to run the spark shell!!
There you have it folks enjoy!
To exit spark-shell use
to close your ssh session use
Finally, don’t forget to stop your VM to reduce your Azure spend. Sensitive machine details, user details and rsa keys used for this blog have since been deleted.