Azure Big Data: Spark 2.3 Centos VM Standalone

As of late I’ve been investigating all the options of running a Spark Big Data platform on Azure using blob and datalake for data storage. So far I’ve poked around with the following – which I may blog about if I get time:

  • IaaS Linux VM Build (standalone and clustered)
  • HDInsight
  • Databricks
  • Spark on Azure Container Cluster (AKS preview) i.e. Kubernetes

This is basic how to install Spark 2.3 on a standalone Centos VM in Azure. Basically the latest and greatest build of Spark 2.3, Centos 7.4 (Linux), Scala 2.11.12 and Java 8+. There are later versions of Scala but Spark 2.3 requires Scala 2.11 max as covered here:

Preparing Your Client Machine


  1. Install bash client
  2. Create ssh rsa key – we need this before creating the Azure VM

We’re setting up a linux server to run Spark on a Centos VM in Azure. I’m not going to bother with a Linux Desktop or remote desktop but we’ll need a client bash terminal to connect to the machine in order to:

  • Administrate the Centos & install software
  • Use the Spark terminal

I run mac OS and windows 10; mostly Windows 10. If you’re running with a mac you don’t need a bash client terminal since you have one already. Windows however does need a bash client.

There is a new Microsoft Linux Subsystem available in the Windows 10 Fall Creators update but I hit some issues with it so wouldn’t advise it yet. It’s not just a Bash client; it emulates a local Linux subsystem which provides some irritating complications. The best experience by far I’ve had is with Git Bash so go ahead and install this if you’re using Windows.

Once we have Bash we need to create public private key so that we can use the Secure Shell (SSH) command to securely connect to our Linux VM.

Open Git Bash and execute the following:

ssh-keygen -t rsa -b 2048

This creates a an 2048 bit rsa private public key pair that we can use to connect to our Azure VM. You’ll be prompted for a filename and passphrase. See here:


As we can see it says that it created the key in:


In my case it didn’t however and and demokey can be found here, which is my bash home directory:

C:\users\shaun\demopub.key (this is the public key)
C:\users\shaun\demopub  (no extension this is the private key)

Review these files using notepad. Copy the private key to the respective .ssh folder and rename it to id_rsa:


Keep a note of the public key which looks something like below because this lives on the Linux server and we need it when creating the Linux VM in Azure. Also don’t forget the passphrase you entered because we need that to login using the ssh command.

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBkb5GTWTIAtGhZeHNKXwbVF6WoQqb0u23D3opQc0TId9NdlWj8WnYmFu1/l4CuqdD/uzE7/JJTP2pW9mrb3/sZyygF560XGQzTmEUAGRlAexTr509Q0wB/Spekp9qGLVqkys3wQdbxjWsWI2lEhwJIvPlyzgzIAJrmeUU/NGS6rQN+tzoqntg4V2fI714W7f0YRerUICb9rveVwbDU0ieihs1B+n+ljNoJ+J3yFAKqYVcYyQIL4WYmpYgi/M1EMOyrRZK0hVySIbhGh4eI1FBOfplxEOhI8SgedK1KaemhBWs4f+zs1bntqkSCgFHJzV/eLUHDsYxTrgEK3Tn9s5X shaun@DESKTOP-CKJA9OR

Right time to create our Linux VM.

Create Centos VM in Azure


Login into the Azure Portal, click Create a Resource and search for Centos. Choose the CentOS-based 7.4 and hit create.


Fill in the necessaries in order to create your VM choosing the most affordable and appropriate machine. For a demo learning standalone I tend to go for about 4 cpu’s and 32GB (remember spark is an in-memory optimised big data platform). The important bit is to copy and paste our public rsa key into the SSH Public Key input box so it can be placed on the VM when provisioned. When Azure has provisioned your VM it leaves it up and running.

centOS VM2.PNG

Connect to CentOS VM


So hopefully that all went well and we’re now ready to connect. You can give your VM a DNS name (see docs) however I tend to just connect using the IP. Navigate to the VM in the portal and click the connect button. This will show you the SSH command with the server address that we can enter into a bash client in order to connect.


Enter the SSH command, enter the passphrase and we’re good to go:


Patch the OS


Ensure the OS is patched, the reboot will kick you out of your ssh session. So you’ll need to sign back in.

sudo yum update -y
sudo reboot

Install Java 8


Install open JDK 1.8 and validate the install

sudo yum install java-1.8.0-openjdk.x86_64
java -version

Set the following home paths in your .bash_profile so that everytime we login our paths are set accordingly. To do this we’ll use the nano text editor.

sudo nano ~/.bash_profile

Add the following path statements, since they’re required by the scala config:

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export JRE_HOME=/usr/lib/jvm/jre

should look something like this.


To exit press ctrl+x, you’ll be prompted to save. Now reload the bash profile.

source ~/.bash_profile

Check the java version.

java -version


check the java paths:

echo $JAVA_JRE


Java is all done. Onto scala.

Install Scala 2.11.12


Spark 2.3.0 requires Scala 2.11.x version. Note that the current scala is version 2.12 so we’ll go for the last 2.11 scala version which is 2.11.12; we want the rpm package:

sudo yum install scala-2.11.12.rpm

We should now validate the install

scala -version

Note the output as follows:

cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: No such file or directory
 Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

This is because there is no release directory in the $JAVA_HOME path directory which the scala script looks for; see a more thorough explanation here. It’s not vitally necessary but I got around this by just creating a release directory at $JAVA_HOME.

sudo mkdir release
cd ~
scala -version


cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: Is a directory
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Spark 2.3.0


Final step! Install spark. Download the 2.3.0 rmp package. We’ll use wget again and download the package from a mirror url listed from on this page. I’m using the 1st listed mirror url but adjust as you see fit.


unzip and move the files into a more appropriate directory:

tar xf spark-2.3.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.3.0-bin-hadoop2.7/* /usr/local/spark

We need need add some paths to out bash profile for convenience so that we don’t have to map them every time we create a session. Again we’ll use nano for this.

cd ~
sudo nano ~/.bash_profile

Add the following lines along with the java paths that added earlier.

export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar

Also put the spark binary folder into the $PATH variable:


The file should look something like this.

spark paths

Now we can exit and save the file using crtl+x, reload the bash profile and check the paths.

source ~/.bash_profile
echo $PATH

Now we should be good to run the spark shell!!


There you have it folks enjoy!


To exit spark-shell use


to close your ssh session use


Finally, don’t forget to stop your VM to reduce your Azure spend. Sensitive machine details, user details and rsa keys used for this blog have since been deleted.