Azure Big Data: Spark Series

Doing lots with Spark in the cloud at the mo… Rather than having my spark blogs all over the shop I’m collecting an index page as a reference. Will be updated as fast as I can create content – which varies:

Advertisements

Azure Big Data: Spark 2.3 Centos VM Standalone

As of late I’ve been investigating all the options of running a Spark Big Data platform on Azure using blob and datalake for data storage. So far I’ve poked around with the following – which I may blog about if I get time:

  • IaaS Linux VM Build (standalone and clustered)
  • HDInsight
  • Databricks
  • Spark on Azure Container Cluster (AKS preview) i.e. Kubernetes

This is basic how to install Spark 2.3 on a standalone Centos VM in Azure. Basically the latest and greatest build of Spark 2.3, Centos 7.4 (Linux), Scala 2.11.12 and Java 8+. There are later versions of Scala but Spark 2.3 requires Scala 2.11 max as covered here:

Preparing Your Client Machine

 

  1. Install bash client
  2. Create ssh rsa key – we need this before creating the Azure VM

We’re setting up a linux server to run Spark on a Centos VM in Azure. I’m not going to bother with a Linux Desktop or remote desktop but we’ll need a client bash terminal to connect to the machine in order to:

  • Administrate the Centos & install software
  • Use the Spark terminal

I run mac OS and windows 10; mostly Windows 10. If you’re running with a mac you don’t need a bash client terminal since you have one already. Windows however does need a bash client.

There is a new Microsoft Linux Subsystem available in the Windows 10 Fall Creators update but I hit some issues with it so wouldn’t advise it yet. It’s not just a Bash client; it emulates a local Linux subsystem which provides some irritating complications. The best experience by far I’ve had is with Git Bash so go ahead and install this if you’re using Windows.

Once we have Bash we need to create public private key so that we can use the Secure Shell (SSH) command to securely connect to our Linux VM.

Open Git Bash and execute the following:

ssh-keygen -t rsa -b 2048

This creates a an 2048 bit rsa private public key pair that we can use to connect to our Azure VM. You’ll be prompted for a filename and passphrase. See here:

gitbashkey

As we can see it says that it created the key in:

C:\Users\shaun\.ssh\id_rsa

In my case it didn’t however and demokey.pub and demokey can be found here, which is my bash home directory:

C:\users\shaun\demopub.key (this is the public key)
C:\users\shaun\demopub  (no extension this is the private key)

Review these files using notepad. Copy the private key to the respective .ssh folder and rename it to id_rsa:

C:\Users\shaun\.ssh\id_rsa

Keep a note of the public key which looks something like below because this lives on the Linux server and we need it when creating the Linux VM in Azure. Also don’t forget the passphrase you entered because we need that to login using the ssh command.

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBkb5GTWTIAtGhZeHNKXwbVF6WoQqb0u23D3opQc0TId9NdlWj8WnYmFu1/l4CuqdD/uzE7/JJTP2pW9mrb3/sZyygF560XGQzTmEUAGRlAexTr509Q0wB/Spekp9qGLVqkys3wQdbxjWsWI2lEhwJIvPlyzgzIAJrmeUU/NGS6rQN+tzoqntg4V2fI714W7f0YRerUICb9rveVwbDU0ieihs1B+n+ljNoJ+J3yFAKqYVcYyQIL4WYmpYgi/M1EMOyrRZK0hVySIbhGh4eI1FBOfplxEOhI8SgedK1KaemhBWs4f+zs1bntqkSCgFHJzV/eLUHDsYxTrgEK3Tn9s5X shaun@DESKTOP-CKJA9OR

Right time to create our Linux VM.

Create Centos VM in Azure

 

Login into the Azure Portal, click Create a Resource and search for Centos. Choose the CentOS-based 7.4 and hit create.

centOS VM.PNG

Fill in the necessaries in order to create your VM choosing the most affordable and appropriate machine. For a demo learning standalone I tend to go for about 4 cpu’s and 32GB (remember spark is an in-memory optimised big data platform). The important bit is to copy and paste our public rsa key into the SSH Public Key input box so it can be placed on the VM when provisioned. When Azure has provisioned your VM it leaves it up and running.

centOS VM2.PNG

Connect to CentOS VM

 

So hopefully that all went well and we’re now ready to connect. You can give your VM a DNS name (see docs) however I tend to just connect using the IP. Navigate to the VM in the portal and click the connect button. This will show you the SSH command with the server address that we can enter into a bash client in order to connect.

connect

Enter the SSH command, enter the passphrase and we’re good to go:

connect

Patch the OS

 

Ensure the OS is patched, the reboot will kick you out of your ssh session. So you’ll need to sign back in.

sudo yum update -y
sudo reboot

Install Java 8

 

Install open JDK 1.8 and validate the install

sudo yum install java-1.8.0-openjdk.x86_64
java -version

Set the following home paths in your .bash_profile so that everytime we login our paths are set accordingly. To do this we’ll use the nano text editor.

sudo nano ~/.bash_profile

Add the following path statements, since they’re required by the scala config:

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export JRE_HOME=/usr/lib/jvm/jre

should look something like this.

javapaths

To exit press ctrl+x, you’ll be prompted to save. Now reload the bash profile.

source ~/.bash_profile

Check the java version.

java -version

javaversion

check the java paths:

echo $JAVA_HOME
echo $JAVA_JRE

echojavapaths

Java is all done. Onto scala.

Install Scala 2.11.12

 

Spark 2.3.0 requires Scala 2.11.x version. Note that the current scala is version 2.12 so we’ll go for the last 2.11 scala version which is 2.11.12; we want the rpm package:

wget http://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.rpm
sudo yum install scala-2.11.12.rpm

We should now validate the install

scala -version

Note the output as follows:

cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: No such file or directory
 Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

This is because there is no release directory in the $JAVA_HOME path directory which the scala script looks for; see a more thorough explanation here. It’s not vitally necessary but I got around this by just creating a release directory at $JAVA_HOME.

cd $JAVA_HOME
sudo mkdir release
cd ~
scala -version

output:

cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: Is a directory
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Spark 2.3.0

 

Final step! Install spark. Download the 2.3.0 rmp package. We’ll use wget again and download the package from a mirror url listed from on this page. I’m using the 1st listed mirror url but adjust as you see fit.

wget http://apache.mirror.anlx.net/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

unzip and move the files into a more appropriate directory:

tar xf spark-2.3.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.3.0-bin-hadoop2.7/* /usr/local/spark

We need need add some paths to out bash profile for convenience so that we don’t have to map them every time we create a session. Again we’ll use nano for this.

cd ~
sudo nano ~/.bash_profile

Add the following lines along with the java paths that added earlier.

export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar

Also put the spark binary folder into the $PATH variable:

PATH=$PATH:$HOME/.local/bin:$HOME/bin:/usr/local/spark/bin

The file should look something like this.

spark paths

Now we can exit and save the file using crtl+x, reload the bash profile and check the paths.

source ~/.bash_profile
echo $PATH
echo $SPARK_EXAMPLES_JAR

Now we should be good to run the spark shell!!

spark-shell

There you have it folks enjoy!

sparkshell

To exit spark-shell use

:quit

to close your ssh session use

exit

Finally, don’t forget to stop your VM to reduce your Azure spend. Sensitive machine details, user details and rsa keys used for this blog have since been deleted.