Skip to content

Cluster and Spark GraphX setup instructions on AWS

Victor Zhang edited this page May 29, 2022 · 25 revisions

Instructions for setting up AWS cluster and executing Spark/GraphX algorithms on the cluster.

Navigate to EMR

Creating the cluster

  • Create Cluster

  • Change Release to emr 6.5.0 (most updated one)

  • Select Spark

  • Select instance types. For now, m5.xlarge.

  • EDIT: Under Hardware configuration, deselect "auto-termination"

  • Choose EC2 key pair EMR (real)

  • Launch the cluster

  • Modify the master's security policy to allow inbound traffic. This needs to be done once per IP. see https://www.oreilly.com/content/how-do-i-connect-to-my-amazon-elastic-mapreduce-emr-cluster-with-ssh/

    • Click on Summary Tab
    • Click on Security Group
    • Click On Master's security group
    • Edit inbound rules
    • Add rules -> SSH -> My Ip -> Save
  • Similarly, the Core nodes should allow SSH from the master's security group. This is also set, but similar instructions apply.

  • Ensure you can ssh into the master node from the public ip. If using putty, you need to convert the pem key to .ppk. See link above or google.

  • scp the .pem key file into the master node. You'll need to use the .ppk file as when sshing in.

  • ensure the ssh agent is running. eval `ssh-agent` EDIT: this is wrong if we want it to work in the background. we should be using eval "$(ssh-agent -s)"

  • chmod 400 'key file name'

  • ssh-add 'key file name'

  • Install ansible. sudo amazon-linux-extras install ansible2

  • Generate a hosts file for ansible. automation pending. EDIT: pls provide a skeleton example for this

  • cd /usr/lib/spark/sbin

  • sudo ./start-master.sh

  • get the connection string for spark from /var/log/spark/spark-root-org.apache.spark.[other stuff].out spark://'connection string'

  • Install spark (with ansible) from https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

  • TODO: unzip spark

  • Use ansible to automate: ./start-worker.sh spark://'connection string'

Installing Spark and GraphX on the cluster

graphx instructions:

Make sure the master node has the key pair that enables it to connect to all of the workers:
scp -i EMR.pem EMR.pem hadoop@{remote_hostname}:/home/hadoop/EMR.pem

Within the master node:
eval $(ssh-agent)
ssh-add ~/EMR.pem

Install ansible:
sudo amazon-linux-extras -y install ansible2
sudo yum -y install ansible

Download Spark on master node:
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz -P ~

Unzip on master:
tar -xvzf spark-3.2.1-bin-hadoop3.2.tgz

Run Spark daemon on master: (may need to re-run after restart) ~/spark-3.2.1-bin-hadoop3.2/sbin/start-master.sh

Run the following command to get the master daemon's info (what you want looks like this: spark//ip-172-31-94-126.ec2.internal:7077):
cat ~/spark-3.2.1-bin-hadoop3.2/logs/*.out

Install git on master if it isn't already installed:
sudo yum install -y git

Clone the repo that has the required ansible/Spark scripts:
git clone git@github.com:GraphStreamingProject/DistributedStreamingCC.git
cd DistributedStreamingCC
git checkout ansible_stuff

Create an inventory.ini file to be used by ansible. The AWS cluster settings should be used to determine the private IP of the workers.

Modify spark.yaml so it contains the correct addresses for the Spark daemon.

Disable host key checking:
export ANSIBLE_HOST_KEY_CHECKING=False

Install java and spark on the workers: EDIT: hadoop user usage in yaml files should be changed to ec2-user or genericized (see Evan's changes on main) ansible-playbook -i inventory.ini java.yaml ansible-playbook -i inventory.ini spark.yaml

The above script will:

  • Copy Spark tar to worker nodes and unzip.
  • Run Spark daemon processes on workers.

If the above steps are done correctly, the cluster is setup and configured to run Spark jobs. You may test this by running a sample Spark program:

  • Create a text file "/home/hadoop/sample.txt" that consists of a bunch of sentences.

Copy the sample to the workers:
ansible-playbook -i inventory.ini generate_sample.yaml

Edit the sparktest.py script, replacing the line sc = SparkContext("spark://ip-172-31-94-126.ec2.internal:7077","first app") with the proper info of your master daemon, as you determined from master's logs in a previous step.

Then run the toy Spark program, replacing "spark://ip-172-31-94-126.ec2.internal:7077" just like in the above step:
/home/hadoop/spark-3.2.1-bin-hadoop3.2/bin spark-submit --master spark//ip-172-31-94-126.ec2.internal:7077
/home/hadoop/DistributedStreamingCC/sparktest.py

========================================================================================= If you got to this point, that means the cluster is correctly setup to use Spark.

The next step is to install the correct scala version of the master. The version we want depends on the Spark version that we have. The site (https://spark-packages.org/package/graphframes/graphframes) lets you download graphframes, but certain Scala versions are only compatible with certain Spark versions. For example, Version: 0.8.2-spark2.4-s_2.11 requires Spark 2.4 and Scala 2.11. If you had Spark 2.4, you would need either Scala 2.11 or Scala 2.12 since those are the only supported Scala versions compatible with Spark 2.4 that GraphFrames supports.

Run the following on the master:

Debian:
scalaVer="2.12.8"
sudo yum remove scala-library scala
wget www.scala-lang.org/files/archive/scala-"$scalaVer".deb -P ~
dpkg -i scala-"$scalaVer".deb
apt-get update
apt-get install scala

RedHat:
scalaVer="2.12.8"
wget https://downloads.lightbend.com/scala/2.12.8/scala-"$scalaVer".rpm -P ~
sudo rpm -i ~/scala-"$scalaVer".rpm

After doing the above, verify that the correct Scala version has been installed with:
scala -version

We need to replicate the above steps on the worker nodes. Simply do:
ansible-playbook -i inventory.ini scala.yaml

The above will copy the scala rpm from master to the workers and then install scala on each worker using the rpm.

Now to install GraphFrames. Go to this site (https://spark-packages.org/package/graphframes/graphframes) and download the correct zip and jar based on your Spark/Scala versions.

For Spark 3.2 and Scala 2.12:
wget https://github.com/graphframes/graphframes/archive/1cd7abb0f424fd76d76ea07438e6486f44fbb440.zip -P ~
wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar -P ~

Move the jar to the Spark jars directory:
cp ~/graphframes-0.8.2-spark3.2-s_2.12.jar ~/spark-3.2.1-bin-hadoop3.2/jars/

Run graphx.yaml:
ansible-playbook -i inventory.ini graphx.yaml

Decompile the jar:
jar xf ~/graphframes-0.8.2-spark3.2-s_2.12.jar
sudo cp -r ~/graphframes /usr/lib/python3.7/site-packages/
pip install numpy pip3 install numpy

================================ Accessing the Spark WebUI: ssh -i EMR.pem -D 12346 ec2-user@[master node public IP] Find the webUI address in spark logs (will be something like http://ip-172-31-9-24.ec2.internal:4040) Modify browser settings to allow proxy from port 12346 (on Firefox it's under settings > network settings > manual proxy configuration > SOCKS host) Connect to WebUI address while a Spark program is running

Running the first toy program:
mkdir ~/checkpoints NOTE: checkpoints may need to be made on a shared fs instead of just on master (???) /home/hadoop/spark-3.2.1-bin-hadoop3.2/bin/spark-submit --master spark//ip-172-31-94-126.ec2.internal:7077 /home/hadoop/DistributedStreamingCC/graphxtest1.py

The above toy program will run GraphFrames connected components on one of their sample datasets, but it will run it locally (i.e. only on the master node). If the program runs fine, it means GraphFrames successfully installed on the master node. The next step is getting GraphFrames to work on the entire cluster.

Second toy program:
/home/hadoop/spark-3.2.1-bin-hadoop3.2/bin/spark-submit --master spark//ip-172-31-94-126.ec2.internal:7077 /home/hadoop/DistributedStreamingCC/graphxtest2.py

This program will run GraphFrames BFS using the entire cluster. You will need to edit "graphxtest2.py" so it loads the SparkContext using your Spark master daemon's correct address.

If necessary, you may need to downgrade the Java version on all nodes from 11 to 8. Skip this step, but if things don't work, then come back to it: java.yaml does this.

Setting up NFS on master:

Install NFS:
sudo yum -y install nfs-utils

Create NFS root:
sudo mkdir /mnt/cloud

Give permissions:
sudo chown hadoop:hadoop /mnt/cloud
sudo chmod 777 /mnt/cloud

Add this line to /etc/exports:
/mnt/cloud *(rw,sync,no_root_squash,no_subtree_check)

Then run these two commands:
sudo exportfs -a #making the file share available
sudo systemctl restart nfs-kernel-server #restarting the NFS kernel