Cluster and Spark GraphX setup instructions on AWS

Instructions for setting up AWS cluster and executing Spark/GraphX algorithms on the cluster.

Navigate to EMR

Creating the cluster

Create Cluster
Change Release to emr 6.5.0 (most updated one)
Select Spark
Select instance types. For now, m5.xlarge.
Choose EC2 key pair EMR (real)
Launch the cluster
Modify the master's security policy to allow inbound traffic. This needs to be done once per IP. see https://www.oreilly.com/content/how-do-i-connect-to-my-amazon-elastic-mapreduce-emr-cluster-with-ssh/
- Click on Summary Tab
- Click on Security Group
- Click On Master's security group
- Edit inbound rules
- Add rules -> SSH -> My Ip -> Save
Similarly, the Core nodes should allow SSH from the master's security group. This is also set, but similar instructions apply.
Ensure you can ssh into the master node from the public ip. If using putty, you need to convert the pem key to .ppk. See link above or google.
scp the .pem key file into the master node. You'll need to use the .ppk file as when sshing in.
ensure the ssh agent is running. eval `ssh-agent`
chmod 400 'key file name'
ssh-add 'key file name'
Install ansible. sudo amazon-linux-extras install ansible2
Generate a hosts file for ansible. automation pending.
cd /usr/lib/spark/sbin
sudo ./start-master.sh
get the connection string for spark from /var/log/spark/ spark://'connection string'
Install spark (with ansible) from https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
TODO: unzip spark
Use ansible to automate: ./start-worker.sh spark://'connection string'

Installing Spark and GraphX on the cluster

graphx instructions:

Make sure the master node has the key pair that enables it to connect to all of the workers:
scp -i EMR.pem EMR.pem hadoop@{remote_hostname}:/home/hadoop/EMR.pem

Within the master node:
eval 'ssh-agent'
ssh-add ~/EMR.pem

Install ansible:
sudo amazon-linux-extras install ansible2
sudo yum install ansible

Download Spark on master node:
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

Unzip on master:
tar -xvzf spark-3.2.1-bin-hadoop3.2.tgz

Run Spark daemon on master:
~/spark-3.2.1-bin-hadoop3.2/sbin/start-master.sh

Run the following command to get the master daemon's info (what you want looks like this: spark//ip-172-31-94-126.ec2.internal:7077):
cat ~/spark-3.2.1-bin-hadoop3.2/logs/*.out

Install git on master if it isn't already installed:
sudo yum install git

Clone the repo that has the required ansible/Spark scripts:
git clone git@github.com:GraphStreamingProject/DistributedStreamingCC.git
cd DistributedStreamingCC
git checkout ansible_stuff

Create an inventory.ini file to be used by ansible. The AWS cluster settings should be used to determine the private IP of the workers.

Run spark.yaml on the master:
ansible-playbook -i inventory.ini spark.yaml

The above script will:

Copy Spark tar to worker nodes and unzip.
Run Spark daemon processes on workers.

If the above steps are done correctly, the cluster is setup and configured to run Spark jobs. You may test this by running a sample Spark program:

Create a text file "/home/hadoop/sample.txt" that consists of a bunch of sentences.

Copy the sample to the workers:
ansible-playbook -i inventory.ini generate_sample.yaml

Edit the sparktest.py script, replacing the line sc = SparkContext("spark://ip-172-31-94-126.ec2.internal:7077","first app") with the proper info of your master daemon, as you determined from master's logs in a previous step.

Then run the toy Spark program, replacing "spark://ip-172-31-94-126.ec2.internal:7077" just like in the above step:
/home/hadoop/spark-3.2.1-bin-hadoop3.2/bin spark-submit --master spark//ip-172-31-94-126.ec2.internal:7077
/home/hadoop/DistributedStreamingCC/sparktest.py

========================================================================================= If you got to this point, that means the cluster is correctly setup to use Spark.

The next step is to install the correct scala version of the master. The version we want depends on the Spark version that we have. The site (https://spark-packages.org/package/graphframes/graphframes) lets you download graphframes, but certain Scala versions are only compatible with certain Spark versions. For example, Version: 0.8.2-spark2.4-s_2.11 requires Spark 2.4 and Scala 2.11. If you had Spark 2.4, you would need either Scala 2.11 or Scala 2.12 since those are the only supported Scala versions compatible with Spark 2.4 that GraphFrames supports.

Run the following on the master:

Debian: scalaVer="2.12.8" sudo yum remove scala-library scala
wget www.scala-lang.org/files/archive/scala-"$scalaVer".deb
dpkg -i scala-"$scalaVer".deb
apt-get update
apt-get install scala

RedHat: scalaVer="2.12.8" wget https://downloads.lightbend.com/scala/2.12.8/scala-"$scalaVer".rpm
sudo rpm -i scala-"$scalaVer".rpm

After doing the above, verify that the correct Scala version has been installed with:
scala -version

We need to replicate the above steps on the worker nodes. Simply do:
ansible-playbook -i inventory.ini scala.yaml

The above will copy the scala rpm from master to the workers and then install scala on each worker using the rpm.

Now to install GraphFrames. Go to this site (https://spark-packages.org/package/graphframes/graphframes) and download the correct zip and jar based on your Spark/Scala versions.

For Spark 3.2 and Scala 2.12:
wget https://github.com/graphframes/graphframes/archive/1cd7abb0f424fd76d76ea07438e6486f44fbb440.zip
wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar

Move the jar to the Spark jars directory:
cp graphframes-0.8.2-spark3.2-s_2.12.jar ~/spark-3.2.1-bin-hadoop3.2/jars/

Decompile the jar:
jar xf graphframes-0.8.2-spark3.2-s_2.12.jar
sudo cp -r graphframes /usr/lib/python3.7/site-packages/
pip install numpy

================================ Running the first toy program:
mkdir ~/checkpoints
/home/hadoop/spark-3.2.1-bin-hadoop3.2/bin/spark-submit --master spark//ip-172-31-94-126.ec2.internal:7077 /home/hadoop/DistributedStreamingCC/graphxtest1.py

The above toy program will run GraphFrames connected components on one of their sample datasets, but it will run it locally (i.e. only on the master node). If the program runs fine, it means GraphFrames successfully installed on the master node. The next step is getting GraphFrames to work on the entire cluster.

Second toy program:
/home/hadoop/spark-3.2.1-bin-hadoop3.2/bin/spark-submit --master spark//ip-172-31-94-126.ec2.internal:7077 /home/hadoop/DistributedStreamingCC/graphxtest2.py

This program will run GraphFrames BFS using the entire cluster. You will need to edit "graphxtest2.py" so it loads the SparkContext using your Spark master daemon's correct address.

If necessary, you may need to downgrade the Java version on all nodes from 11 to 8. Skip this step, but if things don't work, then come back to it: java.yaml does this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly