GitHub

Spark Optimization:

** 1: ** Standalone spark with: Master: 15 GB memory and 12 Cores Worker: 1 GB per executor 6 cores per executor.

** 2: ** Spark submit command used.

bin/spark-submit --master spark://Host:7077 optimize.py

** 1. Resource Tuning ** Here * Since spark was running in stand alone mode, increasing the number of executors had negative effect on performance. * having 2 worker nodes with 1 executor and 6 core took 14 sec to complete the job where as having single worker node with 2 executors took 13 seconds to complete * When number of executors were increased to 10 with single core, job took more than 30 seconds to complete * Best time of 12 Seconds was achieved with below configuration on single worker machine --num-executors 1 --executor-cores 12 --executor-memory 1GB

** Data frame Caching **

DataFrame caching using cache()/persist() had no effect at all.Before and after execution times were same. answers_month.cache()

** Repartitioning **

Repartitioning answersDF had 2 seconds gain when default spark configuration was use. answersDF.withColumn('month', month('creation_date')).repartition(col('question_id'),col('month')).groupBy('question_id', 'month').agg(count('*').alias('cnt'))

** Broadcast and join order **

Using either Broadcast or changing join order of questionsDF and answers_month had no effect on time either
From Here i see that by default under 10MB broadcast is on.

** Limiting shuffle partitions **

I see that by default number of partitions are 200.That will cause lot of shuffle.Using spark.sql.shuffle.partitions config option number of shuffle partitions can be controlled.
setting this to less than 20 had 2 second performance gain. Overall the job took about 9 seconds after tuning, which is about 5 seconds less than original 14 seconds.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
docs		docs
README.md		README.md
optimize.py		optimize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Swapnay/Spark_Optimization

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages