Skip to content

Swapnay/Spark_Optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spark Optimization:

** 1: ** Standalone spark with: Master: 15 GB memory and 12 Cores Worker: 1 GB per executor 6 cores per executor.

** 2: ** Spark submit command used.

bin/spark-submit --master spark://Host:7077 optimize.py

** 1. Resource Tuning ** Here * Since spark was running in stand alone mode, increasing the number of executors had negative effect on performance. * having 2 worker nodes with 1 executor and 6 core took 14 sec to complete the job where as having single worker node with 2 executors took 13 seconds to complete * When number of executors were increased to 10 with single core, job took more than 30 seconds to complete * Best time of 12 Seconds was achieved with below configuration on single worker machine --num-executors 1 --executor-cores 12 --executor-memory 1GB

** Data frame Caching **

  • DataFrame caching using cache()/persist() had no effect at all.Before and after execution times were same. answers_month.cache()

** Repartitioning **

  • Repartitioning answersDF had 2 seconds gain when default spark configuration was use. answersDF.withColumn('month', month('creation_date')).repartition(col('question_id'),col('month')).groupBy('question_id', 'month').agg(count('*').alias('cnt'))

** Broadcast and join order **

  • Using either Broadcast or changing join order of questionsDF and answers_month had no effect on time either
  • From Here i see that by default under 10MB broadcast is on.

** Limiting shuffle partitions **

  • I see that by default number of partitions are 200.That will cause lot of shuffle.Using spark.sql.shuffle.partitions config option number of shuffle partitions can be controlled.
  • setting this to less than 20 had 2 second performance gain. Overall the job took about 9 seconds after tuning, which is about 5 seconds less than original 14 seconds.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages