-
Notifications
You must be signed in to change notification settings - Fork 119
PySpark Project Creation
/home/sm/Downloads/spark-2.4.0-bin-hadoop2.7/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin
- sudo add-apt-repository ppa:webupd8team/java
- sudo apt update;
- sudo apt install oracle-java8-installer
- export SPARK_HOME=/home/zekelabs-user/Downloads/spark-2.4.3-bin-hadoop2.7
- export PATH=$SPARK_HOME/bin:$PATH
- export PYSPARK_PYTHON=python3
- Create Project directory
- Copy launch_spark_submit script here ( Required if notebook also running on same spark )
#!/bin/bash
unset PYSPARK_DRIVER_PYTHON
spark-submit $*
export PYSPARK_DRIVER_PYTHON=jupyter
- Now create entry program entry.py with 'main'
from pyspark.sql import SparkSession
import argparse
if __name__ == '__main__':
spark = SparkSession.builder.appName('PySpark-App').getOrCreate()
print('Session created')
emp_data = spark.read.csv('~/Downloads/HR_comma_sep.csv',inferSchema=True,header=True)
print (emp_data.count())
-
create another dir 'additionalCode'
-
cd additionalCode
-
Create setup.py
from setuptools import setup
setup(
name='PySparkUtilities',
version='0.1dev',
packages=['utilities'],
license='''
Creative Commons
Attribution-Noncommercial-Share Alike license''',
long_description='''
An example of how to package code for PySpark'''
)
-
mkdir utilities
-
Copy modules inside it
- sudo apt-get install python3-pip
- pip install setuptools
- In additionalCode execute - python setup.py bdist_egg
- This will create dist dir.
- dist will contain egg file
- To run ./launch_spark_submit.sh --master local[4] --py-files additionalCode/dist/PySparkUtilities-0.2.dev0-py2.7.egg entry.py