dataframe
None
Follow instructions for your environment.
For this lab we are going to use clickstream data in JSON format.
Follow the steps to
# be in the working directory
$ cd hadoop-spark/dataframe
# create an HDFS directory
$ hdfs dfs -mkdir -p clickstream/in-json
# copy sample file
$ hdfs dfs -put ../data/click-stream/clickstream.json clickstream/in-json
Let's generate some more clickstream JSON data and stage into HDFS.
# generate data
$ python ../data/clickstream/gen-clickstream-json.py
# copy data to HDFS
$ hdfs dfs -put *.json clickstream/in-json/
# verify data in HDFS
$ hdfs dfs -ls clickstream/in-json/
$ spark-shell
And set the log level to WARN
scala>
sc.setLogLevel("WARN")
Issue the following commands in Spark-shell
val clickstream = sqlContext.read.json("/user/root/clickstream/in-json/clickstream.json")
// inspect the schema
clickstream.printSchema
// inspect data
clickstream.show
Use tab completion to explore methods available
clickstream. HIT THE TAB KEY
Find records that have cost > 100
val over100 = clickstream.filter(clickstream("cost") > 100)
over100.show
// count the records
over100.count
Count records where action = 'click'
clickstream.filter(clickstream("action") === "clicked").count
Count the number of visits from each domain
clickstream.groupBy("domain").count.show
// register the data frame as a temporary table
clickstream.registerTempTable("clickstream")
// try sql queries
sqlContext.sql("select * from clickstream").show
// find all facebook traffic
sqlContext.sql("select * from clickstream where domain = 'facebook.com'").show
// count traffic per domain from highest to lowest
sqlContext.sql("select domain, count(*) as total from clickstream group by domain order by total desc").show
Let's load all JSON files in clickstream/in-json
directory
val clicks = sqlContext.read.json("/user/root/clickstream/in-json/")
clicks.count
clicks.groupBy("domain").count.show
// repeat instructions from step 6