Skip to content

Latest commit

 

History

History
101 lines (70 loc) · 9.17 KB

week-6.adoc

File metadata and controls

101 lines (70 loc) · 9.17 KB

Week 6 - New Zealand credit card spend data set

In this week’s episode, Alexander and Lju explore a data set on credit card spending in New Zealand, from the New Zealand government statistics website. Catch-up on the episode now!

maxresdefault
Week 6 catch-up video

The data set

This week we explored New Zealand credit card spend data, which had information on amounts, dates and categories involved in the spend.

Unfortunately, we discovered during the stream that the data in the CSV had a couple of issues:

  • We were unable to use LOAD CSV directly against the compressed file. Usually, this is handled by Cypher

  • It turns out that the data are the stats for populating a chart 🤦

For solving the first problem, we took the executive decision to just upload a decompressed version of the CSV up to a repo, so that for those of you following along, you’ll be able to run the queries in this write up.

For resolving the second problem, we just shrugged our shoulders and carried on. We can still use that data to answer some questions! Just perhaps not the questions we’d liked to have asked!

The questions

  • Which per month was highest spend category?

  • Per category, do people spend more on credit or debit cards? - Unfortunately we can’t answer this question, based on the data set! Sorry!

The data model

You can view and play with the data model in Arrows App here.

Nothing particularly contraversial here, and as always, there are a number of ways to do the model. One stand-out item here, we put cardType as a property on SeriesReference. We could have potentially done this as a double label as well.

img\nzcc
Figure 1. The New Zealand credit card data model used in the stream

Loading the data

To load the data into the database, we use the following query:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/lju-lazarevic/misc/main/electronic-card-transactions-july-2021-csv-tables.csv" AS row
WITH row
WHERE row.Series_title_1 = "Actual"
MERGE (c:Category {type:row.Series_title_2})
MERGE (d:Date {value:row.Period})
CREATE (s:SeriesReference {id:row.Series_reference, value:tofloat(row.Data_value)})
CREATE (s)-[:HAS_DATE]->(d)
CREATE (s)-[:HAS_Category]->(c)

Note that we’re only drawing the 'Actual' values from the data. Also, row.Series_title_2 (fairly safe to say this gives the game away about the data - whoops) has a mix of things, including those category titles we’re interested in.

As before, we need to cast lots of the numbers into floats as appropriate!.

Querying the data

Question - Which per month was highest spend category?

Starting off, let’s have a look at what comes back as a category.

MATCH (d:Date)<--(s)-->(c:Category)
WHERE s.value IS NOT NULL
RETURN d.value AS Date, s.value AS Spend, c.type AS Category ORDER BY Date, Category

As well as getting categories of interest, we also get a bunch of other values we’re not interested in, such as 'Total spend', etc. We can provide a list of things we’re interested in to tweak the above query:

WITH ["Apparel", "Consumables", "Durables", "Fuel", "Hospitality", "Motor vehicles excl. fuel", "Non-retail excl. services", "Services"] AS cats
MATCH (d:Date)<--(s)-->(c:Category)
WHERE s.value IS NOT NULL AND c.type IN cats
RETURN d.value AS Date, s.value AS Spend, c.type AS Category ORDER BY Date, Category

Last but not least, the specific question was to get the top category per month, so let’s get that done! We’ll do it by using a little trick of ordering and then collecting values into an array, and then pulling the first value to get the 'top' values:

WITH ["Apparel", "Consumables", "Durables", "Fuel", "Hospitality", "Motor vehicles excl. fuel", "Non-retail excl. services", "Services"] AS cats
MATCH (d:Date)<--(s)-->(c:Category)
WHERE s.value IS NOT NULL AND c.type IN cats
WITH d, s, c ORDER BY d.value, s.value desc
WITH d, collect([c.type, s.value]) as col
RETURN d.value AS Date, col[0][1] AS Spend, col[0][0] AS Category
Tip
When using multiple values as part of an ORDER BY, you will need to specify each time the order, either ascending (leave blank as is by default) or descending with DESC