In this week’s episode, Alexander and Lju explore a data set on credit card spending in New Zealand, from the New Zealand government statistics website. Catch-up on the episode now!
This week we explored New Zealand credit card spend data, which had information on amounts, dates and categories involved in the spend.
Unfortunately, we discovered during the stream that the data in the CSV had a couple of issues:
-
We were unable to use
LOAD CSV
directly against the compressed file. Usually, this is handled by Cypher -
It turns out that the data are the stats for populating a chart 🤦
For solving the first problem, we took the executive decision to just upload a decompressed version of the CSV up to a repo, so that for those of you following along, you’ll be able to run the queries in this write up.
For resolving the second problem, we just shrugged our shoulders and carried on. We can still use that data to answer some questions! Just perhaps not the questions we’d liked to have asked!
-
Which per month was highest spend category?
-
Per category, do people spend more on credit or debit cards? - Unfortunately we can’t answer this question, based on the data set! Sorry!
You can view and play with the data model in Arrows App here.
Nothing particularly contraversial here, and as always, there are a number of ways to do the model. One stand-out item here, we put cardType
as a property on SeriesReference
. We could have potentially done this as a double label as well.
To load the data into the database, we use the following query:
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/lju-lazarevic/misc/main/electronic-card-transactions-july-2021-csv-tables.csv" AS row
WITH row
WHERE row.Series_title_1 = "Actual"
MERGE (c:Category {type:row.Series_title_2})
MERGE (d:Date {value:row.Period})
CREATE (s:SeriesReference {id:row.Series_reference, value:tofloat(row.Data_value)})
CREATE (s)-[:HAS_DATE]->(d)
CREATE (s)-[:HAS_Category]->(c)
Note that we’re only drawing the 'Actual' values from the data. Also, row.Series_title_2
(fairly safe to say this gives the game away about the data - whoops) has a mix of things, including those category titles we’re interested in.
As before, we need to cast lots of the numbers into floats as appropriate!.
Question - Which per month was highest spend category?
Starting off, let’s have a look at what comes back as a category.
MATCH (d:Date)<--(s)-->(c:Category)
WHERE s.value IS NOT NULL
RETURN d.value AS Date, s.value AS Spend, c.type AS Category ORDER BY Date, Category
As well as getting categories of interest, we also get a bunch of other values we’re not interested in, such as 'Total spend', etc. We can provide a list of things we’re interested in to tweak the above query:
WITH ["Apparel", "Consumables", "Durables", "Fuel", "Hospitality", "Motor vehicles excl. fuel", "Non-retail excl. services", "Services"] AS cats
MATCH (d:Date)<--(s)-->(c:Category)
WHERE s.value IS NOT NULL AND c.type IN cats
RETURN d.value AS Date, s.value AS Spend, c.type AS Category ORDER BY Date, Category
Last but not least, the specific question was to get the top category per month, so let’s get that done! We’ll do it by using a little trick of ordering and then collecting values into an array, and then pulling the first value to get the 'top' values:
WITH ["Apparel", "Consumables", "Durables", "Fuel", "Hospitality", "Motor vehicles excl. fuel", "Non-retail excl. services", "Services"] AS cats
MATCH (d:Date)<--(s)-->(c:Category)
WHERE s.value IS NOT NULL AND c.type IN cats
WITH d, s, c ORDER BY d.value, s.value desc
WITH d, collect([c.type, s.value]) as col
RETURN d.value AS Date, col[0][1] AS Spend, col[0][0] AS Category
Tip
|
When using multiple values as part of an ORDER BY , you will need to specify each time the order, either ascending (leave blank as is by default) or descending with DESC
|