I know I’m late to the game as Wordle was sold to the New York Times last week.
I haven’t played it much myself just saw it popping up on my twitter stream a lot.
If you haven’t played wordle yourself, it is a guessing game similar to mastermind, only with (5-letter) english words (there are now many clones in other languages). Correctly guessed letters in the right position appear green, wrong position yellow and incorrect letters gray. The goal is to guess the word in 6 attempts, i.e. make all 5 letters green.
Our creative folks even created a Neo4j version of it, see if you can spot the mistake.
So I thought one late night that it would be fun to represent the wordle world as a graph. And here we go.
If you missed our live-stream here is the recording, we had a lot of fun both playing the game but also exploring the graph version of it:
That wordle of the day was "elder", which got got to in attempt 3, after someone from the stream hinting at it.
You can find past live sessions and deep dive blog write-ups.
All the Cypher statements and the source CSV can be found in this GitHub repository:
But let’s create our database instance first, so that you can follow the modeling and import.
I found a scraped wordle list in this repository of an R solver, which I turned into a CSV with 12k entries.
Here are the first few words, none of which I would have associated with English :)
aahed aalii aargh aarti abaca
The CSV is available on GitHub, it doesn’t have a header and only a single column.
We can load the CSV data directly with LOAD CSV into Word
nodes, first we create a constraint to make this and future lookups fast and the import repeatable.
// create constraint
CREATE CONSTRAINT word_name IF NOT EXISTS ON (w:Word) ASSERT w.name IS UNIQUE;
// load csv and turn each row into a Word node
LOAD CSV FROM "https://raw.githubusercontent.com/jexp/wordle-graph/main/wordle.csv" AS row
MERGE (:Word {name:row[0]});
That’s a lot of 5-letter words (12k) nodes.
I also put a dump file on s3: https://data.neo4j.com/wordle.dump that you can load into your AuraDB or Neo4j Desktop.
The idea is to split the words into their constituent letters and represent those "character positions" in the graph, sharing letters and positions as needed.
That will allow us to later query things like
-
character frequency
-
follow frequencies
-
top-starter words
-
possible solution words if we have already some positive and negative clues
Initially I represented characters at positions with dedicated nodes labeled CharAtPos
with a char
and an idx
property connected to the word via HAS
and to each other with an NEXT
relationship and the first node (idx=0) having an extra STARTS
relationship.
As it turns out the model might have been a bit overengineered :)
Here is the code to split the words into characters and then create the first CharAt
pos and subsequent (1..4) nodes and connect them.
To make it work with memory constrained environments I wrapped it in a CALL {} IN TRANSACTIONS
subquery, but even in AuraDB Free that was actually not necessary.
MATCH (w:Word)
CALL { WITH w
WITH w, split(w.name,"") AS chars
MERGE (start:CharAtPos {idx:0, char:chars[0]})
MERGE (w)-[:STARTS]->(start)
MERGE (w)-[:HAS]->(start)
WITH *
UNWIND range(1,size(chars)-1) AS idx
MERGE (next:CharAtPos {idx:idx, char:chars[idx]})
MERGE (w)-[:HAS]->(next)
WITH *
MATCH (prev:CharAtPos {idx:idx-1, char:chars[idx-1]})
MERGE (prev)-[:NEXT]->(next)
} IN TRANSACTIONS OF 1000 ROWS;
IF we query a word like crash
now based on our model:
match p=(:Word {name:"crash"})--()
return p
To solve a word, you pass in the letters you know with their positions and the letters that you don’t have the right position for and match any words that fit this pattern.
MATCH (c1:CharAtPos {idx:0, char:'c'}),
(c5:CharAtPos {idx:4, char:'h'}),
(c:CharAtPos {char:'a'})
match (w:Word)-[:HAS]->(c1),
(w)-[:HAS]->(c5),
(w)-[:HAS]->(c)
return w.name;
╒════════╕ │"w.name"│ ╞════════╡ │"clach" │ ├────────┤ │"clash" │ ├────────┤ │"caneh" │ ├────────┤ │"coach" │ ├────────┤ │"catch" │ ├────────┤ │"crash" │ └────────┘
If we have more information, then we can extend the query by excluding letters or positions and get a smaller result set. It takes a bit longer to query due to the exclusions.
match (c1:CharAtPos {idx:0, char:'c'}), // correct
(c2:CharAtPos {idx:1, char:'a'}), // wrong pos
(c3:CharAtPos {char:'l'}), // incorrect
(c4:CharAtPos {char:'i'}), // incorrect
(c5:CharAtPos {idx:4, char:'h'}), // correct
(c:CharAtPos {char:'a'})
match (w:Word)-[h1:HAS]->(c1),
(w)-[h2:HAS]->(c5), (w)-[h3:HAS]->(c)
WHERE not exists { (w)-[:HAS]->(c2) } and not exists { (w)-[:HAS]->(c3) } and not exists { (w)-[:HAS]->(c4) }
return *
An alternative model represents just the 26 characters and puts the position onto the relationship either as a property or as the rel-type.
Because we know we have 5 letters we can just spell it out.
MATCH (w:Word)
WITH w, split(w.name,"") AS chars
MERGE (c0:Char {char:chars[0]})
MERGE (w)-[p0:POS0]->(c0) SET p0.idx = 0
MERGE (c1:Char {char:chars[1]})
MERGE (w)-[p1:POS1]->(c1) SET p1.idx = 1
MERGE (c2:Char {char:chars[2]})
MERGE (w)-[p2:POS2]->(c2) SET p2.idx = 2
MERGE (c3:Char {char:chars[3]})
MERGE (w)-[p3:POS3]->(c3) SET p3.idx = 3
MERGE (c4:Char {char:chars[4]})
MERGE (w)-[p4:POS4]->(c4) SET p4.idx = 4;
We can first look at the representation of a word ("diver") in this model.
match (w:Word {name:"diver"})-[r]->(c:Char)
return *;
Then see how shared characters between two words ("diver" and "elder") look like, here c
are the shared letters and c1, c2 the other letters respectively.
match path = (c1:Char)<--(:Word {name:"diver"})-->(c:Char)
match path2 = (c:Char)<--(:Word {name:"elder"})-->(c2:Char)
return path, path2;
Looking at that frequent suffix of er
, we wanted to see what the letter frequencies look like.
With this model we can also easy look at character frequencies and follow probabilities.
For the letter frequencies we just count the number of relationships pointing to a character node (aka the in-degree).
MATCH (c:Char)
RETURN c.char, size((c)<--()) as deg
ORDER BY deg DESC;
As expected for the English language the vowels, and R
,S
,T
are pretty high up.
╒════════╤═════╕ │"c.char"│"deg"│ ╞════════╪═════╡ │"s" │6665 │ ├────────┼─────┤ │"e" │6662 │ ├────────┼─────┤ │"a" │5990 │ ├────────┼─────┤ │"o" │4438 │ ├────────┼─────┤ │"r" │4158 │ ├────────┼─────┤ │"i" │3759 │ ├────────┼─────┤ │"l" │3371 │ ├────────┼─────┤ │"t" │3295 │ ├────────┼─────┤ │"n" │2952 │ ├────────┼─────┤ │"u" │2511 │ ├────────┼─────┤ │"d" │2453 │ ├────────┼─────┤ │"y" │2074 │ ├────────┼─────┤ │"c" │2028 │ ├────────┼─────┤ │"p" │2019 │ ├────────┼─────┤ │"m" │1976 │ ├────────┼─────┤ │"h" │1760 │ ├────────┼─────┤ │"g" │1644 │ ├────────┼─────┤ │"b" │1627 │ ├────────┼─────┤ │"k" │1505 │ ├────────┼─────┤ │"f" │1115 │ ├────────┼─────┤ │"w" │1039 │ ├────────┼─────┤ │"v" │694 │ ├────────┼─────┤ │"z" │434 │ ├────────┼─────┤ │"j" │291 │ ├────────┼─────┤ │"x" │288 │ ├────────┼─────┤ │"q" │112 │ └────────┴─────┘
We can now use that information to find good starting words, by summing up the character frequencies for each word.
MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN w.name, sum(size((c)<--())) as total
ORDER BY total DESC LIMIT 5;
We see there are quite a lot of "cheater" words which contain the high frequency characters multiple times.
╒════════╤═══════╕ │"w.name"│"total"│ ╞════════╪═══════╡ │"esses" │33319 │ ├────────┼───────┤ │"sasse" │32647 │ ├────────┼───────┤ │"sessa" │32647 │ ├────────┼───────┤ │"asses" │32647 │ ├────────┼───────┤ │"eases" │32644 │ └────────┴───────┘
If we want to avoid that we can state, that we want to only look at words with 5 distinct characters.
MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN w.name, sum(size((c)<--())) as total, count(distinct c) = 5 as uniques
ORDER BY uniques DESC, total DESC LIMIT 10;
╒════════╤═══════╤═════════╕ │"w.name"│"total"│"uniques"│ ╞════════╪═══════╪═════════╡ │"arose" │27913 │true │ ├────────┼───────┼─────────┤ │"soare" │27913 │true │ ├────────┼───────┼─────────┤ │"aeros" │27913 │true │ ├────────┼───────┼─────────┤ │"serai" │27234 │true │ ├────────┼───────┼─────────┤ │"arise" │27234 │true │ ├────────┼───────┼─────────┤ │"reais" │27234 │true │ ├────────┼───────┼─────────┤ │"aesir" │27234 │true │ ├────────┼───────┼─────────┤ │"raise" │27234 │true │ ├────────┼───────┼─────────┤ │"aloes" │27126 │true │ ├────────┼───────┼─────────┤ │"stoae" │27050 │true │ └────────┴───────┴─────────┘
Still not perfect, most of these are just variations of the same set of letters. Let’s group them by the sorted set of letters and show the first few.
MATCH (w:Word)
MATCH (w)-->(c:Char)
RETURN apoc.coll.sort(split(w.name,'')) as letters, sum(size((c)<--())) as total, count(distinct c) = 5 as uniques, collect(w.name)[0..2] as words
ORDER BY uniques DESC, total DESC LIMIT 10;
╒═════════════════════╤═══════╤═════════╤═════════════════╕ │"letters" │"total"│"uniques"│"words" │ ╞═════════════════════╪═══════╪═════════╪═════════════════╡ │["a","e","r","s","t"]│348010 │true │["aster","arets"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","e","p","r","s"]│331422 │true │["apres","apers"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","e","l","s","t"]│311796 │true │["salet","taels"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","e","l","p","s"]│247070 │true │["pales","lapse"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["d","e","o","r","s"]│243760 │true │["doser","deros"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","e","l","r","s"]│241614 │true │["arles","earls"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","c","e","r","s"]│229527 │true │["acres","acers"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["d","e","i","l","s"]│229100 │true │["delis","diels"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","i","r","s","t"]│214803 │true │["artis","astir"]│ ├─────────────────────┼───────┼─────────┼─────────────────┤ │["a","e","l","s","v"]│210438 │true │["avels","valse"]│ └─────────────────────┴───────┴─────────┴─────────────────┘
E.g. on position 1 we can use this for the follow frequency
MATCH (c1:Char)<-[:POS0]-(w)-[:POS1]->(c2:Char)
RETURN c1.char, c2.char, count(*) as freq
ORDER BY freq DESC LIMIT 5;
╒═════════╤═════════╤══════╕ │"c1.char"│"c2.char"│"freq"│ ╞═════════╪═════════╪══════╡ │"c" │"o" │220 │ ├─────────┼─────────┼──────┤ │"m" │"a" │193 │ ├─────────┼─────────┼──────┤ │"r" │"e" │187 │ ├─────────┼─────────┼──────┤ │"s" │"t" │183 │ ├─────────┼─────────┼──────┤ │"b" │"o" │175 │ └─────────┴─────────┴──────┘
Globally we can either do a big union and sum them up as we did on the stream. Or with the positions on the relationships, we can also generalize our statement from above.
MATCH (c1:Char)<-[p1]-(w)-[p2]->(c2:Char)
WHERE p1.idx +1 = p2.idx
RETURN c1.char, c2.char, count(*) as freq
ORDER BY freq DESC LIMIT 10;
╒═════════╤═════════╤══════╕ │"c1.char"│"c2.char"│"freq"│ ╞═════════╪═════════╪══════╡ │"e" │"s" │867 │ ├─────────┼─────────┼──────┤ │"e" │"r" │765 │ ├─────────┼─────────┼──────┤ │"a" │"r" │659 │ ├─────────┼─────────┼──────┤ │"r" │"e" │646 │ ├─────────┼─────────┼──────┤ │"e" │"d" │642 │ ├─────────┼─────────┼──────┤ │"r" │"a" │562 │ ├─────────┼─────────┼──────┤ │"i" │"n" │556 │ ├─────────┼─────────┼──────┤ │"a" │"n" │544 │ ├─────────┼─────────┼──────┤ │"a" │"l" │536 │ ├─────────┼─────────┼──────┤ │"a" │"s" │532 │ └─────────┴─────────┴──────┘
These can help us finding that missing letter, but I guess you have internalized it already from your language knowledge, except for some rare combinations.
One interesting question we wanted to answer is, which letters appear frequently in succession as double-letters.
MATCH (c:Char)<-[r1]-(w:Word)-[r2]->(c)
WHERE r1.idx = r2.idx +1
RETURN c.char, count(*) as freq
ORDER BY freq DESC LIMIT 8
As expected o
and e
lead the pack, l
was a bit surprising to me.
╒════════╤══════╕ │"c.char"│"freq"│ ╞════════╪══════╡ │"o" │328 │ ├────────┼──────┤ │"e" │308 │ ├────────┼──────┤ │"l" │209 │ ├────────┼──────┤ │"t" │114 │ ├────────┼──────┤ │"f" │111 │ ├────────┼──────┤ │"r" │108 │ ├────────┼──────┤ │"s" │105 │ ├────────┼──────┤ │"n" │91 │ └────────┴──────┘
For generic, repeats and re-occurrences one could generalize that to
For resolving our wordle puzzle (v1) we could use this Cypher using this time the relationships as structuring means.
MATCH (c:Char {char:'c'}),
(h:Char {char:'h'}),
(a:Char {char:'a'})
MATCH (wordle:Word)-[p0:POS0]->(c),
(wordle)-[p4:POS4]->(h),
(wordle)-[px]->(a)
WHERE not exists { (wordle)-[:POS1]->(a) }
AND not exists { (wordle)-[:POS2]->(:Char {char:'l'}) }
AND not exists { (wordle)-[:POS3]->(:Char {char:'i'}) }
RETURN *;
If we have more information, then we can extend the query by excluding letters or positions and get a smaller result set.
MATCH (c:Char {char:'c'}),
(h:Char {char:'h'}),
(a:Char {char:'a'})
MATCH (wordle:Word)-[p0:POS0]->(c),
(wordle)-[p4:POS4]->(h),
(wordle)-[px]->(a)
WHERE not exists { (wordle)-[:POS1]->(a) }
AND not exists { (wordle)-[:POS2]->(:Char {char:'l'}) }
AND not exists { (wordle)-[:POS3]->(:Char {char:'i'}) }
RETURN *;
We can even go so far as to implement a generic solver, that takes an structured input (we could also split an parse a marked up word as shown in the 2nd statement) and for each position, includes or excludes the letter (at position) as needed.
// Laser 🟩🟨⬜🟩⬜
WITH [{char:'l',match:true},{char:'a'},{char:'s',match:false},{char:'e',match:true},{char:'r',match:false}] as input
MATCH (w:Word)
CALL {
WITH w, input
UNWIND range(0,size(input)-1) as idx
WITH size(input) as total, idx, w, input[idx].match as m, input[idx].char as char
// for matching must match position
WHERE (m AND exists { (w)-[{idx:idx}]->(:Char {char:char}) })
// for non-matching must not contain
OR (m = false AND NOT exists { (w)-->(:Char {char:char}) })
// existing must contain somewhere
OR (m IS NULL AND exists { (w)-->(:Char {char:char}) })
// all conditions need to match for this word
WITH total, count(*) as count WHERE count = total
RETURN true AS found
}
RETURN w.name;
For 'laser' this returns 9 suggestions for the next word: label, laced, lacet, lacey, laded, laden, laked, lamed, lapel, lated, laten, latex, laved, lawed, laxed, layed, lazed, lutea, lycea.
We could now rank them by the information gain, i.e. which of those have high frequency new characters or reduce most of the uncertainty.
I leave that for you valued reader for now, so that we can have a look at (the much simpler) wordle game implemented using Cypher.
If you just want to play, run ./wordle-neo4j.sh
in your terminal, it sends a Cypher query to a wordle database in demo.neo4j.labs.com (username, password, database = wordle) to see if your guesses were right.
./wordle-neo4j.sh 12972
Guess 1: graph
⬜⬜⬜⬜⬜
Guess 2: bloom
⬜⬜🟩🟨⬜
Guess 3: scale
⬜⬜⬜⬜🟨
Guess 4: nodes
🟩🟨⬜🟨⬜
Guess 5: edges
🟨⬜⬜🟨⬜
Guess 6: neo4j
🟩🟩🟩🟩🟩
Guessed "neo4j" resulting in 🟩🟩🟩🟩🟩 in 6 rounds.
The statement that’s running is:
match (w:Word)
with w skip $word limit 1
// turn guess and word into lists
with split($guess,'') as guessed, split(w.name,'') as letters, w.name as name
// iterate and combine for each position
return reduce(res='', idx in range(0,size(letters)-1) | res +
// when correct
case when guessed[idx] = letters[idx] then '🟩'
// when contained
when name contains guessed[idx] then '🟨'
// otherwise
else '⬜' end) as res
Happy guessing and playing with the wordle data!
Let us know if you have more ideas / suggestions on how to have more fun with this or other datasets on AuraDB Free.