These JSON files can also be generated from a PostgreSQL database whose schema is defined in schema.sql
.
We have 3 top-level entities: Topics, Creators, Items. Reviews are are nested within Items and have links to either another Item or a Creator.
topis (name, hname, parent, rank) creators (name, hname, links) items (id, name, description, image, splinks, topics, creators, reviews, tags, difficulty, rating)
spnames are topic names designed to fix many shortcomings of traditional names:
- we want case-insensitive, unique, URL-safe, hashtag-compatible names as topic identifiers
- but case-sensitive and including special characters for human display. Eg: "AT & T" or "C++ 20"
- topics have a taxonomy, typically a hierarchy. For eg: comp.lang.python
- within a parent, we usually want to preserve display order which is not alphabetical. This can be done with names like "100-physics", "200-chemistry", "300-biology" etc.
Wikipedia handles it awkwardly (escape characters that are hard to type manually, no hierarchy):
Newsgroups are quite nice, but the taxonomy is not granular enough for us to build a universal knowledge graph. For eg: there is no name yet for quadratic equations. Also, everything other than Big-8 (comp, humanities, misc, news, rec, sci, soc, and talk) is shoved into the alt.* hierarchy (the historical reason for that is European networks did not want to pay for groups like religion or racism)
Taxonomy maintenance is not easy. When does a concept/subtopic deserve its own topic-name? What happens when topics get merged or retired? Should they be language-agnostic or English-first?
Should names optimize for brevity or readability? What if names get really long?
Should names always fully-specified like math.algebra.quadratics
or just quadratics
?
If sort order is included in the name, do we expect the users to write "100-science.200-chemistry"?
If sort order is NOT included in the name then it (and other attributes) needs to be looked up elsewhere.
What if a topic belongs under two separate parent topics? statistics.machine_learning or computer_science.machine_learning?
Can names be compatible with hashtags?
How to handle disambiguation (a name that belongs to 2+ things or a thing that has 2+ names?
How about nameless things or topics?
Should names be meaningful or randomized?
Are names properties (like DNS or ENS)? If not, who assigns them?
How can a taxonomy keep up with expanding knowledge?
splinks are HTTP URLs modified specially in a few ways:
- Link should indicate the learning media type (eg: book/course/video/game/event etc)
- Link should optionally include content-hash for integrity check and alternat fetch mechanism like IPFS
- Link should optionally include enable easy lookup for its snapshot on places like Wayback machine
- Link should optionally indicate its open-access status:
- Does it require login?
- Does it require payment?
- Does it require ads?
- Links should be backward-compatible and should navigate to primary location in the usual way
name
is used as primary key and therefore, must be unique and avoid uppercase and special characters other than hyphen and slash. Here are some examples: physics
, linear-algebra
, nations/india
, programming-languages/objective-c
.
hname
is used as human-readable name and can preserve uppercase. For eg: ADHD
.
parent
should be the name of the parent topic. This makes it possible to show a hierarchical view. If a topic does not have parent_name
, it would be at the top-level but if it doesn't have children topics of its own, it will be clubbed under a dummy top-level topic called Misc
.
rank
is an integer that's used for controlling the ordering in which topics are displayed.
A top-level table that lists well-known experts and their metadata like occupation, links etc. These are references from items and their reviews.
name
is unique, URL-safe.
hname
is case-sensitive, allows special characters and may not be unique.
id
should be a unique UUID. It is needed because there is no other natural primary key. This might also be useful later for defining collections of items.
description
can contain markdown with multiple lines.
links
is an array of strings. Each item in this array is format
, url
and optional identifiers separated by |
. For eg, one of the strings in links
might: summary|https://sivers.org/book/Decisive|ipfs:bafykbzaceaejt6z54qnwnl3ccvw2lsdfksbeuwuh4sv77ixj4c3ldeof2c5so?filename=Daniel%20Higginbotham%20-%20Clojure%20for%20the%20Brave%20and%20True-No%20Starch%20Press%20%282015%29.pdf
.
The use-case for optional identifiers are things like ipfs:
, doi:
, isbn:
or isbn13:
.
topics
is an array of topic names. These should exactly match topics
table's name
column.
creators
is an array of strings which are references to the creators
table's name
column. For eg: charles_darwin
.
difficulty
must be either null
or one of these: childlike
, beginner
, intermediate
, advanced
, research
.
rating
is an integer on 0-100 point scale. This is a curated value and should not be simply copied from external sources.
tags
can describe quality: visual
, entertaining
, challenging
, inspirational
, interactive
, oer
. oer
stands for "Open Educational Resource" and can be used if the linked content does not require payment or user login.
reviews
is an array of JSON objects that must match this schema as you can see in schema.sql
:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"by_item": {"type": ["string", "null"]},
"by_creator": {"type": ["string", "null"]},
"rating": {"type": ["integer", "null"], "minimum": 0, "maximum": 100},
"blurb": {"type": ["string", "null"]},
"url": {"type": ["string", "null"]}
}
}
}