Skip to content

How it works

zverok edited this page Apr 12, 2016 · 3 revisions

When you query some entity by name, Reality does the following:

  • queries the page from Wikipedia API;
  • parses this page with infoboxer into navigable DOM-alike structure;
  • also queries Wikidata item, corresponding to current page, as a set of predicates.

For example, my home city Kharkiv is represented in Wikipedia this way and in Wikidata that way.

Next, there's a dictionary of Wikidata predicates (properties) and their mapping into methods. So, any entity having Wikidata predicate P625 ("coordinate location") will map it into #coord method, providing instance of Reality::Geo::Coord. Other properties can be parsed into entities, named measures, just strings, list of those objects and so on.

Then, there are many useful data about objects which (still?) doesn't exist in Wikidata's structured form. We are taking them from parsed Wikipedia page. For example:

kharkiv = E('Kharkiv')
kharkiv.country # from Wikidata predicate
# => #<Reality::Entity?(Ukraine)>

kharkiv.area # not in Wikidata, from Wikipedia page infobox
# => #<Reality::Measure(350 km²)>

E('Bjork').albums # from list Wikipedia page's "Discography" section
# => #<Reality::List[Björk (album)?, Debut (Björk album)?, Post (Björk album)?, Homogenic?, Vespertine?, Medúlla?, Volta (album)?, Biophilia (album)?, Vulnicura?]>

Unfortunately, Wikipedia infoboxes are not standartized and we never can write "this field in infobox should always be that method in entity". For example, country infoboxes typically have field "area_km2" for country area, and city infoboxes typically name this "area_total_km2", for continent it is "area", and written in different manner (using {{Convert template).

So, for Wikipedia parsing, Reality defines a DSL like "from this type of infobox extract that type of data", "if there's a section named so-and-so, it goes to such method" and so on.

Links to real code: