Skip to content
zverok edited this page Aug 18, 2015 · 8 revisions

Infoboxer's page is basically a tree of nodes (like AST or DOM). It means a paragraph with some bold text and hyperlink inside is parsed into structure like this:

# Parse paragraph
para = Infoboxer::Parser.paragraph("Some paragraph with '''bold with [[Link]]''")
# => #<Paragraph: "Some paragraph with bold with Link"> 

# Show its structure:
puts para.to_tree
# <Paragraph>
#   Some paragraph with  <Text>
#   <Bold>
#     bold with  <Text>
#     Link <Wikilink(link: "Link")>

(You can use #to_tree on entire pages! Yet beware, the tree is HUGE.)

So, the data extraction is basically looks like navigating through the tree, finding nodes containing data you want and extracting something from them.

Like this:

Infoboxer.wp.get('Sri Lanka').
  sections('Politics', 'Administrative divisions').
  tables.first.body.
  map{|tr| tr.cells.first.to_s}
# => ["Central", "Eastern", "North Central", "Northern", "North Western", "Sabaragamuwa", "Southern", "Uva", "Western"]

Further reading: