-
Notifications
You must be signed in to change notification settings - Fork 16
Extracting data
zverok edited this page Aug 18, 2015
·
8 revisions
Infoboxer's page is basically a tree of nodes (like AST or DOM). It means a paragraph with some bold text and hyperlink inside is parsed into structure like this:
# Parse paragraph
para = Infoboxer::Parser.paragraph("Some paragraph with '''bold with [[Link]]''")
# => #<Paragraph: "Some paragraph with bold with Link">
# Show its structure:
puts para.to_tree
# <Paragraph>
# Some paragraph with <Text>
# <Bold>
# bold with <Text>
# Link <Wikilink(link: "Link")>
(You can use #to_tree
on entire pages! Yet beware, the tree is HUGE.)
So, the data extraction is basically looks like navigating through the tree, finding nodes containing data you want and extracting something from them.
Like this:
Infoboxer.wp.get('Sri Lanka').
sections('Politics', 'Administrative divisions').
tables.first.body.
map{|tr| tr.cells.first.to_s}
# => ["Central", "Eastern", "North Central", "Northern", "North Western", "Sabaragamuwa", "Southern", "Uva", "Western"]
Further reading: