Extracting data

Infoboxer's page is basically a tree of nodes (like AST or DOM). It means a paragraph with some bold text and hyperlink inside is parsed into structure like this:

# Parse paragraph
para = Infoboxer::Parser.paragraph("Some paragraph with '''bold with [[Link]]''")
# => #<Paragraph: "Some paragraph with bold with Link"> 

# Show its structure:
puts para.to_tree
# <Paragraph>
#   Some paragraph with  <Text>
#   <Bold>
#     bold with  <Text>
#     Link <Wikilink(link: "Link")>

(You can use #to_tree on entire pages! Yet beware, the tree is HUGE.)

So, the data extraction is basically looks like navigating through the tree, finding nodes containing data you want and extracting something from them.

Like this:

Infoboxer.wp.get('Sri Lanka').
  sections('Politics', 'Administrative divisions').
  tables.first.body.
  map{|tr| tr.cells.first.to_s}
# => ["Central", "Eastern", "North Central", "Northern", "North Western", "Sabaragamuwa", "Southern", "Uva", "Western"]

Further reading:

(copyleft) 2015 Victor 'Zverok' Shepelev

Intro
Showcase
Retrieving pages
Extracting data
Advanced topics
Development
- Contributing
- Roadmap
Molybdenum?..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting data

Clone this wiki locally