-
-
Notifications
You must be signed in to change notification settings - Fork 432
articlemasterextractor
This guide explains the functionality of the ArticleMasterExtractor, the incorporated Extractors and the architecture.
The ArticleMasterExtractor bundles several tools into one pipeline module in order to extract meta data from raw articles. Based on the html response of the processed pipeline item it extracts:
- author
- date the article was published
- article title
- article description
- article text
- top image
- used language
This is the main class of the module. Based on the Array passed at initialization, it initializes the extractors, a Cleaner
and a Comparer
.
This abstract class defines the basic structure of extractors. Each extractor has to implement AbstractExtractor
with the extract(NewscrawlerItem)
returning an ArticleCandidate
holding its results.
If you want to implement your own extractor simply add a new module to /extractor/extractors
including an implementation of AbstractExtractor
.
This class provides preprocessing, cleaning the intermediate results from unnecessary white-spaces, HTML-tags etc..
Compares the results and creates an ArticleCandidate
holding the best results.
-
km4_extractor/
-
__init__.py
(empty[1]) -
article_candidate.py
(container for intermediate results) -
article_extractor.py
(manages all components) -
cleaner.py
(implements the cleaner used to preprocess extractor output) -
comparer/
-
__init__.py
(empty[1]) comparer.py
comparer_author.py
comparer_date.py
comparer_description.py
comparer_language.py
comparer_text.py
comparer_title.py
comparer_topimage.py
-
-
extractors/
-
__init__.py
(empty[1]) abstract_extractor.py
date_extractor.py
langdetect__extractor.py
newspaper_extractor.py
readability_extractor.py
-
-
[1]: These files are empty but required because otherwise python would not recognize these directories as packages.
-
Module name:
date_extractor
-
Description:
The DateExtractor uses Beautiful Soup to parse the response. It looks at the URL, JSON data, meta tags as well as html tags in order to extract the publish date of an article. -
Extraction features:
- publish date
-
Module name:
lang_detect_extractor
-
Description:
LangDetect first parses the HTML response for meta tags in order to detect the used language. If this fails, it resorts to langdetect, a python port of Googles language detection. As input it uses the largest connected body of text. -
Extraction features:
- language
-
Module name:
newspaper_extractor
-
Description:
This extractor takes advantage of Newspaper, a tool authored and maintained by Lucas Ou-Yang. The main advantages of Newspaper are good image extraction and the support of over 10 languages (English, Chinese, German, Arabic, ...). The extraction of publish date on the other hand is very unreliable.Some information about Newspaper can be found on reddit.
-
Extraction features:
- authors
- publish date
- article title
- article (meta) description
- article text
- top image
- language
-
Module name:
readability_extractor
-
Description:
This extractor takes advantage of Readability, extracting features based on html-tags. Readability shows good results for the extraction of descriptions and stands under Apache License 2. For further information visit readability-lxml -
Extraction features:
- article title
- article description
There is a comparer for every extracted data content. In the following, there is a small description of the idea how the comparers work.
After excluding texts with less than 15 words, the comparer compares every text with every other text by creating a cross product. Furthermore, it splits the string into words which form a set. It then calculates a score for two extracted texts (sets) which divides the amount of elements of the symmetric difference by the amount of elements in the intersection which is multiplied by two (because every word that exists in both text is counted only once in the intersection). The result is subtracted from one. This means that words that are completely equal have a score of one. The more words are in the symmetric difference, the less equal the texts are and the lower the score is. It then checks which texts have the highest score and returns the one from newspaper or the longer one. ###ComparerDescription This comparer is primitive. If there is an extraction by newspaper, return it. If not, return other extraction if possible.
In order to compare each title with any other title, the comparer creates a cartesian product. If two titles match, the string is saved in a list. After comparing every title it counts which one has the most matches and returns it. In case, no title is found this way, the comparer searches for the shortest title. If there are any matches between titles, the comparer would start his search in the list with the matched titles. Otherwise it will start with the list where all extracted titles are saved. This method searches for the shortest title because the page title, which contains mostly a tag from the website (e.g. Title - BBC News) is often extracted instead of the title of the article. In view of the fact that this happens much more often than a partially extracted title, this is a useful method.
This comparer is primitive. If there is an extraction by newspaper, return it. If not, return other extraction if possible.
This comparer is primitive. If there is an extraction by dateExtractor, return it. If not, return other extraction if possible.
The comparer saves every extracted language in a list and counts how often each language was extracted. The language which occurs most frequently will be returned. In case, there are two or more languages which occure most frequently, the one occuring first in the list would be returned. This "random" method is used for the fact that it is difficult to find the correct language through further criteria. If each language occurs equally often the one being extracted by Newspaper would be returned. This is due to the reason that Newspaper extracts the language tag which is contained in the html file, making this a very accurate method.
This comparer is primitive. If there is an extraction by newspaper, return it. If not, return other extraction if possible.