We propose a news recommendation approach based on articles similarity and person co-occurencies found in the same articles. The potential business profit is in article by article reading. It helps to increase user active time at service.
Direct benefit - more time - more advertisment for example.
We store dataset in data folder. Dataset creating, entity linking system and graph building, recommendation algorithm implementation can be found in ria_dataset.ipynb and social_network_project.ipynb respectively.
Contributors:
- Artem Aroslankin
- Polina Smolnikova
Ria News dataset contains last news from several categories:
- politics
- economy
- science
- culture
- incidents
Entity Linking system with DeepPavlov
Entity Linking component performs the following steps:
-
the substring, detected with NER (Russian), is fed to TfidfVectorizer and the resulting sparse vector is converted to dense one
-
Faiss library is used to find k nearest neighbours for tf-idf vector in the matrix where rows correspond to tf-idf vectors of words in entity titles
-
entities are ranked by number of relations in Wikidata (number of outgoing edges of nodes in the knowledge graph)
-
BERT (Russian) is used for entities ranking by entity description and by sentence that mentions the entity
Out recommendation scenario is complementary type: for an article suggest list of complementary articles.
Generally, algorithm consists of two parts:
- Find TopK similar articles.
- Reranking articles from 1. with information about article importance.
Firstly, articles are preprocessed:
- Removal of 1 offer with information about the date and place of the event
- Deleting a sentence with information that is not related to the news (for example, "only selected quotes in our telegram channel")
- Downcast
- Remove punctuation marks
- Tokenization, stopword removal and lemmatization
- Delete words less than 3 characters long
Next step, Word2Vec model is used for articles encoding and Word Mover’s Distance as a subfunction of similarity measure. And as a final list, topK by similarity measure is taken as a first level of recommendation list.
At this step, topK list of related articles is available. How to predict which articles are significant to read, and which are not?
Our own proposed algorithm is the following:
- Build the graph containing person co-occurencies in articles.
- Normalize weights.
- Compute PageRank scores for each vertex.
- Compute importance score for each article by sum of PR scores of any person mentioned in the article.
- Rerank TopK list by importance score.
Examples:
- Local politics news Article currently opened - 'Депутат госдумы обязан присутствовать во время пленарных заседаний палаты...' Top from suggestion list - 'Президент России Владимир Путин представил нового главу мчс Александра Куренкова...'
- Space news Article currently opened - 'Александр скворцов на счету которого три космических полета уходит из российского отряда космонавтов...' Top from suggestion list - 'Капсула космического корабля crew dragon с экипажем из четырех человек приводнилась в атлантическом океане...'
There are no open-source real user's eventstreams related to news products (No real answers for algorithm to check). The way of quality assurance is expert assessment (human eye). Algorithm of assessment:
- Find the most interesting article to start with.
- Take 3 articles from top of list of recommendations.
- Use MAP@3 to evualute.
The result is ambigous, one person collected high values, while another one has extremely low results.
Another algorithm of assessment is about article by article scrolling.
- Take 10 interested articles.
- If interesting in top3. use next.
- Repeat until no interesting in top3.
- Count how many articles were scrolled.
Here, we have in average about 8 articles length of session. It is a great baseline to start with.