Created a set of visualizations and plots to illustrate some of the current World events (as of April 2022). Used images of 3 news sources (CNN, Fox, and Aljazeera) as data. Built dynamic maps of the specific locations of interest by performing OCR on images of news sources. Explored and analyzed the entities appearing in the news, as well as their relation. Compared and contrasted various news sources and identify the key similarities and differences between them.
- For Optical Character Recognition (OCR):
Install the following:
- pip install geopy
- pip install pytesseract
- sudo apt install tesseract-ocr
- sudo apt install libtesseract-dev
- pip install Pillow==9.0.0
Install the following libraries:
- import PIL
- from PIL import Image
- import pytesseract
- import os
- import re
- import string
- import nltk
- from collections import OrderedDict
- import json
- For Named-Entity Recognition (NER):
Install the following libraries:
- import spacy
- import matplotlib.pyplot as plt
- import json
- from itertools import islice
- nlp = spacy.load("en_core_web_sm")
- For Geoparsing/Building maps
Install the following:
- pip install geopy
- pip install plotly
- pip install geopandas
- pip install -U kaleido
Install the following libraries:
- import spacy
- from itertools import islice
- import plotly.graph_objects as go
- from geopy.geocoders import Nominatim
- from geopy.geocoders import Photon
- import kaleido
Dynamic World Map
The overall distribution of entities across the news sources is similar, but the frequencies and rankings of specific entities vary significantly. For example, the frequency of mentions of the NORP entity (countries) is ~1200, ~1750, and ~1400 for Fox, CNN, and Aljazeera, respectively. The ORG entity (organizations) has a large leader for each news source, but the leading ORG changes based on the source. Additionally, the OCR (optical character recognition) categorization of certain words into entities can be influenced by context. For example, the word "Biden" was both a PERSON and LOC entity in all three sources due to context in the sentence.