This package provides functionality to automate tasks on the USOSweb Interface.
Selenium
for navigating the interface and BeautifulSoup4
for parsing in the ScrapingTemplates.- A good place to start is to clone the repository:
git clone https://github.com/mkochanowski/USOSweb-automated.git
- Inside the project's root directory create a new virtual environment, then activate it:
python3 -m venv venv
# to activate on Linux:
source venv/bin/activate
# to activate on Windows:
.\venv\Scripts\activate
- Now you can safely install required packages:
pip install -r requirements.txt
For automating the browser, install Chrome Driver.
You can skip this step if you already utilize a different driver, such as Ghost Driver or Edge Driver.Learn more about configuring web drivers in the documentation.
- Done! Time for some configuration.
Your app will not execute without a properly configured .env
file.
This project comes with a .env.sample
to help you get started. You only need to introduce minor changes.
The file's contents are:
USOS_SETTINGS_USERNAME=""
USOS_SETTINGS_PASSWORD=""
USOS_SCRAPER_ROOT_URL="https://usosweb.uni.wroc.pl/kontroler.php?_action="
USOS_SCRAPER_DESTINATIONS="dla_stud/studia/oceny/index dla_stud/studia/sprawdziany/index"
USOS_SCRAPER_MINIMUM_DELAY=4
USOS_SCRAPER_WEBDRIVER_HEADLESS=False
USOS_SCRAPER_DEBUG_MODE=True
USOS_NOTIFICATIONS_ENABLE=True
USOS_NOTIFICATIONS_STREAMS="Email WebPush SMS"
USOS_NOTIFICATIONS_CONFIG_FILE="notifications_config.json"
Name of the setting | Description | Default value |
---|---|---|
USOS_SETTINGS_USERNAME |
Credentials neeeded for the process of authentication on the USOSweb interface. | Empty strings |
USOS_SETTINGS_PASSWORD |
||
USOS_SCRAPER_ROOT_URL |
A root url of the USOSweb application. The default root url includes a GET parameter action
because it is used throughout the interface U might think of it as a representation of a
structure similiar to http://usosweb.app/action/ . |
A root url for the University of Wroclaw |
USOS_SCRAPER_DESTINATIONS |
Predefined actions (destinations) that will be visited by the scraper after calling the run() method. | Final grades and course results |
USOS_SCRAPER_MINIMUM_DELAY |
Minimum delay between individual executions of the app.py main script. Do not exploit the
services you're using because you might get in trouble! |
4 minutes (don't go any lower) |
USOS_SCRAPER_WEBDRIVER_HEADLESS |
Whether to run the web driver in headless mode (in other words: silently, without the browser window appearing). You might want to disable it for debugging or developing new interactions. | False |
USOS_SCRAPER_DEBUG_MODE |
Whether to run the application in debug mode that provides more additional logging statements. Enable it only on your local development environement to avoid collecting unnnecessary data. | True |
USOS_NOTIFICATIONS_ENABLE |
Whether to allow the dispatcher to send any notifications via configured channels. | True |
USOS_NOTIFICATIONS_STREAMS |
Streams (channels) are user-configurable medias for delivering the notifications such as Email, Text messages or direct WebPush notifications to your browser. | Email and other examples |
USOS_NOTIFICATIONS_CONFIG_FILE |
Path to the configuration file responsible for providing necessary variables such as API Keys or special parameters to individual channels. Utilizing a separate source for config data will allow you to design streams that are much more flexible. | A file provided with a project. |
Input the credentials and the root url of the USOSweb app you want to access and you're good to go!
To execute the app, run:
python3 app.py
This script supports dispatching notifications via multiple channels, but Email is the one implemented by default. Initially, it comes with yagmail preinstalled, but you're free to replace it with a different library if needed.
To use yagmail you will need to configure OAuth2: Configuring yagmail.
You can place the oauth2_creds.json
file in the root directory of your project.
notifications_config.json
with the recipient and sender email addresses.When running on a server, remember to setUSOS_SCRAPER_DEBUG_MODE=False
andUSOS_SCRAPER_WEBDRIVER_HEADLESS=True
in the.env
file.
Now that you made sure the app is configured and fully working, let's deploy it to our server.
There are different ways of doing that, the most basic one would be to replicate the steps in Getting started guide and copy the configuration files from your local machine.
Let's set up a script that will execute the app inside of the virtual environment.
It may look like this:
#!/bin/bash cd /home/username/USOSweb-automated source venv/bin/activate python3 app.py
Replace the path with the directory you installed the script in and save the file as
cron.sh
.The last step is to add the script to the crontab.
Open the crontab by running:
crontab -e
And add the script:
*/10 * * * * /home/username/USOSweb-automated/cron.sh
That means the
cron.sh
script will be executed every 10 minutes.Congratulations! Your project is fully set up.
A ScrapingTemplate
is a set of rules that is predefined for a specific page.
https://usosweb.uni.wroc.pl/kontroler.php?_action=dla_stud/studia/sprawdziany/pokaz&wez_id=33693
In this example, a ROOT_URL
is https://usosweb.uni.wroc.pl/kontroler.php?_action=
and the destination: dla_stud/studia/sprawdziany/pokaz
.
The path of the template is going to be templates/scraping/dla_stud-studia-sprawdziany-pokaz.py
(just replace the slashes with dashes).
This is how a minimal template looks like:
import logging
from bs4 import BeautifulSoup
logging = logging.getLogger(__name__)
class ScrapingTemplate:
"""Scrapes the specific type of page by using predefined
set of actions."""
def __init__(self, web_driver: object) -> None:
self.driver = web_driver
self.results = None
def get_data(self) -> object:
"""Returns the scraped and parsed data."""
self._parse(soup=self._soup())
logging.debug(self.results)
return self.results
def _soup(self) -> object:
"""Generates a soup object out of a specific element
provided by the web driver."""
driver_html = self.driver.find_element_by_id("container")
soup = BeautifulSoup(
driver_html.get_attribute("innerHTML"),
"html.parser")
return soup
def _parse(self, soup: object) -> None:
"""Initializes parsing of the innerHTML."""
parser = Parser(soup=soup, web_driver=self.driver)
self.results = {
"module": __name__,
"parsed_results": parser.get_parsed_results()
}
class Parser:
"""Parses the provided HTML with BeautifulSoup."""
def __init__(self, web_driver: object, soup: object) -> None:
self.soup = soup
self.driver = web_driver
self.results = []
def get_parsed_results(self) -> list:
"""Returns the results back to the ScrapingTemplate."""
... # does parsing magic
return self.results
The only requirement for the ScrapingTemplate
is to implement the get_data()
method so that it returns a dictionary with a module
key, such as:
{
"module": __name__,
"new_destinations": [ ... ],
"parsed_results": [ ... ]
}
Available keys:
new_destinations
- URLs to pass back to the scraper for building up the queue of crawling.parsed_results
- data saved in a form of a list of entities.By default, the Scraper
class uses ChromeDriver
to automate the browser.
You can add more drivers in usos/web_driver.py
. Here is an example of a custom driver:
def _driver_phantomjs(self) -> None:
"""Adds PhantomJS WebDriver support."""
logging.info("Creating new PhantomJS Driver")
dir_path = os.path.dirname(os.path.realpath(__file__))
driver_path = dir_path + '/phantomjs'
driver = webdriver.PhantomJS(executable_path=driver_path)
driver.set_window_size(1120, 550)
self._driver = driver
self._driver
attribute to point to the instance of the driver.PhantomJS
driver to launch only in debug mode, and ChromeDriver
on our production server.def get_instance(self) -> object:
"""Returns an instance of the selected web driver."""
self.reset()
if self.config["MY_DEBUG_MODE"]:
self._driver_phantomjs()
else:
self._driver_chrome()
return self._driver
The current implementation of an Entity will be replaced in the future by an independant data structure.Honestly, operating on dictionaries instead of a dedicated class feels a little weird for such an important element.
Entity
is a dictionary structure that contains two keys: entity
and items
.For example:
{
"entity": "course-results-tree",
"items": [
{
"group": "28-INF-S-DOLI",
"subgroup": "Logic for Computer Science",
"hierarchy": "Exams",
"item": "Final Exam",
"values": ["85.0 pts", "Editor: John Doe"]
}, {
"group": "28-INF-S-DOLI",
"subgroup": "Logic for Computer Science",
"hierarchy": "Class/Tests",
"item": "Test no. 3",
"values": ["15.0 pts", "Editor: Jane Doe"]
}
]
}
Entity course-results-tree
defines not only what it stores in the items
key, but also how to process the data - the defined behaviour is to compare the supplied items with existing data to search for changes.
- If you want to introduce a new entity, start with a ScrapingTemplate. This is the very first step of a lifecycle of an entity.
- Add custom behaviour for the specific entity you're implementing. Check and if needed, expand methods
_get_filename()
andanalyze()
of theusos.data.DataController
class. - Update your rendering templates to support this type of entity.
- Great! You now have a new type of entity that supports custom behaviour.
This package comes with Jinja2 as a default templating engine.
templates/notifications/
directory.To learn more about writing templates in Jinja2, check out the documentation.
Streams are defined in usos/notifications.py
. To add your own channel, just subclass Notification
and implement two private methods:
_render()
and _send()
.
The Dispatcher class automatically sets the self.data
and self.config
attributes that supply results from the DataController as well as channel-specific key variables from notifications_config.json
file.
The final template should be saved in the self._rendered_template
attribute.
def _render(self) -> None:
env = Environment(loader=FileSystemLoader('templates/notifications'))
template = env.get_template('WebRequest.html')
self._rendered_template = template.render(data=self.data)
Your _send()
method should return a boolean indicating whether the notification has been sent successfuly or not.
def _send(self) -> bool:
data = {
'API_KEY': self.config["API_KEY"],
'MESSAGE': self._rendered_template
}
request = requests.post(API_URL, data=data)
return (request.status_code == 200)
Here's another example of a custom stream: PaperMail
.
class PaperMail(Notification):
def _render(self) -> None:
letter: str = "Hey, {name}! "
+ "{message} "
+ "Take care, {author}."
letter = letter.format(
name=data["recipient"],
message=data["message"],
author=data["sender"])
self._rendered_template = letter
def _send(self) -> bool:
put_in_a_mailbox(self._rendered_template)
return True
Now it can be used as a channel on it's own:
dispatcher = Dispatcher(
channels="PaperMail",
enable=True,
config_file="mailbox_coordinates.json")
my_message = {
"recipient": "Kate",
"message": "I'm getting a divorce.",
"sender": "Anthony"
}
dispatcher.send(my_message)
Visit https://docs.kochanow.ski/usos/api.html to get more information.