This project explores the application of Information Processing and Retrieval techniques to the study of Portuguese monuments. We aim to develop an efficient system for collecting, organizing, and retrieving relevant data about historical landmarks across Portugal. This work contributes to the digital preservation of cultural heritage and supports the creation of user-friendly tools for educational and touristic purposes.
The first milestone of this project focuses on data collection and processing. For the collection of data, we first determined what websites we would take the information from. We selected two different sources with three specific links: Rota do Românico; Wikipedia - List of National Monuments; Wikipedia - Categoria: Imóveis de interesse público em Portugal.
As we explored the websites, we found that each one had a different HTML structure and, in some cases, even the same website had different HTML structures for each monument. To address this, we developed three distinct web scrapers: one for Rota do Românico; one for Wikipedia - List of National Monuments; and one for Wikipedia - List of Public Interest Real Estate. Each link provides a detailed explanation of how the data collection process and pipeline were implemented for each source.