This is Python code meant to download and parse Spanish government’s Plataforma de contratación del sector público metadata. It produces parquet files that can be easily read in many programming languages.
This project was developed with nbdev, and hence each module stems from a Jupyter notebook that contains the code, along with tests and documentation. If you are interested in the inner workings of any module you can check its corresponding notebook in the appropriate section of the github pages of the project.
pip install git+https://github.com/nextprocurement/sproc@main
should do.
The software can be exploited as a library or as standalone scripts.
pip install -r requirements.txt
STEP 1: Opendata download {Gencat, Madrid, Zaragoza}
STEP 2: Integration with minors and outsiders
python3 integracion_opendata.py
--input_dir: 'path_to_save opendata and PLACE integrated files'
--place_dir: 'path where data from PLACE is previously downloaded insiders, outsiders, minors'
--administration 'all' '-> also valid with zaragoza, madrid, gencat'
sproc_dl
command is the work-horse of the library. It allows
downloading all the data of a given kind
into a parquet file, that
later can be updated invoking the same command. Running, e.g.,
sproc_dl outsiders
will download all the aggregated procurement data (excluding minor
contracts), and write an outsiders.parquet
file. Argument -o
can be
used to specify a directory other than the current one. Instead of
outsiders
, one can pass insiders
or minors
.
This is the highest-level command, and most likely the only one you need. The remaining ones (briefly explained below) provide access to finer granularity functionality.
For testing purposes one can download Outsiders contracts for 2018, either directly by clicking this link or, if wget is available, running
wget https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2018.zip
Running
sproc_read_single_zip.py PlataformasAgregadasSinMenores_2018.zip 2018.parquet
outputs the file 2018.parquet
(the name being given by the 2nd
argument), which contains a pd.DataFrame
with all the 2018 metadata.
It can be readily loaded (in Python, through
Pandas’ pd.read_parquet
). The columns of
the pd.DataFrame
stored inside are multiindexed (meaning one could
get columns such as (ContractFolderStatus','ContractFolderID)
and
(ContractFolderStatus','ContractFolderStatusCode)
. This is very
convenient when visualizing the data (see the the documentation for the
hier
module).
The columns of the above pd.DataFrame
can be flattened to get, in
the example above, ContractFolderStatus - ContractFolderID
and
ContractFolderStatus - ContractFolderStatusCode
, respectively.
Additionally, some renaming might be applied following the mapping in
some YAML file
sproc_rename_cols.py 2018.parquet -l samples/PLACE.yaml
This would yield a pd.DataFrame
with plain columns in file
2018_renamed.parquet
. Renaming is carried out using the mapping in
PLACE.yaml,
which can be found in the samples
directory of this repository. If you
don’t provide a local file (-l
) or a remote file (-r
), a default
naming scheme will be used if the name of the input file is
outsiders.parquet
, insiders.parquet
, or minors.parquet
.
Command sproc_read_zips.py
can be used to batch-process a sequence
of files, e.g.,
sproc_read_zips.py contratosMenoresPerfilesContratantes_2018.zip contratosMenoresPerfilesContratantes_2019.zip
If no output file is specified (through the -o
option), an
out.parquet
file (in which all the entries of all the zip files are
stitched together) is produced.
We can append new data to an existing pd.DataFrame
. Let us, for
instance, download, data from January
2022,
wget https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_202201.zip
and extend the previous parquet file with data extracted from the newly downloaded zip,
sproc_extend_parquet_with_zip.py 2018.parquet PlataformasAgregadasSinMenores_202201.zip 2018_202201.parquet
The combined data was saved in 2018_202201.parquet
.