Skip to content

Latest commit

 

History

History
208 lines (174 loc) · 7.18 KB

README.md

File metadata and controls

208 lines (174 loc) · 7.18 KB

Cartolafc reports

Docker Apache Airflow Apache Hive

Architecture

This project aims to build and structure a data lake and data warehouse based on the data extracted from Cartola FC (a game about the Brazilian national football championship). To do it, a few services are used:

  • Airflow: scheduler and task orchestrator;
  • DataHub: metadata lineage and data catalog;
  • Hadoop: data repository;
  • Hive: database over Hadoop;
  • Superset: visualization tool.

How to start

The base of components are orchestrated with docker containers in the compose files at the root level of the project. Services like Hadoop datanode and namenode, Hive server and metastore, Airflow webserver and scheduler, and Superset server can be found there. So, to create all the project components at the same time, run the following command:

docker-compose up \
    --file docker-compose.airflow.yaml \
    --file docker-compose.hive.yaml \
    --file docker-compose.datahub.yaml \
    --file docker-compose.superset.yaml \
    up --detach

# or just quickstart
bash quickstart.sh

To setup only the operational group of containers you will need Airflow, Hive, Hadoop and DataHub containers. So, run the following:

docker-compose \
    --file docker-compose.airflow.yaml \
    --file docker-compose.hive.yaml \
    --file docker-compose.datahub.yaml \
    up --detach

# or just quickstart
bash quickstart.sh airflow

To setup only the querying group of containers (Superset, Hive and Hadoop) run the following:

docker-compose \
    --file docker-compose.superset.yaml \
    --file docker-compose.hive.yaml \
    up --detach

# or just quickstart
bash quickstart.sh superset

After a few moments of the start and healthcheck of services, the web interfaces will be found in:

Miscellaneous

The current data warehouse schema used on Hive is presented next. It mirrors the trusted layer build on Hadoop with external tables (this is the curated schema) to make some ELT to construct the managed tables in the trusted schema.

Database schema

The Airflow DAG includes file extraction from github API and transform/load groups for each table of the data warehouse schema. The DAG diagram is presented next:

Airflow DAG

Two dashboards are built into Superset consuming data from views, the first with an emphasis on teams and the second on players perfomance:

Teams dashboard

Players dashboard

Also, DataHub is configured to gather metadata from Superset, Hive and Airflow. The following images present the metadata extracted from Airflow and the data catalog built from Hive schemas:

Airflow metadata

Data catalog

Data catalog

Query examples

Some sample queries are stored in docs/sql. As an example, the best lineup of players based on scouts in all data since 2014:

WITH ranked_players AS (
    SELECT
        scouts.temporada,
        partidas.rodada,
        clubes.nome AS clube,
        atletas.apelido AS jogador,
        scouts.pontos AS pontos,
        scouts.pontosmedia AS media,
        posicoes.abreviacao AS posicao,
        RANK() OVER (
            PARTITION BY scouts.posicaoid
            ORDER BY scouts.pontos DESC
        ) AS scoutsrank
    FROM trusted.scouts
    JOIN trusted.clubes
        ON scouts.clubeid = clubes.clubeid
    JOIN trusted.partidas
        ON scouts.partidaid = partidas.partidaid
    JOIN trusted.atletas
        ON scouts.atletaid = atletas.atletaid
    JOIN trusted.posicoes
        ON scouts.posicaoid = posicoes.posicaoid
)
SELECT
    temporada,
    rodada,
    clube,
    jogador,
    pontos,
    media,
    posicao
FROM ranked_players
WHERE
    (posicao IN ('gol', 'tec') AND scoutsrank = 1) OR
    (posicao IN ('mei', 'ata') AND scoutsrank <= 3) OR
    (posicao IN ('zag', 'lat') AND scoutsrank <= 2)

-- Output:
-- temporada rodada clube              jogador              pontos media pos
-- 2017      12     Avaí - SC          Douglas Friedrich    33.0   21.83 gol
-- 2014      20     Sport - PE         Patric               27.1   4.74  lat
-- 2014      11     Internacional - RS Fabrício             24.8   5.0   lat
-- 2016      15     Grêmio - RS        Pedro Geromel        23.5   7.97  zag
-- 2014      19     Atlético - MG      Leonardo Silva       20.5   4.01  zag
-- 2019      10     Flamengo - RJ      Arrascaeta           37.7   12.62 mei
-- 2014      4      Botafogo - RJ      Daniel               31.9   12.93 mei
-- 2016      12     Atlético - MG      Cazares              27.6   10.02 mei
-- 2015      23     Cruzeiro - MG      Willian              32.7   4.34  ata
-- 2017      9      Flamengo - RJ      Paolo Guerrero       32.1   7.48  ata
-- 2020      31     Internacional - RS Yuri Alberto         31.5   5.85  ata
-- 2017      15     Sport - PE         Vanderlei Luxemburgo 10.92  5.34  tec

Best scouts by season since 2014:

WITH ranked_scouts AS (
    SELECT
        scouts.temporada,
        partidas.rodada,
        clubes.nome AS clube,
        atletas.apelido AS jogador,
        scouts.pontos,
        scouts.pontosmedia AS media,
        RANK() OVER (
            PARTITION BY scouts.temporada
            ORDER BY scouts.pontos DESC
        ) AS scoutrank
    FROM trusted.scouts AS scouts
    JOIN trusted.clubes AS clubes
        ON clubes.clubeid  = scouts.clubeid
    JOIN trusted.partidas AS partidas
        ON partidas.partidaid = scouts.partidaid
    JOIN trusted.atletas AS atletas
        ON atletas.atletaid = scouts.atletaid
)
SELECT
    temporada,
    rodada,
    clube,
    jogador,
    pontos,
    media
FROM ranked_scouts
WHERE ranked_scouts.scoutrank = 1
ORDER BY ranked_scouts.pontos DESC

-- Output:
-- temporada rodada clube              jogador           pontos media
-- 2019      10     Flamengo - RJ      Arrascaeta        37.7   12.62
-- 2017      12     Avaí - SC          Douglas Friedrich 33.0   21.83
-- 2015      23     Cruzeiro - MG      Willian           32.7   4.34
-- 2014      4      Botafogo - RJ      Daniel            31.9   12.93
-- 2020      31     Internacional - RS Yuri Alberto      31.5   5.85
-- 2018      9      Santos - SP        Rodrygo           31.1   6.31
-- 2016      8      Botafogo - RJ      Sassá             28.2   6.67