Skip to content

frances-ai/KnowledgeGraph

Repository files navigation

Knowledge Graphs for Textual Heritage

This repository includes code and sources to create knowledge graphs for digitised textual heritage. Knowledge Graphs generated here are based on the Heritage Text Ontology. Note that this repository does not introduce any methods to extract information from the source digital datasets whose formats varies, instead, we utilise the dataframes generated from Information Extraction for frances Project.

Get Started

Step 1: Install required packages (in the root folder):

pip install -r requirements.txt
python -m nltk.downloader all

Step 2: Prepare the source dataframes

Transfer source dataframes (download here) generated from Information Extraction for frances Project to the following folders:

.
├── source_dataframes
    ├── chapbooks
        ├── chapbooks_dataframe
    ├── eb
        ├── nls_metadata_dataframe
        ├── final_eb_1_dataframe
        ├── .......

Step 3: Create Folders

Create folder results to store graphs, and dataframe_with_uris for temperate dataframe files. The project file structure will be:

.
├── GraphGenerator
│   ├── dataframe_with_uris
│   ├── configs
│       ├── chapbook_nls_config.json
│       ├── eb_1_hq_config.json
│       ├── ......
│   ├── constructions
│       ├── multiple_source_eb_dataframe_to_rdf.py
│       ├── single_source_eb_dataframe_to_rdf.py
│       ├── ......
│   ├── enrichments
│       ├── summary.py
│       ├── sentiment_analysis.py
│       ├── ......
│   └── PythonScripts
│       ├── run_tasks.py
│       └── utils.py
├── hto.ttl
├── requirements.txt
├── results
└── source_dataframes

Step 4: Create task configuration json file.

This file will be passed as an argument when we run the run_tasks.py, it will tell the program what tasks should execute, what's the inputs and outputs of tasks. So far, we have implemented 11 tasks executors:

Construction tasks:

  1. single_source_eb_dataframe_to_rdf: generates a graph for Encyclopaedia Britannica (EB) using a list of dataframe. Each term in the dataframe was extract from single source. For example, this task can deal with the dataframe list with: first edition dataframe extracted from Ash's work, and seven edition from National Library of Scotland (NLS). However, it should not be used to process the list with seven edition from NLS, and Knowledge Project , because it will ignore all terms in that edition if the edition has been added to the graph. In this case, it should be handled in the next task multiple_source_eb_dataframe_to_rdf.

  2. multiple_source_eb_dataframe_to_rdf: adds descriptions and extract information of terms in EB extracted from multiple sources.

  3. neuspell_corrected_eb_dataframe_to_rdf: adds descriptions and extract information of terms in EB corrected using Neuspell.

  4. nls_dataframe_to_rdf: generates a graph for other NLS collection using a list of dataframe.

  5. add_page_permanent_url: adds page permanent url into the graph. This works for any NLS collection.

Enrichment tasks:

  1. summary: adds summaries of topic terms descriptions to a graph.
  2. save_embedding: generate embeddings for terms with their highest quality descriptions, save the result in a dataframe.
  3. sentiment_analysis: generate binary sentiment labels for terms, save the result in a dataframe.
  4. term_record_linkage: links terms across editions by grouping them into concepts, save the result in a dataframe.
  5. wikidata_linkage: Adds Wikidata items to concepts created from term_record_linkage task.
  6. dbpedia_linkage: Adds Dbpedia items to concepts created from term_record_linkage task.

More details of the tasks can be found here

The config file below tells the program to first generate a graph for 7th edition EB from the knowledge project, and then add summaries to the graph. More examples can be found here

{
  "tasks": [
    {
      "task_name": "single_source_eb_dataframe_to_rdf",
      "inputs": {
        "dataframes": [
          {"agent": "NCKP",
            "filename":"nckp_final_eb_7_dataframe_clean_Damon"}
        ],
        "results_filenames": {
          "dataframe_with_uris": "nckp_final_eb_7_dataframe_clean_Damon_with_uris",
          "graph": "hto_eb_7th_hq.ttl"
        }
      }
    },
    {
      "task_name": "summary",
      "inputs": {
        "results_filenames": {
          "graph": "hto_eb_7th_hq_summary.ttl"
        }
      }
    }
  ]
}

Step 5: Run the tasks

In command line, run the python file run_tasks.py.

python -m GraphGenerator.run_tasks --config_file=<path_to_your_config_file>
# exmaple: python -m GraphGenerator.run_tasks --config_file=GraphGenerator/configs/eb_total_config.json

The final graph file can be found results folder.

We have generated some knowledge graphs

Upload graphs to Fuseki Server

Jena Fuseki Server allows storing and querying RDF based knowledge graphs. To build Fuskei server, see official documentation. A quick start point can be the published docker image.

Explore the graph using this notebook

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published