Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical T-Res resources #278

Closed
6 tasks done
thobson88 opened this issue Oct 15, 2024 · 5 comments
Closed
6 tasks done

Canonical T-Res resources #278

thobson88 opened this issue Oct 15, 2024 · 5 comments
Assignees

Comments

@thobson88
Copy link
Collaborator

thobson88 commented Oct 15, 2024

The resources/ subdirectory is .gitignored due to the large size of the data files on which T-Res depends, but there is currently no shared storage location containing a canonical set of resources.

This makes deployment more difficult and also causes problems with integration tests which depend on the resources and cannot be run with consistent results unless a common resources are used.

Steps:

  • identify a canonical set of resources (on existing VMs/disks),
  • collect these resources into a central storage location (with appropriate compression),
  • write a script to automatically download & unpack them into the resources/ subdirectory,
  • separately include a subset of resources under the tests/ subdirectory, for unit test fixtures,
  • mark any tests depending on the full, canonical resources to be skipped by default when running pytest,
  • repeat these steps for the experiments/ subdirectory.
@thobson88 thobson88 self-assigned this Oct 15, 2024
@thobson88
Copy link
Collaborator Author

thobson88 commented Oct 15, 2024

Identify a canonical set of resources

Latest and most complete resources (afaics) are to be found at:

@thobson88
Copy link
Collaborator Author

thobson88 commented Oct 17, 2024

Collect resources into a central storage location (with appropriate compression)

Script for bundling up the resources:

#!/bin/bash

# Script that creates a zip file of T-Res resources.

# Run this script from the resources/ directory and pass the output directory
# in which the file `resources.zip` will be created.

zip "$1/resources.zip" publication_metadata.json
zip -ur "$1/resources.zip" deezymatch/
zip -ur "$1/resources.zip" models/
zip -ur "$1/resources.zip" rel_db/
zip -ur "$1/resources.zip" wikidata/
zip -ur "$1/resources.zip" wikipedia/
echo "Created T-Res resources.zip at $1/resources.zip"

To be run from /home/lukehare/T-Res/resources on the deployed-toponym-resolution VM.

Note: this script takes several minutes to run because some subfolders contain many small files (e.g. deezymatch/candidate_vectors/wkdtalts_w2v_ocr/embeddings/). It may be worth tweaking so that those subfolders are not compressed but instead added with level 0 compression. This would save time when uncompressing (at the expense of a large zip file).

Update: files now on Azure in tres storage account, resources container.

@thobson88
Copy link
Collaborator Author

With these resources in place (on Azure, at least), I'm now making sure all tests pass on Rosie's dev_rw branch, in preparation for merging that PR.

@thobson88
Copy link
Collaborator Author

Script to fetch resources from Azure

Added to the existing PR as resources/fetch_resources.py. Requires AzCopy and a valid SAS token.

This is a stop-gap solution intended for internal use, pending public release of the resource files #279.

@thobson88
Copy link
Collaborator Author

thobson88 commented Oct 28, 2024

All tests now pass on a new checkout (of dev_rw) after running the script fetch_resources.py with the command:

python resources/fetch_resources.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant