Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In 1096 ocw workflow #91

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

In 1096 ocw workflow #91

wants to merge 2 commits into from

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jan 23, 2025

Purpose and background context

This PR creates the OpenCourseWare DSC workflow.

This also introduces the following changes:

How can a reviewer manually see the effects of these changes?

A. Review the added unit tests.
Note: The only custom method defined for OpenCourseWare without a unit test is the item_metadata_iter method. See method B for testing with MinIO server.


B. Optional but highly recommended (especially for future development).
Run OpenCourseWare commands using local MinIO server.

Prerequisite

  1. Follow instructions in README: Running a Local MinIO Server.
    Note: As of this writing, the root password set for the local MinIO server must be at least 8 characters long. Didn't want to write this requirement in the README as it is subject to change if/when we download updated versions of the MinIO Docker image.

  2. Mock out the local MinIO server with test zip files.
    Note: I did these steps via the WebUI.

    • Create paths (i.e., prefix) in the dsc bucket:
      • dsc/opencourseware/batch-00/
        • Upload two (2) sample zip files with metadata
          It is not important to mock other files as the bitstream for OpenCourseWare deposits is the zip file itself.
          • abc123.zip: Zip file containing a single data.json.
          • def456.zip: Zip file containing a single data.json
      • dsc/opencourseware/batch-01/
        • Upload one (1) sample zip file without metadata.
  3. Add the following environment variables in your .env file.

    AWS_ENDPOINT_URL=http://localhost:9000/
    AWS_ACCESS_KEY_ID=<local-minio-username>
    AWS_SECRET_ACCESS_KEY=<local-minio-password>
    

OpenCourseWare commands
Launch Python in your terminal: pipenv run python

  1. Check item_metadata_iter() result for batch-00.
from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-00", email_recipients="me@gmail.com")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

[
    {
        "item_identifier": "abc123",
        "course_title": "Matrix Calculus for Machine Learning and Beyond",
        "course_description": "We all know that calculus courses.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Edelman, Alan|Johnson, Steven G.",
    },
    {
        "item_identifier": "def456",
        "course_title": "Burgers and Beyond",
        "course_description": "Investigating the paranormal, one burger at a time.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Burger, Cheese E.",
    },
]
  1. Check item_metadata_iter() result for batch-01.
from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-01", email_recipients="me@gmail.com")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 60, in item_metadata_iter
    **self._extract_metadata_from_zip_file(zip_file, item_identifier),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 76, in _extract_metadata_from_zip_file
    raise FileNotFoundError(
FileNotFoundError: The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip

An FileNotFoundError is raised if any zip file is missing metadata (i.e., the data.json file)

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jan 23, 2025
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 2 times, most recently from 9f54f9f to 5097b39 Compare January 23, 2025 18:09
@coveralls
Copy link

coveralls commented Jan 23, 2025

Pull Request Test Coverage Report for Build 12935641949

Details

  • 44 of 54 (81.48%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.8%) to 95.397%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/opencourseware.py 42 52 80.77%
Totals Coverage Status
Change from base Build 12799339996: -1.8%
Covered Lines: 456
Relevant Lines: 478

💛 - Coveralls

Comment on lines +32 to +27
"dc.contributor.author": {
"source_field_name": "instructor",
"language": "en_US",
"delimiter": "|"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +132 to +138
def _construct_instructor_name(instructor: dict[str, str]) -> str:
"""Given a dictionary of name fields, derive instructor name."""
if not (last_name := instructor.get("last_name")) or not (
first_name := instructor.get("first_name")
):
return ""
return f"{last_name}, {first_name} {instructor.get("middle_initial", "")}".strip()
Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is plausible that all the metadata in data.json will always be formatted as needed (i.e., all instructor name fields provided), it would be a good idea to check in with stakeholders (IN-1156) on the "minimum required instructor name fields` to construct an instructor name.

In this sample mapping file we received, ocw_json_to_dspace_mapping.xlsx, it indicates the instructor names must be formatted as:

<last_name>, <first_name> <middle_initial>

The code above will return an empty string if either the last_name or first_name is missing; it allows for missing middle_initial values.

Why these changes are being introduced:
* Support OpenCourseWare deposits requested by Technical Services staff.

How this addresses that need:
* Define custom methods to extract metadata from 'data.json'
* Define custom 'get_bitstream_s3_uris' to filter to zip files

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1096
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants