Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter data #181

Merged
merged 13 commits into from
Nov 13, 2024
Merged

Filter data #181

merged 13 commits into from
Nov 13, 2024

Conversation

maehr
Copy link
Member

@maehr maehr commented Nov 10, 2024

Pull request

Proposed changes

Types of changes

  • New feature (non-breaking change which adds functionality).
  • Enhancement (non-breaking change which enhances functionality)
  • Bug Fix (non-breaking change which fixes an issue).
  • Breaking change (fix or feature that would cause existing functionality to change).

Checklist

  • I have read the CONTRIBUTING document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced data extraction and processing from the Omeka API, including support for CSV and JSON formats.
    • Improved pagination handling for item retrieval.
    • New functions for URL validation and combined value extraction.
    • Additional metadata fields for export and facets generation.
    • New upload step for JSON file in deployment workflow.
  • Improvements

    • Streamlined metadata configuration for better clarity and expanded export fields.
    • Modernized JavaScript functions for improved performance and readability.
    • Simplified data access and improved URL generation logic in JSON exports.
    • Enhanced artifact handling in Jekyll deployment workflow.
  • Bug Fixes

    • Enhanced error handling in data fetching processes and URL generation logic.
  • Chores

    • New scripts added for cleaning and populating metadata.
    • Updated workflow for clearer artifact management during Jekyll site deployment.

Copy link

changeset-bot bot commented Nov 10, 2024

⚠️ No Changeset found

Latest commit: d602e0c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

coderabbitai bot commented Nov 10, 2024

Walkthrough

The changes in this pull request primarily enhance the functionality of data extraction and processing from an Omeka API within the .github/workflows/process_data.py file. Key updates include new environment variable configurations, improved modularity and error handling, and the introduction of new helper functions. Additionally, the _config.yml, _data/theme.yml, and assets/data/metadata.json files have been modified to reflect changes in metadata handling and structure. The package.json file now includes new scripts for cleaning and populating data, streamlining the overall workflow.

Changes

File Change Summary
.github/workflows/process_data.py - Added ITEM_SET_ID, CSV_PATH, and JSON_PATH variables.
- Introduced is_valid_url and get_paginated_items functions.
- Updated get_items_from_collection and get_media functions.
- Consolidated extraction functions into extract_property and extract_combined_values_csv.
- Added save_to_files function.
_config.yml - Updated metadata to sgb-metadata-csv.
- Added metadata-json field with value sgb-metadata-json.
_data/theme.yml - Updated metadata-export-fields to include additional fields.
- Updated metadata-facets-fields to include publisher and license.
_includes/js/browse-js.html - Refactored filterItems and makeCard functions to modern JavaScript syntax.
- Removed shuffle function, replaced with sorting method.
assets/data/metadata.json - Updated items variable assignment to use site.metadata-json.
- Changed reference_url to use absolute_url.
package.json - Added clean and populate scripts.
- Updated @shopify/prettier-plugin-liquid dependency version.
.github/workflows/jekyll.yml - Updated artifact names for clarity.
- Added upload step for sgb-metadata-json.json.

Possibly related PRs

  • Subpage Navigation via Buttons #170: The changes in this PR enhance navigation and introduce new buttons for various functionalities, which may relate to the improved data extraction and processing in the main PR, particularly in how data is presented and accessed through the UI.
  • ci: add jekyll build to ci #172: This PR adds a Jekyll build step to the CI process, which could indirectly relate to the main PR's changes in data processing and output formats, ensuring that the enhancements in .github/workflows/process_data.py are properly integrated and built in the deployment pipeline.

Suggested reviewers

  • koilebeit
  • mtwente

🐰 In the meadow, data flows,
With paths for CSVs and JSONs,
Functions dance, errors flee,
A rabbit's joy, oh can't you see?
With every script, our work's refined,
In code we hop, our dreams aligned! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (5)
_includes/js/browse-js.html (1)

87-109: LGTM with performance optimization suggestions.

The modernized filterItems function is well-structured and more maintainable. Consider these optional performance optimizations:

  1. Cache Object.values(item).join(" ").toUpperCase() for large objects
  2. Use toLowerCase() instead of toUpperCase() as it's generally faster in most JavaScript engines
 function filterItems(arr, q) {
     // Show loading icon
     loadingIcon.classList.remove("d-none");
 
     // Trim and convert query to uppercase
-    const query = q.trim().toUpperCase();
+    const query = q.trim().toLowerCase();
 
     // Filter items based on query
     const filteredItems = query === "" ? arr : arr.filter(item => {
-        const itemValues = Object.values(item).join(" ").toUpperCase();
+        // Cache the joined values for better performance with large objects
+        if (!item._searchText) {
+            item._searchText = Object.values(item).join(" ").toLowerCase();
+        }
+        return item._searchText.includes(query);
     });
.github/workflows/process_data.py (4)

62-64: Handle pagination errors more robustly in get_paginated_items

If an error occurs during pagination, the current implementation logs the error and breaks the loop, potentially leaving data incomplete.

Consider re-raising the exception or implementing a retry mechanism:

 except requests.exceptions.RequestException as err:
     logging.error(f"Error fetching items: {err}")
+    raise

This ensures that the caller is aware of the incomplete data fetch.


153-176: Replace magic numbers with descriptive constants

Using hardcoded property IDs reduces code readability and maintainability.

Define constants with descriptive names for property IDs:

DC_IDENTIFIER_ID = 10
DC_TITLE_ID = 1
DC_DESCRIPTION_ID = 4
# Add other property IDs accordingly

Update your code to use these constants:

 return {
-    "objectid": extract_property(item.get("dcterms:identifier", []), 10),
+    "objectid": extract_property(item.get("dcterms:identifier", []), DC_IDENTIFIER_ID),
-    "title": extract_property(item.get("dcterms:title", []), 1),
+    "title": extract_property(item.get("dcterms:title", []), DC_TITLE_ID),
 # Update other property IDs similarly

This improves code clarity and ease of maintenance.


91-99: Optimize extract_property to handle multiple values

The current implementation of extract_property returns only the first matching property. If there are multiple values for a property, they are ignored.

Modify the function to return all matching values:

 def extract_property(props, prop_id, as_uri=False):
     """Extracts property values or URIs from properties based on property ID."""
-    for prop in props:
+    results = []
+    for prop in props:
         if prop.get("property_id") == prop_id:
             if as_uri:
                 result = f"[{prop.get('o:label', '')}]({prop.get('@id', '')})"
             else:
                 result = prop.get("@value", "")
+            results.append(result)
-            return result
-    return ""
+    return results if results else ""

This allows you to capture all values associated with a property ID.


9-10: Review TODO comments with external URLs

The TODO comments on lines 9 and 10 contain external URLs, which may not be accessible or relevant in all environments.

Consider adding context or additional information to these TODOs to clarify the intended actions, or move them to documentation if they require ongoing reference.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6fd1525 and 9827137.

⛔ Files ignored due to path filters (2)
  • _data/config-metadata.csv is excluded by !**/*.csv
  • assets/data/metadata.csv is excluded by !**/*.csv
📒 Files selected for processing (6)
  • .github/workflows/process_data.py (5 hunks)
  • _config.yml (1 hunks)
  • _data/theme.yml (1 hunks)
  • _includes/js/browse-js.html (1 hunks)
  • assets/data/metadata.json (1 hunks)
  • package.json (1 hunks)
🧰 Additional context used
🪛 Gitleaks
.github/workflows/process_data.py

22-22: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)


23-23: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 Biome
assets/data/metadata.json

[error] 4-4: unexpected character %

(parse)


[error] 4-4: Minus must be followed by a digit

(parse)


[error] 4-4: unexpected character =

(parse)


[error] 4-4: unexpected character .

(parse)


[error] 4-4: expected , but instead found [

Remove [

(parse)


[error] 4-4: String values must be double quoted.

(parse)


[error] 4-4: unexpected character .

(parse)


[error] 4-4: expected , but instead found metadata

Remove metadata

(parse)


[error] 4-4: Minus must be followed by a digit

(parse)


[error] 4-4: expected , but instead found json

Remove json

(parse)


[error] 4-4: Minus must be followed by a digit

(parse)


[error] 4-4: unexpected character %

(parse)


[error] 5-5: unexpected character %

(parse)


[error] 5-5: Minus must be followed by a digit

(parse)


[error] 5-5: unexpected character =

(parse)


[error] 5-5: unexpected character .

(parse)


[error] 5-5: unexpected character .

(parse)


[error] 5-5: unexpected character .

(parse)


[error] 5-5: Invalid exponent part

(parse)


[error] 5-5: Minus must be followed by a digit

(parse)


[error] 5-5: unexpected character |

(parse)


[error] 5-5: expected , but instead found :

Remove :

(parse)


[error] 4-5: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 5-5: Minus must be followed by a digit

(parse)


[error] 5-5: unexpected character %

(parse)


[error] 7-7: unexpected character %

(parse)


[error] 7-7: Minus must be followed by a digit

(parse)


[error] 7-7: Minus must be followed by a digit

(parse)


[error] 7-7: unexpected character %

(parse)


[error] 8-8: expected , but instead found {

Remove {

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: expected , but instead found {

Remove {

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: expected , but instead found [

Remove [

(parse)


[error] 9-9: String values must be double quoted.

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: expected , but instead found {

Remove {

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: Property key must be double quoted

(parse)


[error] 9-9: unexpected character |

(parse)


[error] 9-9: expected , but instead found jsonify

Remove jsonify

(parse)


[error] 9-9: expected : but instead found }

Remove }

(parse)


[error] 9-9: expected , but instead found :

Remove :

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: Property key must be double quoted

(parse)


[error] 9-9: expected : but instead found [

Remove [

(parse)


[error] 9-9: String values must be double quoted.

(parse)


[error] 9-9: unexpected character |

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 10-10: expected , but instead found {

Remove {

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: expected , but instead found {

Remove {

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: unexpected character .

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: expected , but instead found "object_thumb"

Remove "object_thumb"

(parse)


[error] 10-10: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 11-11: expected , but instead found "object_location"

Remove "object_location"

(parse)


[error] 13-13: expected , but instead found {

Remove {

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 13-13: unexpected character .

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 14-14: expected , but instead found {

Remove {

(parse)


[error] 5-15: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 14-14: unexpected character %

(parse)


[error] 14-14: unexpected character %

(parse)

🔇 Additional comments (9)
package.json (1)

6-6: Verify uv dependency requirements.

The script uses uv run to execute Python code, but uv is not listed in the dependencies.

Let's check if uv is documented elsewhere:

Consider:

  1. Adding documentation about the uv requirement
  2. Or using standard python command instead if uv is not strictly necessary
✅ Verification successful

Let me gather more information to verify if uv is required or documented elsewhere in the codebase.


Let me check the content of pyproject.toml to see if uv is documented there.


uv is properly configured in the GitHub Actions workflow

The GitHub Actions workflow file shows that uv is set up using the official astral-sh/setup-uv@v3 action before running the Python script. This is the correct way to handle the dependency, as it's a CI/CD tool requirement rather than a project dependency. No changes are needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if uv is documented as a requirement
# Test: Look for uv in documentation files
rg -i "uv|python" "README.md" || echo "No matches in README.md"

Length of output: 86


Script:

#!/bin/bash
# Check for uv in other documentation files and configuration
fd -e md -e txt -e yaml -e yml -x rg -i "uv|python" {} \;

# Check the Python script that's being executed
cat .github/workflows/process_data.py || echo "Script not found"

# Look for any requirements.txt or similar Python dependency files
fd -e txt -e toml "requirements|pyproject" 

Length of output: 10345


Script:

#!/bin/bash
# Check pyproject.toml for dependencies and Python requirements
cat pyproject.toml

# Also check for any GitHub workflow files that might document the uv setup
fd -e yml -e yaml . .github/workflows/ -x cat {}

Length of output: 3471

assets/data/metadata.json (3)

6-15: LGTM! Well-structured JSON generation.

The JSON structure is clean and maintainable:

  • Proper conditional field inclusion
  • Clean formatting
  • Valid JSON array structure
🧰 Tools
🪛 Biome

[error] 7-7: unexpected character %

(parse)


[error] 7-7: Minus must be followed by a digit

(parse)


[error] 7-7: Minus must be followed by a digit

(parse)


[error] 7-7: unexpected character %

(parse)


[error] 8-8: expected , but instead found {

Remove {

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: expected , but instead found {

Remove {

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: expected , but instead found [

Remove [

(parse)


[error] 9-9: String values must be double quoted.

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: expected , but instead found {

Remove {

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: Property key must be double quoted

(parse)


[error] 9-9: unexpected character |

(parse)


[error] 9-9: expected , but instead found jsonify

Remove jsonify

(parse)


[error] 9-9: expected : but instead found }

Remove }

(parse)


[error] 9-9: expected , but instead found :

Remove :

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: Property key must be double quoted

(parse)


[error] 9-9: expected : but instead found [

Remove [

(parse)


[error] 9-9: String values must be double quoted.

(parse)


[error] 9-9: unexpected character |

(parse)


[error] 9-9: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 9-9: unexpected character %

(parse)


[error] 10-10: expected , but instead found {

Remove {

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: expected , but instead found {

Remove {

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: unexpected character .

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: expected , but instead found "object_thumb"

Remove "object_thumb"

(parse)


[error] 10-10: Expected a property but instead found '{'.

Expected a property here.

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 10-10: unexpected character %

(parse)


[error] 11-11: expected , but instead found "object_location"

Remove "object_location"

(parse)


[error] 13-13: expected , but instead found {

Remove {

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 13-13: unexpected character .

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 13-13: unexpected character %

(parse)


[error] 14-14: expected , but instead found {

Remove {

(parse)


[error] 5-15: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 14-14: unexpected character %

(parse)


[error] 14-14: unexpected character %

(parse)


12-12: LGTM! Improved parentid check and URL handling.

The changes improve the code in two ways:

  1. Using absolute_url ensures fully qualified URLs
  2. The parentid.size > 0 check is more robust than the previous implicit boolean check
#!/bin/bash
# Description: Verify URL generation in different environments
# Test: Check if absolute_url filter is consistently used across the codebase
rg -l "absolute_url" --type html --type md --type liquid

4-5: Verify data integrity of site.metadata-json.

The change from filtered site.metadata to direct site.metadata-json access simplifies the code. However, we should ensure that site.metadata-json maintains the same data quality as the previous filtered approach.

✅ Verification successful

Let me gather more information about the configuration and data structure.


Based on the _config.yml content, I can verify that the change from site.metadata to site.metadata-json is properly configured. Both fields are defined in the collection settings:

metadata: sgb-metadata-csv
metadata-json: sgb-metadata-json

Let me check the actual metadata files to ensure they exist and verify their structure.


Let me try a different approach to verify the metadata files and their usage in the codebase.


Let me check the content of the metadata.json file and related templates to verify the data structure and usage.


Based on the evidence gathered, I can now provide a final response:

Change to site.metadata-json is properly configured and safe

The change from site.metadata to site.metadata-json in assets/data/metadata.json is properly configured and supported by the system:

  1. The _config.yml explicitly defines both metadata sources:

    metadata: sgb-metadata-csv
    metadata-json: sgb-metadata-json
  2. The metadata.json template correctly uses the new format:

    {%- assign items = site.data[site.metadata-json] -%}
  3. The change only affects JSON data export while maintaining the original CSV metadata source for other features, as evidenced by the numerous templates still using site.metadata for core functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the structure and content of metadata configuration
# Test 1: Check if metadata-json is properly configured
rg -l "metadata-json:" _config.yml

# Test 2: Compare the structure with previous metadata field
rg -l "metadata:" _config.yml

Length of output: 232


Script:

#!/bin/bash
# Check the content of _config.yml for metadata configuration
cat _config.yml

# Look for any JSON files that might contain metadata
fd -e json -x cat {} \;

# Search for metadata-related configuration in other yml files
fd -e yml -e yaml -x cat {} \;

Length of output: 78924


Script:

#!/bin/bash
# Check for metadata files
fd -e csv -e json "sgb-metadata" _data/

# If files exist, peek at their content
find _data -name "sgb-metadata*" -exec head -n 5 {} \;

Length of output: 99


Script:

#!/bin/bash
# Search for references to metadata-json in liquid templates
rg -l "site.metadata-json" --type html --type md --type liquid

# Search for references to the old metadata format
rg -l "site.metadata" --type html --type md --type liquid

# Look for any JSON generation files
fd -e json.liquid -e json

Length of output: 558


Script:

#!/bin/bash
# Check the metadata.json template
cat assets/data/metadata.json

# Look for liquid templates that handle metadata
fd -e html -e liquid | xargs grep -l "site.metadata"

# Check if there are any other relevant configuration files
fd -g "*config*.{yml,yaml,json}" -x cat {}

Length of output: 4699

🧰 Tools
🪛 Biome

[error] 4-4: unexpected character %

(parse)


[error] 4-4: Minus must be followed by a digit

(parse)


[error] 4-4: unexpected character =

(parse)


[error] 4-4: unexpected character .

(parse)


[error] 4-4: expected , but instead found [

Remove [

(parse)


[error] 4-4: String values must be double quoted.

(parse)


[error] 4-4: unexpected character .

(parse)


[error] 4-4: expected , but instead found metadata

Remove metadata

(parse)


[error] 4-4: Minus must be followed by a digit

(parse)


[error] 4-4: expected , but instead found json

Remove json

(parse)


[error] 4-4: Minus must be followed by a digit

(parse)


[error] 4-4: unexpected character %

(parse)


[error] 5-5: unexpected character %

(parse)


[error] 5-5: Minus must be followed by a digit

(parse)


[error] 5-5: unexpected character =

(parse)


[error] 5-5: unexpected character .

(parse)


[error] 5-5: unexpected character .

(parse)


[error] 5-5: unexpected character .

(parse)


[error] 5-5: Invalid exponent part

(parse)


[error] 5-5: Minus must be followed by a digit

(parse)


[error] 5-5: unexpected character |

(parse)


[error] 5-5: expected , but instead found :

Remove :

(parse)


[error] 4-5: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 5-5: Minus must be followed by a digit

(parse)


[error] 5-5: unexpected character %

(parse)

_config.yml (1)

39-40: LGTM! Clear and consistent metadata configuration.

The changes effectively support both CSV and JSON metadata formats with clear, self-documenting field names.

_data/theme.yml (2)

71-71: LGTM! Enhanced metadata export fields.

The updated metadata export fields provide a more comprehensive set of information. The new organization groups related fields together (e.g., objectid/parentid, title/description/relation) and includes important additions like publisher and extent. This aligns well with the PR's goal of improving metadata handling.


73-73: LGTM! Improved faceted search capabilities.

The addition of publisher, type, language, and license to the facets fields enhances the filtering capabilities, allowing users to find content more effectively. This change aligns with the PR's goal of improving metadata presentation and user experience.

.github/workflows/process_data.py (2)

234-235: ⚠️ Potential issue

Remove duplicate call to get_media

The function get_media is called twice consecutively with the same item ID on lines 234 and 235. This redundancy may lead to unnecessary API calls and degrade performance.

You can remove the duplicate call to streamline the code:

 # Process media data for each item
-media_data = get_media(item.get("o:id", ""))
 media_data = get_media(item["o:id"])

Ensure that item["o:id"] always exists or handle the potential KeyError if it doesn't.

Likely invalid or redundant comment.


245-260: Ensure data consistency when saving records

When converting records to a DataFrame, there might be inconsistencies if some records have missing keys.

Please verify that all records have consistent keys before saving. Run the following script to identify any discrepancies:

This script helps ensure data integrity in your CSV and JSON outputs.

package.json Show resolved Hide resolved
_config.yml Show resolved Hide resolved
_includes/js/browse-js.html Outdated Show resolved Hide resolved
.github/workflows/process_data.py Outdated Show resolved Hide resolved
.github/workflows/process_data.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (2)
.github/workflows/process_data.py (2)

Line range hint 93-106: Security: Use HTML escape function for semicolon replacement

Using direct string replacement for HTML escaping is unsafe. Consider using proper HTML escaping functions:

+import html
 def extract_combined_values(props):
     """Combines text values and URIs from properties into a single list."""
     values = [
-        prop.get("@value", "").replace(";", "&#59")
+        html.escape(prop.get("@value", ""), quote=False)
         for prop in props
         if "@value" in prop
     ]
     uris = [
-        f"[{prop.get('o:label', '').replace(';', '&#59')}]({prop.get('@id', '').replace(';', '&#59')})"
+        f"[{html.escape(prop.get('o:label', ''), quote=False)}]({html.escape(prop.get('@id', ''), quote=False)})"
         for prop in props
         if "@id" in prop
     ]
     return values + uris

216-234: Consider implementing checkpoint/resume capability

For large datasets, the script should be able to resume from where it left off in case of failures. Consider:

  1. Implementing checkpointing after processing each item
  2. Adding a resume capability that can skip already processed items

This would make the script more resilient to failures and more efficient when re-running:

def save_checkpoint(processed_ids, checkpoint_file="checkpoint.json"):
    with open(checkpoint_file, "w") as f:
        json.dump(processed_ids, f)

def load_checkpoint(checkpoint_file="checkpoint.json"):
    if os.path.exists(checkpoint_file):
        with open(checkpoint_file) as f:
            return set(json.load(f))
    return set()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9827137 and c35f448.

📒 Files selected for processing (1)
  • .github/workflows/process_data.py (6 hunks)

.github/workflows/process_data.py Outdated Show resolved Hide resolved
.github/workflows/process_data.py Outdated Show resolved Hide resolved
.github/workflows/process_data.py Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
.github/workflows/jekyll.yml (1)

43-47: Consider adding file existence checks.

While the JSON artifact upload aligns with the PR's objectives for enhanced metadata handling, it would be beneficial to add error handling.

Consider wrapping the upload step with a condition:

- name: Check if JSON file exists
  id: check_json
  run: |
    if [ -f "_data/sgb-metadata-json.json" ]; then
      echo "exists=true" >> $GITHUB_OUTPUT
    else
      echo "exists=false" >> $GITHUB_OUTPUT
    fi

- name: Upload sgb-metadata-json.json
  if: steps.check_json.outputs.exists == 'true'
  uses: actions/upload-artifact@v4
  with:
    name: sgb-metadata-json
    path: _data/sgb-metadata-json.json
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c35f448 and 905ed7e.

📒 Files selected for processing (1)
  • .github/workflows/jekyll.yml (2 hunks)
🔇 Additional comments (3)
.github/workflows/jekyll.yml (3)

38-42: LGTM! Improved artifact naming for better clarity.

The renamed artifact explicitly indicates the CSV format, making it clearer what type of data is being handled.


60-64: LGTM! Consistent artifact naming with upload step.

The download step correctly mirrors the changes made in the upload step for the CSV artifact.


Line range hint 82-84: Verify Jekyll configuration for JSON metadata handling.

Please ensure that Jekyll is configured to handle both CSV and JSON metadata files during the build process.

✅ Verification successful

The configuration file shows explicit support for both CSV and JSON metadata through the settings metadata: sgb-metadata-csv and metadata-json: sgb-metadata-json. Let me verify the actual usage of these configurations in the Jekyll build process.


Jekyll is properly configured for both CSV and JSON metadata handling

The codebase shows proper configuration and implementation for handling both metadata formats:

  • Configuration in _config.yml explicitly defines both formats via metadata and metadata-json settings
  • The Jekyll plugins (_plugins/cb_helpers.rb and _plugins/cb_page_gen.rb) contain robust data loading logic that works with the configured metadata sources
  • The build process is format-agnostic, using site.data abstraction to handle the metadata regardless of source format
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check Jekyll configuration for metadata handling

# Test: Look for metadata configuration in Jekyll config files
echo "Checking Jekyll configuration files for metadata settings..."
rg -A 5 "metadata|data" "_config.yml"

Length of output: 1064


Script:

#!/bin/bash
# Check for Jekyll data processing logic
echo "Checking Jekyll data processing implementation..."
rg -l "metadata-json|sgb-metadata-json" --type ruby

# Check for the presence of both metadata files
echo -e "\nChecking for metadata files..."
fd -e csv -e json "sgb-metadata"

# Look for any Jekyll data loading logic
echo -e "\nChecking for data loading logic..."
rg "site\.data" --type ruby -A 3

Length of output: 3690

.github/workflows/jekyll.yml Outdated Show resolved Hide resolved
@maehr maehr requested a review from koilebeit November 11, 2024 15:28
Copy link

@koilebeit koilebeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your PR. Looks very good. But we should also catch the following cases on the browse page that may occur:

@maehr
Copy link
Member Author

maehr commented Nov 12, 2024

Thank you very much for your PR. Looks very good. But we should also catch the following cases on the browse page that may occur:

Thanks for pointing this out. I'll fix it.

The reason for the missing derivates are manual uploads because of the file size. We should report this to @flicksolutions directly.

@maehr maehr marked this pull request as draft November 13, 2024 07:03
@maehr maehr self-assigned this Nov 13, 2024
@maehr maehr marked this pull request as ready for review November 13, 2024 08:21
@maehr maehr requested a review from koilebeit November 13, 2024 08:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (6)
_includes/js/browse-js.html (1)

162-181: Add event listener cleanup

Event listeners are added but never removed, which could lead to memory leaks if the browse functionality is re-initialized without proper cleanup.

Consider adding cleanup code:

+ // Store event listener references for cleanup
+ const buttonListeners = new Map();

  sortOptions.forEach((button) => {
-     button.addEventListener("click", () => {
+     const listener = () => {
          const field = button.dataset.filter;
          // ... rest of the code
-     });
+     };
+     buttonListeners.set(button, listener);
+     button.addEventListener("click", listener);
  });

+ // Add cleanup function
+ function cleanup() {
+     buttonListeners.forEach((listener, button) => {
+         button.removeEventListener("click", listener);
+     });
+     buttonListeners.clear();
+ }
.github/workflows/process_data.py (5)

14-19: Improve exception handling with proper error chaining

Use raise ... from err to maintain the exception chain for better debugging:

 try:
     ITEM_SET_ID = int(ITEM_SET_ID)
 except ValueError as err:
-    raise ValueError("ITEM_SET_ID must be a valid integer")
+    raise ValueError("ITEM_SET_ID must be a valid integer") from err
🧰 Tools
🪛 Ruff

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


Line range hint 9-13: Add validation for required environment variables

Critical environment variables should be validated at startup to fail fast:

 OMEKA_API_URL = os.getenv("OMEKA_API_URL")
 KEY_IDENTITY = os.getenv("KEY_IDENTITY")
 KEY_CREDENTIAL = os.getenv("KEY_CREDENTIAL")
+
+# Validate required environment variables
+if not all([OMEKA_API_URL, KEY_IDENTITY, KEY_CREDENTIAL]):
+    raise ValueError(
+        "Missing required environment variables. "
+        "Please set OMEKA_API_URL, KEY_IDENTITY, and KEY_CREDENTIAL"
+    )
🧰 Tools
🪛 Ruff

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


Line range hint 188-191: Remove or enhance debug logging

The current logging statements seem like debug information. Consider:

  1. Using debug level instead of info
  2. Adding more context
  3. Or removing if not necessary for production
-    logging.info(f"Media ID: {media['o:id']}")
-    logging.info(f"is_public: {media.get('o:is_public')}")
+    logging.debug(
+        f"Processing media - ID: {media['o:id']}, "
+        f"Public: {media.get('o:is_public')}, "
+        f"Format: {format_value}"
+    )

Line range hint 100-113: Optimize extract_combined_values function

The current implementation creates two separate lists and concatenates them. Consider using a list comprehension with conditional logic:

 def extract_combined_values(props):
     """Combines text values and URIs from properties into a single list."""
-    values = [
-        prop.get("@value", "").replace(";", "&#59")
-        for prop in props
-        if "@value" in prop
-    ]
-    uris = [
-        f"[{prop.get('o:label', '').replace(';', '&#59')}]({prop.get('@id', '').replace(';', '&#59')})"
-        for prop in props
-        if "@id" in prop
-    ]
-    return values + uris
+    return [
+        (prop.get("@value", "").replace(";", "&#59")
+         if "@value" in prop
+         else f"[{prop.get('o:label', '').replace(';', '&#59')}]({prop.get('@id', '').replace(';', '&#59')})")
+        for prop in props
+        if "@value" in prop or "@id" in prop
+    ]
🧰 Tools
🪛 Ruff

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


223-232: Add progress tracking for large datasets

For better observability during processing of large datasets, consider adding progress tracking:

+    total_items = len(items_data)
+    logging.info(f"Processing {total_items} items...")
     item_records, media_records = [], []
-    for item in items_data:
+    for index, item in enumerate(items_data, 1):
+        if index % 10 == 0:
+            logging.info(f"Progress: {index}/{total_items} items processed")
         item_record = extract_item_data(item)
         item_records.append(item_record)
         media_data = get_media(item.get("o:id", ""))
         if media_data:
+            logging.info(f"Processing {len(media_data)} media items for item {index}/{total_items}")
             for media in media_data:
                 media_records.append(extract_media_data(media, item_record["objectid"]))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 905ed7e and 4c8d44d.

⛔ Files ignored due to path filters (1)
  • assets/img/no-image.png is excluded by !**/*.png
📒 Files selected for processing (4)
  • .github/workflows/jekyll.yml (2 hunks)
  • .github/workflows/process_data.py (6 hunks)
  • _includes/js/browse-js.html (1 hunks)
  • package.json (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/jekyll.yml
  • package.json
🧰 Additional context used
🪛 Ruff
.github/workflows/process_data.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (3)
_includes/js/browse-js.html (2)

8-23: LGTM: Well-structured items initialization

The array initialization is well-structured with proper JSON escaping and careful handling of optional fields.


114-115: Replace biased shuffle with Fisher-Yates algorithm

The past review comment about the biased shuffle implementation is still valid.

.github/workflows/process_data.py (1)

53-65: 🛠️ Refactor suggestion

Enhance pagination error handling with retry logic

The current implementation breaks the loop on first error. Consider implementing retry logic and validating the response data:

 def get_paginated_items(url, params):
     """Fetches all items from a paginated API endpoint."""
     items = []
+    max_retries = 3
+    retry_count = 0
     while url:
         try:
             response = requests.get(url, params=params)
             response.raise_for_status()
+            data = response.json()
+            if not isinstance(data, list):
+                raise ValueError(f"Expected list response, got {type(data)}")
+            items.extend(data)
+            url = response.links.get("next", {}).get("url")
+            params = None
+            retry_count = 0
         except requests.exceptions.RequestException as err:
+            if retry_count < max_retries:
+                retry_count += 1
+                logging.warning(f"Retry {retry_count}/{max_retries} after error: {err}")
+                continue
             logging.error(f"Error fetching items: {err}")
-            break
+            raise
-        items.extend(response.json())
-        url = response.links.get("next", {}).get("url")
-        params = None
     return items

Likely invalid or redundant comment.

_includes/js/browse-js.html Show resolved Hide resolved
_includes/js/browse-js.html Show resolved Hide resolved
Copy link

@koilebeit koilebeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it works fine. But I actually liked the solution with the icon themes better. Now it always uses the same placeholder instead of the media-specific placeholder

function getIcon(objectTemplate, objectFormat, svgType) {

Now:
image
Before:
image

@mtwente, what do you think?

@maehr
Copy link
Member Author

maehr commented Nov 13, 2024

Now it works fine. But I actually liked the solution with the icon themes better. Now it always uses the same placeholder instead of the media-specific placeholder

function getIcon(objectTemplate, objectFormat, svgType) {

Now:
image
Before:
image
@mtwente, what do you think?

I'll remove the line for media objects, ok? If they are images they are not supposed to have no thumbs anyway

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)
.github/workflows/process_data.py (3)

90-97: Optimize property extraction with dictionary lookup

The current implementation iterates through all properties. Using a dictionary would be more efficient for multiple lookups.

+def _build_props_dict(props):
+    """Build a dictionary of properties for efficient lookup."""
+    return {prop.get("property_id"): prop for prop in props}

 def extract_property(props, prop_id, as_uri=False):
     """Extracts a property value or URI from properties based on property ID."""
-    for prop in props:
-        if prop.get("property_id") == prop_id:
-            if as_uri:
-                return f"[{prop.get('o:label', '')}]({prop.get('@id', '')})"
-            return prop.get("@value", "")
-    return ""
+    props_dict = _build_props_dict(props)
+    prop = props_dict.get(prop_id, {})
+    if as_uri and "@id" in prop:
+        return f"[{prop.get('o:label', '')}]({prop.get('@id', '')})"
+    return prop.get("@value", "")

Line range hint 100-112: Sanitize HTML in combined values to prevent XSS

The current implementation only escapes semicolons but doesn't sanitize HTML characters.

+import html

 def extract_combined_values(props):
     """Combines text values and URIs from properties into a single list."""
     values = [
-        prop.get("@value", "").replace(";", "&#59")
+        html.escape(prop.get("@value", "")).replace(";", "&#59")
         for prop in props
         if "@value" in prop
     ]
     uris = [
-        f"[{prop.get('o:label', '').replace(';', '&#59')}]({prop.get('@id', '').replace(';', '&#59')})"
+        f"[{html.escape(prop.get('o:label', '')).replace(';', '&#59')}]({html.escape(prop.get('@id', '')).replace(';', '&#59')})"
         for prop in props
         if "@id" in prop
     ]
     return values + uris
🧰 Tools
🪛 Ruff

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


224-237: Add progress logging and error handling in main processing loop

The main processing loop should provide progress updates and handle errors gracefully.

 def main():
     # Fetch item data
+    logging.info(f"Fetching items from collection {ITEM_SET_ID}")
     items_data = get_items_from_collection(ITEM_SET_ID)
+    total_items = len(items_data)
+    logging.info(f"Processing {total_items} items")

     # Process each item and associated media
     item_records, media_records = [], []
-    for item in items_data:
-        item_record = extract_item_data(item)
-        item_records.append(item_record)
-        media_data = get_media(item.get("o:id", ""))
-        if media_data:
-            for media in media_data:
-                media_records.append(extract_media_data(media, item_record["objectid"]))
+    for i, item in enumerate(items_data, 1):
+        try:
+            item_record = extract_item_data(item)
+            item_records.append(item_record)
+            
+            media_data = get_media(item.get("o:id", ""))
+            if media_data:
+                for media in media_data:
+                    media_records.append(extract_media_data(media, item_record["objectid"]))
+            
+            if i % 10 == 0:
+                logging.info(f"Processed {i}/{total_items} items")
+        except Exception as e:
+            logging.error(f"Error processing item {item.get('o:id')}: {e}")
+            continue
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4c8d44d and 14d1882.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (1)
  • .github/workflows/process_data.py (6 hunks)
🧰 Additional context used
🪛 Ruff
.github/workflows/process_data.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (2)
.github/workflows/process_data.py (2)

30-36: LGTM! Well-implemented URL validation

The URL validation function is concise and handles both scheme and netloc checks properly.


53-65: 🛠️ Refactor suggestion

Enhance pagination error handling with retry logic

The current implementation breaks the loop on the first error, which could result in incomplete data retrieval.

 def get_paginated_items(url, params):
     """Fetches all items from a paginated API endpoint."""
     items = []
+    max_retries = 3
+    retry_delay = 1  # seconds
     while url:
+        retry_count = 0
+        while retry_count < max_retries:
+            try:
+                response = requests.get(url, params=params)
+                response.raise_for_status()
+                items.extend(response.json())
+                url = response.links.get("next", {}).get("url")
+                params = None
+                break
+            except requests.exceptions.RequestException as err:
+                retry_count += 1
+                if retry_count == max_retries:
+                    logging.error(f"Failed after {max_retries} retries: {err}")
+                    raise
+                logging.warning(f"Attempt {retry_count} failed: {err}. Retrying...")
+                time.sleep(retry_delay * retry_count)
-        try:
-            response = requests.get(url, params=params)
-            response.raise_for_status()
-        except requests.exceptions.RequestException as err:
-            logging.error(f"Error fetching items: {err}")
-            break
-        items.extend(response.json())
-        url = response.links.get("next", {}).get("url")
-        params = None
     return items

Likely invalid or redundant comment.

.github/workflows/process_data.py Show resolved Hide resolved
.github/workflows/process_data.py Show resolved Hide resolved
@flicksolutions
Copy link

The reason for the missing derivates are manual uploads because of the file size. We should report this to @flicksolutions directly.

I just recreated all derivates on the new omeka server. The thumbnail generation fails these 4 tifs:

  • 10722
  • 10723
    -10953
    -11413

I think the files might just be too large for the server to handle:

convert: width or height exceeds limit `/var/www/html/omeka-s-new/files/original/75d3863cda7afe2c253d7ff2c75b41b9f30e412f.tif' @ error/cache.c/OpenPixelCache/3911.
convert: no images defined `/tmp/omekaOFSOEE.jpg' @ error/convert.c/ConvertImageCommand/3258.

Do you want me to look into that further?

@maehr
Copy link
Member Author

maehr commented Nov 13, 2024

The reason for the missing derivates are manual uploads because of the file size. We should report this to @flicksolutions directly.

I just recreated all derivates on the new omeka server. The thumbnail generation fails these 4 tifs:

  • 10722
  • 10723
    -10953
    -11413

I think the files might just be too large for the server to handle:

convert: width or height exceeds limit `/var/www/html/omeka-s-new/files/original/75d3863cda7afe2c253d7ff2c75b41b9f30e412f.tif' @ error/cache.c/OpenPixelCache/3911.
convert: no images defined `/tmp/omekaOFSOEE.jpg' @ error/convert.c/ConvertImageCommand/3258.

Do you want me to look into that further?

Thank you very much for the swift reply. Can create the thumbnails manually and send them to you?

@maehr maehr requested a review from koilebeit November 13, 2024 12:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
.github/workflows/process_data.py (2)

14-19: Enhance error handling in environment variable validation

The error chain is not preserved when re-raising the ValueError. Use raise from to maintain the exception context.

Apply this diff to improve the error handling:

 if not ITEM_SET_ID:
     raise ValueError("ITEM_SET_ID environment variable must be set")
 try:
     ITEM_SET_ID = int(ITEM_SET_ID)
 except ValueError as err:
-    raise ValueError("ITEM_SET_ID must be a valid integer")
+    raise ValueError("ITEM_SET_ID must be a valid integer") from err
🧰 Tools
🪛 Ruff

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


Line range hint 193-194: Remove or mask sensitive information in logs

Logging the media ID and public status might expose sensitive information about private media items.

Apply this diff to improve logging:

-    logging.info(f"Media ID: {media['o:id']}")
-    logging.info(f"is_public: {media.get('o:is_public')}")
+    logging.debug(f"Processing media {media['o:id'][-4:]}...")  # Log only last 4 chars of ID
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 14d1882 and d602e0c.

⛔ Files ignored due to path filters (1)
  • assets/img/no-image.svg is excluded by !**/*.svg
📒 Files selected for processing (1)
  • .github/workflows/process_data.py (6 hunks)
🧰 Additional context used
🪛 Ruff
.github/workflows/process_data.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (2)
.github/workflows/process_data.py (2)

53-65: 🛠️ Refactor suggestion

Enhance error handling and reliability in get_paginated_items

The function could benefit from:

  1. Retry mechanism for transient failures
  2. Response data validation
  3. Proper error propagation

Apply this diff to improve the function:

 def get_paginated_items(url, params):
     """Fetches all items from a paginated API endpoint."""
     items = []
+    max_retries = 3
+    retry_count = 0
     while url:
         try:
             response = requests.get(url, params=params)
             response.raise_for_status()
+            data = response.json()
+            if not isinstance(data, list):
+                raise ValueError(f"Expected list response, got {type(data)}")
+            items.extend(data)
+            url = response.links.get("next", {}).get("url")
+            params = None
+            retry_count = 0
         except requests.exceptions.RequestException as err:
+            if retry_count < max_retries:
+                retry_count += 1
+                logging.warning(f"Retry {retry_count}/{max_retries} after error: {err}")
+                continue
             logging.error(f"Error fetching items: {err}")
-            break
-        items.extend(response.json())
-        url = response.links.get("next", {}).get("url")
-        params = None
+            raise
     return items

Likely invalid or redundant comment.


146-149: Verify the fallback image path

The hard-coded fallback image path assets/img/no-image.svg might not exist, causing broken image links.

✅ Verification successful

Fallback image path is valid

Fallback image exists at assets/img/no-image.svg.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the fallback image exists
if [ ! -f "assets/img/no-image.svg" ]; then
    echo "Warning: Fallback image not found at assets/img/no-image.svg"
    # Search for similar files that might be the intended fallback
    fd -e svg -e png -e jpg "no-image|default|placeholder" assets/img/
fi

Length of output: 40


Script:

#!/bin/bash
# Confirm the existence of the fallback image
if [ -f "assets/img/no-image.svg" ]; then
    echo "Fallback image exists at assets/img/no-image.svg"
else
    echo "Warning: Fallback image not found at assets/img/no-image.svg"
    # Search for similar files that might be the intended fallback
    fd -e svg -e png -e jpg -i "no-image|default|placeholder" assets/img/ --exec ls -l {}
fi

Length of output: 143

.github/workflows/process_data.py Show resolved Hide resolved
@flicksolutions
Copy link

The reason for the missing derivates are manual uploads because of the file size. We should report this to @flicksolutions directly.

I just recreated all derivates on the new omeka server. The thumbnail generation fails these 4 tifs:

  • 10722
  • 10723
    -10953
    -11413

I think the files might just be too large for the server to handle:

convert: width or height exceeds limit `/var/www/html/omeka-s-new/files/original/75d3863cda7afe2c253d7ff2c75b41b9f30e412f.tif' @ error/cache.c/OpenPixelCache/3911.
convert: no images defined `/tmp/omekaOFSOEE.jpg' @ error/convert.c/ConvertImageCommand/3258.

Do you want me to look into that further?

Thank you very much for the swift reply. Can create the thumbnails manually and send them to you?

That's worth a try. sure, go ahead

@maehr maehr merged commit ac004eb into Stadt-Geschichte-Basel:main Nov 13, 2024
3 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Nov 15, 2024
8 tasks
@mtwente mtwente mentioned this pull request Nov 18, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants