-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter data #181
Filter data #181
Conversation
|
WalkthroughThe changes in this pull request primarily enhance the functionality of data extraction and processing from an Omeka API within the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (5)
_includes/js/browse-js.html (1)
87-109
: LGTM with performance optimization suggestions.The modernized
filterItems
function is well-structured and more maintainable. Consider these optional performance optimizations:
- Cache
Object.values(item).join(" ").toUpperCase()
for large objects- Use
toLowerCase()
instead oftoUpperCase()
as it's generally faster in most JavaScript enginesfunction filterItems(arr, q) { // Show loading icon loadingIcon.classList.remove("d-none"); // Trim and convert query to uppercase - const query = q.trim().toUpperCase(); + const query = q.trim().toLowerCase(); // Filter items based on query const filteredItems = query === "" ? arr : arr.filter(item => { - const itemValues = Object.values(item).join(" ").toUpperCase(); + // Cache the joined values for better performance with large objects + if (!item._searchText) { + item._searchText = Object.values(item).join(" ").toLowerCase(); + } + return item._searchText.includes(query); });.github/workflows/process_data.py (4)
62-64
: Handle pagination errors more robustly inget_paginated_items
If an error occurs during pagination, the current implementation logs the error and breaks the loop, potentially leaving data incomplete.
Consider re-raising the exception or implementing a retry mechanism:
except requests.exceptions.RequestException as err: logging.error(f"Error fetching items: {err}") + raise
This ensures that the caller is aware of the incomplete data fetch.
153-176
: Replace magic numbers with descriptive constantsUsing hardcoded property IDs reduces code readability and maintainability.
Define constants with descriptive names for property IDs:
DC_IDENTIFIER_ID = 10 DC_TITLE_ID = 1 DC_DESCRIPTION_ID = 4 # Add other property IDs accordinglyUpdate your code to use these constants:
return { - "objectid": extract_property(item.get("dcterms:identifier", []), 10), + "objectid": extract_property(item.get("dcterms:identifier", []), DC_IDENTIFIER_ID), - "title": extract_property(item.get("dcterms:title", []), 1), + "title": extract_property(item.get("dcterms:title", []), DC_TITLE_ID), # Update other property IDs similarlyThis improves code clarity and ease of maintenance.
91-99
: Optimizeextract_property
to handle multiple valuesThe current implementation of
extract_property
returns only the first matching property. If there are multiple values for a property, they are ignored.Modify the function to return all matching values:
def extract_property(props, prop_id, as_uri=False): """Extracts property values or URIs from properties based on property ID.""" - for prop in props: + results = [] + for prop in props: if prop.get("property_id") == prop_id: if as_uri: result = f"[{prop.get('o:label', '')}]({prop.get('@id', '')})" else: result = prop.get("@value", "") + results.append(result) - return result - return "" + return results if results else ""This allows you to capture all values associated with a property ID.
9-10
: Review TODO comments with external URLsThe TODO comments on lines 9 and 10 contain external URLs, which may not be accessible or relevant in all environments.
Consider adding context or additional information to these TODOs to clarify the intended actions, or move them to documentation if they require ongoing reference.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (2)
_data/config-metadata.csv
is excluded by!**/*.csv
assets/data/metadata.csv
is excluded by!**/*.csv
📒 Files selected for processing (6)
.github/workflows/process_data.py
(5 hunks)_config.yml
(1 hunks)_data/theme.yml
(1 hunks)_includes/js/browse-js.html
(1 hunks)assets/data/metadata.json
(1 hunks)package.json
(1 hunks)
🧰 Additional context used
🪛 Gitleaks
.github/workflows/process_data.py
22-22: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
23-23: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 Biome
assets/data/metadata.json
[error] 4-4: unexpected character %
(parse)
[error] 4-4: Minus must be followed by a digit
(parse)
[error] 4-4: unexpected character =
(parse)
[error] 4-4: unexpected character .
(parse)
[error] 4-4: expected ,
but instead found [
Remove [
(parse)
[error] 4-4: String values must be double quoted.
(parse)
[error] 4-4: unexpected character .
(parse)
[error] 4-4: expected ,
but instead found metadata
Remove metadata
(parse)
[error] 4-4: Minus must be followed by a digit
(parse)
[error] 4-4: expected ,
but instead found json
Remove json
(parse)
[error] 4-4: Minus must be followed by a digit
(parse)
[error] 4-4: unexpected character %
(parse)
[error] 5-5: unexpected character %
(parse)
[error] 5-5: Minus must be followed by a digit
(parse)
[error] 5-5: unexpected character =
(parse)
[error] 5-5: unexpected character .
(parse)
[error] 5-5: unexpected character .
(parse)
[error] 5-5: unexpected character .
(parse)
[error] 5-5: Invalid exponent part
(parse)
[error] 5-5: Minus must be followed by a digit
(parse)
[error] 5-5: unexpected character |
(parse)
[error] 5-5: expected ,
but instead found :
Remove :
(parse)
[error] 4-5: End of file expected
Use an array for a sequence of values: [1, 2]
(parse)
[error] 5-5: Minus must be followed by a digit
(parse)
[error] 5-5: unexpected character %
(parse)
[error] 7-7: unexpected character %
(parse)
[error] 7-7: Minus must be followed by a digit
(parse)
[error] 7-7: Minus must be followed by a digit
(parse)
[error] 7-7: unexpected character %
(parse)
[error] 8-8: expected ,
but instead found {
Remove {
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: expected ,
but instead found {
Remove {
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: expected ,
but instead found [
Remove [
(parse)
[error] 9-9: String values must be double quoted.
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: expected ,
but instead found {
Remove {
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: Property key must be double quoted
(parse)
[error] 9-9: unexpected character |
(parse)
[error] 9-9: expected ,
but instead found jsonify
Remove jsonify
(parse)
[error] 9-9: expected :
but instead found }
Remove }
(parse)
[error] 9-9: expected ,
but instead found :
Remove :
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: Property key must be double quoted
(parse)
[error] 9-9: expected :
but instead found [
Remove [
(parse)
[error] 9-9: String values must be double quoted.
(parse)
[error] 9-9: unexpected character |
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 10-10: expected ,
but instead found {
Remove {
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: expected ,
but instead found {
Remove {
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: unexpected character .
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: expected ,
but instead found "object_thumb"
Remove "object_thumb"
(parse)
[error] 10-10: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 11-11: expected ,
but instead found "object_location"
Remove "object_location"
(parse)
[error] 13-13: expected ,
but instead found {
Remove {
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 13-13: unexpected character .
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 14-14: expected ,
but instead found {
Remove {
(parse)
[error] 5-15: End of file expected
Use an array for a sequence of values: [1, 2]
(parse)
[error] 14-14: unexpected character %
(parse)
[error] 14-14: unexpected character %
(parse)
🔇 Additional comments (9)
package.json (1)
6-6
: Verify uv dependency requirements.
The script uses uv run
to execute Python code, but uv
is not listed in the dependencies.
Let's check if uv is documented elsewhere:
Consider:
- Adding documentation about the
uv
requirement - Or using standard
python
command instead ifuv
is not strictly necessary
✅ Verification successful
Let me gather more information to verify if uv
is required or documented elsewhere in the codebase.
Let me check the content of pyproject.toml to see if uv is documented there.
uv
is properly configured in the GitHub Actions workflow
The GitHub Actions workflow file shows that uv
is set up using the official astral-sh/setup-uv@v3
action before running the Python script. This is the correct way to handle the dependency, as it's a CI/CD tool requirement rather than a project dependency. No changes are needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if uv is documented as a requirement
# Test: Look for uv in documentation files
rg -i "uv|python" "README.md" || echo "No matches in README.md"
Length of output: 86
Script:
#!/bin/bash
# Check for uv in other documentation files and configuration
fd -e md -e txt -e yaml -e yml -x rg -i "uv|python" {} \;
# Check the Python script that's being executed
cat .github/workflows/process_data.py || echo "Script not found"
# Look for any requirements.txt or similar Python dependency files
fd -e txt -e toml "requirements|pyproject"
Length of output: 10345
Script:
#!/bin/bash
# Check pyproject.toml for dependencies and Python requirements
cat pyproject.toml
# Also check for any GitHub workflow files that might document the uv setup
fd -e yml -e yaml . .github/workflows/ -x cat {}
Length of output: 3471
assets/data/metadata.json (3)
6-15
: LGTM! Well-structured JSON generation.
The JSON structure is clean and maintainable:
- Proper conditional field inclusion
- Clean formatting
- Valid JSON array structure
🧰 Tools
🪛 Biome
[error] 7-7: unexpected character %
(parse)
[error] 7-7: Minus must be followed by a digit
(parse)
[error] 7-7: Minus must be followed by a digit
(parse)
[error] 7-7: unexpected character %
(parse)
[error] 8-8: expected ,
but instead found {
Remove {
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: expected ,
but instead found {
Remove {
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: expected ,
but instead found [
Remove [
(parse)
[error] 9-9: String values must be double quoted.
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: expected ,
but instead found {
Remove {
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: Property key must be double quoted
(parse)
[error] 9-9: unexpected character |
(parse)
[error] 9-9: expected ,
but instead found jsonify
Remove jsonify
(parse)
[error] 9-9: expected :
but instead found }
Remove }
(parse)
[error] 9-9: expected ,
but instead found :
Remove :
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: Property key must be double quoted
(parse)
[error] 9-9: expected :
but instead found [
Remove [
(parse)
[error] 9-9: String values must be double quoted.
(parse)
[error] 9-9: unexpected character |
(parse)
[error] 9-9: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 9-9: unexpected character %
(parse)
[error] 10-10: expected ,
but instead found {
Remove {
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: expected ,
but instead found {
Remove {
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: unexpected character .
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: expected ,
but instead found "object_thumb"
Remove "object_thumb"
(parse)
[error] 10-10: Expected a property but instead found '{'.
Expected a property here.
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 10-10: unexpected character %
(parse)
[error] 11-11: expected ,
but instead found "object_location"
Remove "object_location"
(parse)
[error] 13-13: expected ,
but instead found {
Remove {
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 13-13: unexpected character .
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 13-13: unexpected character %
(parse)
[error] 14-14: expected ,
but instead found {
Remove {
(parse)
[error] 5-15: End of file expected
Use an array for a sequence of values: [1, 2]
(parse)
[error] 14-14: unexpected character %
(parse)
[error] 14-14: unexpected character %
(parse)
12-12
: LGTM! Improved parentid check and URL handling.
The changes improve the code in two ways:
- Using
absolute_url
ensures fully qualified URLs - The
parentid.size > 0
check is more robust than the previous implicit boolean check
#!/bin/bash
# Description: Verify URL generation in different environments
# Test: Check if absolute_url filter is consistently used across the codebase
rg -l "absolute_url" --type html --type md --type liquid
4-5
: Verify data integrity of site.metadata-json.
The change from filtered site.metadata
to direct site.metadata-json
access simplifies the code. However, we should ensure that site.metadata-json
maintains the same data quality as the previous filtered approach.
✅ Verification successful
Let me gather more information about the configuration and data structure.
Based on the _config.yml content, I can verify that the change from site.metadata
to site.metadata-json
is properly configured. Both fields are defined in the collection settings:
metadata: sgb-metadata-csv
metadata-json: sgb-metadata-json
Let me check the actual metadata files to ensure they exist and verify their structure.
Let me try a different approach to verify the metadata files and their usage in the codebase.
Let me check the content of the metadata.json file and related templates to verify the data structure and usage.
Based on the evidence gathered, I can now provide a final response:
Change to site.metadata-json is properly configured and safe
The change from site.metadata
to site.metadata-json
in assets/data/metadata.json
is properly configured and supported by the system:
-
The
_config.yml
explicitly defines both metadata sources:metadata: sgb-metadata-csv metadata-json: sgb-metadata-json
-
The metadata.json template correctly uses the new format:
{%- assign items = site.data[site.metadata-json] -%}
-
The change only affects JSON data export while maintaining the original CSV metadata source for other features, as evidenced by the numerous templates still using
site.metadata
for core functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the structure and content of metadata configuration
# Test 1: Check if metadata-json is properly configured
rg -l "metadata-json:" _config.yml
# Test 2: Compare the structure with previous metadata field
rg -l "metadata:" _config.yml
Length of output: 232
Script:
#!/bin/bash
# Check the content of _config.yml for metadata configuration
cat _config.yml
# Look for any JSON files that might contain metadata
fd -e json -x cat {} \;
# Search for metadata-related configuration in other yml files
fd -e yml -e yaml -x cat {} \;
Length of output: 78924
Script:
#!/bin/bash
# Check for metadata files
fd -e csv -e json "sgb-metadata" _data/
# If files exist, peek at their content
find _data -name "sgb-metadata*" -exec head -n 5 {} \;
Length of output: 99
Script:
#!/bin/bash
# Search for references to metadata-json in liquid templates
rg -l "site.metadata-json" --type html --type md --type liquid
# Search for references to the old metadata format
rg -l "site.metadata" --type html --type md --type liquid
# Look for any JSON generation files
fd -e json.liquid -e json
Length of output: 558
Script:
#!/bin/bash
# Check the metadata.json template
cat assets/data/metadata.json
# Look for liquid templates that handle metadata
fd -e html -e liquid | xargs grep -l "site.metadata"
# Check if there are any other relevant configuration files
fd -g "*config*.{yml,yaml,json}" -x cat {}
Length of output: 4699
🧰 Tools
🪛 Biome
[error] 4-4: unexpected character %
(parse)
[error] 4-4: Minus must be followed by a digit
(parse)
[error] 4-4: unexpected character =
(parse)
[error] 4-4: unexpected character .
(parse)
[error] 4-4: expected ,
but instead found [
Remove [
(parse)
[error] 4-4: String values must be double quoted.
(parse)
[error] 4-4: unexpected character .
(parse)
[error] 4-4: expected ,
but instead found metadata
Remove metadata
(parse)
[error] 4-4: Minus must be followed by a digit
(parse)
[error] 4-4: expected ,
but instead found json
Remove json
(parse)
[error] 4-4: Minus must be followed by a digit
(parse)
[error] 4-4: unexpected character %
(parse)
[error] 5-5: unexpected character %
(parse)
[error] 5-5: Minus must be followed by a digit
(parse)
[error] 5-5: unexpected character =
(parse)
[error] 5-5: unexpected character .
(parse)
[error] 5-5: unexpected character .
(parse)
[error] 5-5: unexpected character .
(parse)
[error] 5-5: Invalid exponent part
(parse)
[error] 5-5: Minus must be followed by a digit
(parse)
[error] 5-5: unexpected character |
(parse)
[error] 5-5: expected ,
but instead found :
Remove :
(parse)
[error] 4-5: End of file expected
Use an array for a sequence of values: [1, 2]
(parse)
[error] 5-5: Minus must be followed by a digit
(parse)
[error] 5-5: unexpected character %
(parse)
_config.yml (1)
39-40
: LGTM! Clear and consistent metadata configuration.
The changes effectively support both CSV and JSON metadata formats with clear, self-documenting field names.
_data/theme.yml (2)
71-71
: LGTM! Enhanced metadata export fields.
The updated metadata export fields provide a more comprehensive set of information. The new organization groups related fields together (e.g., objectid/parentid, title/description/relation) and includes important additions like publisher and extent. This aligns well with the PR's goal of improving metadata handling.
73-73
: LGTM! Improved faceted search capabilities.
The addition of publisher, type, language, and license to the facets fields enhances the filtering capabilities, allowing users to find content more effectively. This change aligns with the PR's goal of improving metadata presentation and user experience.
.github/workflows/process_data.py (2)
234-235
:
Remove duplicate call to get_media
The function get_media
is called twice consecutively with the same item
ID on lines 234 and 235. This redundancy may lead to unnecessary API calls and degrade performance.
You can remove the duplicate call to streamline the code:
# Process media data for each item
-media_data = get_media(item.get("o:id", ""))
media_data = get_media(item["o:id"])
Ensure that item["o:id"]
always exists or handle the potential KeyError
if it doesn't.
Likely invalid or redundant comment.
245-260
: Ensure data consistency when saving records
When converting records to a DataFrame, there might be inconsistencies if some records have missing keys.
Please verify that all records have consistent keys before saving. Run the following script to identify any discrepancies:
This script helps ensure data integrity in your CSV and JSON outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (2)
.github/workflows/process_data.py (2)
Line range hint
93-106
: Security: Use HTML escape function for semicolon replacementUsing direct string replacement for HTML escaping is unsafe. Consider using proper HTML escaping functions:
+import html def extract_combined_values(props): """Combines text values and URIs from properties into a single list.""" values = [ - prop.get("@value", "").replace(";", ";") + html.escape(prop.get("@value", ""), quote=False) for prop in props if "@value" in prop ] uris = [ - f"[{prop.get('o:label', '').replace(';', ';')}]({prop.get('@id', '').replace(';', ';')})" + f"[{html.escape(prop.get('o:label', ''), quote=False)}]({html.escape(prop.get('@id', ''), quote=False)})" for prop in props if "@id" in prop ] return values + uris
216-234
: Consider implementing checkpoint/resume capabilityFor large datasets, the script should be able to resume from where it left off in case of failures. Consider:
- Implementing checkpointing after processing each item
- Adding a resume capability that can skip already processed items
This would make the script more resilient to failures and more efficient when re-running:
def save_checkpoint(processed_ids, checkpoint_file="checkpoint.json"): with open(checkpoint_file, "w") as f: json.dump(processed_ids, f) def load_checkpoint(checkpoint_file="checkpoint.json"): if os.path.exists(checkpoint_file): with open(checkpoint_file) as f: return set(json.load(f)) return set()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
.github/workflows/jekyll.yml (1)
43-47
: Consider adding file existence checks.While the JSON artifact upload aligns with the PR's objectives for enhanced metadata handling, it would be beneficial to add error handling.
Consider wrapping the upload step with a condition:
- name: Check if JSON file exists id: check_json run: | if [ -f "_data/sgb-metadata-json.json" ]; then echo "exists=true" >> $GITHUB_OUTPUT else echo "exists=false" >> $GITHUB_OUTPUT fi - name: Upload sgb-metadata-json.json if: steps.check_json.outputs.exists == 'true' uses: actions/upload-artifact@v4 with: name: sgb-metadata-json path: _data/sgb-metadata-json.json
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
.github/workflows/jekyll.yml
(2 hunks)
🔇 Additional comments (3)
.github/workflows/jekyll.yml (3)
38-42
: LGTM! Improved artifact naming for better clarity.
The renamed artifact explicitly indicates the CSV format, making it clearer what type of data is being handled.
60-64
: LGTM! Consistent artifact naming with upload step.
The download step correctly mirrors the changes made in the upload step for the CSV artifact.
Line range hint 82-84
: Verify Jekyll configuration for JSON metadata handling.
Please ensure that Jekyll is configured to handle both CSV and JSON metadata files during the build process.
✅ Verification successful
The configuration file shows explicit support for both CSV and JSON metadata through the settings metadata: sgb-metadata-csv
and metadata-json: sgb-metadata-json
. Let me verify the actual usage of these configurations in the Jekyll build process.
Jekyll is properly configured for both CSV and JSON metadata handling
The codebase shows proper configuration and implementation for handling both metadata formats:
- Configuration in
_config.yml
explicitly defines both formats viametadata
andmetadata-json
settings - The Jekyll plugins (
_plugins/cb_helpers.rb
and_plugins/cb_page_gen.rb
) contain robust data loading logic that works with the configured metadata sources - The build process is format-agnostic, using
site.data
abstraction to handle the metadata regardless of source format
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check Jekyll configuration for metadata handling
# Test: Look for metadata configuration in Jekyll config files
echo "Checking Jekyll configuration files for metadata settings..."
rg -A 5 "metadata|data" "_config.yml"
Length of output: 1064
Script:
#!/bin/bash
# Check for Jekyll data processing logic
echo "Checking Jekyll data processing implementation..."
rg -l "metadata-json|sgb-metadata-json" --type ruby
# Check for the presence of both metadata files
echo -e "\nChecking for metadata files..."
fd -e csv -e json "sgb-metadata"
# Look for any Jekyll data loading logic
echo -e "\nChecking for data loading logic..."
rg "site\.data" --type ruby -A 3
Length of output: 3690
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your PR. Looks very good. But we should also catch the following cases on the browse page that may occur:
- a parent object does not (yet) have any media objects (e.g. https://omeka.unibe.ch/admin/item/10891)
- the media object has no derivatives in OMEKA (i don't know the reason for this yet). See https://omeka.unibe.ch/admin/item/11412
Thanks for pointing this out. I'll fix it. The reason for the missing derivates are manual uploads because of the file size. We should report this to @flicksolutions directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (6)
_includes/js/browse-js.html (1)
162-181
: Add event listener cleanupEvent listeners are added but never removed, which could lead to memory leaks if the browse functionality is re-initialized without proper cleanup.
Consider adding cleanup code:
+ // Store event listener references for cleanup + const buttonListeners = new Map(); sortOptions.forEach((button) => { - button.addEventListener("click", () => { + const listener = () => { const field = button.dataset.filter; // ... rest of the code - }); + }; + buttonListeners.set(button, listener); + button.addEventListener("click", listener); }); + // Add cleanup function + function cleanup() { + buttonListeners.forEach((listener, button) => { + button.removeEventListener("click", listener); + }); + buttonListeners.clear(); + }.github/workflows/process_data.py (5)
14-19
: Improve exception handling with proper error chainingUse
raise ... from err
to maintain the exception chain for better debugging:try: ITEM_SET_ID = int(ITEM_SET_ID) except ValueError as err: - raise ValueError("ITEM_SET_ID must be a valid integer") + raise ValueError("ITEM_SET_ID must be a valid integer") from err🧰 Tools
🪛 Ruff
19-19: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
Line range hint
9-13
: Add validation for required environment variablesCritical environment variables should be validated at startup to fail fast:
OMEKA_API_URL = os.getenv("OMEKA_API_URL") KEY_IDENTITY = os.getenv("KEY_IDENTITY") KEY_CREDENTIAL = os.getenv("KEY_CREDENTIAL") + +# Validate required environment variables +if not all([OMEKA_API_URL, KEY_IDENTITY, KEY_CREDENTIAL]): + raise ValueError( + "Missing required environment variables. " + "Please set OMEKA_API_URL, KEY_IDENTITY, and KEY_CREDENTIAL" + )🧰 Tools
🪛 Ruff
19-19: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
Line range hint
188-191
: Remove or enhance debug loggingThe current logging statements seem like debug information. Consider:
- Using debug level instead of info
- Adding more context
- Or removing if not necessary for production
- logging.info(f"Media ID: {media['o:id']}") - logging.info(f"is_public: {media.get('o:is_public')}") + logging.debug( + f"Processing media - ID: {media['o:id']}, " + f"Public: {media.get('o:is_public')}, " + f"Format: {format_value}" + )
Line range hint
100-113
: Optimize extract_combined_values functionThe current implementation creates two separate lists and concatenates them. Consider using a list comprehension with conditional logic:
def extract_combined_values(props): """Combines text values and URIs from properties into a single list.""" - values = [ - prop.get("@value", "").replace(";", ";") - for prop in props - if "@value" in prop - ] - uris = [ - f"[{prop.get('o:label', '').replace(';', ';')}]({prop.get('@id', '').replace(';', ';')})" - for prop in props - if "@id" in prop - ] - return values + uris + return [ + (prop.get("@value", "").replace(";", ";") + if "@value" in prop + else f"[{prop.get('o:label', '').replace(';', ';')}]({prop.get('@id', '').replace(';', ';')})") + for prop in props + if "@value" in prop or "@id" in prop + ]🧰 Tools
🪛 Ruff
19-19: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
223-232
: Add progress tracking for large datasetsFor better observability during processing of large datasets, consider adding progress tracking:
+ total_items = len(items_data) + logging.info(f"Processing {total_items} items...") item_records, media_records = [], [] - for item in items_data: + for index, item in enumerate(items_data, 1): + if index % 10 == 0: + logging.info(f"Progress: {index}/{total_items} items processed") item_record = extract_item_data(item) item_records.append(item_record) media_data = get_media(item.get("o:id", "")) if media_data: + logging.info(f"Processing {len(media_data)} media items for item {index}/{total_items}") for media in media_data: media_records.append(extract_media_data(media, item_record["objectid"]))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
assets/img/no-image.png
is excluded by!**/*.png
📒 Files selected for processing (4)
.github/workflows/jekyll.yml
(2 hunks).github/workflows/process_data.py
(6 hunks)_includes/js/browse-js.html
(1 hunks)package.json
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/jekyll.yml
- package.json
🧰 Additional context used
🪛 Ruff
.github/workflows/process_data.py
19-19: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (3)
_includes/js/browse-js.html (2)
8-23
: LGTM: Well-structured items initialization
The array initialization is well-structured with proper JSON escaping and careful handling of optional fields.
114-115
: Replace biased shuffle with Fisher-Yates algorithm
The past review comment about the biased shuffle implementation is still valid.
.github/workflows/process_data.py (1)
53-65
: 🛠️ Refactor suggestion
Enhance pagination error handling with retry logic
The current implementation breaks the loop on first error. Consider implementing retry logic and validating the response data:
def get_paginated_items(url, params):
"""Fetches all items from a paginated API endpoint."""
items = []
+ max_retries = 3
+ retry_count = 0
while url:
try:
response = requests.get(url, params=params)
response.raise_for_status()
+ data = response.json()
+ if not isinstance(data, list):
+ raise ValueError(f"Expected list response, got {type(data)}")
+ items.extend(data)
+ url = response.links.get("next", {}).get("url")
+ params = None
+ retry_count = 0
except requests.exceptions.RequestException as err:
+ if retry_count < max_retries:
+ retry_count += 1
+ logging.warning(f"Retry {retry_count}/{max_retries} after error: {err}")
+ continue
logging.error(f"Error fetching items: {err}")
- break
+ raise
- items.extend(response.json())
- url = response.links.get("next", {}).get("url")
- params = None
return items
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it works fine. But I actually liked the solution with the icon themes better. Now it always uses the same placeholder instead of the media-specific placeholder
function getIcon(objectTemplate, objectFormat, svgType) { |
Now:
Before:
@mtwente, what do you think?
I'll remove the line for media objects, ok? If they are images they are not supposed to have no thumbs anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
.github/workflows/process_data.py (3)
90-97
: Optimize property extraction with dictionary lookupThe current implementation iterates through all properties. Using a dictionary would be more efficient for multiple lookups.
+def _build_props_dict(props): + """Build a dictionary of properties for efficient lookup.""" + return {prop.get("property_id"): prop for prop in props} def extract_property(props, prop_id, as_uri=False): """Extracts a property value or URI from properties based on property ID.""" - for prop in props: - if prop.get("property_id") == prop_id: - if as_uri: - return f"[{prop.get('o:label', '')}]({prop.get('@id', '')})" - return prop.get("@value", "") - return "" + props_dict = _build_props_dict(props) + prop = props_dict.get(prop_id, {}) + if as_uri and "@id" in prop: + return f"[{prop.get('o:label', '')}]({prop.get('@id', '')})" + return prop.get("@value", "")
Line range hint
100-112
: Sanitize HTML in combined values to prevent XSSThe current implementation only escapes semicolons but doesn't sanitize HTML characters.
+import html def extract_combined_values(props): """Combines text values and URIs from properties into a single list.""" values = [ - prop.get("@value", "").replace(";", ";") + html.escape(prop.get("@value", "")).replace(";", ";") for prop in props if "@value" in prop ] uris = [ - f"[{prop.get('o:label', '').replace(';', ';')}]({prop.get('@id', '').replace(';', ';')})" + f"[{html.escape(prop.get('o:label', '')).replace(';', ';')}]({html.escape(prop.get('@id', '')).replace(';', ';')})" for prop in props if "@id" in prop ] return values + uris🧰 Tools
🪛 Ruff
19-19: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
224-237
: Add progress logging and error handling in main processing loopThe main processing loop should provide progress updates and handle errors gracefully.
def main(): # Fetch item data + logging.info(f"Fetching items from collection {ITEM_SET_ID}") items_data = get_items_from_collection(ITEM_SET_ID) + total_items = len(items_data) + logging.info(f"Processing {total_items} items") # Process each item and associated media item_records, media_records = [], [] - for item in items_data: - item_record = extract_item_data(item) - item_records.append(item_record) - media_data = get_media(item.get("o:id", "")) - if media_data: - for media in media_data: - media_records.append(extract_media_data(media, item_record["objectid"])) + for i, item in enumerate(items_data, 1): + try: + item_record = extract_item_data(item) + item_records.append(item_record) + + media_data = get_media(item.get("o:id", "")) + if media_data: + for media in media_data: + media_records.append(extract_media_data(media, item_record["objectid"])) + + if i % 10 == 0: + logging.info(f"Processed {i}/{total_items} items") + except Exception as e: + logging.error(f"Error processing item {item.get('o:id')}: {e}") + continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
package-lock.json
is excluded by!**/package-lock.json
📒 Files selected for processing (1)
.github/workflows/process_data.py
(6 hunks)
🧰 Additional context used
🪛 Ruff
.github/workflows/process_data.py
19-19: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (2)
.github/workflows/process_data.py (2)
30-36
: LGTM! Well-implemented URL validation
The URL validation function is concise and handles both scheme and netloc checks properly.
53-65
: 🛠️ Refactor suggestion
Enhance pagination error handling with retry logic
The current implementation breaks the loop on the first error, which could result in incomplete data retrieval.
def get_paginated_items(url, params):
"""Fetches all items from a paginated API endpoint."""
items = []
+ max_retries = 3
+ retry_delay = 1 # seconds
while url:
+ retry_count = 0
+ while retry_count < max_retries:
+ try:
+ response = requests.get(url, params=params)
+ response.raise_for_status()
+ items.extend(response.json())
+ url = response.links.get("next", {}).get("url")
+ params = None
+ break
+ except requests.exceptions.RequestException as err:
+ retry_count += 1
+ if retry_count == max_retries:
+ logging.error(f"Failed after {max_retries} retries: {err}")
+ raise
+ logging.warning(f"Attempt {retry_count} failed: {err}. Retrying...")
+ time.sleep(retry_delay * retry_count)
- try:
- response = requests.get(url, params=params)
- response.raise_for_status()
- except requests.exceptions.RequestException as err:
- logging.error(f"Error fetching items: {err}")
- break
- items.extend(response.json())
- url = response.links.get("next", {}).get("url")
- params = None
return items
Likely invalid or redundant comment.
I just recreated all derivates on the new omeka server. The thumbnail generation fails these 4 tifs:
I think the files might just be too large for the server to handle:
Do you want me to look into that further? |
Thank you very much for the swift reply. Can create the thumbnails manually and send them to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
.github/workflows/process_data.py (2)
14-19
: Enhance error handling in environment variable validationThe error chain is not preserved when re-raising the ValueError. Use
raise from
to maintain the exception context.Apply this diff to improve the error handling:
if not ITEM_SET_ID: raise ValueError("ITEM_SET_ID environment variable must be set") try: ITEM_SET_ID = int(ITEM_SET_ID) except ValueError as err: - raise ValueError("ITEM_SET_ID must be a valid integer") + raise ValueError("ITEM_SET_ID must be a valid integer") from err🧰 Tools
🪛 Ruff
19-19: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
Line range hint
193-194
: Remove or mask sensitive information in logsLogging the media ID and public status might expose sensitive information about private media items.
Apply this diff to improve logging:
- logging.info(f"Media ID: {media['o:id']}") - logging.info(f"is_public: {media.get('o:is_public')}") + logging.debug(f"Processing media {media['o:id'][-4:]}...") # Log only last 4 chars of ID
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
assets/img/no-image.svg
is excluded by!**/*.svg
📒 Files selected for processing (1)
.github/workflows/process_data.py
(6 hunks)
🧰 Additional context used
🪛 Ruff
.github/workflows/process_data.py
19-19: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (2)
.github/workflows/process_data.py (2)
53-65
: 🛠️ Refactor suggestion
Enhance error handling and reliability in get_paginated_items
The function could benefit from:
- Retry mechanism for transient failures
- Response data validation
- Proper error propagation
Apply this diff to improve the function:
def get_paginated_items(url, params):
"""Fetches all items from a paginated API endpoint."""
items = []
+ max_retries = 3
+ retry_count = 0
while url:
try:
response = requests.get(url, params=params)
response.raise_for_status()
+ data = response.json()
+ if not isinstance(data, list):
+ raise ValueError(f"Expected list response, got {type(data)}")
+ items.extend(data)
+ url = response.links.get("next", {}).get("url")
+ params = None
+ retry_count = 0
except requests.exceptions.RequestException as err:
+ if retry_count < max_retries:
+ retry_count += 1
+ logging.warning(f"Retry {retry_count}/{max_retries} after error: {err}")
+ continue
logging.error(f"Error fetching items: {err}")
- break
- items.extend(response.json())
- url = response.links.get("next", {}).get("url")
- params = None
+ raise
return items
Likely invalid or redundant comment.
146-149
: Verify the fallback image path
The hard-coded fallback image path assets/img/no-image.svg
might not exist, causing broken image links.
✅ Verification successful
Fallback image path is valid
Fallback image exists at assets/img/no-image.svg.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the fallback image exists
if [ ! -f "assets/img/no-image.svg" ]; then
echo "Warning: Fallback image not found at assets/img/no-image.svg"
# Search for similar files that might be the intended fallback
fd -e svg -e png -e jpg "no-image|default|placeholder" assets/img/
fi
Length of output: 40
Script:
#!/bin/bash
# Confirm the existence of the fallback image
if [ -f "assets/img/no-image.svg" ]; then
echo "Fallback image exists at assets/img/no-image.svg"
else
echo "Warning: Fallback image not found at assets/img/no-image.svg"
# Search for similar files that might be the intended fallback
fd -e svg -e png -e jpg -i "no-image|default|placeholder" assets/img/ --exec ls -l {}
fi
Length of output: 143
That's worth a try. sure, go ahead |
Pull request
Proposed changes
Types of changes
Checklist
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes
Chores