-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[en] new dicts #10699
base: master
Are you sure you want to change the base?
[en] new dicts #10699
Conversation
I didn't build the new dictionaries locally, so these APs aren't guaranteed to work (although I think most of them should as they don't rely too much on postag info).
|
923ebfd
to
80c6f5f
Compare
6d936b5
to
842fa0f
Compare
WalkthroughThis pull request involves comprehensive updates to the LanguageTool English language modules. The changes span multiple resource files, including dictionaries, spelling lists, disambiguation rules, and test configurations. The modifications primarily focus on refining vocabulary entries, updating spelling rules, adjusting grammar and style guidelines, and making corresponding adjustments to test cases. The updates affect various English language variants (US, GB, AU, CA) and involve additions, removals, and modifications of linguistic resources. Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (7)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1)
46-48
: Consider adding replacement test cases for South African English.While removing this specific test case, consider adding new test cases that:
- Cover the new dictionary entries being introduced
- Verify region-specific spelling variations
- Test the interaction between new grammar patterns and spelling suggestions
Would you like me to help draft additional test cases that cover these scenarios?
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (3)
54409-54410
: Consider refining the correction suggestions for "due".While the examples correctly identify the misuse of "due", some suggested corrections like "dude" or "duet" are unlikely to be appropriate in most contexts. Consider limiting suggestions to more probable alternatives like "do".
109335-109335
: Consider documenting the rationale for exceptions.While the exceptions list is comprehensive, it would be helpful to document why these specific words are excluded from interjection detection, possibly grouping them by category (e.g., greetings, expressions, commands).
140626-140626
: Consider adding more descriptive documentation.While the antipattern is well-implemented, the comment referencing issue #4823 should include a brief description of the issue being addressed, making the code more self-documenting.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)
Line range hint
1-8098
: Consider expanding term coverage in key categories.The file has good coverage but could be enhanced with:
- Technical/Scientific terms
- More idiomatic expressions
- Additional foreign phrases commonly used in English
- Modern brand names and trademarks
This would improve the disambiguation capabilities for contemporary text.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)
Line range hint
3118-3337
: Address TODO comments for false negatives.Multiple examples are marked with "TODO. False negative" comments:
- "I like both chips and eggs and pie and peas."
- "Art, literature and film."
These false negatives could affect the accuracy of the grammar checker.
Would you like me to help:
- Analyze the patterns causing these false negatives?
- Create a GitHub issue to track these cases?
- Propose rule adjustments to handle these cases correctly?
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)
10665-10665
: Consider using XML entity for Unicode characterWhile the example correctly demonstrates the disambiguation of "Most", consider using XML entity reference for the Unicode character 'μ' to ensure compatibility across different XML parsers.
- <example type="ambiguous" inputform="Most[most/NN:U,much/JJS,much/RBS]" outputform="Most[much/JJS]"><marker>Most</marker> are around 40-50 μm in diameter.</example> + <example type="ambiguous" inputform="Most[most/NN:U,much/JJS,much/RBS]" outputform="Most[much/JJS]"><marker>Most</marker> are around 40-50 μm in diameter.</example>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
(7 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
(12 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
(3 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
(4 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
(1 hunks)pom.xml
(1 hunks)
💤 Files with no reviewable changes (5)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
✅ Files skipped from review due to trivial changes (2)
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt
[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...
(EN_SIMPLE_REPLACE_MALLARD_REACTION)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...
(EN_WORD_COHERENCY)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.
(EN_WORD_COHERENCY)
🔇 Additional comments (31)
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java (1)
47-47
: Verify the impact of removing this test assertion.The removal of this test case reduces coverage for South African English specific spelling suggestions. While this aligns with similar changes in other regional spelling test files, we should ensure that:
- The removal doesn't leave critical functionality untested
- There's still adequate coverage for South African English specific words
Let's verify the test coverage and related dictionary files:
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java (1)
47-47
: Verify the removal of spelling suggestion testThe test for "AUTestWordToBeIgnore" has been commented out. While this appears to be part of a broader pattern of changes across regional variants, we should ensure we're not inadvertently losing important test coverage.
Please clarify:
- Is this test case no longer valid with the new dictionaries?
- Will this word be handled differently in the new implementation?
- Should we add alternative test cases to validate the new dictionary behavior?
Run this script to check for similar patterns across other regional variants:
✅ Verification successful
Test fixtures cleanup is valid
The commented-out test case is part of a systematic cleanup across all regional variants (AU, GB, CA, US, ZA). These were artificial test words prefixed with region codes, and their removal appears intentional as similar changes were made consistently across all regional test files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for similar commented-out tests across regional variants # and verify if the test word exists in the new dictionaries # Check for similar patterns in other test files echo "Checking for similar patterns in other test files:" rg -A 1 "assertSuggestion.*TestWordToBeIgnore" "languagetool-language-modules/en/src/test" # Check if the test word exists in the new dictionaries echo -e "\nChecking for test word in dictionaries:" rg -i "AUTestWordToBeIgnore" "languagetool-language-modules/en/src/main/resources"Length of output: 2549
Script:
#!/bin/bash # Check for potential new test patterns and dictionary changes echo "Checking for new test patterns in spell checker tests:" rg "assertSuggestion.*rule.*lt" "languagetool-language-modules/en/src/test" echo -e "\nChecking for recent changes in dictionary files:" fd . "languagetool-language-modules/en/src/main/resources" -t f -e txt -e dict | while read -r file; do echo "=== $file ===" head -n 5 "$file" 2>/dev/null doneLength of output: 35264
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java (1)
48-48
: Avoid commenting out tests without explanation or replacement.Commenting out test assertions reduces test coverage and could mask potential issues with Canadian-specific spelling suggestions. If the test case is no longer valid due to dictionary changes:
- Either remove it completely with a clear explanation in the commit message
- Or update it with new test cases that validate the new dictionary entries
Let's check if this is part of a broader pattern of disabled tests:
languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java (2)
76-76
: Review similar word patterns for consistency.Since we're adding an agent noun form ("campaigner") to the suggestions, we should review other similar test cases to ensure consistent handling of agent nouns.
Let's search for similar patterns:
✅ Verification successful
The agent noun handling is consistent across test cases.
The addition of "campaigner" follows the same pattern as the existing "spreader" test case, where agent nouns are included in suggestions for verb-based misspellings.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Find other test cases that might need similar updates rg -B 2 -A 2 "assertAllMatches.*er\"" "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/" # Check for other agent noun patterns in the tests ast-grep --pattern 'assertAllMatches($_, $_, $_, $$$, "er")'Length of output: 1318
Script:
#!/bin/bash # Find test cases with past tense verbs (-ed) to check if they're missing agent nouns rg -B 2 -A 2 'assertAllMatches.*ed"' "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java" # Also check for present participle forms (-ing) to get a complete picture rg -B 2 -A 2 'assertAllMatches.*ing"' "languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java"Length of output: 838
76-76
: Verify dictionary consistency for "campaigner" suggestion.The addition of "campaigner" to the expected suggestions appears reasonable, but we should verify that this change aligns with the dictionary modifications mentioned in the PR.
Let's check the dictionary entries:
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (9)
13975-13980
: LGTM! Well-structured antipattern for "missing persons" compound modifier.The antipattern correctly handles the "missing persons" phrase when followed by a noun phrase, with a clear example demonstrating its usage.
14353-14353
: LGTM! Clear example for possessive apostrophe usage.The example effectively demonstrates the distinction between plural and possessive forms.
18804-18804
: LGTM! Natural example for "you" vs "your" correction.The example effectively illustrates a common grammatical error in expressing gratitude.
22583-22583
: LGTM! Appropriate handling of Danish text examples.The examples correctly demonstrate Danish text detection scenarios.
81228-81235
: LGTM! Excellent antipattern for preserving parallel structure.The antipattern correctly prevents incorrect suggestions that would break parallel structure in sentences with coordinated gerunds. The example effectively demonstrates this.
93011-93011
: LGTM! Clear example for proper noun handling.The example effectively demonstrates the correct handling of proper nouns in historical context.
109345-109345
: LGTM! Comprehensive examples for repeated word patterns.The examples effectively cover various scenarios of acceptable word repetition.
109981-109988
: LGTM! Well-structured antipattern for polite refusal.The antipattern correctly handles the "no thank you" phrase with appropriate comma placement. The example is natural and clear.
114193-114200
: LGTM! Robust antipattern for proper nouns with numbers.The antipattern effectively handles various cases of proper nouns containing numbers, with comprehensive examples demonstrating different scenarios.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1)
5-5
: Verify the impact of removing all dictionary entries.The file has been emptied of all entries. This could significantly impact the language tool's ability to recognize and process various terms.
Let's verify what entries were removed and their potential impact:
✅ Verification successful
Removal of supplementary dictionary entries is acceptable
The removed entries were supplementary to the main binary dictionary (*.dict) and not critical for core functionality:
- The file explicitly states it's "used additionally to the binary dictionary"
- Spell checking is handled separately in hunspell/spelling.txt
- Custom additions can still be made via added_custom.txt
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check the git history to see what entries were removed # and analyze their importance for language processing. # Get the previous version of the file git show HEAD^:languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt > /tmp/added_old.txt # Count and categorize the removed entries echo "Total entries removed:" wc -l /tmp/added_old.txt echo -e "\nSample of removed entries by part of speech:" grep -P "\tNNP\t" /tmp/added_old.txt | head -n 5 grep -P "\tNN\t" /tmp/added_old.txt | head -n 5 grep -P "\tVB\t" /tmp/added_old.txt | head -n 5 grep -P "\tJJ\t" /tmp/added_old.txt | head -n 5Length of output: 521
Script:
#!/bin/bash # Check the format and content of the previous version echo "First 10 lines of the previous version:" head -n 10 /tmp/added_old.txt echo -e "\nChecking for related dictionary files:" fd ".*\.txt" languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/ echo -e "\nChecking for documentation:" rg -l "added.txt" --type mdLength of output: 4911
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml (1)
230-230
: LGTM! The regex pattern update improves noun form handling.The change to
postag="(NN|NNS)(:UN)?"
with$1
replacement makes the pattern more flexible by:
- Properly grouping the noun forms
- Handling both singular and plural forms with optional uncountable noun tag
Run the following script to verify the impact on other rules:
✅ Verification successful
LGTM! The pattern update is consistent with codebase conventions.
The change to
postag="(NN|NNS)(:UN)?"
with$1
replacement follows established patterns for handling noun forms throughout the codebase, particularly in handling both singular and plural forms with optional uncountable noun tags.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for similar patterns in other rules that might benefit from this improvement # Search for similar postag patterns in other rules rg -A 2 'postag="NN\|NN\(S\)"' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/Length of output: 108
Script:
#!/bin/bash # Search for various forms of postag patterns with NN or NNS in English rules echo "=== Searching for postag patterns with NN/NNS ===" rg -A 2 'postag="[^"]*(?:NN|NNS)[^"]*"' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/ echo -e "\n=== Searching for specific postag replace patterns ===" rg -A 2 'postag_replace' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/ echo -e "\n=== Searching for similar message patterns with suggestions ===" rg -A 2 '<message>[^<]*<suggestion>[^<]*<match[^>]*postag=' languagetool-language-modules/en/src/main/resources/org/languagetool/rules/Length of output: 67564
pom.xml (1)
227-227
: Verify compatibility with the new english-pos-dict version.The update from version 0.3 to 0.4 of the
english-pos-dict
dependency could introduce breaking changes. Since this PR adds new dictionaries, please ensure:
- All existing functionality remains compatible
- The new version supports the new dictionary features
- No migration steps are required
Run the following script to check for potential impacts:
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)
Line range hint
1-4
: Well-documented file header with clear usage rules.The header comment clearly explains:
- Purpose: Multi-word terms for EnglishHybridDisambiguator
- Usage rules: No disambiguation needed, single POS only
- Case handling: Case-sensitive with POS added for variants
- Integration: Terms excluded from spelling checks
Line range hint
1-8098
: Verify accuracy of specialized terms.Please verify:
- Historical names and titles (e.g., "Genghis Khan", "Kublai Khan")
- Foreign language terms and their translations
- Brand names and trademarks for current validity
- Correct pluralization forms (e.g., "chefs d'oeuvre", "chefs-d'oeuvre")
Consider adding comments for terms with special usage rules or historical context.
✅ Verification successful
Multiwords file follows stated guidelines correctly.
The file maintains consistency in POS tagging and follows its own guidelines:
- Terms have appropriate part-of-speech tags (e.g., Latin/French phrases as RB/JJ/NN, proper nouns as NNP)
- Multiple tags are only used when grammatically valid (e.g., "bona fide" as both JJ and RB)
- Special cases are marked with multiword tag
The commented-out entries appear to have been correctly commented as they had incorrect POS tags (mostly marked as NNP when they should be RB/JJ/NN).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for any existing comments explaining special cases rg -A 1 "^#.*" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"Length of output: 1990
Script:
#!/bin/bash # Search for Latin/French legal terms and their POS tags rg "^(in|ex|ad|de|prima|sub|per|inter|ultra|vice|sui|pro)\s.*\t(NN|JJ|RB)" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt" # Search for terms with multiple POS tags to verify disambiguation claim rg ".*\t.*,.*" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"Length of output: 874
Script:
#!/bin/bash # Compare commented terms with their uncommented counterparts rg "^#.*\t" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt" | cut -f1 | sed 's/^#//' | while read -r term; do rg "^$term\t" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt" doneLength of output: 5305
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)
3118-3118
: Verify the correction suggestion for consistency with style guides.The example suggests adding an Oxford comma in "Legislative power is vested in the President, the government and parliament." While this is a stylistic choice, ensure it aligns with the intended style guide for the project.
Run this script to check for consistency in Oxford comma rules:
✅ Verification successful
Oxford comma usage is consistent with project style guide
The suggestion to add an Oxford comma in "Legislative power is vested in the President, the government, and parliament" aligns with the project's consistent style pattern, as evidenced by numerous similar corrections in the codebase's style rules.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for other Oxford comma examples and corrections rg -A 2 'correction="[^"]*,' "languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/"Length of output: 90978
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (3)
88-89
: LGTM! Clean removal of outdated commentThe removal of the "gif/NNS" comment and the example clarification improves the documentation.
1570-1570
: LGTM! Good example additionThe new example correctly demonstrates the disambiguation of "No" as a determiner in the "No one" context, complementing the existing "one" disambiguation example.
3219-3219
: LGTM! Clear disambiguation examples for "Let"The examples effectively demonstrate the disambiguation between:
- "Let's" as a contraction (VBP)
- "Let" as an imperative verb (VB/VBD/VBN/VBP)
Also applies to: 3255-3255
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt (2)
11-11
: LGTM: Mapping of "a'ight" to "alright"The informal contraction "a'ight" is correctly mapped to "alright" with the RG (degree adverb) tag.
12-12
: Verify archaic form "saith"The archaic form "saith" is mapped to "say" with VBZ tag. Consider moving this to
filter-archaic.txt
during the next dictionary rebuild as suggested in the file header.languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java (1)
54-55
: Restore or remove commented test assertionThe assertion testing spelling suggestions for "anotherArtificialTestWordForLanguageTol" is commented out. Either:
- Restore the assertion if it's still valid
- Remove it completely if it's no longer needed
- Add a comment explaining why it's commented out
This helps maintain test coverage and code clarity.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (2)
170-170
: LGTM: Standardized country name formatThe correction of "Burkina-Faso" to "Burkina Faso" aligns with the official country name format.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...(EN_SIMPLE_REPLACE_MALLARD_REACTION)
220-220
: LGTM: Consistent metric unit spellingThe standardization of "zettagrammes" to "zettagrams" maintains consistency with other metric unit spellings in the file.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...(EN_WORD_COHERENCY)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.(EN_WORD_COHERENCY)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml (3)
200-201
: LGTM: Updated examples for Oxford spelling rulesThe examples for "Americanisation/Americanization" correctly demonstrate the Oxford spelling variants with appropriate part-of-speech tags (NNP/NNPS).
203-203
: LGTM: Added examples for naturalization and de-energizationThe examples effectively demonstrate the Oxford spelling rules for:
- "naturalisations/naturalizations" in the context of official documents
- "de-energisations/de-energizations" with appropriate postag information
Also applies to: 209-209
257-257
: LGTM: Added example for recentralizationThe example correctly demonstrates the Oxford spelling rule for "recentralise/recentralize" with proper POS tag verification.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Removal of terms will impact US English spell checking functionality
The verification confirms that the removed technical and compound terms (e.g., nanofiber, re-energize, sublicensable) are not present in any of the following dictionaries:
- Common English spelling dictionary (spelling.txt)
- Merged dictionary (spelling_merged.txt)
- Custom additions (spelling_custom.txt)
This removal will negatively impact spell checking functionality for US English users working with technical and business documents.
🔗 Analysis chain
Verify the removal of US-specific technical and compound terms.
The removal of all entries from this file could impact spell checking functionality for US English users, particularly for technical terms (nanofiber, quantizers), compound words (re-energize), and business terminology (sublicensable).
Let's verify if these terms were moved to the common spelling.txt file:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if removed words exist in the common spelling file
# and verify potential impact on spell checking functionality
# Define the list of removed words to check
words=(
"calisthenics"
"megadonor"
"megadonors"
"remodeler"
"remodelers"
"nanofiber"
"nanofibers"
"quantizers"
"re-energize"
"re-energized"
"re-energizes"
"re-energizing"
"nonsublicensable"
"sublicensable"
"non-sublicensable"
)
echo "Checking for removed words in spelling.txt..."
for word in "${words[@]}"; do
echo "Checking: $word"
rg -i "^${word}$" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/spelling.txt"
done
# Check if these words are used in example sentences or test cases
echo -e "\nChecking for usage in tests and examples..."
rg -i -l "($(IFS="|"; echo "${words[*]}"))" "languagetool-language-modules/en/src/test/"
Length of output: 5905
Script:
#!/bin/bash
# Find all spelling-related dictionary files
echo "Locating spelling dictionary files..."
fd -t f -e txt spelling
# Find specific files we're interested in
echo -e "\nLooking for main spelling files..."
fd -t f "spelling\.txt$|spelling_en-US\.txt$"
# Show directory structure for language resources
echo -e "\nChecking resource directory structure..."
fd . "languagetool-language-modules/en/src/main/resources" -t d
Length of output: 10113
Script:
#!/bin/bash
# Define the list of removed words
words=(
"calisthenics"
"megadonor"
"megadonors"
"remodeler"
"remodelers"
"nanofiber"
"nanofibers"
"quantizers"
"re-energize"
"re-energized"
"re-energizes"
"re-energizing"
"nonsublicensable"
"sublicensable"
"non-sublicensable"
)
# Check each word in the main spelling files
for file in "spelling.txt" "spelling_merged.txt" "spelling_custom.txt"; do
echo "=== Checking $file ==="
for word in "${words[@]}"; do
echo "Looking for: $word"
rg -i "^${word}$" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/${file}"
done
echo
done
# Look for any migration-related comments
echo "=== Checking for migration comments ==="
rg -i "moved|migration|merged|consolidat" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/"
Length of output: 10955
tam-o'-shanter NN | ||
tam-o'-shanters NNS | ||
tam o'-shanter NN | ||
tam o'-shanters NNS | ||
Tam o' Shanter NNP | ||
Ten Thousands Lakes NNP | ||
status quo NN | ||
Status Quo NN | ||
status quos NNS | ||
Status Quos NNS | ||
status quo ante NN | ||
Status Quo Ante NN | ||
status quo ante bellum NN | ||
Status Quo Ante Bellum NN | ||
in situ RB | ||
in situ JJ | ||
hic et nunc RB | ||
ac etiam RB | ||
ceteris paribus RB | ||
ib. RB | ||
ibid. RB | ||
a.k.a. RB | ||
chef d'oeuvre NN | ||
chefs d'oeuvre NNS | ||
chef-d'oeuvre NN | ||
chefs-d'oeuvre NNS | ||
chef-d'œuvre NN | ||
chefs-d'œuvre NNS | ||
de facto RB | ||
Genghis Khan NNP | ||
'Ndrangheta NNP | ||
’Ndrangheta NNP | ||
Viet Cong NNP | ||
Kublai Khan NNP | ||
double Dutch NN:U | ||
hacky sack NN | ||
hacky sacks NN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Review and standardize POS tagging.
Several inconsistencies in part-of-speech tagging need attention:
-
Foreign phrases have inconsistent tags:
- "vice versa" tagged as RB
- "in situ" tagged as both RB and JJ
- "de facto" tagged as RB
Consider standardizing the tagging approach for foreign phrases.
-
Compound terms need review:
- "double Dutch" tagged as NN:U (uncountable)
- "hacky sack" tagged as NN (should be countable)
-
Some terms missing alternative forms:
- "chef d'oeuvre"/"chef-d'oeuvre" - consider adding adjectival forms
@@ -3227,7 +3227,7 @@ For more detailed information, including tag definitions, please check out our d | |||
<example>I like both chips and <marker>eggs and</marker> pie and peas.</example><!-- TODO. False negative --> | |||
<example>Art, <marker>literature and</marker> film.</example><!-- TODO. False negative --> | |||
<example>Art is used in rituals, <marker>performances and</marker> dances as a decoration or symbol.</example> | |||
<example>Legislative power is vested in the President, the government and parliament.</example><!-- TODO. False negative --> | |||
<!--<example correction="government, and">Legislative power is vested in the President, the <marker>government and</marker> parliament.</example>--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Resolve inconsistent example usage across rule sections.
The "Legislative power" example appears in multiple sections with different treatments (active vs. commented). This inconsistency could lead to confusion in rule maintenance.
Consider:
- Documenting why the example is commented out in some sections but active in others
- Using a consistent approach across all sections
- If the example is problematic, replace it with a more suitable one in all sections
Also applies to: 3337-3337
@@ -10606,7 +10617,7 @@ Incorrectly touched. <example type="untouched">Incorrect grammar. Those present | |||
<example type="untouched">I thought I'd already told you oil and water don't mix.</example> | |||
<example type="untouched">Who told you oil and water actually mixes well?</example> | |||
<example type="untouched">... as he said to Jane and I milk and honey is the food of the gods.</example> | |||
<example type="ambiguous" inputform="email[email/NN:U,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-- #2539--> | |||
<!--<example type="ambiguous" inputform="email[email/NN:UN,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-\- #2539-\->--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Uncomment the email disambiguation example
The example should be restored as it provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email", which aligns with the resolution needed for issue #2539.
🔗 Analysis chain
Verify if the email example should remain commented out
The example is commented out with reference to issue #2539. Please verify if this is still relevant or if the example should be restored.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if issue #2539 is still open and relevant
gh issue view 2539 2>/dev/null || echo "Issue not found or not accessible"
Length of output: 247
<rule id="INSTAL_INSTALL" name="instal/install"> | ||
<pattern> | ||
<and> | ||
<token inflected="yes">install</token> | ||
<token inflected="yes">instal</token> | ||
</and> | ||
</pattern> | ||
<disambig action="remove"><wd lemma="instal"></wd></disambig> | ||
</rule> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
British English variant handling needs improvement
The review comment is valid. The en-US-GB.txt
file confirms "instal" as a legitimate British English variant, similar to other single-'l' British spellings like "instil". The current disambiguation rule that unconditionally removes "instal" would incorrectly handle British English texts.
- Consider modifying the rule to respect the text's language variant (US/GB) before removing "instal"
- The rule should align with other British spelling patterns in the codebase, like "instil/instill"
🔗 Analysis chain
Consider British English variant handling
The rule removes "instal" when both forms are present, which might not be appropriate for British English texts where "instal" is a valid variant spelling.
Consider:
- Adding a language variant check before removing "instal"
- Including test examples for both American and British English cases
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if "instal" is used in other British English rules or dictionaries
rg -l 'instal' . | grep -i 'british\|gb'
Length of output: 136
Script:
#!/bin/bash
# Check the context of "instal" in the US-GB comparison file
echo "=== Content of en-US-GB.txt around 'instal' ==="
rg -C 3 'instal' ./languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/en-US-GB.txt
echo -e "\n=== Check for British/GB variant handling in disambiguation rules ==="
rg -C 3 'GB|British|variant' ./languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
Length of output: 5555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (java-kotlin)
@@ -37,7 +37,7 @@ public class LanguageAnnotatorTest { | |||
public void testGetTokensWithPotentialLanguages() { | |||
LanguageAnnotator annotator = new LanguageAnnotator(); | |||
List<LanguageAnnotator.TokenWithLanguages> tokens = annotator.getTokensWithPotentialLanguages("Der große Haus.", en, deList); | |||
assertThat(tokens.toString(), is("[Der/de-DE, , große, , Haus/de-DE, .]")); // TODO: why no lang for 'große'? | |||
assertThat(tokens.toString(), is("[Der/en-US/de-DE, , große, , Haus/de-DE, .]")); // TODO: why no lang for 'große'? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Test expectation for standalone "Der" should not include en-US language tag
The test expects "Der" to be detected as both English (en-US) and German (de-DE), but this is incorrect. While "Der" exists in English dictionaries, it's only as part of the proper noun "Der Spiegel" (German news magazine). When used as a standalone article, "Der" should only be tagged as German (de-DE).
- The test at line 40 should expect
[Der/de-DE, ...]
instead of[Der/en-US/de-DE, ...]
🔗 Analysis chain
Verify if "Der" should be detected as an English word
The change suggests that the German article "Der" is now being detected as both English (en-US) and German (de-DE) word. This might indicate unintended side effects from the dictionary changes in this PR.
Please verify:
- Is this change intentional?
- Could the new English dictionaries be incorrectly classifying German words?
Run this script to check if "Der" appears in the English dictionaries:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if "Der" appears in English dictionaries
# Note: Checking case-sensitive to avoid false positives with "der"
echo "Checking English dictionaries for 'Der':"
fd -t f . "languagetool-language-modules/en/src/main/resources" -x rg -l "^Der\b" {}
Length of output: 378
Script:
#!/bin/bash
# Check the actual entries containing "Der" in the English dictionary files
echo "=== Entries in specific_case.txt ==="
rg "^Der\b" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/specific_case.txt"
echo -e "\n=== Entries in multiwords.txt ==="
rg "^Der\b" "languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt"
Length of output: 432
94bfef0
to
56a79d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)
170-170
: Add an explanatory message for the Burkina Faso correction.Consider adding a message to explain that this is the country's official name, similar to other geographical name corrections in the file.
-Burkina-Faso=Burkina Faso +Burkina-Faso=Burkina Faso The official name of the country is "Burkina Faso" (without a hyphen).🧰 Tools
🪛 LanguageTool
[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...(EN_SIMPLE_REPLACE_MALLARD_REACTION)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
(7 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
(12 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
(3 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
(4 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
(1 hunks)languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
(1 hunks)pom.xml
(1 hunks)
💤 Files with no reviewable changes (5)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
🚧 Files skipped from review as they are similar to previous changes (15)
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
- languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
- pom.xml
👮 Files not reviewed due to content moderation or server errors (2)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt
[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...
(EN_SIMPLE_REPLACE_MALLARD_REACTION)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...
(EN_WORD_COHERENCY)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.
(EN_WORD_COHERENCY)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (8)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)
220-220
: LGTM! Consistent with other metric unit entries.The entry follows the established pattern for metric unit spelling standardization and includes an appropriate explanatory message.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...(EN_WORD_COHERENCY)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.(EN_WORD_COHERENCY)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (7)
32-40
: Consider language variants before removing "instal"The rule unconditionally removes "instal" when both forms are present, which could incorrectly handle British English texts where "instal" is a valid variant.
Consider modifying the rule to:
- Check the text's language variant (US/GB) before removing "instal"
- Align with other British spelling patterns like "instil/instill"
<rule id="INSTAL_INSTALL" name="instal/install"> <pattern> <and> + <token><exception language="en-GB"/></token> <token inflected="yes">install</token> <token inflected="yes">instal</token> </and> </pattern> <disambig action="remove"><wd lemma="instal"></wd></disambig> </rule>
88-89
: LGTM! Input form simplified correctlyRemoving the incorrect "gif/NNS" tag from the input form improves accuracy while maintaining the correct disambiguation between noun and verb forms.
1570-1570
: LGTM! Accurate disambiguation exampleThe example correctly demonstrates the disambiguation of "No" as a determiner in the context "No one trusts him any more."
3219-3219
: LGTM! Correct disambiguation for contractionThe example accurately demonstrates the disambiguation of "Let" as a present tense verb (VBP) in the "Let's" construction.
3255-3255
: LGTM! Appropriate verb form ambiguity retainedThe example correctly maintains ambiguity between verb forms (VB/VBD/VBN/VBP) for "Let" in the imperative context.
10620-10620
: Restore the email disambiguation exampleThe commented-out example provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email".
Please uncomment the example as it aligns with the resolution needed for issue #2539.
10665-10665
: LGTM! Clear disambiguation exampleThe example accurately demonstrates the disambiguation of "Most" as a superlative adjective (JJS) in the given context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (1)
Line range hint
2068-2070
: Inconsistent tag usage for similar termsThe tags for similar terms are inconsistent:
- "Red Hot Chili Peppers" uses NPMP000
- "Red Hot" uses NPMNS00
They should use consistent tag patterns since they refer to the same entity type.-Red Hot NPMNS00_ +Red Hot NPMP000_
🧹 Nitpick comments (2)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)
Line range hint
1-24
: File header documentation could be improvedThe file header provides basic copyright information but could be enhanced with:
- Purpose and scope of the multiwords list
- Description of the tagging scheme used
- Guidelines for maintaining/updating the file
# Portuguese — Grammar and Typo Rules for LanguageTool. +# Purpose: Multiword expressions dictionary for Portuguese language processing +# Tag format: NCMS000 = Noun Common Masculine Singular +# Tag reference: See https://freeling.org/ +# # © 2016–2022 Tiago F. Santos (tiagofsantos81@sapo.pt)
Line range hint
1-2100
: Consider adding pronunciation informationFor proper nouns, especially foreign ones, consider adding pronunciation guidance to help users.
Example format:
+# Format: term<tab>tag<tab>pronunciation Gipsy Kings NPMPO00_ /ˈdʒɪpsi kɪŋz/
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (2)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)
Line range hint
28-35
: Section headers improve file organizationGood use of section headers with clear separation using # marks. This makes the file well-organized and easy to navigate.
Line range hint
1-2100
: Verify tag accuracy for proper nounsMany proper noun tags need verification to ensure they correctly reflect gender, number and other grammatical features. Consider running automated validation.
✅ Verification successful
Proper noun tags are correctly assigned
The grammatical tags for proper nouns in the multiwords.txt file are accurate and consistent with Portuguese language rules. The tags properly reflect:
- Gender (masculine/feminine) based on the grammatical gender of the main noun
- Number (singular/plural) matching the form of the expression
- Special cases like organizations and geographic locations follow standard Portuguese conventions
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for inconsistent tags on similar terms grep -P "NP[MF][SP]" multiwords.txt | sort > tags.txt # Manual review needed to identify patternsLength of output: 92
Script:
#!/bin/bash # Get proper noun entries with context for analysis echo "=== Proper noun tag patterns ===" rg "^[^#].*\tNP[MF][SP]" languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt -A 1 -B 1 echo -e "\n=== Tag distribution statistics ===" rg "^[^#].*\tNP[MF][SP][^0-9]*$" languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nrLength of output: 73197
[en] fix tests [en] multiwords [en] disambiguation: Even [en] multiwords [en] multiwords [en] multiwords [en] multiwords [en] multiwords [en] fix some FPs, by @AzadehSafakish [en] multiwords [en] do not synthesize: saith [en] multiwords [en] multiwords [en] multiwords, empty files: spelling.txt, added.txt [en] dummy commit
1c995e1
to
19a41b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)
Line range hint
1-500
: Standardize tag format across entriesThere are inconsistencies in the tag format - some entries use a trailing underscore (e.g.,
NPMS000_
) while others don't (e.g.,NPMS000
). This should be standardized across all entries.- Lanterna Verde NPMS000_ + Lanterna Verde NPMS000 - Walt Disney NPCS000_ + Walt Disney NPCS000
Line range hint
2000-2500
: Standardize Latin expression tagsSome Latin expressions have inconsistent tags. All Latin expressions should use the
L_LATIM_
tag.- ad eternum RG_ + ad eternum L_LATIM_ - in vivo RG_ + in vivo L_LATIM_
♻️ Duplicate comments (2)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (2)
32-40
:⚠️ Potential issueConsider language variants before removing "instal"
The rule unconditionally removes "instal" when both forms are present, which could incorrectly handle British English texts where "instal" is a valid spelling variant.
Consider modifying the rule to:
- Check the text's language variant (US/GB) before removing "instal"
- Add a condition to preserve "instal" for British English texts
<rule id="INSTAL_INSTALL" name="instal/install"> <pattern> <and> + <token><exception language="en-GB"/></token> <token inflected="yes">install</token> <token inflected="yes">instal</token> </and> </pattern> <disambig action="remove"><wd lemma="instal"></wd></disambig> </rule>
10620-10620
:⚠️ Potential issueUncomment the email disambiguation example
The example should be restored as it provides correct disambiguation rules for "email" as a verb in contexts like "Currently we email".
-<!--<example type="ambiguous" inputform="email[email/NN:UN,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example><!-\- #2539-\->--> +<example type="ambiguous" inputform="email[email/NN:UN,email/VB,email/VBP]" outputform="email[email/VBP]">Currently we <marker>email</marker> and share the stuff.</example>
🧹 Nitpick comments (8)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)
170-170
: Consider adding a clarifying message for the Burkina Faso entry.While the correction is accurate, consider adding a message to clarify this is the official country name:
-Burkina-Faso=Burkina Faso +Burkina-Faso=Burkina Faso The official name of the country is "Burkina Faso" (without a hyphen).🧰 Tools
🪛 LanguageTool
[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...(EN_SIMPLE_REPLACE_MALLARD_REACTION)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (3)
54409-54410
: LGTM! Comprehensive suggestions for "due" confusion.Consider adding examples with correct usage of "due" for contrast.
Add examples like:
<example>The payment is due tomorrow.</example> <example>The project was completed due to team effort.</example>
81228-81235
: Consider limiting the token skip range.While the antipattern correctly preserves parallel structure, the
skip="5"
might be too permissive and could miss some invalid cases.Consider reducing the skip range and adding more specific exceptions:
- <token skip="5" postag="VBG"> + <token skip="3" postag="VBG"> + <exception>more</exception> + <exception>very</exception>
114208-114215
: LGTM! Consider adding more specific token constraints.While the antipattern effectively handles capitalized phrases with numbers, consider adding length constraints to prevent over-matching.
- <token min="2" regexp="yes" case_sensitive="yes">[A-Z].*</token> + <token min="2" max="4" regexp="yes" case_sensitive="yes">[A-Z].*</token>languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1)
Line range hint
1-4
: Consider updating the file header documentation.Since this file appears to be transitioning to a new purpose (focusing on antipatterns), consider updating the header comments to:
- Document the new intended use
- Explain where the original part-of-speech functionality has moved to
- Add a warning if certain grammar rules might be affected
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)
8085-8090
: Add missing alternative forms for "chef d'oeuvre".Consider adding adjectival forms for completeness and consistency. Also, standardize the apostrophe usage between regular and Unicode variants.
Apply this diff to add missing forms:
+chef d'oeuvre JJ +chefs d'oeuvre JJ +chef-d'oeuvre JJ +chefs-d'oeuvre JJ +chef-d'œuvre JJ +chefs-d'œuvre JJlanguagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt (2)
Line range hint
500-1000
: Improve category organizationSome entries appear to be miscategorized or could be better organized. For example, some company names appear in the general proper nouns section rather than the "Major companies" section.
Consider:
- Moving company names to the dedicated companies section
- Adding subcategories for better organization
- Maintaining consistent category headers
Line range hint
1-12068
: Address structural issues in the fileSeveral structural issues need attention:
- Standardize line endings (CRLF vs LF)
- Remove duplicate entries
- Add missing section headers
- Use consistent comment formatting
Additionally, consider adding a file header comment describing:
- The file's purpose
- Tag format specification
- Guidelines for adding new entries
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
(8 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
(0 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
(13 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt
(2 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
(3 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
(4 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
(1 hunks)languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
(1 hunks)languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/multiwords.txt
(1 hunks)languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
(1 hunks)pom.xml
(1 hunks)
💤 Files with no reviewable changes (5)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-AU.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-GB.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-CA.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/removed.txt
🚧 Files skipped from review as they are similar to previous changes (14)
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/AbstractEnglishSpellerRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/SpellingCheckRuleTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikCanadianSpellerRuleTest.java
- pom.xml
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikBritishSpellerRuleTest.java
- languagetool-standalone/src/test/java/org/languagetool/LanguageAnnotatorTest.java
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAmericanSpellerRuleTest.java
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/do-not-synthesize.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikAustralianSpellerRuleTest.java
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/grammar.xml
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling_en-US.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/en-GB/style.xml
- languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/MorfologikSouthAfricanSpellerRuleTest.java
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt
[uncategorized] ~170-~170: Did you mean “Maillard reaction”?
Context: ...ay=a long way Burkina-Faso=Burkina Faso Mallard reaction=Maillard reaction dev ops=DevOps port c...
(EN_SIMPLE_REPLACE_MALLARD_REACTION)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...
(EN_WORD_COHERENCY)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.
(EN_WORD_COHERENCY)
🔇 Additional comments (20)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/replace.txt (1)
220-220
: LGTM! The entry maintains consistency with other metric unit corrections.The entry follows the established pattern of standardizing metric unit spellings and includes an appropriate clarifying message.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...re commonly used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used ...(EN_WORD_COHERENCY)
[uncategorized] ~220-~220: Do not mix variants of the same word (‘zettagram’ and ‘zettagramme’) within a single text.
Context: ...used spelling. zettagrammes=zettagrams 'Zettagrams' is the more commonly used spelling.(EN_WORD_COHERENCY)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (10)
14353-14353
: LGTM! Example correctly illustrates possessive apostrophe usage.The example effectively demonstrates the distinction between plural and possessive forms.
18804-18804
: LGTM! Example revision improves clarity.The modified example provides a more natural context for demonstrating the "you" vs "your" distinction.
93011-93011
: LGTM! Example correctly demonstrates proper noun handling.The example effectively illustrates the handling of historical collection names.
109350-109350
: LGTM! Comprehensive exceptions for interjection handling.The exceptions list effectively prevents false positives for common words that might be mistaken for interjections.
109360-109360
: LGTM! Examples effectively demonstrate interjection handling.The examples cover both cases requiring comma separation and legitimate repeated expressions.
109996-110003
: LGTM! Antipattern correctly handles polite refusal phrases.The antipattern effectively manages "no thank you" phrases with appropriate comma usage.
140641-140641
: LGTM! Verify handling of complex noun phrases.The antipattern correctly prevents false positives in complex noun phrases with numbers.
✅ Verification successful
Complex noun phrase handling verified successfully
The codebase demonstrates consistent and robust handling of complex noun phrases across all rule files, with proper support for:
- Multi-token phrases with determiners, adjectives, and numbers
- Both singular and plural forms
- Unknown words and special cases
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for similar patterns to ensure consistent handling rg -l 'chunk_re="[BIE]-NP-(singular|plural)"' . | while read -r file; do echo "=== $file ===" rg 'chunk_re="[BIE]-NP-(singular|plural)"' "$file" doneLength of output: 16887
13975-13980
: LGTM! Antipattern correctly handles "missing persons" phrases.The antipattern effectively prevents false positives for legitimate uses of "missing persons" in professional titles and official contexts.
Let's verify the antipattern's effectiveness with other similar phrases:
✅ Verification successful
Antipattern implementation verified and confirmed correct
The antipattern is properly integrated within the rule structure and effectively handles "missing persons" phrases followed by role/title words. The implementation correctly uses chunk patterns to match noun phrase continuations, and the example demonstrates the intended behavior.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for similar phrases to ensure the antipattern doesn't over-suppress rg -l "missing\s+persons?\s+\w+" . | while read -r file; do echo "=== $file ===" rg "missing\s+persons?\s+\w+" "$file" doneLength of output: 507
Script:
#!/bin/bash # Get the complete rule context around the example rg -B 10 -A 10 "missing persons investigator" ./languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xmlLength of output: 1105
22583-22583
: Verify handling of non-English text in rules.While the Danish example is valid, ensure that LanguageTool correctly handles non-English text in various contexts.
✅ Verification successful
Non-English text handling is properly implemented
The codebase demonstrates consistent and proper handling of multilingual content across different language modules, with appropriate support for language-specific characters and clear distinction between valid foreign text and error cases.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for other non-English examples to verify consistent handling rg -B2 -A2 'type="triggers_error"' . | grep -C2 '[æøåäöüß]'Length of output: 77411
98341-98355
: Review commented-out examples for potential inclusion.The new examples effectively demonstrate article usage, but there are commented examples that might be valuable if uncommented.
✅ Verification successful
Keep the examples commented out - current coverage is comprehensive
The commented examples are redundant as the active examples already provide comprehensive coverage of article usage with "oil", including cases where articles are needed and where they should be omitted. The active examples better demonstrate the nuances of the rule.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for similar commented examples to ensure consistency rg -l "<!--.*example.*-->" . | while read -r file; do echo "=== $file ===" rg "<!--.*example.*-->" "$file" doneLength of output: 80739
Script:
#!/bin/bash # Get the specific section of the file with context rg -A 5 -B 5 -U --line-number "^" "./languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml" | grep -A 15 -B 15 "^98341:"Length of output: 2588
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1)
5-5
:⚠️ Potential issue❌ Critical: Empty part-of-speech dictionary may break grammar checking.
The complete removal of entries from this additional part-of-speech dictionary could severely impact LanguageTool's ability to correctly identify parts of speech and perform grammar checks. While the PR focuses on introducing new antipatterns, maintaining essential part-of-speech information is crucial for the tool's core functionality.
Let's verify the impact of these changes:
Recommendations:
- Please preserve essential entries that are required for core grammar checking functionality
- If this functionality is being moved elsewhere, please document where and how it's being replaced
- Consider splitting this change into two PRs:
- One for removing unused entries
- Another for introducing new antipatterns
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)
Line range hint
1-4
: LGTM! Clear and comprehensive header comments.The header effectively explains the file's purpose, usage criteria, and case-sensitivity rules.
8077-8099
: Review and standardize POS tagging.Several inconsistencies in part-of-speech tagging need attention:
Foreign phrases have inconsistent tags:
- "in situ" tagged as both RB and JJ
- "de facto" tagged as RB
Consider standardizing the tagging approach for foreign phrases.Compound terms need review:
- "double Dutch" tagged as NN:U (uncountable)
- "hacky sack" tagged as NN (should be countable)
Some terms missing alternative forms:
- "chef d'oeuvre"/"chef-d'oeuvre" - consider adding adjectival forms
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (6)
88-89
: LGTM! Removal of invalid POS tagThe removal of
gif/NNS
from the input form is correct as it was an invalid part-of-speech tag for "gives".
1570-1570
: LGTM! Simplified POS tags for "No"The removal of the unnecessary proper noun tag (NNP) from the input form is appropriate, as "No" in this context can only be a determiner or noun.
3219-3219
: LGTM! Simplified POS tags for "Let"The removal of the unnecessary proper noun tag (NNP) from the input form is appropriate for the contraction "Let's".
3255-3255
: LGTM! Appropriate POS tags for imperative "Let"The input form correctly maintains all relevant verb forms (VB, VBD, VBN, VBP) for the imperative usage of "Let".
3530-3530
: LGTM! Consolidated uncountable noun tagThe consolidation of noun tags to use "NN:UN" for uncountable nouns is appropriate and consistent.
10665-10665
: LGTM! Clear example for "Most" disambiguationThe example correctly demonstrates the disambiguation of "Most" as a superlative adjective in a quantitative context.
in situ JJ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Review and standardize POS tagging for "in situ".
The term "in situ" is tagged as both RB (adverb) and JJ (adjective). Consider standardizing the approach for foreign phrases to maintain consistency.
Summary by CodeRabbit
Release Notes
Dictionary Updates
Spelling and Grammar
Testing
Dependency
english-pos-dict
dependency to version 0.5.