Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenisation problem with n-dash #196

Open
mcollardanuy opened this issue Mar 16, 2023 · 0 comments · Fixed by #201
Open

Fix tokenisation problem with n-dash #196

mcollardanuy opened this issue Mar 16, 2023 · 0 comments · Fixed by #201
Assignees
Labels
bug Something isn't working named entity recognition

Comments

@mcollardanuy
Copy link
Collaborator

The n-dash is a very frequent character in historical newspapers, but the NER pipeline does not process it well: Plymouth—Kingston is parsed as `"Plymouth" (B-LOC), "—" (B-LOC), "Kingston" (B-LOC)", instead of the n-dash being interpreted as a word separator.

@mcollardanuy mcollardanuy self-assigned this Mar 16, 2023
@mcollardanuy mcollardanuy added bug Something isn't working named entity recognition labels Mar 16, 2023
@mcollardanuy mcollardanuy linked a pull request Mar 22, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working named entity recognition
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant