Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find overlapping entities in a domain by type #6

Closed
AmateurAcademic opened this issue Sep 3, 2022 · 3 comments
Closed

Find overlapping entities in a domain by type #6

AmateurAcademic opened this issue Sep 3, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested

Comments

@AmateurAcademic
Copy link
Collaborator

AmateurAcademic commented Sep 3, 2022

Description

When creating or maintaining an NLU dataset, there are very frequently overlaps between types of entities. These overlaps cause confusion for entity extraction models. Naturally, many entity extractors use the context in some form of the utterance to label the entity words and types. Nevertheless, this is often an issue.

To solve this, one should get all the entities and their types and look for these duplicates to try to combine them as much as possible.

This issue relates to Flow for entity refinement #4 and is required to be solved for entity refinement.

User stories

As a user refining entity data, I want to:

  • find the overlapping entity types so that I can review them,
  • take the reviewed entries and refine them into my original data set.

Problems

Data types

One problem here is the data types. In the notebook for entity refinement the domain_df (and other dfs!), use a column called entities where the entities were extracted using EntityExtractor.extract_entities. This creates entries into the column like so:
['{type': 'time', 'words': ['five', 'am']}, ..].

One shouldn't simply place lists of dictionaries into pandas data frame columns. But instead of trying to figure out a smart way to do this, I focused on getting it done because the resulting clean data set is considered a higher priority than collecting crappy technical debt.

Question

However, now we need to search through this to find duplicates. So, the issue here is: how to do this well considering the current data structure?

Partial or full matches

Would it be interesting to match both exact matches of the entity words, or would it be beneficial to also search for individual words for partial matches? We could even remove stop words to avoid some boring matches. Perhaps this might be suited for the next version.

Question

Should we focus on just full matches as an MVP, or do we need to also go for partial matches and remove stop words?

Solutions

There are just the solutions I will go with for now, perhaps this can be refactored in the future. As stated, ripping out the cleaned up data is more important than nice code! We got quick and dirty here.

Data types

I propose we make a quick and dirty version where we create a new data frame that contains:

  • index: Just a normal unique index for every entry, like pandas usually does.
  • id: From the original index from domains_df, several entries can have the same id. There will be, as there can be multiple entities and entity types in one utterance!
  • entity_type: The word type from the original data frame's entities column.
  • entity_words: The words joined with space from the list inside the words part of the original data frame's entities column.

We use this data frame to pandas out what we want for the matches and return the matches, including with the original data frame's index, so we can bring them together for review and refinement.

If this works well, this can be integrated as a method in the EntityExtraction class and the method extract_entities can spit out this new data frame instead of just appending all of that junk into a column.

Partial or full matches

We will just go with full exact matches for now. We can always throw in the partial matching later. Let's keep the MVP lean! I will make an issue for this, perhaps someone else would like to give it a try? It shouldn't be too hard to do. It would make a good first issue.

DoD (Definition of Done)

  • Flow to refine entities by overlap
  • The results can be saved to CSVs
  • The refinements can be added to the original data frame
@AmateurAcademic AmateurAcademic added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers question Further information is requested labels Sep 3, 2022
@AmateurAcademic AmateurAcademic self-assigned this Sep 3, 2022
@AmateurAcademic
Copy link
Collaborator Author

AmateurAcademic commented Sep 3, 2022

Almost done. I need to look at this again with fresh eyes.

The domain_df needs to have its entries updated for the correct entity_types and then be saved as a CSV.

@AmateurAcademic
Copy link
Collaborator Author

Part of the code is still in the notebook: where the overlapping entities are replaced by the correct ones on the domain_df. This should be added to the entity refinement class. Otherwise, this feature is done. That last bit should be refactored into the class next time to close this issue.

@AmateurAcademic
Copy link
Collaborator Author

This has been done with commit c8941de

The DoD for CSVs was not needed. There was a bug discovered with the EntityExtractor.normalise_utterance method and was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant