Find overlapping entities in a domain by type #6

AmateurAcademic · 2022-09-03T12:45:00Z

Description

When creating or maintaining an NLU dataset, there are very frequently overlaps between types of entities. These overlaps cause confusion for entity extraction models. Naturally, many entity extractors use the context in some form of the utterance to label the entity words and types. Nevertheless, this is often an issue.

To solve this, one should get all the entities and their types and look for these duplicates to try to combine them as much as possible.

This issue relates to Flow for entity refinement #4 and is required to be solved for entity refinement.

User stories

As a user refining entity data, I want to:

find the overlapping entity types so that I can review them,
take the reviewed entries and refine them into my original data set.

Problems

Data types

One problem here is the data types. In the notebook for entity refinement the domain_df (and other dfs!), use a column called entities where the entities were extracted using EntityExtractor.extract_entities. This creates entries into the column like so:
['{type': 'time', 'words': ['five', 'am']}, ..].

One shouldn't simply place lists of dictionaries into pandas data frame columns. But instead of trying to figure out a smart way to do this, I focused on getting it done because the resulting clean data set is considered a higher priority than collecting crappy technical debt.

Question

However, now we need to search through this to find duplicates. So, the issue here is: how to do this well considering the current data structure?

Partial or full matches

Would it be interesting to match both exact matches of the entity words, or would it be beneficial to also search for individual words for partial matches? We could even remove stop words to avoid some boring matches. Perhaps this might be suited for the next version.

Question

Should we focus on just full matches as an MVP, or do we need to also go for partial matches and remove stop words?

Solutions

There are just the solutions I will go with for now, perhaps this can be refactored in the future. As stated, ripping out the cleaned up data is more important than nice code! We got quick and dirty here.

Data types

I propose we make a quick and dirty version where we create a new data frame that contains:

index: Just a normal unique index for every entry, like pandas usually does.
id: From the original index from domains_df, several entries can have the same id. There will be, as there can be multiple entities and entity types in one utterance!
entity_type: The word type from the original data frame's entities column.
entity_words: The words joined with space from the list inside the words part of the original data frame's entities column.

We use this data frame to pandas out what we want for the matches and return the matches, including with the original data frame's index, so we can bring them together for review and refinement.

If this works well, this can be integrated as a method in the EntityExtraction class and the method extract_entities can spit out this new data frame instead of just appending all of that junk into a column.

Partial or full matches

We will just go with full exact matches for now. We can always throw in the partial matching later. Let's keep the MVP lean! I will make an issue for this, perhaps someone else would like to give it a try? It shouldn't be too hard to do. It would make a good first issue.

DoD (Definition of Done)

Flow to refine entities by overlap
The results can be saved to CSVs
The refinements can be added to the original data frame

The text was updated successfully, but these errors were encountered:

AmateurAcademic · 2022-09-03T18:48:46Z

Almost done. I need to look at this again with fresh eyes.

The domain_df needs to have its entries updated for the correct entity_types and then be saved as a CSV.

AmateurAcademic · 2022-09-17T21:38:26Z

Part of the code is still in the notebook: where the overlapping entities are replaced by the correct ones on the domain_df. This should be added to the entity refinement class. Otherwise, this feature is done. That last bit should be refactored into the class next time to close this issue.

AmateurAcademic · 2022-10-01T19:10:13Z

This has been done with commit c8941de

The DoD for CSVs was not needed. There was a bug discovered with the EntityExtractor.normalise_utterance method and was fixed.

AmateurAcademic added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers question Further information is requested labels Sep 3, 2022

AmateurAcademic added this to the Create entity refinement data set milestone Sep 3, 2022

AmateurAcademic self-assigned this Sep 3, 2022

AmateurAcademic closed this as completed Oct 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find overlapping entities in a domain by type #6

Find overlapping entities in a domain by type #6

AmateurAcademic commented Sep 3, 2022 •

edited

Loading

AmateurAcademic commented Sep 3, 2022 •

edited

Loading

AmateurAcademic commented Sep 17, 2022

AmateurAcademic commented Oct 1, 2022

Find overlapping entities in a domain by type #6

Find overlapping entities in a domain by type #6

Comments

AmateurAcademic commented Sep 3, 2022 • edited Loading

Description

User stories

Problems

Data types

Question

Partial or full matches

Question

Solutions

Data types

Partial or full matches

DoD (Definition of Done)

AmateurAcademic commented Sep 3, 2022 • edited Loading

AmateurAcademic commented Sep 17, 2022

AmateurAcademic commented Oct 1, 2022

AmateurAcademic commented Sep 3, 2022 •

edited

Loading

AmateurAcademic commented Sep 3, 2022 •

edited

Loading