Find overlapping entities in a domain by type #6
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
question
Further information is requested
Milestone
Description
When creating or maintaining an NLU dataset, there are very frequently overlaps between types of entities. These overlaps cause confusion for entity extraction models. Naturally, many entity extractors use the context in some form of the utterance to label the entity words and types. Nevertheless, this is often an issue.
To solve this, one should get all the entities and their types and look for these duplicates to try to combine them as much as possible.
This issue relates to Flow for entity refinement #4 and is required to be solved for entity refinement.
User stories
As a user refining entity data, I want to:
Problems
Data types
One problem here is the data types. In the notebook for entity refinement the
domain_df
(and other dfs!), use a column calledentities
where the entities were extracted usingEntityExtractor.extract_entities
. This creates entries into the column like so:['{type': 'time', 'words': ['five', 'am']}, ..]
.One shouldn't simply place lists of dictionaries into pandas data frame columns. But instead of trying to figure out a smart way to do this, I focused on getting it done because the resulting clean data set is considered a higher priority than collecting crappy technical debt.
Question
However, now we need to search through this to find duplicates. So, the issue here is: how to do this well considering the current data structure?
Partial or full matches
Would it be interesting to match both exact matches of the entity words, or would it be beneficial to also search for individual words for partial matches? We could even remove stop words to avoid some boring matches. Perhaps this might be suited for the next version.
Question
Should we focus on just full matches as an MVP, or do we need to also go for partial matches and remove stop words?
Solutions
There are just the solutions I will go with for now, perhaps this can be refactored in the future. As stated, ripping out the cleaned up data is more important than nice code! We got quick and dirty here.
Data types
I propose we make a quick and dirty version where we create a new data frame that contains:
index
: Just a normal unique index for every entry, like pandas usually does.id
: From the original index fromdomains_df
, several entries can have the sameid
. There will be, as there can be multiple entities and entity types in one utterance!entity_type
: The word type from the original data frame'sentities
column.entity_words
: The words joined with space from the list inside thewords
part of the original data frame'sentities
column.We use this data frame to pandas out what we want for the matches and return the matches, including with the original data frame's index, so we can bring them together for review and refinement.
If this works well, this can be integrated as a method in the
EntityExtraction
class and the methodextract_entities
can spit out this new data frame instead of just appending all of that junk into a column.Partial or full matches
We will just go with full exact matches for now. We can always throw in the partial matching later. Let's keep the MVP lean! I will make an issue for this, perhaps someone else would like to give it a try? It shouldn't be too hard to do. It would make a good first issue.
DoD (Definition of Done)
The text was updated successfully, but these errors were encountered: