-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train/modify collocation model with existing (ngram) dictionary? (question) #224
Comments
The current flow is that if you want "new york city" to be identified/collapsed as a collocation than there should be pairs of |
Thank you for your reply. So I was halfway on the right track by introducing dashes into the ngrams of the dictionary. Something like So far, I have come up with a workaround (see code below) that identifies trailing Ngrams in the dictionary and does a stepwise iteration over N (from low to high) to transform the main iterator for the documents. I have used this stepwise iteration because as soon as there are more than trigrams the inner ngram should presumably not been recognized as a collocation (e.g. in "the state of new york city" the collocation "the state" would usually not be desired). This workaround still requires training of cc_models, which, I guess, could probably be skipped with the right implementation. I would have to think over the order of which collocations to be bound, but I would be happy to work on a solution for this problem. Since I am not a computer/data scientist, my programming skills are not of mature professional nature, hence, I would be glad if you could provide some guidance/pointers/critique so that I am on the right track.
|
Sorry for being not very responsive these days, a lot of things going... Will try to spend some time tomorrow to clarify the way we can proceed with this issue. |
That`s fine, I guess you have some more important/complex problems to solve than some dictionary lookups. For the time being I think I can use my workaround, but as soon as you find some time I am happy to provide support where possible (or to the extent my limited skills allow) to establish a more elegant/robust solution. |
Just realized that I had forgotten to insert the helper function that finds trailing ngrams into the code, sorry for that. I have updated my last code comment accordingly so that you have my latest attempt as soon as you find time to have a further look. |
This is an update on this issue, however, not a solution, yet. As per your first comment in this thread, I have created a Following this approach correctly trains the colllocation "state_of_new_york_city" as present in the dictionary. However, it also trains word combinations embedded within longer phrases such as "state_of" which would not be a desired output. To avoid this behaviour I can only think of some kind of iterative training of collocations (e.g. from low to high ngrams or so) to improve the results. Let me know if you have any other idea how to approach this further (as far as you have time)?
|
Dear Dmitriy,
thank you again for solving issue #218 concerning replacement of terms by multiple synonyms. I now have a question concerning how to best incorporate dictionaries that include information on ngrams/collocations, e.g., city names. A standard solution would be to simply replace all matched patterns in the text by the dashed_version_of_patterns, e.g., via
stri_replace_all
. However, this is slow for large corpora and I am interested how you would solve this task in text2vec.As a workaround, I trained a collocation model on a modified dictionary containing all terms bound by dashes leaving the first unigram unbound so the model sees only one prefix and suffix, e.g, "new york_city". Please, see below code example.
I was wondering if you would incorporate such dictionary information differently, e.g., without training a model and manually defining colloaction_stats or so.
I would appreciate your thoughts. Thank you in advance.
The text was updated successfully, but these errors were encountered: