-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add vignette demonstrating application of syn in text analysis #22
Comments
It would be a really interesting application to have a "dictionary" of synonyms and then to use a function such as However, to apply this to a series of tokens, there would need to be some priority rules about conversion to avoid cycling or indeterminacy, maybe based on frequency. So great -> good, terrific -> good, but a match for good does not become great. There would also need to be a way to choose which word to select from a list of multiple synonyms. Frequency is probably the best criterion. Package looks great! |
Thanks for your thoughts, @kbenoit :) I'm not quite sure how to avoid things as you said, so Now, on to your note about a "dictionary" of synonyms, I had a fiddle with the library(syn)
library(quanteda)
#> Package version: 1.3.14
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
mycorpus <- corpus_subset(data_corpus_inaugural, Year>1900)
mydict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxing = "taxing",
taxation = "taxation",
taxregex = "tax*",
country = "america"))
mydict_syn <- dictionary(syns(c("christmas",
"opposition",
"taxing",
"taxation",
"country")))
head(dfm(mycorpus, dictionary = mydict))
#> Document-feature matrix of: 6 documents, 6 features (63.9% sparse).
#> 6 x 6 sparse Matrix of class "dfm"
#> features
#> docs christmas opposition taxing taxation taxregex country
#> 1901-McKinley 0 2 0 1 1 0
#> 1905-Roosevelt 0 0 0 0 0 0
#> 1909-Taft 0 1 0 4 6 4
#> 1913-Wilson 0 0 0 1 1 0
#> 1917-Wilson 0 0 0 0 0 2
#> 1921-Harding 0 0 0 1 2 15
head(dfm(mycorpus, dictionary = mydict_syn))
#> Document-feature matrix of: 6 documents, 5 features (23.3% sparse).
#> 6 x 5 sparse Matrix of class "dfm"
#> features
#> docs christmas opposition taxing taxation country
#> 1901-McKinley 0 2 7 1 17
#> 1905-Roosevelt 0 0 3 1 12
#> 1909-Taft 0 12 22 9 32
#> 1913-Wilson 0 3 7 4 14
#> 1917-Wilson 0 12 9 4 17
#> 1921-Harding 0 14 25 5 25
# subset a dictionary
mydict[1:2]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#> - christmas, santa, holiday
#> - [opposition]:
#> - opposition, reject, notincorpus
mydict[c("christmas", "opposition")]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#> - christmas, santa, holiday
#> - [opposition]:
#> - opposition, reject, notincorpus
mydict[["opposition"]]
#> [1] "opposition" "reject" "notincorpus"
# subset the synonym dictionary
mydict_syn[1:2]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#> - [opposition]:
#> - adversary, adversity, agreement to disagree, alienation, allegory, analogy, antagonism, antagonist, antagonistic, anteposition, antipathy, antithesis, antithetical, apostasy, argumentation, arrest, arrestation, arrestment, assailant, at daggers drawn, averseness, aversion, backlash, backwardness, balancing, ban, blackball, blackballing, blockage, blocking, challenge, check, clashing, clogging, closing up, closure, collision, combatant, combative reaction, comparative anatomy, comparative degree, comparative grammar, comparative judgment, comparative linguistics, comparative literature, comparative method, compare, comparing, comparison, competing, competition, competitive, competitor, complaint, con, conflict, conflicting, confrontation, confrontment, confutation, constriction, contention, contradiction, contradistinction, contraindication, contraposition, contrariety, contrast, contrastiveness, controversy, correlation, counter-culture, counteraction, counterposition, counterworking, cramp, crankiness, cross-purposes, crotchetiness, cursoriness, defiance, delay, demur, departure, detainment, detention, deviation, difference, dim view, disaccord, disaccordance, disagreement, disappointment, disapprobation, disapproval, disconformity, discongruity, discontent, discontentedness, discontentment, discord, discordance, discordancy, discrepancy, discreteness, disenchantment, disesteem, disfavor, disgruntlement, disharmony, disillusion, disillusionment, disinclination, disobedience, disparity, displeasure, dispute, disrelish, disrespect, dissatisfaction, dissension, dissent, dissentience, dissidence, dissimilarity, dissonance, distaste, distinction, distinctiveness, distinctness, disunion, disunity, divergence, divergency, diversity, dropping out, enemy, exclusion, faction, far cry, fixation, flak, foe, foeman, foot-dragging, fractiousness, friction, grudging consent, grudgingness, hampering, heterogeneity, hindering, hindrance, holdback, holdup, hostile, hostility, impediment, in opposition, inaccordance, incompatibility, incongruity, inconsistency, inconsonance, indignation, indisposedness, indisposition, indocility, inequality, inharmoniousness, inharmony, inhibition, inimicalness, interference, interruption, intractableness, irreconcilability, jarring, kick, lack of enthusiasm, lack of zeal, let, likening, low estimation, low opinion, matching, metaphor, minority opinion, mixture, mutinousness, negation, negativism, nolition, nonagreement, nonassent, nonconcurrence, nonconformity, nonconsent, noncooperation, nuisance value, objection, obstinacy, obstruction, obstructionism, occlusion, odds, opponent, opposed, opposing, opposing party, opposite camp, oppositeness, opposition, opposure, oppugnance, oppugnancy, ostracism, other side, otherness, parallelism, passive resistance, perfunctoriness, perverseness, perversity, polar opposition, polarity, polarization, posing against, proportion, protest, reaction, rebuff, recalcitrance, recalcitrancy, recalcitration, recoil, recusance, recusancy, refractoriness, refusal, rejection, relation, reluctance, renitence, renitency, repellence, repellency, repercussion, repression, repudiation, repugnance, repulse, repulsion, resistance, restraint, restriction, retardation, retardment, revolt, rival, secession, separateness, setback, showdown, simile, similitude, slowness, squeeze, stand, stranglehold, stricture, stubbornness, sulk, sulkiness, sulks, sullenness, suppression, swimming upstream, the loyal opposition, the opposition, thumbs-down, trope of comparison, unconformity, uncooperativeness, underground, unenthusiasm, unfriendliness, unhappiness, unharmoniousness, unlikeness, unorthodoxy, unwillingness, variance, variation, variegation, variety, weighing, withdrawal, withstanding
mydict_syn[c("christmas", "opposition")]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#> - [opposition]:
#> - adversary, adversity, agreement to disagree, alienation, allegory, analogy, antagonism, antagonist, antagonistic, anteposition, antipathy, antithesis, antithetical, apostasy, argumentation, arrest, arrestation, arrestment, assailant, at daggers drawn, averseness, aversion, backlash, backwardness, balancing, ban, blackball, blackballing, blockage, blocking, challenge, check, clashing, clogging, closing up, closure, collision, combatant, combative reaction, comparative anatomy, comparative degree, comparative grammar, comparative judgment, comparative linguistics, comparative literature, comparative method, compare, comparing, comparison, competing, competition, competitive, competitor, complaint, con, conflict, conflicting, confrontation, confrontment, confutation, constriction, contention, contradiction, contradistinction, contraindication, contraposition, contrariety, contrast, contrastiveness, controversy, correlation, counter-culture, counteraction, counterposition, counterworking, cramp, crankiness, cross-purposes, crotchetiness, cursoriness, defiance, delay, demur, departure, detainment, detention, deviation, difference, dim view, disaccord, disaccordance, disagreement, disappointment, disapprobation, disapproval, disconformity, discongruity, discontent, discontentedness, discontentment, discord, discordance, discordancy, discrepancy, discreteness, disenchantment, disesteem, disfavor, disgruntlement, disharmony, disillusion, disillusionment, disinclination, disobedience, disparity, displeasure, dispute, disrelish, disrespect, dissatisfaction, dissension, dissent, dissentience, dissidence, dissimilarity, dissonance, distaste, distinction, distinctiveness, distinctness, disunion, disunity, divergence, divergency, diversity, dropping out, enemy, exclusion, faction, far cry, fixation, flak, foe, foeman, foot-dragging, fractiousness, friction, grudging consent, grudgingness, hampering, heterogeneity, hindering, hindrance, holdback, holdup, hostile, hostility, impediment, in opposition, inaccordance, incompatibility, incongruity, inconsistency, inconsonance, indignation, indisposedness, indisposition, indocility, inequality, inharmoniousness, inharmony, inhibition, inimicalness, interference, interruption, intractableness, irreconcilability, jarring, kick, lack of enthusiasm, lack of zeal, let, likening, low estimation, low opinion, matching, metaphor, minority opinion, mixture, mutinousness, negation, negativism, nolition, nonagreement, nonassent, nonconcurrence, nonconformity, nonconsent, noncooperation, nuisance value, objection, obstinacy, obstruction, obstructionism, occlusion, odds, opponent, opposed, opposing, opposing party, opposite camp, oppositeness, opposition, opposure, oppugnance, oppugnancy, ostracism, other side, otherness, parallelism, passive resistance, perfunctoriness, perverseness, perversity, polar opposition, polarity, polarization, posing against, proportion, protest, reaction, rebuff, recalcitrance, recalcitrancy, recalcitration, recoil, recusance, recusancy, refractoriness, refusal, rejection, relation, reluctance, renitence, renitency, repellence, repellency, repercussion, repression, repudiation, repugnance, repulse, repulsion, resistance, restraint, restriction, retardation, retardment, revolt, rival, secession, separateness, setback, showdown, simile, similitude, slowness, squeeze, stand, stranglehold, stricture, stubbornness, sulk, sulkiness, sulks, sullenness, suppression, swimming upstream, the loyal opposition, the opposition, thumbs-down, trope of comparison, unconformity, uncooperativeness, underground, unenthusiasm, unfriendliness, unhappiness, unharmoniousness, unlikeness, unorthodoxy, unwillingness, variance, variation, variegation, variety, weighing, withdrawal, withstanding
head(mydict_syn[["opposition"]])
#> [1] "adversary" "adversity" "agreement to disagree"
#> [4] "alienation" "allegory" "analogy"
tail(mydict_syn[["opposition"]])
#> [1] "variation" "variegation" "variety" "weighing"
#> [5] "withdrawal" "withstanding" Created on 2018-11-28 by the reprex package (v0.2.1) |
I don't know too much about this either, but my first instinct upon seeing the package was that to work with tidytext I'd first arrange the synonyms as a tidy dataset. library(dplyr)
library(tibble)
library(tidyr)
word_synonyms <- tibble::enframe(syn:::words_idx) %>%
unnest(value) %>%
transmute(word = name,
synonym = syn:::all_words[value]) For instance, this allows us to find the most common synonym for each word (common defined as "being a synonym to many words"). This helps solve Nick's question about about choosing one synonym for each for a dictionary.
If we wanted to be a bit silly, we could then replace every word in a text with the most common synonym, turning "Sense and Sensibility" into "Point and Note". library(janeaustenr)
library(tidytext)
austen_books() %>%
unnest_tokens(word, text) %>%
left_join(most_common_synonyms, by = "word") %>%
mutate(synonym = coalesce(synonym, word))
I don't know if there's a vignette to be made here; this would just be the direction I'd go in tidying syn. |
Thanks for that @dgrtwo ! :) |
Following @njtierney 's example above, I experimented a bit myself and found a few issues that I illustrate below. I use here the language I've adopted for text analysis dictionaries, in terms of the key (the target word whose synonyms are retrieved) and its values (the synonyms retrieved). If the use case is to convert values matches to a key, say in order to simplify vocabulary by reducing synonyms to their canonical concepts - like "good" or "bad" - then there are going to be a lot of overlapping matches. For instance library("syn")
library("quanteda")
## Package version: 1.3.16
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
syndict <- dictionary(list(
good = syn("good"),
excellent = syn("excellent")
))
# show the overlap
lapply(syndict, head, 20)
## $good
## [1] "able to pay" "absolutely" "acceptable"
## [4] "accomplished" "according to hoyle" "ace"
## [7] "actual" "adept" "adequate"
## [10] "admirable" "admissible" "adroit"
## [13] "advantage" "advantageous" "advisable"
## [16] "affable" "affectionate" "agreeable"
## [19] "all right" "all-knowing"
##
## $excellent
## [1] "a cut above" "above" "adept" "admirable"
## [5] "adroit" "advantageous" "aesthetic" "aggrandized"
## [9] "ahead" "al" "apotheosized" "apt"
## [13] "artistic" "ascendant" "attic" "auspicious"
## [17] "authoritative" "awesome" "bang-up" "banner"
syndict[["good"]][180:190]
## [1] "even" "evenhanded" "everlasting" "exactly"
## [5] "excellent" "exemplary" "expedient" "expert"
## [9] "exquisite" "extensive" "extraordinary"
syndict[["excellent"]][80:100]
## [1] "extraordinary" "fair" "famous" "fancy"
## [5] "fantastic" "favorable" "fine" "finer"
## [9] "first-class" "first-rate" "first-string" "glorified"
## [13] "good" "goodish" "goodly" "graceful"
## [17] "grade a" "grand" "great" "greater"
## [21] "handy" Here, we see that "good" is a synonym of excellent, and vice-versa. In general, these are very long entries in the thesaurus, so we have lots of overlaps and linkages. Almost certainly, this thesaurus is too inclusive. "according to hoyle" is a synonym of "good"? 🤔 This means if we use the entries as a dictionary, we will get multiple matches. Here, the token "good" becomes its key of both "GOOD" and "EXCELLENT" (first sentence) and the other terms produce similar matches. Some priority rule is needed. txt <- "Good? It's fantastic, great, awesome, even excellent!"
toks <- tokens(txt)
tokens_lookup(toks, syndict, exclusive = FALSE, capkeys = TRUE)
## tokens from 1 document.
## text1 :
## [1] "GOOD" "EXCELLENT" "?" "It's" "GOOD"
## [6] "EXCELLENT" "," "GOOD" "EXCELLENT" ","
## [11] "EXCELLENT" "," "GOOD" "GOOD" "EXCELLENT"
## [16] "!" There are no doubt other use cases, but at least this illustrates a problem to be solved, as well as showing how the current thesaurus is too inclusive (but even a more restricted one will not solve the first problem, since there will always be overlap). But resolving #1 might improve things a lot. Great project! |
I imagine that there would be some application of deriving synonyms of words to assist in parts of text analysis, but I do not work in this area. Perhaps @juliasilge @dgrtwo or @kbenoit might have an idea?
The text was updated successfully, but these errors were encountered: