semantic_space_and_size

this project is an extension of the Word Embedding Association Test (WEAT) framework, proposed in 2016 by Caliskan et al.

includes code to scrape more than a dozen subReddits, obtaining roughly 111,000 tokens for linguistic research, creating more than a dozen corpora with a mean unique vocabulary of 5,580 tokens exclusive of stop-words. automated text preprocessing to obtain top n-grams, train 18 unique Word2Vec neural networks for shallow neural vectorization, & automate scalable in-depth cosine similarity analysis. callable in one line of Python.

about WEAT

the WEAT test was based on the discovery that word embeddings also encoded measureable social biases, which correlated to known cultural phenomena:

Our results indicate that language itself contains recoverable and accurate imprints of our historic biases, whether these are morally neutral as towards insects or flowers, problematic as towards race or gender, or even simply veridical, reflecting the {\em status quo} for the distribution of gender with respect to careers or first names. These regularities are captured by machine learning along with the rest of semantics. In addition to our empirical findings concerning language, we also contribute new methods for evaluating bias in text...

the paper speculated that the test could be applied to study historical cultures via vectorization & cosine similarity analysis of their surviving corpora, provided the corpora were large enough--however, historical corpora are notoriously limited, and the paper gave no guidance as to what size might be technically feasible given the constraints of shallow word embeddings.

this project aims to answer that question, as a proof-of-concept for applying WEAT analysis on very small corpora.

while the original paper used GloVe word embeddings trained on the Common Crawl (a very large corpus), this project uses the Word2Vec algorithm, trained on individual subreddits selected for their relatively small size (20,000 tokens or fewer) and ease of testing biases. for more information on Word2Vec, see this blog post.

in addition to Word2Vec from gensim, this project also makes use of the praw, nltk, sklearn and matplotlib libraries.

code for this project lives in a jupyter notebooke here. find the full writeup with results here.

how-to

calling the final_pipeline function requires several inputs: subreddit_name, app_name, oa_script, oa_secret, pwd, vocab_number, term_list_A, term_list_B.'

term_list_A & term_list_B represent lists of tokens (as strings) for cosine similarity comparison.

sample input

the function returns the trained models, visualizations, cosine simialrity comparisons as well as top vocabulary words & other stats from the corpus:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
small_corpora_bias_analysis_1.ipynb		small_corpora_bias_analysis_1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semantic_space_and_size

about WEAT

how-to

sample input

sample output

About

Releases

Packages

Languages

disesdi/semantic_space_and_size

Folders and files

Latest commit

History

Repository files navigation

semantic_space_and_size

about WEAT

how-to

sample input

sample output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages