Skip to content

Latest commit

 

History

History
35 lines (20 loc) · 1.27 KB

README.md

File metadata and controls

35 lines (20 loc) · 1.27 KB

HLT

lexical_diversity_calculator.py

Uses type/token ratio to calculate lexical diversity.

Pre-processing includes tokenising input, removing stopwords and using nltk's Porter Stemmer to obtain word stems.

Run:

python3 lexical_diversity_calculator.py -n SampleTexts/EdSheeranLyrics.txt

Output

EdSheeranLyrics.txt lexical diversity: 0.2112

word_proportions.py

Finds proportions of adjectives, verbs, nouns and adjectives in a text. Categorises remaining types as 'other'.

Preprocessing involves tokenisation of input and removal of stopwords.

Uses nltk's part of speech (POS) tagger to assign parts of speech to input text tokens. Given that nltk's POS tagger was trained using the Treebank Corpus it uses the Treebank tag set. This script will map the Treebank tags to WordNet tags before giving the proportions as output.

Run:

python3 word_proportions.py -n SampleTexts/GulliversTravels.txt

Output:

Adjectives: 7.75 %
Verbs: 17.18 %
Nouns: 22.76 %
Adverbs: 5.6 %
Other: 46.7 %