HLT

lexical_diversity_calculator.py

Uses type/token ratio to calculate lexical diversity.

Pre-processing includes tokenising input, removing stopwords and using nltk's Porter Stemmer to obtain word stems.

Run:

python3 lexical_diversity_calculator.py -n SampleTexts/EdSheeranLyrics.txt

Output

EdSheeranLyrics.txt lexical diversity: 0.2112

word_proportions.py

Finds proportions of adjectives, verbs, nouns and adjectives in a text. Categorises remaining types as 'other'.

Preprocessing involves tokenisation of input and removal of stopwords.

Uses nltk's part of speech (POS) tagger to assign parts of speech to input text tokens. Given that nltk's POS tagger was trained using the Treebank Corpus it uses the Treebank tag set. This script will map the Treebank tags to WordNet tags before giving the proportions as output.

Run:

python3 word_proportions.py -n SampleTexts/GulliversTravels.txt

Output:

Adjectives: 7.75 %
Verbs: 17.18 %
Nouns: 22.76 %
Adverbs: 5.6 %
Other: 46.7 %

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HLT

lexical_diversity_calculator.py

word_proportions.py

Files

README.md

Latest commit

History

README.md

File metadata and controls

HLT

lexical_diversity_calculator.py

word_proportions.py