Historical context analysis - Word frequency counts

This code was elaborated to evaluate the evolution of topics discussed in the literature throughout the years. We divided the abstracts of the papers in different corpus according to the year they were published and executed a word frequency count for each of these periods.

1. Import libraries

library(tm)
library(NLP)
library(tidyverse)
library(tidytext)
library(readxl)
library(textstem)
library(skimr)
library(kableExtra)

2. Import table

After the keyword definition and paper collection, the main information from the papers were passed to an Excel sheet. Additionally, we created two columns for filtering the papers for the historical context analysis and the specific paper analysis.

systematicreview <- read_excel("Data/table_systematic_review.xlsx")

ID: identification of paper.
Historical_Analysis: In the filtering process, the papers selected for the historical analysis were assigned in this column as “1”. Papers that were totally out of scope of the objective of the literature review were assigned as “0” and not included in this analysis.
Specific_Analysis: In the filtering process, papers that were selected for full paper analysis are thereby were directly related to the topic of the literature review were assigned as “1”. Otherwise, papers were assigned as “0” and were not included in the full paper analysis.
Keyword: The keywords used in the paper.
Journal: The name of journal the paper was published.
Title: The title of the published paper.
Authors: The respective authors of the published paper.
Year: The year the paper was published.
Abstract: The abstract of the published paper.

Note: For more information about the inclusion and exclusion criteria, please check section 2.2 of the paper.

3. Convert the table into a dataframe.

df <- data.frame(systematicreview)

Note: In this code we we want to access the papers that were considered coherent for performing an historical analysis. Therefore, the papers that were filtered “Historical_Analysis” = 1.

4. Filter only papers that were considered relevant for the analysis. Thus, “Historical_Analysis” = 1.

df <- subset(df, df[2]=="1")

5. Select the period you want to analyze

Papers from 1978 to 1985

A78_85 <- data.frame(df[df$Year >= 1978 & df$Year<=1985,])

Select only the abstract

A78_85 <- A78_85[,9]

6. Data Cleaning

This step is extremely important for the text mining to provide meaningful words that can help detect how topics evolve over time.

#Remove punctuation
text <- gsub(pattern = "\\W", replace = " ", A78_85)
#Remove Numbers (digits)
text2 <- gsub(pattern = "\\d", replace = " ", text)
#Lowercase words
text3 <- tolower(text2)
#remove single words 
text4 <- gsub(pattern = "\\b[A-z]\\b{1}", replace = " ", text3) 
#Remove whitespace
text5 <- stripWhitespace(text4)
#Lematize terms in its dictionary form
A78_85 <- lemmatize_strings(text5, dictionary = lexicon::hash_lemmas)

A78_85 <- data.frame(A78_85)
colnames(A78_85) <- c("abstract")

7. Tokenize words from abstracts

Tokenization is the process of breaking a text chunk into smaller parts. In this case, we are breaking into single words.

tokenizing_abstract <- A78_85 %>%
  mutate(id = row_number()) %>% #Create new id for each abstract
  unnest_tokens(word, abstract) %>%
  anti_join(stop_words) #Remove stopwords (english package of stopwords)

## Joining, by = "word"

Select additional stopwords and repeat the tokenization process

adicional_stopwords <- tribble(
  ~word,~lexicon,
  "study", "CUSTOM",
  "paper", "CUSTOM",
  "article", "CUSTOM",
  "base", "CUSTOM",
  "result", "CUSTOM",
  "information","CUSTOM",
  "research", "CUSTOM")

stop_words2 <- stop_words %>%
  bind_rows(adicional_stopwords) 

tokenizing_abstract <- A78_85 %>%
  mutate(id = row_number()) %>% 
  unnest_tokens(word, abstract) %>%
  anti_join(stop_words2)

## Joining, by = "word"

8. Frequency of words

word_counts <- tokenizing_abstract %>%
  count(word) %>%
  filter(n>1) %>% #Filter words that appear more than "n" times
  mutate(word2 = fct_reorder(word, n)) %>% 
  arrange(desc(n))

9. Plot word count

ggplot(
  word_counts, aes (x = word2, y = n/max(n))
) +
  geom_col(fill = "gray75") + 
  coord_flip() + 
  labs(
    title = "Most frequent words of filtered abstracts: 1978 - 1985",
    x = "Words", y ="Normalized frequency - Total of 2 papers")

Note: The process is the same for other periods of time. At the end, it is possible to compare the most frequent words in abstracts of papers from different periods.

In case you want to check how to perform this algorithm in the other periods, the complete code that was used for the paper can be assessed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Historical_Context_Abs_TM.md

Historical_Context_Abs_TM.md

Historical context analysis - Word frequency counts

1. Import libraries

2. Import table

3. Convert the table into a dataframe.

4. Filter only papers that were considered relevant for the analysis. Thus, “Historical_Analysis” = 1.

5. Select the period you want to analyze

Papers from 1978 to 1985

Select only the abstract

6. Data Cleaning

7. Tokenize words from abstracts

Select additional stopwords and repeat the tokenization process

8. Frequency of words

9. Plot word count

Files

Historical_Context_Abs_TM.md

Latest commit

History

Historical_Context_Abs_TM.md

File metadata and controls

Historical context analysis - Word frequency counts

1. Import libraries

2. Import table

3. Convert the table into a dataframe.

4. Filter only papers that were considered relevant for the analysis. Thus, “Historical_Analysis” = 1.

5. Select the period you want to analyze

Papers from 1978 to 1985

Select only the abstract

6. Data Cleaning

7. Tokenize words from abstracts

Select additional stopwords and repeat the tokenization process

8. Frequency of words

9. Plot word count