This code was elaborated to evaluate the evolution of topics discussed in the literature throughout the years. We divided the abstracts of the papers in different corpus according to the year they were published and executed a word frequency count for each of these periods.
library(tm)
library(NLP)
library(tidyverse)
library(tidytext)
library(readxl)
library(textstem)
library(skimr)
library(kableExtra)
After the keyword definition and paper collection, the main information from the papers were passed to an Excel sheet. Additionally, we created two columns for filtering the papers for the historical context analysis and the specific paper analysis.
systematicreview <- read_excel("Data/table_systematic_review.xlsx")
-
ID
: identification of paper. -
Historical_Analysis
: In the filtering process, the papers selected for the historical analysis were assigned in this column as “1”. Papers that were totally out of scope of the objective of the literature review were assigned as “0” and not included in this analysis. -
Specific_Analysis
: In the filtering process, papers that were selected for full paper analysis are thereby were directly related to the topic of the literature review were assigned as “1”. Otherwise, papers were assigned as “0” and were not included in the full paper analysis. -
Keyword
: The keywords used in the paper. -
Journal
: The name of journal the paper was published. -
Title
: The title of the published paper. -
Authors
: The respective authors of the published paper. -
Year
: The year the paper was published. -
Abstract
: The abstract of the published paper.
Note: For more information about the inclusion and exclusion criteria, please check section 2.2 of the paper.
df <- data.frame(systematicreview)
Note: In this code we we want to access the papers that were considered coherent for performing an historical analysis. Therefore, the papers that were filtered “Historical_Analysis” = 1.
4. Filter only papers that were considered relevant for the analysis. Thus, “Historical_Analysis” = 1.
df <- subset(df, df[2]=="1")
A78_85 <- data.frame(df[df$Year >= 1978 & df$Year<=1985,])
A78_85 <- A78_85[,9]
This step is extremely important for the text mining to provide meaningful words that can help detect how topics evolve over time.
#Remove punctuation
text <- gsub(pattern = "\\W", replace = " ", A78_85)
#Remove Numbers (digits)
text2 <- gsub(pattern = "\\d", replace = " ", text)
#Lowercase words
text3 <- tolower(text2)
#remove single words
text4 <- gsub(pattern = "\\b[A-z]\\b{1}", replace = " ", text3)
#Remove whitespace
text5 <- stripWhitespace(text4)
#Lematize terms in its dictionary form
A78_85 <- lemmatize_strings(text5, dictionary = lexicon::hash_lemmas)
A78_85 <- data.frame(A78_85)
colnames(A78_85) <- c("abstract")
Tokenization is the process of breaking a text chunk into smaller parts. In this case, we are breaking into single words.
tokenizing_abstract <- A78_85 %>%
mutate(id = row_number()) %>% #Create new id for each abstract
unnest_tokens(word, abstract) %>%
anti_join(stop_words) #Remove stopwords (english package of stopwords)
## Joining, by = "word"
adicional_stopwords <- tribble(
~word,~lexicon,
"study", "CUSTOM",
"paper", "CUSTOM",
"article", "CUSTOM",
"base", "CUSTOM",
"result", "CUSTOM",
"information","CUSTOM",
"research", "CUSTOM")
stop_words2 <- stop_words %>%
bind_rows(adicional_stopwords)
tokenizing_abstract <- A78_85 %>%
mutate(id = row_number()) %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words2)
## Joining, by = "word"
word_counts <- tokenizing_abstract %>%
count(word) %>%
filter(n>1) %>% #Filter words that appear more than "n" times
mutate(word2 = fct_reorder(word, n)) %>%
arrange(desc(n))
ggplot(
word_counts, aes (x = word2, y = n/max(n))
) +
geom_col(fill = "gray75") +
coord_flip() +
labs(
title = "Most frequent words of filtered abstracts: 1978 - 1985",
x = "Words", y ="Normalized frequency - Total of 2 papers")
Note: The process is the same for other periods of time. At the end, it is possible to compare the most frequent words in abstracts of papers from different periods.
In case you want to check how to perform this algorithm in the other periods, the complete code that was used for the paper can be assessed here.