Coarse-Grained Word Sense Disambiguation (WSD)
This repository is dedicated to the event detection task, which constitutes my Homework 2 for the course MULTILINGUAL NATURAL LANGUAGE PROCESSING, taught by Professor Roberto Navigli, as part of my Master’s in AI and Robotics at Sapienza University of Rome. The project explores two distinct approaches to Coarse-Grained WSD:
-
Baseline Model (Bidirectional LSTM with GloVe pre-trained embeddings): This approach evaluates the performance of a Bidirectional LSTM model enhanced with GloVe pre-trained word embeddings, providing a foundation for understanding the baseline capabilities in coarse-grained WSD.
-
Transformer Architecture (RoBERTa): This method leverages the RoBERTa model, a robust transformer-based architecture, to explore advanced techniques in coarse-grained WSD and compare its performance against the baseline model.
Important
For Homework 1 on Event Detection as a part of my Multilingual NLP course, please visit the Event Detection Repository.
In human language, words are often used in several contexts. For distinct Natural Language Processing Applications, It is essential to understand the language's various usage patterns. The same word might have numerous meanings depending on the context. These “difficult” words are more generally referred to as homonyms.
Let's take the word "bark" as an example:
- “Cinnamon comes from the bark of the Cinnamon tree.”
- “The dog barked at the stranger.”
Both sentences use the word “bark”, but sentence 1 refers to the outer covering of the tree. While sentence 2 refers to the sound made by a dog.
Therefore, it is clear that the same words can have multiple meanings depending on how they are used in a given sentence. A lot of a word's meaning is defined by how it is used. However, the issue is that when working with text data in NLP, we need a way to interpret the different word's different meanings. To solve this issue WSD (Word Sense Disambiguation) comes into play.
Determining the correct meaning or "sense" of a word in a specific context is a challenge in Natural Language Processing (NLP), and it is known as word sense disambiguation (WSD). Since many words in natural language have numerous meanings, it can be difficult to determine the intended meaning. WSD aims to resolve this issue.
To better understand the challenge of WSD, let's consider the phrase "light bulb" Without any context, it could refer to an object that produces light when connected to electricity. However, with additional context, the meaning can change. In the sentence "He had a brilliant idea; a light bulb went off in his head," the term "light bulb" is used metaphorically to represent a sudden understanding or realization. The context surrounding the phrase disambiguates its sense.
In the context of Word Sense Disambiguation (WSD), coarse-grained and fine-grained approaches refer to the level of detail in distinguishing between different senses of a word.
A coarse-grained approach involves distinguishing between broader, more general senses of a word. This approach groups similar senses into larger categories, focusing on major sense distinctions rather than detailed nuances. It is useful when the goal is to achieve general understanding or when resources are limited.
For the word "bank," a coarse-grained approach might categorize its senses into general types like "financial institution" and "side of a river," without differentiating between specific types of financial institutions or riverbanks.
Note
In our project, we utilize a coarse-grained dataset, which means that our focus is on identifying broad sense categories rather than specific nuances. This choice aligns with the project's goals and constraints, enabling us to develop a system that performs well with a general understanding of word senses.
A fine-grained approach, in contrast, aims to identify more specific and detailed senses of a word. This approach involves distinguishing between subtle variations and specific meanings, providing a more precise understanding of word usage. It is beneficial for applications requiring detailed semantic distinctions.
For the word "bank," a fine-grained approach would differentiate between various types of financial institutions (e.g., "investment bank," "commercial bank") and specific types of riverbanks (e.g., "muddy bank," "rocky bank"), capturing detailed nuances of each sense.
Here I am using the coarse-grained file to solve WSD. In a coarse-grained file, we have different inputs (instance_ids, words, lemmas, pos_tags, and candidates) and the output is (senses).
To make a program as much as simple, I only take inputs as (words), and outputs as (senses).
This flow diagram illustrates the complete process of the WSD task.
In this task, I am considering two different approaches:
- Baseline Model (Bidirectional LSTM + GloVe pre-trained embedding)
- Transformer Architecture (RoBERTa)
The approach described utilizes a Bidirectional Long Short-Term Memory (Bi-LSTM) network architecture to capture contextual information from both preceding and subsequent words in a sentence. The model consists of two LSTM layers, with one processing the input sequence forward and the other processing it backward. Pre-trained GloVe embeddings, which provide dense vector representations of words based on co-occurrence statistics, are used as input for the Bi-LSTM. The GloVe embeddings have 300 dimensions and are kept fixed during training. To prevent overfitting, a dropout of 0.4 is applied to the Bi-LSTM's output. Finally, a fully connected layer with softmax activation is used to classify the data.
- Most of the words are in different cases, So I convert all the words into the same case (lower).
- Then we need to make a list of both words and senses of training data. It will help us to convert into numbers and for out-of-vocabulary (OOV) problem.
- The model aims to classify words into senses, but the lengths of the words and senses don't match to address this, a dummy sense key 'PADDING' is created for words that are not homonyms. Padding technique is applied to handle variable length inputs, using the same 'PADDING' key for both inputs and outputs.
- The use of the same key helps in the loss function (categorical cross-entropy) by ignoring only one key instead of two. This approach allows the model to focus on learning sense keys and not the 'PADDING' sense key.
- Words and senses are converted into numbers using pre-built lists generated from the training data, resolving out-of-vocabulary (OOV) issues.
This flow diagram illustrates the complete process of the Baseline Model.
The second model, here I am using is the Robustly Optimized BERT Approach (RoBERTa) Transformer, Although Bidirectional LSTM (BiLSTM) models are effective at sequentially capturing contextual dependencies by processing input in both forward and backward directions. The reason behind that is in WSD, BiLSTM with pre-trained word embedding (Glove) gives 'same’ embedding for the same words. Although the contexts of words are different, Suppose I have two sentences: "He didn't receive fair treatment" and "Funfair in New York City this summer. In both sentences, the word 'fair' has two different meanings according to the context, but (Glove) embedding gives both 'fair' words embedding the same. The transformer-based model RoBERTa, on the other hand, is capable of effectively capturing contextual data. Transformers, like RoBERTa, use self-attention mechanisms to recognize relationships between words in a sentence, which improves their ability to effectively recognize contextual information.
- Most of the words are in different cases, So I convert all the words into the same case (lower).
- Then we need to make a list of both words and senses of training data. It will help us to convert into numbers and for out-of-vocabulary (OOV) problem.
- Inputs (words) are in lowercase and need to be tokenized. Tokenization is crucial for RoBERTa models, and the 'roberta-base' tokenizer is used. 'roberta-base' utilizes subword tokenization (BPE variant) that divides words into subword units based on frequency. Tokenization is applied to inputs using the Datasets class.
- Tokenization changes the length of the output (senses). An align_label() function is used to assign keys only to homonyms, while non-homonyms are assigned 'PADDING' value (-100). This maintains the length of the sense key list for outputs.
- The -100 values are ignored during training with categorical cross-entropy loss.
This flow diagram illustrates the complete process of the RoBERTa Model.
Models | Accuracy (Validation Data) | Accuracy (Test Data) |
---|---|---|
Baseline Model (Bi-LSTM + Pretrained word embedding – Glove) | 72.7% | 71.2% |
RoBERTa Transformer | 90.1% | 89.0% |
The Coarse-Grained Word Sense Disambiguation (WSD) project has successfully highlighted the effectiveness of modern transformer models in improving WSD tasks. Our baseline model, which utilized a Bi-LSTM architecture with pre-trained GloVe embeddings, provided a solid starting point with an accuracy of 71.2% on the test data. However, the RoBERTa Transformer significantly outperformed the baseline, achieving an impressive 89.0% accuracy. This stark improvement underscores the superior capability of transformer-based models in capturing contextual nuances in language, making them highly suitable for WSD applications. Future work will explore further optimizations and potential integrations with other advanced NLP frameworks.