This repository contains datasets, data processing code, model descriptions, and a datasheet for the benchmark used for 'TRAM: Benchmarking Temporal Reasoning in Large Language Models'.
TRAM encompasses ten temporal reasoning tasks, presented as multiple-choice questions (MCQs) across a range of time-related domains. For clarity, we ensure that each question has only one correct answer. TRAM incorporates existing natural language understanding datasets, human-crafted templates and questions, web sources, and program generation. Answers have been derived through a combination of expert annotations and programmatic generation. The benchmark includes 526,668 problems in total. For each dataset, we introduce a few-shot development set, with 5 questions per category, and a separate test set for evaluation. All datasets used for experiments can be downloaded in ./datasets" folder. Overview of ten tasks included in the benchmark:
[1] Zhou et al., 2019, [2] Rajpurkar et al., 2016, [3] Uzzaman et al., 2013, [4] Williams et al., 2018, [5] Bowman et al., 2015, [6] Roemmele et al., 2011, [7] Mostafazadeh et al., 2016, [8] Mostafazadeh et al., 2017
Note: The “Data Size" column aggregates totals from both the development and test sets. “K-Way MC" signifies a multiple-choice response format with K options. Amb. Res. denotes Ambiguity Resolution. NLI stands for natural language inference. “Same" indicates the text source is the same as the row above.
For more details, please refer to the paper.
We evaluate the performance of several well-known language models on the TRAM benchmark, which is organized into two main categories. In the first category, we consider four popular large language models (LLMs): the open-source model Llama-2-13b-chat, and the closed-source models PaLM-bison-chat, GPT-3.5-turbo, and GPT-4. We evaluate each model using two prompting strategies: standard prompting (SP) and chain-of-thought (CoT) prompting. Under both strategies, the models undergo tests in zero-shot and 5-shot settings. For all models, we apply greedy decoding (i.e., temperature = 0) for response generation. Each of these models is accessed using its corresponding API key.
In the second category, we consider minimal supervision as opposed to traditional fully supervised learning in order to establish baseline evaluations. Specifically, we employ four representative BERT-style models, including BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large. For the temporal NLI task, we employ the Sequence Classification variant of BERT and RoBERTa from Huggingface (i.e., BertForSequenceClassification and RobertaForSequenceClassification), given its suitability for the task's structure. However, for the other tasks, we utilize the Multiple Choice variant of BERT and RoBERTa from Huggingface (i.e., BertForMultipleChoice, RobertaForMultipleChoice).