Skip to content

a prompt-based within-document event coreference resolution system, trained and evaluated on the KBP corpus.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

CorefPrompt: A Prompt-based Event Coreference Resolver

This code was used in the paper:

"CorefPrompt: Prompt-based Event Coreference Resolution by Measuring Event Type and Argument Compatibilities"
Sheng Xu, Peifeng Li and Qiaoming Zhu. EMNLP 2023.

A simple prompt-based model for predicting event pair coreferences. The model was trained and evaluated on the KBP corpus.


It first utilizes a prefix template $\mathcal{T}_{pre}$ to inform the pre-trained model what to focus on when encoding, then marks event types and arguments by inserting anchor templates $\mathcal{T}_{anc}$ around event mentions, and finally demonstrates the reasoning process of coreference prediction using an inference template $\mathcal{T}_{inf}$ which introduces two auxiliary prompt tasks, event-type compatibility and argument compatibility.

Set up

Set up a Python virtual environment and install packages using the requirements file:

conda create -n corefprompt python=3.9
conda activate corefprompt
python3 -m pip install -r requirements.txt

How to use

It is easy to use our model to predict event coreferences.

For example, consider the following text, which contains seven event mentions (ev1-ev7) and ten entity mentions (arg1-arg10) that serve as arguments.


Among them,

  • the death event mention $ev_1$ with the argument $arg_1$ and the death event mention $ev_5$ with the arguments $arg_5, arg_6,$ and $arg_7$ are coreferential, as both of them describe the girl's suicide by jumping off a building;
  • the injury event mention $ev_3$ with the arguments $arg_2$ and $arg_3$ and the injury event $ev_4$ with the argument $arg_4$ are coreferential, as both of them describe the girl's disfigurement;
  • other event mentions are singletons.

Let's apply our CorefPrompt to predict the coreferences among these events. Note that For event arguments, we only care about participants and locations.

document = 'Former Pakistani dancing girl commits suicide 12 years after horrific acid attack which left her looking "not human". She had undergone 39 separate surgeries to repair damage. Leapt to her death from sixth floor Rome building earlier this month. Her ex-husband was charged with attempted murder in 2002 but has since been acquitted.'

ev1 = {
    'offset': 38, 
    'trigger': 'suicide', 
    'args': [
        {'mention': 'Former Pakistani dancing girl', 'role': 'participant'}
ev3 = {
    'offset': 88, 
    'trigger': 'left', 
    'args': [
        {'mention': 'acid', 'role': 'participant'}, 
        {'mention': 'her', 'role': 'participant'}
ev4 = {
    'offset': 168, 
    'trigger': 'damage', 
    'args': [
        {'mention': 'She', 'role': 'participant'}
ev5 = {
    'offset': 189, 
    'trigger': 'death', 
    'args': [
        {'mention': 'her', 'role': 'participant'}, 
        {'mention': 'sixth floor Rome building', 'role': 'place'}

First, we need to create a CorefPrompt model by loading our provided checkpoint coref-prompt-large (can be downloaded from Google Drive):

from corefprompt import CorefPrompt

model_checkpoint = './coref-prompt-large'
coref_model = CorefPrompt(model_checkpoint)

We provide two functions to predict event coreferences:

  • predict_coref(event1, event2): suitable for processing multiple event pairs located in the same document. It is necessary to first load the corresponding document using the init_document(doc) function.
  • predict_coref_in_doc(document, event1, event2): directly predict coreference between the event pair located in a document.

Here are some usage examples:

# direct predict event pairs
res = coref_model.predict_coref_in_doc(document, ev1, ev5)
print('[Prompt]:', res['prompt'])
print(f"ev1[{ev1['trigger']}] - ev5[{ev5['trigger']}]: {res['label']} ({res['probability']})")
[Prompt]: In the following text, the focus is on the events expressed by <e1_start> suicide <e1_end> and <e2_start> death <e2_end>, and it needs to judge whether they refer to the same or different events: Former Pakistani dancing girl commits <e1_start> suicide <e1_end> 12 years after horrific acid attack which left her looking "not human". Here <e1_start> suicide <e1_end> expresses a <mask> event with Former Pakistani dancing girl as participants. She had undergone 39 separate surgeries to repair damage. Leapt to her <e2_start> death <e2_end> from sixth floor Rome building earlier this month. Here <e2_start> death <e2_end> expresses a <mask> event with her as participants at sixth floor Rome building. Her ex-husband was charged with attempted murder in 2002 but has since been acquitted. In conclusion, the events expressed by <e1_start> suicide <e1_end> and <e2_start> death <e2_end> have <mask> event type and <mask> participants, so they refer to <mask> event.

ev1[suicide] - ev5[death]: coref (0.9997438788414001)
# predict event pairs in the same document
res = coref_model.predict_coref(ev1, ev5)
print(f"ev1[{ev1['trigger']}] - ev5[{ev5['trigger']}]: {res['label']} ({res['probability']})")
res = coref_model.predict_coref(ev1, ev3)
print(f"ev1[{ev1['trigger']}] - ev3[{ev5['trigger']}]: {res['label']} ({res['probability']})")
res = coref_model.predict_coref(ev1, ev4)
print(f"ev1[{ev1['trigger']}] - ev4[{ev5['trigger']}]: {res['label']} ({res['probability']})")
res = coref_model.predict_coref(ev3, ev4)
print(f"ev3[{ev1['trigger']}] - ev4[{ev5['trigger']}]: {res['label']} ({res['probability']})")
ev1[suicide] - ev5[death]: coref (0.9997438788414001)
ev1[suicide] - ev3[left]: non-coref (0.9977204203605652)
ev1[suicide] - ev4[damage]: non-coref (0.9989845156669617)
ev3[left] - ev4[damage]: coref (0.999984622001648)

You can modify the file to try it out!

Training & Evaluation on the KBP corpus


Download the pre-trained model weights used in our experiment from Huggingface Model Hub:


Note: this script will save all downloaded weights in ./PT_MODELS/.

Download the evaluation script

Coreference results are obtained using official Reference Coreference Scorer. This scorer reports results in terms of AVG-F, which is the unweighted average of the F-scores of four commonly used coreference evaluation metrics, namely $\text{MUC}$ (Vilain et al., 1995), $\text{B}^3$ (Bagga and Baldwin, 1998), $\text{CEAF}_e$ (Luo, 2005) and $\text{BLANC}$ (Recasens and Hovy, 2011).

Run (from inside the repo):

cd ./
git clone

Prepare the dataset

This repo assumes access to the English corpora used in TAC KBP Event Nugget Detection and Coreference task (i.e., KBP 2015, KBP 2016, and KBP 2017). In total, they contain 648 + 169 + 167 = 984 documents, which are either newswire articles or discussion forum threads.

'2015': [
'2016': [
'2017': [
KBP 2015 KBP 2016 KBP 2017 All
#Documents 648 169 167 984
#Event mentions 18739 4155 4375 27269
#Event Clusters 11603 3191 2963 17757

Following Lu & Ng, (2021), we select LDC2015E29, E68, E73, E94 and LDC2016E64 as train set (817 docs, 735 for training and the remaining 82 for parameter tuning), and report results on the KBP 2017 dataset.

Dataset Statistics:

Train Dev Test All
#Documents 735 82 167 984
#Event mentions 20512 2382 4375 27269
#Event Clusters 13292 1502 2963 17757


  1. Download the kbp_sent.txt from the Github repository of previous work (Xu et al., 2022), which contains sentences splited using Stanford CoreNLP, and place it in the ./data directory.

  2. Convert the original dataset into jsonlines format using:

    cd data/
    export DATA_DIR=<ldc_tac_kbp_data_dir>
    export SENT_DIR=./
    python3 --kbp_data_dir $DATA_DIR --sent_data_dir $SENT_DIR

    Note: this script will create train.jsondev.json and test.json in the data folder, as well as train_filtered.jsondev_filtered.json and test_filtered.json which filter same or overlapping event mentions.

  3. Use the trigger detector provided by Xu et al., (2022) to extract event triggers in the test set, and store the results in the ./data/epoch_3_test_pred_events.json file.

  4. Install the OmniEvent tool via pip install OmniEvent, download model weights, and then recognize event arguments using:

    cd data/KnowledgeExtraction/

    Note: this script will create xxx_pred_args.json files in the data/KnowledgeExtraction/argument_files folder.


To reduce the computational cost, we apply undersampling on the training set based on the event similarities.


Train an event similarity scorer, which can output similar event embeddings for coreferential event mentions and then calculate cosine values as event similarities (Run with --do_train):

cd src/sample_selector/

export OUTPUT_DIR=./results/

python3 \
    --output_dir=$OUTPUT_DIR \
    --model_type=longformer \
    --model_checkpoint=../../PT_MODELS/allenai/longformer-large-4096/ \
    --train_file=../../data/train_filtered.json \
    --dev_file=../../data/dev_filtered.json \
    --test_file=../../data/test_filtered.json \
    --pred_test_file=../../data/epoch_3_test_pred_events.json \
    --max_seq_length=4096 \
    --learning_rate=1e-5 \
    --num_train_epochs=30 \
    --batch_size=1 \
    --do_train \
    --warmup_proportion=0. \

After training, the model weights and the evaluation results on Dev set would be saved in $OUTPUT_DIR. Then use --do_predict parameter to predict event similarities. The predicted results, i.e., XXX_with_cos.json, would be saved in $OUTPUT_DIR.

Finally, create event info files based on recognized arguments and event similarities:

cd data/KnowledgeExtraction/


This will create xxx_related_info_{cosine_threshold}.json files in the data/KnowledgeExtraction/simi_files folder, which contain the arguments of each event and the related event information with high similarity.

Event Coreference Resolution


Train our prompt-based model CorefPrompt using (Run with --do_train):

cd src/coref_prompt/

export OUTPUT_DIR=./roberta_m_hta_hn_512_with_mask_product_cosine_results/

python3 \
    --output_dir=$OUTPUT_DIR \
    --prompt_type=m_hta_hn \
    --with_mask \
    --select_arg_strategy=no_filter \
    --matching_style=product_cosine \
    --cosine_space_dim=64 \
    --cosine_slices=128 \
    --cosine_factor=4 \
    --model_type=roberta \
    --model_checkpoint=../../PT_MODELS/roberta-large/ \
    --train_file=../../data/train_filtered.json \
    --train_file_with_cos=../../data/train_filtered_with_cos.json \
    --dev_file=../../data/dev_filtered.json \
    --test_file=../../data/test_filtered.json \
    --train_simi_file=../../data/KnowledgeExtraction/simi_files/simi_omni_train_related_info_0.75.json \
    --dev_simi_file=../../data/KnowledgeExtraction/simi_files/simi_omni_dev_related_info_0.75.json \
    --test_simi_file=../../data/KnowledgeExtraction/simi_files/simi_omni_gold_test_related_info_0.75.json \
    --pred_test_simi_file=../../data/KnowledgeExtraction/simi_files/simi_omni_epoch_3_test_related_info_0.75.json \
    --sample_strategy=corefnm \
    --neg_top_k=3 \
    --max_seq_length=512 \
    --learning_rate=1e-5 \
    --num_train_epochs=10 \
    --batch_size=4 \
    --do_train \
    --warmup_proportion=0. \

After training, the model weights and evaluation results on Dev set would be saved in $OUTPUT_DIR. Then use --do_predict parameter to predict coreferences for event mention pairs. The predicted results, i.e., XXX_test_pred_corefs.json, would be saved in $OUTPUT_DIR.


Create the final event clusters using predicted pairwise results:

cd src/clustering


python3 \
    --output_dir=$OUTPUT_DIR \
    --test_golden_filepath=../../data/test.json \
    --test_pred_filepath=event-event/xxx_test_pred_corefs.json \
    --golden_conll_filename=gold_test.conll \
    --pred_conll_filename=pred_test.conll \


You can download the final event similarity scorer and our best weights at Google Drive.

BERT 35.8 54.4 55.6 36.0 45.5
RoBERTa 37.9 55.9 57.3 38.3 47.3
(Lu & Ng, 2021) 45.2 54.7 53.8 38.2 48.0
(Xu et al., 2022) 46.2 57.4 59.0 42.0 51.2
CorefPrompt 45.3 57.5 59.9 42.3 51.3

Contact info

Contact Sheng Xu at for questions about this repository.

      title={CorefPrompt: Prompt-based Event Coreference Resolution by Measuring Event Type and Argument Compatibilities}, 
      author={Sheng Xu and Peifeng Li and Qiaoming Zhu},


a prompt-based within-document event coreference resolution system, trained and evaluated on the KBP corpus.




