Skip to content

Official repository for "Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity (COLING2022)"

Notifications You must be signed in to change notification settings

ddehun/coling2022_reweighting_sts

Repository files navigation

Reweighting synthetic examples

This repository is the code for our paper, "Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity" (COLING2022) [paper].

How to Begin

  • Install required packages in requirements.txt.
  • Download preprocessed benchmark datasets (STSb, QQP, and MRPC) from this drive link.
  • Prepare PAWS-QQP dataset following this repository, and locate it in datasets/benchmarks/paws/.

How to Reproduce

1. Data preparation

  • Run scripts/0_preprocessing.sh script. This will prepare sentences (C_src) to make synthetic dataset, and split PAWS dataset into dev and test splits.

2. Synthetic dataset generation & Machine-written example identification

  • Run scripts/1_generation.sh script to generate synthetic examples and train a discriminator model that identifies them.
  • A process to create synthetic dataset is same with the original DINO framework suggested by Schick et al. (2021).

3. Training and evaluating STS models

  • Run scripts/2_run_sts.sh to train bi-encoder models for sentence similarity tasks.
  • The shell script is to reproduce all results in Table 2 (reweighting or not, ablation study).

4. Other baseline models

  • Run scripts/3_run_other_baselines.sh to reprduce the results of other baseilne models in Table 6, such as GloVe, BERT, and USE.

Acknowledge

Codes to generate synthetic dataset are derieved from Schick et al. (2021)'s work. (Github)

About

Official repository for "Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity (COLING2022)"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published