DocRED-eval

Evaluation of Gemini or Other LLMs on DocRED

Installation

Clone the repository:
```
git clone <repo-url>
cd DocRED-eval
```
Set up a virtual environment and install the requirements:

For pip, it's better to install in editable mode:
```
python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
pip install -e .
```
If you prefer to try uv, a faster alternative to pip, you can install it by:
```
pip install uv
uv sync
```

Execution

Download the data from DocRED and place it in the data/ folder. Only train_annotated.json, test.json, dev.json, and rel_info.json are needed.
Copy the .env.example to .env and fill in your API key.
Run the Python script:
```
python main.py
```
Check the results in the output/ folder. Note that sometimes Gemini may raise an error, and the new result will overwrite the previous one.
- messages.json: The chat history provided by pydantic-ai.
- pred_labels.json: The output from the model.
- true_labels.json: The labels extracted from data/dev.json.
- system_prompt.md: The prompt using few-shot learning from data/train_annotated.json.
- user_prompt.yaml: The document extracted from data/dev.json.

Methods

I use pydantic to validate the DocRED dataset. It has a Rust-implemented core to achieve high and reliable speed and type hints.
Convert the original document to a simpler version. I hypothesize that LLMs are similar to humans when understanding the document, so some entity position information and tokenized text are not important to the model. The schema for the validation is in docred_eval/schema.py.
Convert the object into YAML flow style, which might be the most token-efficient format (my intuition, not fully researched for evidence).
Use pydantic-ai to interact with the model and get the output.

Initial Thoughts

With the rise of generative models, old ways to perform NLP tasks (customized models and complex pipelines) are no longer the most popular choice. LLMs can be used to perform a wide range of NLP tasks, but due to their generative nature, it is not easy to use them to extract information in a structured way.

However, I found that OpenAI provides a structured output mode for their API, which can enforce output JSON data by some unknown magic. This is different from using prompts to guide the model to generate structured output, which is way more reliable.

The problem is that I don't want to spend money on a school project, and Gemini is almost free for personal use. So I chose to evaluate Gemini on the DocRED dataset, which is a dataset for relation extraction.

I thought it would easily achieve near state-of-the-art performance, but it turns out that it is not that easy...

The Problem

Gemini seems to refuse to output long text in structured output mode, which is a big problem for the relation extraction task.

The DocRED test set contains 3 million tokens, which is not the biggest problem for Gemini-1.5-flash, as it has a 1 million tokens context window. However, such a large input also means long output, and it has a max 8192 tokens output limit, which is not enough for the task.

Observations

google-generative-ai is one of the worst Python packages I've ever encountered. However, Gemini itself isn't too bad. Interestingly, the pydantic-ai devs feel the same way and have re-implemented the API, sharing their thoughts on the reference page.

Also, the evaluation script in the DocRED repository is a nightmare and makes it harder to reach our goal. Nevertheless, thanks to the authors for providing the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src/docred_eval		src/docred_eval
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocRED-eval

Installation

Execution

Methods

Initial Thoughts

The Problem

Next

Observations

About

Languages

Stanley5249/ncu-ce7024-final-docred-eval

Folders and files

Latest commit

History

Repository files navigation

DocRED-eval

Installation

Execution

Methods

Initial Thoughts

The Problem

Next

Observations

About

Topics

Resources

Stars

Watchers

Forks

Languages