This repository contains the trajectories and results for agent evaluations run on BALROG.
The repository is organized as follows:
submissions/
│ ├── LLM/
│ └── VLM/
| ├── <date>_<agent>
│ │ ├── babyai
│ │ ├── babaisai
│ │ ├── crafter
│ │ ├── textworld
│ │ ├── minihack
│ │ ├── nle
│ │ ├── metadata.yaml
│ │ ├── README.md
│ │ ├── logs/*.log (Execution Logs)
│ └── ...
More about how the repository is organized Viewing Trajectories: <COMING SOON!>
If you are interested in submitting your agent to the BALROG Leaderboard, please do the following:
- Fork and clone this repository.
- Create a new folder with the submission date and the agent name in the LLM or VLM directory (e.g. submissions/LLM/20240921_balrog_gpt4o).
- Copy the log of the run of your agent, please include the following files from your agent's evaluation:
- babaisai: babaisai folder, containing summary and trajectory logs
- babyai: babyai folder, containing summary and trajectory logs
- crafter: crafter folder, containing summary and trajectory logs
- minihack: minihack folder, containing summary and trajectory logs
- nle: nethack folder, containing summary and trajectory logs
- textworld: textworld folder, containing summary and trajectory logs
- summary.json: Summary of the evaluation outcomes for all environments
- NOTE: You shouldn't have to create any of these files. They should automatically be generated by BALROG evaluation.
- metadata.yaml: Metadata for how the result is shown on website. Please include the following fields:
- name: The name of your leaderboard entry
- oss: true if your agent (model + strategy) is open-source
- site: URL/link to more information about your agent
- verified: false (See below for results verification)
- date: submission date in string format, (e.g. "2024-12-09") 5 README.md: Include anything you'd like to share about your agent here!
- Run python submit.py
- Create a pull request to the BALROG/experiments repository with the new folder.
You can refer to this tutorial for a quick overview of how to evaluate your agent on BALROG.
The Verified check ✓ indicates that we (the BALROG team) received access to your agent and were able to reproduce a selection of the results.
If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:
Create an issue In the issue, provide us instructions on how to run your agent on BALROG. We will run your agent on a random subset of BALROG and verify the results.