BALROG Experiments

This repository contains the trajectories and results for agent evaluations run on BALROG.

The repository is organized as follows:

submissions/
│ ├── LLM/
│ └── VLM/
|   ├── <date>_<agent>
│   │ ├── babyai
│   │ ├── babaisai
│   │ ├── crafter
│   │ ├── textworld
│   │ ├── minihack
│   │ ├── nle
│   │ ├── metadata.yaml
│   │ ├── README.md
│   │ ├── logs/*.log (Execution Logs)
│   └── ...

More about how the repository is organized Viewing Trajectories: <COMING SOON!>

Submit to BALROG Leaderboard

If you are interested in submitting your agent to the BALROG Leaderboard, please do the following:

Fork and clone this repository.
Create a new folder with the submission date and the agent name in the LLM or VLM directory (e.g. submissions/LLM/20240921_balrog_gpt4o).
Copy the log of the run of your agent, please include the following files from your agent's evaluation:
- babaisai: babaisai folder, containing summary and trajectory logs
- babyai: babyai folder, containing summary and trajectory logs
- crafter: crafter folder, containing summary and trajectory logs
- minihack: minihack folder, containing summary and trajectory logs
- nle: nethack folder, containing summary and trajectory logs
- textworld: textworld folder, containing summary and trajectory logs
- summary.json: Summary of the evaluation outcomes for all environments

NOTE: You shouldn't have to create any of these files. They should automatically be generated by BALROG evaluation.

metadata.yaml: Metadata for how the result is shown on website. Please include the following fields:
- name: The name of your leaderboard entry
- oss: true if your agent (model + strategy) is open-source
- site: URL/link to more information about your agent
- verified: false (See below for results verification)
- date: submission date in string format, (e.g. "2024-12-09") 5 README.md: Include anything you'd like to share about your agent here!
Run python submit.py
Create a pull request to the BALROG/experiments repository with the new folder.

You can refer to this tutorial for a quick overview of how to evaluate your agent on BALROG.

Verify Your Results

The Verified check ✓ indicates that we (the BALROG team) received access to your agent and were able to reproduce a selection of the results.

If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:

Create an issue In the issue, provide us instructions on how to run your agent on BALROG. We will run your agent on a random subset of BALROG and verify the results.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
css		css
img		img
js		js
submissions		submissions
template		template
CNAME		CNAME
README.md		README.md
correlation.py		correlation.py
index.html		index.html
paper.pdf.html		paper.pdf.html
submit.html		submit.html
submit.py		submit.py
update_score.py		update_score.py
viewer.html		viewer.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BALROG Experiments

Submit to BALROG Leaderboard

Verify Your Results

About

Releases

Packages

Languages

balrog-ai/experiments

Folders and files

Latest commit

History

Repository files navigation

BALROG Experiments

Submit to BALROG Leaderboard

Verify Your Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages