Skip to content

balrog-ai/experiments

Repository files navigation

BALROG Agent


BALROG Experiments

This repository contains the trajectories and results for agent evaluations run on BALROG.

The repository is organized as follows:

submissions/
│ ├── LLM/
│ └── VLM/
|   ├── <date>_<agent>
│   │ ├── babyai
│   │ ├── babaisai
│   │ ├── crafter
│   │ ├── textworld
│   │ ├── minihack
│   │ ├── nle
│   │ ├── metadata.yaml
│   │ ├── README.md
│   │ ├── logs/*.log (Execution Logs)
│   └── ...

More about how the repository is organized Viewing Trajectories: <COMING SOON!>

Submit to BALROG Leaderboard

If you are interested in submitting your agent to the BALROG Leaderboard, please do the following:

  1. Fork and clone this repository.
  2. Create a new folder with the submission date and the agent name in the LLM or VLM directory (e.g. submissions/LLM/20240921_balrog_gpt4o).
  3. Copy the log of the run of your agent, please include the following files from your agent's evaluation:
    • babaisai: babaisai folder, containing summary and trajectory logs
    • babyai: babyai folder, containing summary and trajectory logs
    • crafter: crafter folder, containing summary and trajectory logs
    • minihack: minihack folder, containing summary and trajectory logs
    • nle: nethack folder, containing summary and trajectory logs
    • textworld: textworld folder, containing summary and trajectory logs
    • summary.json: Summary of the evaluation outcomes for all environments
  • NOTE: You shouldn't have to create any of these files. They should automatically be generated by BALROG evaluation.
  1. metadata.yaml: Metadata for how the result is shown on website. Please include the following fields:
    • name: The name of your leaderboard entry
    • oss: true if your agent (model + strategy) is open-source
    • site: URL/link to more information about your agent
    • verified: false (See below for results verification)
    • date: submission date in string format, (e.g. "2024-12-09") 5 README.md: Include anything you'd like to share about your agent here!
  2. Run python submit.py
  3. Create a pull request to the BALROG/experiments repository with the new folder.

You can refer to this tutorial for a quick overview of how to evaluate your agent on BALROG.

Verify Your Results

The Verified check ✓ indicates that we (the BALROG team) received access to your agent and were able to reproduce a selection of the results.

If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:

Create an issue In the issue, provide us instructions on how to run your agent on BALROG. We will run your agent on a random subset of BALROG and verify the results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published