Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If I let the agent perform this task, how do I convert the text information into an executable dict? And how can we convert our executable dict back into text? #92

Open
SakuraXiaMF opened this issue Dec 22, 2024 · 5 comments

Comments

@SakuraXiaMF
Copy link

Question

If I let the agent perform this task, how do I convert the text information into an executable dict? And how can we convert our executable dict back into text?

MinWeb++Action

@ppasupat
Copy link
Collaborator

Hi. Could you give a small code snippet for what you are trying to do?

And are you using a 3rd party agent? In MiniWoB++ environments, the method env.step(action) requires action to be a Python dict. There is no need to convert into text, unless the 3rd party agent requires it (e.g. maybe it's an LLM-based agent).

@xiaxiaxiatengxi
Copy link

xiaxiaxiatengxi commented Jan 2, 2025

Oh, yes. That's what I have in mind. I want to use this environment for my LLM agent experiments.

My plan is to input observation data (obs) into the language model, have the language model output text-based actions, and then convert these actions into executable methods for implementation.
my code as follow

`import time
import gymnasium
import miniwob
from miniwob.action import ActionTypes

gymnasium.register_envs(miniwob)

env = gymnasium.make('miniwob/click-test-2-v1', render_mode=None)

llm_agent=LLM_agent()

while(1):
try:
# Start a new episode.
obs, info = env.reset()

    assert obs["utterance"] == "Click button ONE."
    assert obs["fields"] == (("target", "ONE"),)
    print(f"======\nOBS:\n{obs}\n{obs.keys()}\n======")
    time.sleep(2)       # Only here to let you look at the environment.
    print(f"======\nINFO:\n{info}\n======")
         
    **

llm_action = llm_agent(obs['text'])

   obs, reward, terminated, truncated, info = env.step(llm_action)

**

    print("My rewards",reward)      # Should be around 0.8 since 2 seconds has passed.
    assert terminated is True
    breakpoint()
except AssertionError:
    print("Failed to reset the environment.")

`

@ppasupat
Copy link
Collaborator

ppasupat commented Jan 2, 2025

In this case, the llm_agent method is responsible for the conversion. The method needs to convert the obs dict into a format that the LLM can understand, and then convert the LLM's output into an action dict for the env.step method.

The meaning of each dict entry in obs and action can be found in the documentation here and here. You probably need to write a custom code for the conversion.

@xiaxiaxiatengxi
Copy link

Thank you for your reply! Yes, I understand. Have there been any existing works where a similar conversion method has been used? It seems that converting the action dictionary into text can be quite challenging.

@ppasupat
Copy link
Collaborator

ppasupat commented Jan 8, 2025

There might be papers that use MiniWoB which have GitHub repositories. I don't know one off the top of my head.

Could you explain what challenges would be involved during the conversion? You can restrict the action set (e.g. to only clicking and typing on an element with a specific ID --- CLICK_ELEMENT and FOCUS_ELEMENT_AND_TYPE_TEXT) to reduce the complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants