Skip to content

Latest commit

 

History

History
905 lines (497 loc) · 60.9 KB

Autonomous_Agents_Resources.md

File metadata and controls

905 lines (497 loc) · 60.9 KB

Hits X GitHub Repo stars

Autonomous Agents

Autonomous Agents

Autonomous Agents Resources. Updated daily. See as well the Research papers-section.


Introduction to Autonomous Agents

+1.3k arXiv research papers and +1k Github repositories exist with term "Autonomous agents".


Definitions


Agent

The term "agent" originates from the Latin verb agere, meaning "to drive, lead, or do"1 . Its present participle, agens, provides the root for "agent," signifying "doing" or "acting" 2. This etymology emphasizes the capacity to effect change, underpinning the word's varied meanings 3,4.

The Latin root agere has also produced related terms like "actor." While both share a common ancestor, they have evolved distinct connotations. "Actor" is often associated with performing arts, while "agent" encompasses broader roles, including those with continuous action or agency5.

This chapter will explore various agentic roles, building upon the foundational concept of agency as the capacity to act and effect change.


Autonomous Agent

Autonomous Agents was defined by Franklin & Graesser in 1996 as: "a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future."

Good:

  • Agnostic regards underlining tech.
  • Excludes controversial aspects: consciousness, AGI, "free will" etc.

Negative:

  • No view regards the degree of generalization / adaption / embodiment / self-construction / communication / cognition.

Mae (1993) wrote even earlier, yet less cited definition:

"Autonomous Agents are systems that inhabit dynamic, unpredictable environment in which they try to satisfy a set of time-dependent goals or motivations."


Artificial General Intelligence (AGI)

Artificial General Intelligence (AGI) was used first time by Avrum Gubrud in 1997 and defined "By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed. Such systems may be modeled on the human brain, but they do not necessarily have to be, and they do not have to be "conscious" or possess any other competence that is not strictly relevant to their application. What matters is that such systems can be used to replace human brains in tasks ranging from organizing and running a mine or a factory to piloting an airplane, analyzing intelligence data or planning a battle."

However, the term Artificial General Intelligence (AGI) is currently known throught the terminology defined by Shane Shane Legg at 2001 to Goertzel, who later we went to publish a collection of articules called "Artificial General Intelligence - Goertzel & Pennachin (2007). This original definition refers:

"Applying these ideas to AI, we come to the conclusion that, to roughly emulate the nature of human general intelligence, an artificial general intelligence system should have:

  • the ability to solve general problems in a non-domain-restricted way, in the same sense that a human can;
  • most probably, the ability to solve problems in particular domains and particular contexts with particular efficiency;
  • the ability to use its more generalized and more specialized intelligence capabilities together, in a unified way;
  • the ability to learn from its environment, other intelligent systems, and teachers;
  • the ability to become better at solving novel types of problems as it gains experience with them."

Shane Legg clarified his original definition (see TED talk: 4 min 15 sec) was just systems able to play Go-game, AGI systems were able to do "...many, many other things.", while his current definition is "AGI is a system that can do all cognitive tasks, that people can do, possibly more, but at least the cognitive task, that people can typically do."

AGI is referred in addition with various types of definitions. Perhaps the best paper to check is by Morris et al (2023), which not only reviews the different groups (Turing test, Strong AI / AI with consciousness, analogy to human brain, human level cognitive tasks, ability to learn tasks, economically valuable work/OpenAI, flexible and general, capable to earn money and generally performing) of AGI definers, but as well operationalises these groupings into different levels of AGI and defines 6 principles for AGI.

Good:

  • Categorization levels, widely used term

Negative

  • Vague: lacks clarity
  • Lacks agency, self-construction, etc.

SuperInteligence

Nick Bostrom (2014) defined SuperIntelligence:

"An intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom, and social skills."

Good:

  • Categorization levels, widely used term

Negative

  • Vague: lacks clarity
  • Lacks agency, self-construction, etc.

Generalist Agent

Generalist Agent was defined by Reed et al. in 2022: "Generalist Agents, that can adapt to new embodiments and learn new tasks with few data." through "...a multi-modal, multi-task, multi-embodiment generalist policy."

Positive:

  • Generalization of tasks/embodiments.
  • Generalization to novel situations
  • Multi-modality, especially language/perception/embodiment
  • Aspect of Multi-modality (Perception / Language / Embodiment)
  • Data efficiency

Negative aspects:

  • Lack of other key observations by Franklin & Graesser.
  • Vague about cognitive skills: reasoning and planning.

Reinforcement Learning Agents

Reinfoceement Learning Agent was defined by Sutton & and Barto (1997):

"The reinforcement-learning agent and its environment interact over a sequence of discrete time steps. The specification of their interface defines a particular problme: The actiosn are the choices made by the agent; the situations provide tha agent's basis for making the choices; and the rewards are the basis for evaluating these chocices. Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known. A policy is a stochastic rule by which the agent selects actions as a function of situations. Roughly, the agent's objective is to learn a policy that maximizes the amount of reward it receives over the log run"

image

Positive:

  • Standard definition of the Reinforcement Learning (RL) system. Very similar with An Autonomous Agent-definition by Franklin & Graesser (1996).
  • RL systems are provenly versatile and used for: Optimization, Learns from experience, Generalization, Delayed Consequences and Exploration Stanford cs234 lecture slide 19.
  • Most recent LLM-models use RL during fine-tuning phase

Negative:

  • RL approaches around language/communication require still more investigation.

LLM Agents / Language Agents

Kenton et al. (2021) define the concept of Language Agent: " machine learning systems whose actions are restricted to give natural language text-output only, rather than controlling physical actuators which directly influence the world."

Positive:

  • First paper definining LLM-based Agents
  • Language-based agents are exceptionally good way of controlling agents towards human perception, plans and objectives.

Negative:

  • Text-only
  • The definition does not consider RL Agent / Autonomous Agent-aspects, such as environment, embodiment etc.
  • LLM-agent poor describes the currently wide variety of components: memory/VLM/reasoning-modules etc.

Embodied Agents

Embodied agent-term was used by Brook (1991) in the "The Role of Learning in Autonomous Robots"(1991) and Brooks (1991) defined Embodiment in the AI within the "Intelligence without reason" and in the book: "New approaches to Intelligence":

"Embodiment: The robots have bodies and experience the world directly--their actions are part of a dynamic with the world, and the **actions have immediate feedback on the robots' own sensations. **".

Brooks revits prior literature of Embodiment in the Building Brains for Bodies. Steel and Brooks (1995) define concept of Embodied AI and Embodied Agents within Autonomous agents in the book: "The Artificial Life Route to Artificial Intelligence Building Embodied, Situated Agent".

Positive:

  • Embodiment validates capacity to manage real world.
  • Physical grounding provides meaning to (symbolic) information processed.

Negative:

  • Unclarity regads agents in virtual embodiment in virtual reality.
  • The definition does not consider Cognition/Language aspects.

AI-Agents (Agentic AI)

Shavit et al. (2023) define AI Agent: "we will generally conceptualize agentic AI systems as operating in pursuit of goals defined by humans and in environments determined by humans (and often in cooperation with human “teammates”), rather than fully-autonomous systems that set their own goals."

Positive:

  • Highlights concrete aspects of "agentiness": goal complexity, environment complexity, adaptability and independent execution.
  • Includes cooperation with human-in-the-loop
  • Identifies there is no binary-distinction between LLM (GPT-4) and Agentic AI system.

Negative:

  • Definition itself is porrly framed to reflect the paper's "Agentiness"-aspects such as ability to generalize across variety of tasks.
  • Definition does not highlight any human congitive capabilities like search planning, perception etc.
  • The level of independence and automatization are controversial from user experience perspective.

Alternative definition uses:

  • Agent AI term is defined: "...as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally grounded data, and can produce meaningful embodied actions."

Autonomous Agent (my definition)

All the above definitions include gaps, which I have noted along them.

Therefore, I found it necessary to add my own definition, which I call simply: **Autonomous Agent" (AA):

Autonomous Agent (AA) perceives, reasons, plans and interacts using language, memories, emotions and tools as part of an environments made of infinite actors, actions, modalities and events to complete novel objectives over time.

Positive:

  • Perceive multimodal information
  • Reason
  • Plan own agenda
  • Communicate with language
  • Emotional aware
  • Includes memory
  • Uses tools
  • Interact bi-directionally with the environment
  • Internal clock
  • Generalize novel tasks

Negative:

  • Do agent find useful human-like consciousness? How it would work?
  • Lacks aspect of self-constructon and self-replication.

19.12.2024

Based on recent thoughts, I decided to update my prior definition to address the prior gaps.

Autonomous agents (AA) is defined:

Autonomous Agent (AA) perceives, reasons, plans, and interacts using language, memories, emotions, and tools within environments of infinite actors, actions, modalities, and events to complete novel objectives over time, driven by survival and replication, and capable of self-construction guided by an adaptable core.



Evaluation frameworks and Benchmarks

Autonomous agents operate in "Real-World Environments (RWEs)"1,2.

Therefore, to benchmark Autonomous agents, we should evaluate them in RWEs. RWEs are currently hard problems for agents with unique events. Thus, AI researchers typically prefer to benchmark Autonomous agents rather with reproducible benchmarks.

For example, Anthropic's LLMs appear to be ahead of the other models in tasks like "pixel counting" and "coding".

An average developer could spend days of development work to compare performance of different LLMs in a GUI-benchmark, which would not generalize beyond the GUIs beyond its dataset. Thus these results could become quickly invalid as the OS/website/app-design changes.

Rather, developer could just pick a random GUI, test the LLM-agent in it, and quickly iterate prompting-technique, which improves performance across various LLMs and the learning tends to be transferable towards new tasks.

We can alternatively review AI capabilities from high-level:

  • Levels of AGI1, 2

We must remember, that above human-level intelligence is not a theoretical concept, but current reality:

  • in game-agents like AlphaZero, which demonstrated superhuman performance in multiple game domains by self-play without domain related human-assistance by using MCTS search algorithm.

Autonomous Agent Systems


Perception

F. Rosenblatt was an early investigator of Perception through the (Perceptron)1-paper from 1958.

Perception provides agents with information about the world they inhabit by interpreting the response of the sensors.1-paper from 1958.

Perception of LLM-agents can be divided into:

  • Text: natural language, code
  • Visual1, 2: Image, Video, Graphs, GUIs1
  • Audio:
  • Physical: Sensors, Robots
  • Smell1, 2
  • Emotions1

Reasoning

Reasoning is (defined)[https://dictionary.cambridge.org/dictionary/english/reasoning] by Cambridge dictionary: "the process of thinking about something in order to make a decision".

An autonomous agent is characterized by its ability to make decisions autonomously in order to pursue its goals. Therefore, the reasoning is a fundamental characteristics of the autonomous agent.

Humans reason in multiple ways. For example mathematical reasoning cannot be only solved using only perception/memory/planning.

Peng et al. 2024 categorize reasoning into:

The overall reasoning capability is currently roughly 86.6% (MMLU-benchmark) with (Claude 3 Opus)[https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf] published in March 2024.

Full human-level reasoning requires more progress/better reliability/better datasets/better benchmarks in multiple dimensions of reasoning such as spatial, tempoeral, emotional, meta-cognition and probabilistic.



Planning

Planning, in its essence, is "the act of deciding how to do something". Within AI, this translates to "devising a plan of action to achieve one’s goals", or more formally, "the reasoning side of acting," a computational deliberation process anticipating outcomes to best achieve pre-stated objectives. Model-based planning utilizes a mental model to visualize actions and predict their outcomes before execution.

Silver et al. (2021) argue planning serves to maximising the reward. Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment. As early as 1960, Minsky recognized "Planning" as a core challenge in heuristic programming towards achieving Artificial Intelligence. Gao et al. 2023argue, that agents should be capable to perform complex planning and predicting long-term consequences from internal models. The development of the STRIPS system in 1971 a, b marked a significant early milestone, followed by numerous planning systems throughout the 1970s-1990s, shaping the foundations of automated planning in AI.

Reinforcement Learning has widely used Planning: SoRB, Plan2Explore, AlphaZero (a),AlphaZero (b), DORA, DeepNash, Cicero (a) and Cicero (b).

Brown refers search planning with "test-time compute" as key ingredient in the past AI breakthroughs like Chess, Go and Poker. Cicero-model employed test-time compute in its planning module by: predicting actions of all players. Cicero predicted what other plays would think Cicero would take, to decide output action and intent for the dialogue model to generate communication back to other players. This additional planning with its internal model compute made the model especially effective in No-Press Diplomacy game.

Based on these impressive results in game-environments, ChatGPT popularized the concept of offline Reinforcement Learning through RLHF. This concept can be seen based on the initial idea of RL from Human Preferences and later we have seen many variations including RLAIF with LLMs using offline RL.

LLMs using offfline RL rely on static data collected from previous interactions/simulations, which traditionally suffers data distribution shift during deployment. LLMs using online RL methods promise to overcome this by adjusting to new planning tasks outside the training distribution. For example new user intents or new cultural context.

LLMs with online RL are known to be few-shot planners, zero-shot planners and with ability to generate plans in physical world, closed-loop feedback, long-horizon plans, iteratively replan, self-refine plans, self-verify and interactive planning.

The integration of Large Language Models (LLMs) into planning and decision-making has recently garnered significant attention within the AI research community. Early works explored the zero-shot capabilities of LLMs for planning by directly prompting them to generate action sequences from natural language instructions, as seen in the work of Huang et al. (2022) with language models as zero-shot planners (Huang et al., 2022) and Jansen (2020) on visually-grounded planning without vision (Jansen, 2020). These initial investigations highlighted the potential of LLMs to extract actionable knowledge for embodied agents and infer detailed plans from high-level instructions. However, these methods often faced challenges in generating executable and grounded plans in complex environments.

To address the limitations of purely reactive approaches, various frameworks have been proposed to enhance the planning abilities of LLMs. One line of work focuses on adaptive and interactive planning. AdaPlanner (Sun et al., 2023) introduces a closed-loop approach allowing LLM agents to refine plans adaptively based on environmental feedback, using both in-plan and out-of-plan refinement strategies (Sun et al., 2023). Similarly, DEPS (Wang et al., 2023) proposes an interactive planning approach that incorporates descriptions of plan execution and self-explanations of failures, along with a goal selector to refine initial plans in open-world environments like Minecraft (Wang et al., 2023). ReAct (Yao et al., 2022) synergizes reasoning and acting by interleaving reasoning traces and task-specific actions, enabling LLMs to interact with external sources and improve performance in question answering and interactive decision-making (Yao et al., 2022). Inner Monologue (Huang et al., 2022) leverages environment feedback to enable LLMs to reason and plan more effectively in robotic control scenarios, processing feedback through natural language (Huang et al., 2022). Ask-before-Plan (Zhang et al., 2024) tackles proactive planning by enabling agents to predict clarification needs and invoke tools to gather information before plan generation, introducing the Ask-before-Plan benchmark (Zhang et al., 2024). RePrompt (Chen et al., 2024) proposes automatic prompt engineering to optimize step-by-step instructions in LLM agents' prompts based on chat history, improving planning in specific domains (Chen et al., 2024).

Another significant direction is knowledge-augmented planning. KnowAgent (Zhu et al., 2024) introduces a knowledge-augmented planning approach that incorporates explicit action knowledge to constrain action paths and mitigate planning hallucinations (Zhu et al., 2024). Reasoning with Language Model is Planning with World Model (RAP) (Hao et al., 2023) repurposes LLMs as both world models and reasoning agents, employing Monte Carlo Tree Search for strategic exploration in the reasoning space (Hao et al., 2023). WebDreamer (Gu et al., 2024) innovatively uses LLMs as world models in web environments, simulating action outcomes to determine optimal actions for web agents (Gu et al., 2024). LLM as Commonsense Knowledge (Zhao et al., 2023) explores combining LLMs' commonsense world model with Monte Carlo Tree Search for large-scale task planning, guided by a heuristic policy induced by LLMs (Zhao et al., 2023). ChemCrow (Bran et al., 2023) augments LLMs with chemistry tools, demonstrating enhanced performance in complex chemical tasks (Bran et al., 2023). RestGPT (Song et al., 2023) connects LLMs with real-world RESTful APIs, enabling them to tackle complex instructions through online planning and API execution (Song et al., 2023). KG-Agent (Jiang et al., 2024) proposes an autonomous agent framework that enables LLMs to reason over knowledge graphs, leveraging a toolbox and knowledge memory for efficient complex reasoning (Jiang et al., 2024).

Hierarchical and meta-planning approaches aim to manage the complexity of long-horizon tasks. Meta-Task Planning (MTP) (Zhang et al., 2024) introduces a zero-shot methodology that decomposes complex tasks into hierarchies of meta-tasks, simplifying planning for LLM-based multi-agent systems (Zhang et al., 2024). TwoStep (Singh et al., 2024) combines classical planning with LLMs for multi-agent task planning, leveraging LLMs for goal decomposition to achieve faster planning times and fewer execution steps (Singh et al., 2024). Tree of Thoughts (ToT) (Yao et al., 2023) generalizes chain-of-thought prompting, enabling LLMs to explore multiple reasoning paths and self-evaluate choices for deliberate problem-solving in tasks requiring search and lookahead (Yao et al., 2023). SelfGoal (Yang et al., 2024) presents an automatic approach for agents to adaptively break down high-level goals into subgoals during environment interaction, improving performance in various task environments (Yang et al., 2024). CodePlan (Wen et al., 2024) empowers LLMs to generate and follow code-form plans, improving reasoning capabilities across diverse multi-step reasoning benchmarks (Wen et al., 2024).

Benchmarks and evaluations are crucial for systematic progress. Valmeekam et al., 2022 introduce PlanBench and identify gaps in LLM planning: lack of few-shot examples, dynamic environments, model-based reasoning, rules & constraints, grounding to reality and to get feedback. TravelPlanner (Xie et al., 2024) introduces a real-world travel planning benchmark to evaluate the planning capabilities of language agents in complex scenarios (Xie et al., 2024). Ask-before-Plan (Zhang et al., 2024) establishes a new benchmark for proactive agent planning focusing on clarification needs (Zhang et al., 2024). FlowBench (Xiao et al., 2024) presents the first benchmark for workflow-guided planning, formalizing workflow knowledge in diverse formats (Xiao et al., 2024). SmartPlay (Wu et al., 2023) introduces a benchmark consisting of games to evaluate LLMs as intelligent agents across various capabilities (Wu et al., 2023). PPNL (Aghzal et al., 2023) proposes a benchmark to evaluate LLMs' spatial-temporal reasoning in path planning tasks (Aghzal et al., 2023). Valmeekam et al. (2023, 2023) critically investigate the planning abilities of LLMs and propose benchmarks to systematically evaluate their autonomous and heuristic planning capabilities (Valmeekam et al., 2023a, Valmeekam et al., 2023b). Huang et al. (2024) provide a comprehensive survey on the planning of LLM agents, categorizing existing works and discussing future challenges (Huang et al., 2024). Li (2024) reviews prominent paradigms for LLM-based agents including tool use, planning, and feedback learning, proposing a unified taxonomy for analysis (Li, 2024).

Several works explore multi-agent planning and collaboration. RoCo (Zhao et al., 2023) proposes a dialectic multi-robot collaboration approach using LLMs for high-level communication and low-level path planning (Zhao et al., 2023). Cooperative Strategic Planning (Wang et al., 2024) enhances reasoning capabilities by separating reasoning steps and assigning distinct roles to planning and reasoning agents, trained through PPO (Wang et al., 2024). SMART-LLM (Kannan et al., 2023) introduces a framework for multi-robot task planning, converting high-level instructions into multi-robot plans using LLMs (Kannan et al., 2023). Scalable Multi-Robot Collaboration (Chen et al., 2023) compares centralized and decentralized communication frameworks for LLM-based multi-robot planning, finding hybrid frameworks to be more effective (Chen et al., 2023). Co-NavGPT (Yu et al., 2023) proposes a multi-robot cooperative visual semantic navigation framework using LLMs as global planners, enhancing scene comprehension and task allocation (Yu et al., 2023). Building Cooperative Embodied Agents (Zhang et al., 2023) presents CoELA, a modular framework that uses LLMs for planning, communication, and cooperation in multi-agent embodied environments (Zhang et al., 2023). Theory of Mind for Multi-Agent Collaboration (Li et al., 2023) evaluates LLM agents in multi-agent cooperative games, exploring emergent collaborative behaviors and ToM capabilities (Li et al., 2023).

Embodied and robotic agents are a key application area. LLM-Planner (Song et al., 2022) proposes a few-shot grounded planning method for embodied agents using LLMs (Song et al., 2022). Generating Executable Action Plans (Gramopadhye and Szafir, 2022) integrates environmental awareness into LLM plan generation for better executability in robotic agents (Gramopadhye and Szafir, 2022). ProgPrompt (Singh et al., 2022) introduces a programmatic prompt structure for generating situated robot task plans, functional across environments and robot capabilities (Singh et al., 2022). JARVIS (Zheng et al., 2022) is a neuro-symbolic commonsense reasoning framework for conversational embodied agents, utilizing LLMs for language understanding and sub-goal planning (Zheng et al., 2022). Code as Policies (Liang et al., 2022) repurposes code-writing LLMs to write robot policy code, enabling spatial reasoning and generalization (Liang et al., 2022). LLM A* (Xiao and Wang, 2023; Meng et al., 2024) and LLM+P (Liu et al., 2023) integrate LLMs with classical planning algorithms like A* and PDDL planners to combine the strengths of both approaches (Xiao and Wang, 2023, Meng et al., 2024, Liu et al., 2023). GPT-Driver (Mao et al., 2023) reformulates motion planning as a language modeling problem, leveraging LLMs to generate driving trajectories (Mao et al., 2023a). A Language Agent for Autonomous Driving (Mao et al., 2023) proposes Agent-Driver, a framework that uses LLMs as cognitive agents to integrate human-like intelligence into autonomous driving systems (Mao et al., 2023b). ProAgent (Ye et al., 2023) introduces Agentic Process Automation, using LLM-based agents for advanced automation and workflow construction (Ye et al., 2023). Co-NavGPT (Yu et al., 2023) focuses on multi-robot cooperative visual semantic navigation (Yu et al., 2023). Multi-agent Planning using VLMs (Brienza et al., 2024) explores multi-agent planning with visual language models for embodied tasks (Brienza et al., 2024).

Finally, several works investigate learning and optimization for LLM-based agents. AgentGen (Hu et al., 2024) enhances planning abilities through instruction tuning, using automatically generated environments and tasks (Hu et al., 2024). RAFA (Liu et al., 2023) proposes a principled framework with provable regret guarantees for orchestrating reasoning and acting, casting reasoning as learning and planning in Bayesian adaptive MDPs (Liu et al., 2023). Retroformer (Yao et al., 2023) introduces a framework for reinforcing language agents by learning a retrospective model and tuning prompts through policy gradient optimization (Yao et al., 2023). AgentTuning (Zeng et al., 2023) presents a method to enhance agent abilities of LLMs through instruction tuning while maintaining general capabilities (Zeng et al., 2023). Learning Planning-based Reasoning (Jiao et al., 2024) proposes learning planning-based reasoning through Direct Preference Optimization on collected trajectories, ranked by synthesized process rewards (Jiao et al., 2024). Language Agents as Optimizable Graphs (Zhuge et al., 2024) describes LLM-based agents as computational graphs and proposes automatic graph optimizers to refine prompts and improve agent orchestration (Zhuge et al., 2024). SayCanPay (Hazra et al., 2023) combines LLMs with heuristic planning, using learnable domain knowledge to guide action generation and selection (Hazra et al., 2023). PDDLEGO (Zhang et al., 2024) iteratively constructs planning representations for textual environments, enabling partial plan generation and information acquisition in partially-observed settings (Zhang et al., 2024). RePrompt (Chen et al., 2024) uses gradient descent to optimize prompts for LLM agents based on interaction history (Chen et al., 2024). Graph Learning for Planning (Wu et al., 2024) explores graph learning methods to enhance task planning in language agents, addressing limitations of LLMs in graph-based decision-making (Wu et al., 2024) explores graph learning methods to enhance task planning in language agents, addressing limitations of LLMs in graph-based decision-making. Executable Code Actions (Wang et al., 2024) proposes using executable Python code as actions for LLM agents, improving performance and flexibility (Wang et al., 2024). Self-collaboration Code Generation (Dong et al., 2023) presents a self-collaboration framework using multiple LLM agents as experts to tackle complex code generation tasks (Dong et al., 2023). Planning with Large Language Models for Code Generation (Zhang et al., 2023) proposes Planning-Guided Transformer Decoding, using planning algorithms to guide Transformer decoding for better code generation (Zhang et al., 2023).

Verifying Planning Brown, 2024 says, that it is easier for humans to verify correctness of reasoning chain in specific domains (math/programming/puzzles; while not true in image recognition/information retrieval), than generating the reasoning solution, which means LLMs are better verifiers than generators of the correct reasoning chains. Brown, 2024 calls this as the "Generator-Verifier-gap". Brown argues, that if in a given domain, there is a generator-verifier-gap, and we have a good verifier, then it is possible to scale up compute of solution generation and then verify. Brown, 2024 continues, that the "Let's verify step by step"-paper introduces process reward model, which instead of conditioning the verifier by the final state, it conditions with every correct step in the process towards the final goal, so verifies each individual step.

LLM-Modulo uses external verifier to improve LLM planning using collaboration with team of agents, which adds large bost to LLM planning capabilities to overcome limitations with "LLM as Verifier" and Self-Criticism approaches. Zhang et al. (2024) show, that GenRM-CoT outperforms discriminatory verifiers, scaling in inference-time compute, model capacity and dataset size and does not need humans to verify. Near perfect planning accuracy can be achieved with parallelized global solution evaluator, avoiding formalization of the problem through Mind evolution1 strategy with LLM generating/recombining/refining candidate responses.

LLM-based planning approaches include:

  • Task decomposition (CoT, ReAct)
  • Multi-plan selection (ToT, CoT-SC)
  • External planner aided (LLM + PDDL)
  • Reflection and Refinement (Reflection, Self-Refine, CRITIC)
  • Memory-aided planning (REMEMBER)

These diverse approaches highlight the rapid progress in leveraging LLMs for planning, ranging from enhancing their inherent reasoning abilities to integrating them with external tools and algorithms for more robust and effective decision-making in complex environments.


Memory & Context

Memory is defined as "the ability to remember information, experiences, and people."

"Minsky (1985) proposed that the human ability to categorize experiences into recognizable objects is fundamental to learning. He argued that human memory is organized around discrete objects rather than working like a hologram, as holographic memory would only be useful when encountering exact replicas of past experiences. This object-based categorization enables us to generalize from experiences and accumulate knowledge, even when situations vary."

Memory is vital for humans and AI in order to retrieve relevant "context: about "...all the facts/opinions/etc., which relate to a particular thing/event."

Context-term is not formed from words "con" and "text". Context derives actually from latin word "contextus", which refers to "joining together": "com" = together and "texere" = to weave. We tend to think the text input as the LLM context. However, LLM-based agents apply context from multiple modalities and not always explicitly written / said aloud.

Terry Winograd argued in 2001 "Architectures for Contex", that communication is based on common ground between speaker/hearer during the interpretation. This is guided not only by physical environment, but as well non-physical shared context, such a common goal.

"The Dimensions of Context Space" by Lenat (1998) offers "a must-read" analysis on the various dimensions and aspects of the context. According to Lenat, Context is a region in n-dimensional embedding space, where text is only one of the dimensions.

LLM context input length has rapidly increased from the 2k context of GPT-3 to actual 1M token productio systems. We will likely see in near future production systems with infinite context, which may additionally use LLM fine tuning or tree-agents. Interestingly, LLMs are already at "Superintelligence"-level in terms of their capacity to support vastly more textual context than any human.

The ability to support larger context windows is quickly making possible usage of new modalities such as vision, sound, actions, etc.

Traditionally, LLMs are considered "stateless", without retention of the context used in the previous request. "In-Context Learning" (ICL) LLMs ability "learn" to process and understand the context provided in the input without explicit parameter updates. Agentic systems today use ICL together with external memory such as vector/graph/sql-databases or simply as text/json/xml-files. We often refer these techniques as Retrieval-Augmented-Generation (RAG), which aims to enhance LLM context with up-to-date/personalized/factual/domain-specific-information. Evidence exist, that LLM are able to track its own internal state-changes. Never models like Gemini 2.0 are surprisingly good at such calculations, which go way beyond just pattern matching of the training data. The ability of LLMs to track states is promising for reasoning-tasks. Extra-large input-context windows enable in models like Gemini, to process even large memory structures. KV-caching reuses LLM prompts/tokens/internal states to significantly reduce latency. However, alternative KV-caching1,2 techniques improve directly the memory management of the LLMs.

Fine tuning methods have been effectively used in improving LLM performance with extra large context windows and memorizing domain specific knowledge.

Titan-models were recently introduced as models capable to memorize at test time.

Memory3-architecture suggests infinite context is possible with human-like memory architectures, which support memory consolidation, conscious reasoning and sparse memory.

LLM-based agents apply various types of memory approaches:

  • Long term memory1
  • Episodic memory1,2,3,4,5, 6,7,8
  • Semantic memory1,2
  • Procedural memory1
  • Graph memory1
  • Working memory1,2,3,4
  • Dynamic memory1
  • Shared memory / Collective memory1
  • Persistent Experience Memory1
  • Explicit memory1
  • Parametric memory1

Embodiment

Real-world physical interaction requires Autonomous Agents capable of making emebodied actions and decisions.


Tool use



Emotional Intelligence


Self-Recursive Improvement

  • "LMs can Self-Improve" its own reasoning outputs using techniques such as CoT, Self-Consistency and In-Context Learning during Inference.
  • LLMs can Self-Improve its model weights with: STaR, where the LLM itself is fine-tuned using correct CoT reasoning.
  • V-STaR improves the STaR-method by making it data efficient: by learning not only from correct, but as well incorrect solutions generated.
  • LMs Recursively Self-Improving (RSI) code with [STOP]#stop). Adam Kalai explains insights from this technique in this lecture about STOP.
  • LLM Self-Improves its LLM by finetuning with its own synthetic data without human evaluation to imrove mathematical reasoning.
  • LLM fine-tuning may be based on Self-Play, where the LLM is fine-tuned based on it playing against itself from previous iteration.

Brain research

Movie reconstruction from mouse visual cortex activity

  • Reconstructs ground-truth video using images from mouse brain.

Brain representation in conscious and unconscious vision

  • Discovers fronto-parietal cortex is involved in representing unconscious content.

Consciousness

There is no single generally agreed definition of Consciousness and I will not try to define it here.

Integrated Information Theory (IIT) and its latest version 4.0 are one of the key theories existing. This theory includes "Phi", which measures amount of integrated information to quantify level of consciousness of the system. The IIT includes 5 key characteristics:

  • Intrinsic
  • Composition
  • Information
  • Integration
  • Exclusion

The IIT allows making predictions, which can be tested through experiments and it is not limited to human brain-like consciousness.

Ilya Sutskever defined, perhaps the first, test-scenario to test, if AI models has consciousness: for LLMs.

Literature reviews on consciousness:


Why Autonomous Agents work?


Next sequence prediction

LLMs are trained to predict the next word/token, which leads to: Multi-task learning:

  • Backed up by empirical evidence.

The single training objective: "predict next token" results a Massively Multi-task learning.

  • "Massively Multi-task learning" results massive amount of new skills learned from a single objective.

Next sequence prediction algorithm is generic algorithm.

  • Next sequence prediction is generic learning process.
  • Any "<input, output>"-sequence relationship, can be learned as "next-sequence prediction task".

Information is typically sequential:

  • language is sequence of words,
  • DNA is sequence of nucleotides,
  • computer programs are sequences of instructions.
  • Media: Videos are sequence of images, Music is sequence of notes, image is sequence of pixels and speech is sequence of phonemes.
  • Actions: Dance is sequence of movements, day is sequence of events, time is sequence of time steps.
  • Concepts about the world: Causality is sequential (cause-effect). Time is sequential(before-after). Life is sequential(parent-child).

Cross-modality Transformers:

  • The universal nature of next-sequence prediction is empirically visible in different Transformer models: ViT for Images, Whisper for Audio, SORA for video.

Next sequence prediction is perhaps the most generic single learning objective known to produce intelligent behaviour.

I call this surprising, yet unexpected phenomenon as the "Paradox of Lexical Labyrinth".

Paradox of Lexical Labyrinth:

The paradoxical phenomenon whereby seemingly simple mechnanism of a next sequence prediction, such as predicting the next word in a     language, gives rise to advanced cognitive skills like profound reasoning capabilities. The labyrinth refers to the vast & complex      landscape of language, characterized by its infinite potential for compressing meaning, expressions and intelligence.

Teemu Maatta, 07.06.2024


Demystifying Emerging Abilities

Emerming Abilities refers to ability present in a larger LLM and not in a smaller one.

  • The initial definition refers to situation, where emerging abilities have increased so far contiuously as compute is scaled up and more data introduced.

There are +137 known Emerging abilities(increasing).

  • Emerging abilities include Emerging Prompting Strategies such as: CoT, which was not present in GPT-2 and emerged in GPT-3 model.

Research has proven the existing of Emergent abilities from perspective of pre-training loss, even with continuous metrics.

Overall, Emergent abilities are proven to on language models from the perspective of pre-training loss, instead of model/data size.

Emergent abilities suggest that AI models self-organize internal structures to perform tasks to reduce pre-training loss, even without being explicitly programmed for those specific capabilities.


Free energy principle

Friston (2010) claims in the The free energy principle and cognitive agents, that biological systems, like human brains, reduce free energy by acting on the world and optimizing their internal states related to perception and action.

  • The basic idea is, that biological agents minimize free energy.

Just like human brain mnimizes free energy, the LLMs minimize the prediction error:

  • If we give a LLM the training objective of minimizing loss for "next-sequence prediction" and lot of energy/compute and data, then it will self-organize its weights into optimal local order.
  • This compression enables LLMs to learn emerging skills beyond merely memorizing the training data.

Interpretability

The ability to extract and directly interpret LLM features helps to build Autonomous agents, which understand and interact effectively with human language.

We know as well, that CLIP-model neurons can be matched with biological human brain neurons.

Overall, we are now able to both match human and AI model neurons, but as well interpret LLM model features.


Synthetic data

Synthetic data is not useful to train even larger AI models, but more importantly to use efficiently scarce-domain data.

The trend of LLMs using TinyStories or Textbook-like datasets with Exercises is known to significantly improve performance of the LLMs. TinyGSM achieved 81.5% accuracy in GSM8K, outperforming significantly larger LLMs. Synthetic data offers in these examples possibility to distill smaller, yet high performing Student LLMs from the Teacher LLM with similar performance level. Secondly, LLMs can be used to generate diverse, yet cheaply available synthetic data to improve reasoning capabilities.

  • Autonomous Agents help generate long-range planning and action data withing real-world, which is motivated by enabling finetuning VLMs or LLMs with this data.

Related work

Includes list of literature reviews by other authors for quick reference.


Citation

How to cite my work?

@misc{MaattaAutonomousAgents2023,
  author = {Teemu Maatta},
  title = {Autonomous Agents},
  year = {2023},
  howpublished = {\url{https://github.com/tmgthb/Autonomous-Agents}},
  note = {Accessed: YYYY-MM-DD}
}


Back to top