Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey

Building autonomous machines that can explore open-ended environments, discover possible interactions and build repertoires of skills is a general objective of artificial intelligence. Developmental approaches argue that this can only be achieved by autotelic agents : intrinsically motivated learning agents that can learn to represent, generate, select and solve their own problems. In recent years, the convergence of developmental approaches with deep reinforcement learning ( rl ) methods has been leading to the emergence of a new field: developmental reinforcement learning . Developmental rl is concerned with the use of deep rl algorithms to tackle a developmental problem — the intrinsically motivated acquisition of open-ended repertoires of skills . The self-generation of goals requires the learning of compact goal encodings as well as their associated goal-achievement functions. This raises new challenges compared to standard rl algorithms originally designed to tackle pre-defined sets of goals using external reward signals. The present paper introduces developmental rl and proposes a computational framework based on goal-conditioned rl to tackle the intrinsically motivated skills acquisition problem. It proceeds to present a typology of the various goal representations used in the literature, before reviewing existing methods to learn to represent and prioritize goals in autonomous systems. We finally close the paper by discussing some open challenges in the quest of intrinsically motivated skills acquisition.


Introduction
Building autonomous machines that can explore large environments, discover interesting interactions and learn open-ended repertoires of skills is a long-standing goal in artificial intelligence.Humans are remarkable examples of this lifelong, open-ended learning.They learn to recognize objects and crawl as infants, then learn to ask questions and interact with peers as children.Across their lives, humans build a large repertoire of diverse skills from a virtually infinite set of possibilities.What is most striking, perhaps, is their ability to invent and pursue their own problems, using internal feedback to assess completion.We would like to build artificial agents able to demonstrate equivalent lifelong learning abilities.
We can think of two approaches to this problem: developmental approaches, in particular developmental robotics, and reinforcement learning (rl).Developmental robotics takes inspirations from artificial intelligence, developmental psychology and neuroscience to model cognitive processes in natural and artificial systems (Asada et al., 2009;Cangelosi & Schlesinger, 2015).Following the idea that intelligence should be embodied, robots are often used to test learning models.Reinforcement learning, on the other hand, is the field interested in problems where agents learn to behave by experiencing the consequences of their actions under the form of rewards and costs.As a result, these agents are not explicitly taught, they need to learn to maximize cumulative rewards over time by trial-and-error (Sutton & Barto, 2018).While developmental robotics is a field oriented towards answering particular questions around sensorimotor, cognitive and social development (e.g.how can we model language acquisition?),reinforcement learning is a field organized around a particular technical framework and set of methods.Now powered by deep learning optimization methods leveraging the computational efficiency of large computational clusters, rl algorithms have recently achieved remarkable results including, but not limited to, learning to solve video games at a super-human level (Mnih et al., 2015), to beat chess and go world players (Silver et al., 2016), or even to control stratospheric balloons in the real world (Bellemare et al., 2020).
Although standard rl problems often involve a single agent learning to solve a unique task, rl researchers extended rl problems to multi-goal rl problems.Instead of pursuing a single goal, agents can now be trained to pursue goal distributions (Kaelbling, 1993;Sutton et al., 2011;Schaul et al., 2015).As the field progresses, new goal representations emerge: from the specific goal states to the high-dimensional goal images or the abstract languagebased goals (Luketina et al., 2019).However, most approaches still fall short of modeling the learning abilities of natural agents because they train them to solve predefined sets of tasks, via external and hand-defined learning signals.
Developmental robotics directly aims to model children learning and, thus, takes inspiration from the mechanisms underlying autonomous behaviors in humans.Most of the time, humans are not motivated by external rewards but spontaneously explore their environment to discover and learn about what is around them.This behavior seems to be driven by intrinsic motivations (ims) a set of brain processes that motivate humans to explore for the mere purpose of experiencing novelty, surprise or learning progress (Berlyne, 1966;Gopnik et al., 1999;Kidd & Hayden, 2015;Oudeyer & Smith, 2016;Gottlieb & Oudeyer, 2018).
The integration of ims into artificial agents thus seems to be a key step towards autonomous learning agents (Schmidhuber, 1991c;Kaplan & Oudeyer, 2007).In developmental robotics, this approach enabled sample efficient learning of high-dimensional motor skills in complex robotic systems (Santucci et al., 2020), including locomotion (Baranes & Oudeyer, 2013;Martius et al., 2013), soft object manipulation (Rolf & Steil, 2013;Nguyen & Oudeyer, 2014), visual skills (Lonini et al., 2013) and nested tool use in real-world robots (Forestier et al., 2017).Most of these approaches rely on population-based optimization algorithms, non-parametric models trained on datasets of (policy, outcome) pairs.Populationbased algorithms cannot leverage automatic differentiation on large computational clusters, often demonstrate limited generalization capabilities and cannot easily handle high-dimension perceptual spaces (e.g.images) without hand-defined input pre-processing.For these reasons, developmental robotics could benefit from new advances in deep rl.
Recently, we have been observing a convergence of these two fields, forming a new domain that we propose to call developmental reinforcement learning, or more broadly developmental artificial intelligence.Indeed, rl researchers now incorporate fundamental ideas from the developmental robotics literature in their own algorithms, and reversely developmental robotics learning architecture are beginning to benefit from the generalization capabilities of deep rl techniques.These convergences can mostly be categorized in two ways depending on the type of intrinsic motivation (ims) being used (Oudeyer & Kaplan, 2007): • Knowledge-based IMs are about prediction.They compare the situations experienced by the agent to its current knowledge and expectations, and reward it for experiencing dissonance (or resonance).This family includes ims rewarding prediction errors (Schmidhuber, 1991c;Pathak et al., 2017), novelty (Bellemare et al., 2016;Burda et al., 2019;Raileanu & Rocktäschel, 2020), surprise (Achiam & Sastry, 2017), negative surprise (Berseth et al., 2019), learning progress (Lopes et al., 2012;Kim et al., 2020) or information gains (Houthooft et al., 2016), see a review in Linke et al. (2020).This type of ims is often used as an auxiliary reward to organize the exploration of agents in environments characterized by sparse rewards.It can also be used to facilitate the construction of world models (Lopes et al., 2012;Kim et al., 2020;Sekar et al., 2020).
• Competence-based IMs, on the other hand, are about control.They reward agents to solve self-generated problems, to achieve self-generated goals.In this category, agents need to represent, select and master self-generated goals.As a result, competence-based ims were often used to organize the acquisition of repertoires of skills in task-agnostic environments (Baranes & Oudeyer, 2010, 2013;Santucci et al., 2016;Forestier & Oudeyer, 2016;Nair et al., 2018b;Warde-Farley et al., 2019;Colas et al., 2019;Blaes et al., 2019;Pong et al., 2020;Colas et al., 2020a).Just like knowledge-based ims, competence-based ims organize the exploration of the world and, thus, might be used to train world models (Baranes & Oudeyer, 2013;Chitnis et al., 2021) or facilitate learning in sparse reward settings (Colas et al., 2018).We propose to use the adjective autotelic, from the Greek auto (self) and telos (end, goal), to characterize agents that are intrinsically motivated to represent, generate, pursue and master their own goals (i.e. that are both intrinsically motivated and goalconditioned).
rl algorithms using knowledge-based ims leverage ideas from developmental robotics to solve standard rl problems.On the other hand, rl algorithms using competence-based ims organize exploration around self-generated goals and can be seen as targeting a developmental robotics problem: the open-ended and self-supervised acquisition of repertoires of diverse skills.
Recently, goal-conditioned rl agents were also endowed with the ability to generate and pursue their own goals and learn to achieve them via self-generated rewards.We call this new set of autotelic methods rl-imgeps.In contrast, one can refer to externally-motivated goal-conditioned rl agents as rl-emgeps.This paper proposes a formalization and a review of the rl-imgep algorithms at the convergence of rl methods and developmental robotics objectives.Figure 1 proposes a visual representation of intrinsic motivations approaches (knowledge-based ims vs competencebased ims or imgeps) and goal-conditioned rl (externally vs intrinsically motivated).Their intersection is the family of autotelic algorithms that train agents to generate and pursue their own goals by training goal-conditioned policies.
We define goals as the combination of a compact goal representation and a goal-achievement function to measure progress.This definition highlights new challenges for autonomous learning agents.While traditional rl agents only need to learn to achieve goals, rl-imgep agents also need to learn to represent them, to generate them and to measure their own progress.After learning, the resulting goal-conditioned policy and its associated goal space form a repertoire of skills, a repertoire of behaviors that the agent can represent and control.We believe organizing past goal-conditioned rl algorithms at the convergence of developmental robotics and rl into a common classification and towards the resolution of a common problem will help organize future research.

Definitions
• Goal: "a cognitive representation of a future object that the organism is committed to approach (Elliot & Fryer, 2008)."In rl, this takes the form of a (embedding, goal-achievement function) pair, see Section 2.2.
• Skill: the association of a goal and a policy to reach it, see Section 3.1.
• Goal-achievement function: a function that measures progress towards a goal (also called goal-conditioned reward function), see Section 2.2.
• Goal-conditioned policy: a function that generates the next action given the current state and the goal, see Section 3.
• Autotelic: from the Greek auto (self) and telos (end, goal), characterizes agents that generate their own goals and learning signals.In is equivalent to intrinsically motivated and goal-conditioned.
Scope of the survey.We are interested in algorithms from the rl-imgep family as algorithmic tools to enable agents to acquire repertoires of skills in an open-ended and self-supervised setting.Externally motivated goal-conditioned rl approaches do not enable agents to generate their own goals and thus cannot be considered autotelic (imgeps).However, these approaches can often be converted into autotelic rl-imgeps by integrating the goal generation process within the agent.For this reason, we include some rl-emgeps approaches when they present interesting mechanisms that can directly be leveraged in autotelic agents.
What is not covered.This survey does not discuss some related but distinct approaches such as multi-task rl (Caruana, 1997), rl with auxiliary tasks (Riedmiller et al., 2018;Jaderberg et al., 2017) and rl with knowledge-based ims (Bellemare et al., 2016;Pathak et al., 2017;Burda et al., 2019).None of these approaches do represent goals or see the agent's behavior affected by goals.The subject of intrinsically motivated goal-conditioned rl also relates to transfer learning and curriculum learning.This survey does not cover transfer learning approaches, but interested readers can refer to Taylor and Stone (2009).It discusses automatic curriculum learning approaches that organize the generation of goals according to the agent's abilities in Section 6 but, for a broader picture on the topic, readers can refer to the recent review Portelas et al. (2020a).Finally, this survey does not review policy learning methods but only focuses on goal-related mechanisms.Indeed, the choice of mechanisms to learn to represent and select goals is somewhat orthogonal to the algorithms used to learn to achieve them.Since the policy learning algorithms used in rl-imgep architectures do not differ significantly from standard rl and goal-conditioned rl approaches, this survey focuses on goal-related mechanisms, specific to rl-imgeps.

Survey organization.
We start by presenting some background on the formalization of rl and multi-goal rl problems and the corresponding algorithms to solve them (Section 2).
We then build on these foundations to formalize the intrinsially motivated skills acquisition problem and propose a computational framework to tackle it: rl-based intrinsically motivated goal exploration processes (Section 3).Once this is done, we organize the surveyed literature along three axes: 1) What are the different types of goal representations?(Section 4); 2) How can we learn goal representations?(Section 5) and 3) How can we prioritize goal selection?(Section 6).We finally close the survey on a discussion of open challenges for developmental reinforcement learning (Section 7).

Background: RL, Multi-Goal RL Problems and Their Solutions
This sections presents background information on the rl problem, the multi-goal rl problem and the families of algorithms used to solve them.This will serve as a foundation to define the intrinsically motivated skill acquisition problem and introduce the rl-based intrinsically motivated goal exploration process framework to solve it (rl-imgep, Section 3).

The Reinforcement Learning Problem
In a reinforcement learning (rl) problem, the agent learns to perform sequences of actions in an environment so as to maximize some notion of cumulative reward (Sutton & Barto, 2018).rl problems are commonly framed as Markov Decision Processes (mdps): M = {S, A, T , ρ 0 , R} (Sutton & Barto, 2018).The agent and its environment, as well as their interaction dynamics are defined by the first components {S, A, T , ρ 0 }, where s ∈ S describes the current state of the agent-environment interaction and ρ 0 is the distribution over initial states.The agent can interact with the environment through actions a ∈ A. Finally, the dynamics are characterized by the transition function T that dictates the distribution of the next state s ′ from the current state and action T (s ′ | s, a).The objective of the agent in this environment is defined by the remaining component of the mdp: R. R is the reward function, it computes a reward for any transition: R(s, a, s ′ ).Note that, in a traditional rl problem, the agent only receives the rewards corresponding to the transitions it experiences but does not have access to the function itself.The objective of the agent is to maximize the cumulative reward computed over complete episodes.When computing the aggregation of rewards, we often introduce discounting and give smaller weights to delayed rewards.
) with γ being a constant discount factor in ]0, 1].Each instance of an mdp implements an rl problem, also called a task.

Defining Goals for Reinforcement Learning
This section takes inspiration from the notion of goal in psychological research to inform the formalization of goals for reinforcement learning.
Goals in psychological research.Working on the origin of the notion goal and its use in past psychological research, Elliot and Fryer (2008) propose a general definition: A goal is a cognitive representation of a future object that the organism is committed to approach or avoid (Elliot & Fryer, 2008).
Because goals are cognitive representations, only animate organisms that represent goals qualify as goal-conditioned.Because this representation relates to a future object, goals are cognitive imagination of future possibilities: goal-conditioned behavior is proactive, not reactive.Finally, organisms commit to their goal, their behavior is thus influenced directly by this cognitive representation.
Generalized goals for reinforcement learning.rl algorithms seem to be a good fit to train such goal-conditioned agents.Indeed, rl algorithms train learning agents (organisms) to maximize (approach) a cumulative (future) reward (object).In rl, goals can be seen as a set of constraints on one or several consecutive states that the agent seeks to respect.These constraints can be very strict and characterize a single target point in the state space (e.g.image-based goals) or a specific sub-space of the state space (e.g.target x-y coordinate in a maze, target block positions in manipulation tasks).They can also be more general, when expressed by language for example (e.g.'find a red object or a wooden one').
To represent these goals, rl agents must be able to 1) have a compact representation of them and 2) assess their progress towards it.This is why we propose the following formalization for rl goals: each goal is a g = (z g , R g ) pair where z g is a compact goal parameterization or goal embedding and R g is a goal-achievement function measuring progress towards the goal.The set of goal-achievement function can be represented as a single goalparameterized or goal-conditioned reward function such that R G (• | z g ) = R g (•).With this definition we can express a diversity of goals, see Section 4 and Table 1.
The goal-achievement function and the goal-conditioned policy both assign meaning to a goal.The former defines what it means to achieve the goal, it describes how the world looks like when it is achieved.The latter characterizes the process by which this goal can be achieved; what the agent needs to do to achieve it.In this search for the meaning of a goal, the goal embedding can be seen as the map: the agent follows this map and via the two functions above, experiences the meaning of the goal.

Generalized definition of the goal construct for RL:
• Goal: a g = (z g , R g ) pair where z g is a compact goal parameterization or goal embedding and R g is a goal-achievement function.
• Goal-achievement function: R g where R G is a goal-conditioned reward function.

The Multi-Goal Reinforcement Learning Problem
By replacing the unique reward function R by the space of reward functions R G , rl problems can be extended to handle multiple goals: M = {S, A, T , ρ 0 , R G }.The term goal should not be mistaken for the term task, which refers to a particular mdp instance.As a result, multi-task rl refers to rl algorithms that tackle a set of mdps that can differ by any of their components (e.g.T , R, S 0 , etc.).The multi-goal rl problem can thus be seen as the particular case of the multi-task rl problem where mdps differ by their reward functions.In the standard multi-goal rl problem, the set of goals -and thus the set of reward functions -is pre-defined by engineers.The experimenter sets goals to the agent, and provides the associated reward functions.
rl algorithms use transitions collected via interactions between the agent and its environment (s, a, s ′ , R(s, a, s ′ )) to train a policy π: a function generating the next action a based on the current state s so as to maximize a cumulative function of rewards.Deep rl (drl) is the extension of rl algorithms that leverage deep neural networks as function approximators to represent policies, reward and value functions.It has been powering most recent breakthrough in rl (Eysenbach et al., 2019;Warde-Farley et al., 2019;Florensa et al., 2018;Pong et al., 2020;Lynch & Sermanet, 2020;Hill et al., 2020bHill et al., , 2021;;Abramson et al., 2020;Colas et al., 2020a;Stooke et al., 2021).
This surveys focuses on goal-related mechanisms that are mostly orthogonal to the choice of underlying optimization algorithm.In practice, however, most of the research in that space uses drl methods.

Solving the Multi-Goal RL Problem with Goal-Conditioned RL Algorithms
Goal-conditioned agents see their behavior affected by the goal they pursue.This is formalized via goal-conditioned policies, that is policies which produce actions based on the environment state and the agent's current goal: Π : S × Z G → A, where Z G is the space of goal embeddings corresponding to the goal space G (Schaul et al., 2015).Note that ensembles of policies can also be formalized this way, via a meta-policy Π that retrieves the particular policy from a one-hot goal embedding z g (e.g.Kaelbling, 1993;Sutton et al., 2011).
The idea of using a unique rl agent to target multiple goals dates back to Kaelbling (1993).Later, the horde architecture proposed to use interaction experience to update one value function per goal, effectively transferring to all goals the knowledge acquired while aiming at a particular one (Sutton et al., 2011).In these approaches, one policy is trained for each of the goals and the data collected by one can be used to train others.
Building on these early results, Schaul et al. (2015) introduced Universal Value Function Approximators (uvfa).They proposed to learn a unique goal-conditioned value function and goal-conditioned policy to replace the set of value functions learned in horde.Using neural networks as function approximators, they showed that uvfas enable transfer between goals and demonstrate strong generalization to new goals.
The idea of hindsight learning further improves knowledge transfer between goals (Kaelbling, 1993; Andrychowicz et al., 2017).Learning by hindsight, agents can reinterpret a past trajectory collected while pursuing a given goal in the light of a new goal.By asking themselves, what is the goal for which this trajectory is optimal?, they can use the originally failed trajectory as an informative trajectory to learn about another goal, thus making the most out of every trajectory (Eysenbach et al., 2020).This ability dramatically increases the sample efficiency of goal-conditioned algorithms and is arguably an important driver of the recent interest in goal-conditioned rl approaches.

The Intrinsically Motivated Skills Acquisition Problem and the RL-IMGEP Framework
This section builds on the multi-goal rl problem to formalize the intrinsically motivated skills acquisition problem, in which goals are not externally provided to the agents but must be represented and generated by them (Section 3.1).The following section discusses how to evaluate competency in such an open problem (Section 3.2).Finally, we then propose an extension of the goal-conditioned rl framework to tackle this problem: rl-based intrinsically motivated goal exploration process framework (rl-imgep, Section 3.3).

The Intrinsically Motivated Skills Acquisition Problem
In the intrinsically motivated skills acquisition problem, the agent is set in an open-ended environment without any pre-defined goal and needs to acquire a repertoire of skills.Here, a skill is defined as the association of a goal embedding z g and the policy to reach it Π g .A repertoire of skills is thus defined as the association of a repertoire of goals G with a goal-conditioned policy trained to reach them Π G .The intrinsically motivated skills acquisition problem can now be modeled by a reward-free mdp M = {S, A, T , ρ 0 } that only characterizes the agent, its environment and their possible interactions.Just like children, agents must be autotelic, i.e. they should learn to represent, generate, pursue and master their own goals.

Evaluating RL-IMGEP Agents
Evaluating agents is often trivial in reinforcement learning.Agents are trained to maximize one or several pre-coded reward functions -the set of possible interactions is known in advance.One can measure generalization abilities by computing the agent's success rate on a held-out set of testing goals.One can measure exploration abilities via several metrics such as the count of task-specific state visitations.
In contrast, autotelic agents evolve in open-ended environments and learn to represent and form their own set of skills.In this context, the space of possible behaviors might quickly become intractable for the experimenter, which is perhaps the most interesting feature of such agents.For these reasons, designing evaluation protocols is not trivial.
The evaluation of such systems raises similar difficulties as the evaluation of task-agnostic content generation systems like Generative Adversarial Networks (gan) (Goodfellow et al., 2014) or self-supervised language models (Devlin et al., 2019;Brown et al., 2020).In both cases, learning is task-agnostic and it is often hard to compare models in terms of their outputs (e.g.comparing the quality of gan output images, or comparing output repertoires of skills in autotelic agents).
One can also draw parallel with the debate on the evaluation of open-ended systems in the field of open-ended evolution (Hintze, 2019;Stanley & Soros, 2016;Stanley, 2019).In both cases, a good system is expected to generate more and more original solutions such that its output cannot be predicted in advance.But what does original mean, precisely?Stanley and Soros (2016) argues that subjectivity has a role to play in the evaluation of open-ended systems.Indeed, the notion of interestingness is tightly coupled with that of open-endedness.What we expect from our open-ended systems, and of our rl-imgep agents in particular, is to generate more and more behaviors that we deem interesting.This is probably why the evaluation of content generators often include human studies.Our end objective is to generate interesting artefacts for us; we thus need to evaluate open-ended processes ourselves, subjectively.
Our end goal would be to interact with trained rl-imgep directly, to set themselves goals and test their abilities.The evaluation would need to adapt to the agent's capabilities.As Einstein said "If you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.".rl-imgep need to be evaluated by humans looking for their area of expertise, assessing the width and depth of their capacities in the world they were trained in.This said, science also requires more objective evaluation metrics to facilitate the comparison of existing methods and enable progress.Let us list some evaluation methods measuring the competency of agents via proxies: • Measuring exploration: one can compute task-agnostic exploration proxies such as the entropy of the visited state distribution, or measures of state coverage (e.g.coverage of the high-level x-y state space in mazes) (Florensa et al., 2018).Exploration can also be measured as the number of interactions from a set of interesting interactions defined subjectively by the experimenter (e.g.interactions with objects in Colas et al., 2020a).
• Measuring generalization: The experimenter can subjectively define a set of relevant target goals and prevent the agent from training on them.Evaluating agents on this held-out set at test time provides a measure of generalization (Ruis et al., 2020), although it is biased towards what the experimenter assesses as relevant goals.
• Measuring transfer learning: The intrinsically motivated exploration of the environment can be seen as a pre-training phase to bootstrap learning in a subsequent downstream task.In the downstream task, the agent is trained to achieve externallydefined goals.We report its performance and learning speed on these goals.This is akin to the evaluation of self-supervised language models, where the reported metrics evaluate performance in various downstream tasks (e.g. Brown et al., 2020).In this evaluation setup, autotelic agents can be compared to task-specific agents.Ideally, autotelic agents should benefit from their open-ended learning process to outperform task-specific agents on their own tasks.This said, performance on downstream tasks remains an evaluation proxy and should not be seen as the explicit objective of the skill discovery phase.Indeed, in humans, skill discovery processes do not target any specific future task, but emerged from a natural evolutionary process maximizing reproductive success, see a discussion in Singh et al. (2010).
• Opening the black-box: Investigating internal representations learned during intrinsically motivated exploration is often informative.One can investigate properties of the goal generation system (e.g.does it generate out-of-distribution goals?), investigate properties of the goal embeddings (e.g. are they disentangled?).One can also look at the learning trajectories of the agents across learning, especially when they implement their own curriculum learning (e.g.Florensa et al., 2018;Colas et al., 2019;Blaes et al., 2019;Pong et al., 2020;Akakzia et al., 2021).
• Measuring robustness: Autonomous learning agents evolving in open-ended environment should be robust to a variety of properties than can be found in the real-world.This includes very large environments, where possible interactions might vary in terms of difficulty (trivial interactions, impossible interactions, interactions whose result is stochastic thus prevent any learning progress).Environments can also include distractors (e.g.non-controllable objects) and various forms of non-stationarity.Evaluating learning algorithms in various environments presenting each of these properties allows to assess their ability to solve the corresponding challenges.

RL-Based Intrinsically Motivated Goal Exploration Processes
Until recently, the imgep family was powered by population-based algorithms (pop-imgep).
The emergence of goal-conditioned rl approaches that generate their own goals gave birth to a new type of imgeps: the rl-based imgeps (rl-imgep).This section builds on traditional rl and goal-conditioned rl algorithms to give a general definition of intrinsically motivated goal-conditioned rl algorithms (rl-imgep).
rl-imgep are intrinsically motivated versions of goal-conditioned rl algorithms.They need to be equipped with mechanisms to represent and generate their own goals in order to solve the intrinsically motivated skills acquisition problem, see Figure 2. Concretely, this means that, in addition to the goal-conditioned policy, they need to learn: 1) to represent goals g by compact embeddings z g ; 2) to represent the support of the goal distribution, also called goal space Z G = {z g } g∈G ; 3) a goal distribution from which targeted goals are sampled D(z g ); 4) a goal-conditioned reward function R G .In practice, only a few architectures tackle the four learning problems above.
In this survey, we call autotelic any architecture where the agent selects its own goals (learning problem 3).Simple autotelic agents assume pre-defined goal represen-tations (1), the support of the goals distribution (2) and goal-conditioned reward functions (4).As autotelic architectures tackle more of the 4 learning problems, they become more and more advanced.As we will see in the following sections, many existing works in goalconditioned rl can be formalized as autotelic agents by including goal sampling mechanisms within the definition of the agent.
With a developmental perspective, one can reinterpret existing work through the autotelic rl framework.Let us take an example.The agent 57 algorithm automatically selects a parameter to balance the intrinsic and extrinsic rewards of the agent at the beginning of each training episode (Badia et al., 2020a).The authors do not mention the concept of goal but instead present this mechanism as a form of reward shaping technique independent from the agent.With a developmental perspective, one can interpret the mixing parameter as a goal embedding.Replacing the sampling mechanism within the boundaries of the agent, agent 57 becomes autotelic.It is intrinsically motivated to sample and target its own goals; i.e. to define its own reward functions (here mixtures of intrinsic and extrinsic reward functions).
Algorithm 1 details the pseudo-code of rl-imgep algorithms.Starting from randomly initialized modules and memory, rl-imgep agents enter a standard rl interaction loop.They first observe the context (initial state), then sample a goal from their goal sampling policy.Then starts the proper interaction.Conditioned on their current goal embedding, they act in the world so as to reach their goal, i.e. to maximize the cumulative rewards generated by the goal-conditioned reward function.After the interaction, the agent can update all its internal models.It learns to represent goals by updating its goal embedding function and goal-conditioned reward function, and improves its behavior towards them by updating its goal-conditioned policy.This surveys focuses on the mechanisms specific to rl-imgep agents, i.e. mechanisms that handle the representation, generation and selection of goals.These mechanisms are mostly orthogonal to the question of how to reach the goals themselves, which often relies on existing goal-conditioned algorithms, but can also be powered by imitation learning, evolutionary algorithms or other control and planning methods.Section 4 first presents a typology of goal representations used in the literature, before Sections 5 and 6 cover existing methods to learn to represent and prioritize goals respectively.Perform Hindsight Relabelling {(s, a, s ′ , z g )} B . 11: Compute internal rewards r = R G (s, a, s ′ | z g ). 12: Update policy Π G via rl on {(s, a, s ′ , z g , r)} B .

13:
Update goal representations Z G .

14:
Update goal-conditioned reward function R G . 15: Update goal sampling policy GS.

A Typology of Goal Representations in the Literature
Now that we defined the problem of interest and the overall framework to tackle it, we can start reviewing relevant approaches from the literature and how they fit in this framework.This section presents a typology of the different kinds of goal representations found in the literature.Each goal is represented by a pair: 1) a goal embedding and 2) a goalconditioned reward function.Figure 3 also provides visuals of the main environments used by the autotelic approaches presented in this paper.

Goals as Choices Between Multiple Objectives
Goals can be expressed as a list of different objectives the agent can choose from.
Goal embedding.In that case, goal embeddings z g are one-hot encodings of the current objective being pursued among the N objectives available.z i g is the i th one-hot vector: . This is the case in Oh et al. (2017), Mankowitz et al. (2018), Codevilla et al. (2018).

Reward function. The goal-conditioned reward function is a collection of
g .In Mankowitz et al. (2018) and Chan et al. (2019), each reward function gives a positive reward when the agent reaches the corresponding object: reaching guitars and keys in the first case, monsters and torches in the second.

Goals as Target Features of States
Goals can be expressed as target features of the state the agent desires to achieve.

Reward function.
For this type of goals, the reward function R G is based on a distance metric D. One can define a dense reward as inversely proportional to the distance between features of the current state and the target goal embedding: R g = R G (s|z g ) = −α × D(φ(s), z g ) (e.g.Nair et al., 2018b).The reward can also be sparse: positive whenever that distance falls below a pre-defined threshold: R G (s|z g ) = 1 if D(φ(s), z g ) < ϵ, 0 otherwise.

Goals as Abstract Binary Problems
Some goals cannot be expressed as target state features but can be represented by binary problems, where each goal expresses as set of constraint on the state (or trajectory) such that these constraints are either verified or not (binary goal achievement).

Goal embeddings.
In binary problems, goal embeddings can be any expression of the set of constraints that the state should respect.Akakzia et al. (2021), Ecoffet et al. (2021) both propose a pre-defined discrete state representation.These representations lie in a finite embedding space so that goal completion can be asserted when the current embedding φ(s) equals the goal embedding z g .Another way to express sets of constraints is via languagebased predicates.A sentence describes the constraints expressed by the goal and the state or trajectory either verifies them, or does not (Hermann et al., 2017;Chan et al., 2019;Jiang et al., 2019;Bahdanau et al., 2019aBahdanau et al., , 2019b;;Hill et al., 2020a;Cideron et al., 2020;Colas et al., 2020a;Lynch & Sermanet, 2020), see (Luketina et al., 2019) for a recent review.Language can easily characterize generic goals such as "grow any blue object" (Colas et al., 2020a), relational goals like "sort objects by size" (Jiang et al., 2019), "put the cylinder in the drawer" (Lynch & Sermanet, 2020) or even sequential goals "Open the yellow door after you open a purple door" (Chevalier-Boisvert et al., 2019).When goals can be expressed by language sentences, goal embeddings z g are usually language embeddings learned jointly with either the policy or the reward function.Note that, although rl goals always express constraints on the state, we can imagine time-extended goals where constraints are expressed on the trajectory (see a discussion in Section 7.1).

Reward function.
The reward function of a binary problem can be viewed as a binary classifier that evaluates whether state s (or trajectory τ ) verifies the constraints expressed by the goal semantics (positive reward) or not (null reward).This binary classification setting has directly been implemented as a way to learn language-based goal-conditioned reward functions R g (s | z g ) in Bahdanau et al. (2019a) and Colas et al. (2020a).Alternatively, the setup described in Colas et al. (2020) proposes to turn binary problems expressed by language-based goals into goals as specific target features.To this end, they train a language-conditioned goal generator that produces specific target features verifying constraints expressed by the binary problem.As a result, this setup can use a distance-based metric to evaluate the fulfillment of a binary goal.

Goals as a Multi-Objective Balance
Some goals can be expressed, not as desired regions of the state or trajectory space but as more general objectives that the agent should maximize.In that case, goals can parameterize a particular mixture of multiple objectives that the agent should maximize.

Goal embeddings.
Here, goal embeddings are simply sets of weights balancing the different objectives z g = (β i ) i=[1..N ] where β i is the weights applied to objective i and N is the number of objectives.Note that, when β j = 1 and β i = v0, ∀i ̸ = j, the agent can decide to pursue any of the objective alone.In Never Give Up, for example, rl agents are trained to maximize a mixture of extrinsic and intrinsic rewards (Badia et al., 2020b).The agent can select the mixing parameter β that can be viewed as a goal.Building on this approach, agent 57 adds a control of the discount factor, effectively controlling the rate at which rewards are discounted as time goes by (Badia et al., 2020a).

Reward function.
When goals are represented as a balance between multiple objectives, the associated reward function cannot be represented neither as a distance metric, nor as a binary classifier.Instead, the agent needs to maximize a convex combination of the objectives: R g .N ] is the set of weights.

Goal-Conditioning
Now that we described the different types of goal embeddings found in the literature, remains the question of how to condition the agent's behavior -i.e. the policy -on them.Originally, the uvfa framework proposed to concatenate the goal embedding to the state representation to form the policy input.Recently, other mechanisms have emerged.When languagebased goals were introduced, Chaplot et al. ( 2018) proposed the gated-attention mechanism where the state features are linearly scaled by attention coefficients computed from the goal representation φ(z g ): input = s ⊙ φ(z g ), where ⊙ is the Hadamard product.Later, the Feature-wise Linear Modulation (film) approach (Perez et al., 2018) generalized this principle to affine transformations: input = s ⊙ φ(z g ) + ψ(z g ).Alternatively, Andreas et al.
(2016) came up with Neural Module Networks, a mechanism that leverages the linguistic structure of goals to derive a symbolic program that defines how states should be processed (Bahdanau et al., 2019a).

Conclusion
This section presented a diversity of goal representations, corresponding to a diversity of reward functions architectures.However, we believe this represents only a small fraction of the diversity of goal types that humans pursue.Section 7 discusses other goal representations that rl algorithms could target.

How to Learn Goal Representations?
The previous section discussed various types of goal representations.Autotelic agents actually need to learn these goal representations.While individual goals are represented by their embeddings and associated reward functions, representing multiple goals also requires the representation of the support of the goal space, i.e. how to represent the collection of valid goals that the agent can sample from, see Figure 2.This section reviews different approaches from the literature.

Assuming Pre-Defined Goal Representation
Most approaches tackle the multi-goal rl problem, where goal spaces and associated rewards are pre-defined by the engineer and are part of the task definition.Navigation and manipulation tasks, for example, pre-define goal spaces (e.g.target agent position and target block positions respectively) and use the Euclidean distance to compute rewards (Schaul et al., 2015;Andrychowicz et al., 2017;Nair et al., 2018a;Plappert et al., 2018;Florensa et al., 2018;Colas et al., 2019;Blaes et al., 2019;Lanier et al., 2019;Ding et al., 2019;Li et al., 2020).Akakzia et al. (2021), Ecoffet et al. (2021) hand-define abstract state representation and provide positive rewards when these match target goal representations.Finally, Stooke et al. (2021) hand-define a large combinatorial goal space, where goals are Boolean formulas of predicates such as being near, on, seeing, and holding, as well as their negations, with arguments taken as entities such as objects, players, and floors in procedurally-generated multi-player worlds.In all these works, goals can only be sampled from a pre-defined bounded space.This falls short of solving the intrinsically motivated skills acquisition problem.The next sub-section investigates how goal representations can be learned.

Learning Goal Embeddings
Some approaches assume the pre-existence of a goal-conditioned reward function, but learn to represent goals by learning goal embeddings.This is the case of language-based approaches, which receive rewards from the environment (thus are rl-emgep), but learn goal embeddings jointly with the policy during policy learning (Hermann et al., 2017;Chan et al., 2019;Jiang et al., 2019;Bahdanau et al., 2019b;Hill et al., 2020a;Cideron et al., Toy Envs.are used to investigate and visualise goal-as-state coverage over 2D worlds; Hard-Exploration Envs.are used to benchmark goal generation algorithms; Object Manipulation Envs.allow for the study of the diversity of learned goals as well as curriculum learning; Interactive Envs permit to represent goals using language and to model interaction with caregivers; Procedurally Generated Envs.enhance the vastness of potentially reachable goals. 2020; Lynch & Sermanet, 2020).When goals are target images, goal embeddings can be learned via generative models of states, assuming the reward to be a fixed distance metric computed in the embedding space (Nair et al., 2018b;Florensa et al., 2019;Pong et al., 2020;Nair et al., 2020).

Learning the Reward Function
A few approaches go even further and learn their own goal-conditioned reward function.Bahdanau et al. (2019a), Colas et al. (2020a) learn language-conditioned reward functions from an expert dataset or from language descriptions of autonomous exploratory trajectories respectively.However, the agile approach from Bahdanau et al. (2019a) does not generate its own goals.
In the domain of image-based goals, Venkattaramanujam et al. (2019), Hartikainen et al. (2020) learn a distance metric estimating the square root of the number of steps required to move from any state s 1 to any s 2 and generates internal signals to reward agents for getting closer to their target goals.Warde-Farley et al. ( 2019) learn a similarity metric in the space of controllable aspects of the environment that is based on a mutual information objective between the state and the goal state s g .Wu et al. (2019) compute a distance metric representing the ability of the agent to reach one state from another using the Laplacian of the transition dynamics graph, where nodes are states and edges are actions.More precisely, they use the eigenvectors of the Laplacian matrix of the graph given by the states of the environment as basis to compute the L2 distance towards a goal configuration.
Another way to learn reward function and their associated skills is via empowerment methods (Mohamed & Rezende, 2015;Gregor et al., 2016;Achiam et al., 2018;Eysenbach et al., 2019;Dai et al., 2020;Sharma et al., 2020;Choi et al., 2021).Empowerment methods aim at maximizing the mutual information between the agent's actions or goals and its experienced states.Recent methods train agents to develop a set of skills leading to maximally different areas of the state space.Agents are rewarded for experiencing states that are easy to discriminate, while a discriminator is trained to better infer the skill z g from the visited states.This discriminator acts as a skill-specific reward function.
All these methods set their own goals and learn their own goal-conditioned reward function.For these reasons, they can be considered as complete autotelic rl algorithms.

Learning the Support of the Goal Distribution
The previous sections reviewed several approaches to learn goal embeddings and reward functions.To represent collections of goals, one also needs to represent the support of the goal distribution -which embeddings correspond to valid goals and which do not.
The option framework (Sutton et al., 1999;Precup, 2000a) proposes to train a high-level policy to compose sequences of behaviors originating from learned low-level policies called options.Each option can be seen as a goal-directed policy where the goal embedding is represented by its index in the set of options.When options are policies aiming at specific states, option discovery methods learn the support of the goal space; they learn which goalstate are most useful to organize higher-level behaviors.Bottleneck states are often targeted as good sub-goals.McGovern and Barto (2001) propose to detect states that are common to multiple successful trajectories.Simsek and Barto (2004) propose to select state with maximal relative novelty, i.e. when the average novelty of following states is higher than the average novelty of previous ones.Simsek and Barto (2008) propose to leverage measures from graph theory.
The option-critic framework then opened the way to a wealth of new approaches (Bacon et al., 2017).Among those, methods based on successor features (Barreto et al., 2017(Barreto et al., , 2020;;Ramesh et al., 2019) propose to learn the option space using reward embeddings.With successor features, the Q-value of a goal can be expressed as a linear combination of learned reward features, efficiently decoupling the rewards from the environmental dynamics.In a multi-goal setting, these methods pair each goal with a reward embedding and use generalized policy improvement to train a set of policies that efficiently share relevant reward features across goals.These methods provide key mechanisms to learn to discover and represent sub-goals.However, they do not belong to the rl-imgep family since high-level goals are externally provided.Some approaches use the set of previously experienced representations to form the support of the goal distribution (Veeriah et al., 2018;Akakzia et al., 2021;Ecoffet et al., 2021).In Florensa et al. (2018), a Generative Adversarial Network (gan) is trained on past representations of states (φ(s)) to model a distribution of goals and thus its support.In the same vein, approaches handling image-based goals usually train a generative model of image states based on Variational Auto-Encoders (vae) to model goal distributions and support (Nair et al., 2018b;Pong et al., 2020;Nair et al., 2020).In both cases, valid goals are the one generated by the generative model.
We saw that the support of valid goals can be pre-defined, a simple set of past representations or approximated by a generative model trained on these.In all cases, the agent can only sample goals within the convex hull of previously encountered goals (in representation space).We say that goals are within training distribution.This drastically limits exploration and the discovery of new behaviors.
Children, on the other hand, can imagine creative goals.Pursuing these goals is thought to be the main driver of exploratory play in children (Chu & Schulz, 2020).This is made possible by the compositionality of language, where sentences can easily be combined to generate new ones.The imagine algorithm leverages the creative power of language to generate such out-of-distribution goals (Colas et al., 2020a).The support of valid goals is extended to any combination of language-based goals experienced during training.They show that this mechanism augments the generalization and exploration abilities of learning agents.
In Section 6, we discuss how agents can learn to adapt the goal sampling distribution to maximize the learning progress of the agent.

Conclusion
This section presented how previous approaches tackled the problem of learning goal representations.While most approaches rely on pre-defined goal embeddings and/or reward functions, some approaches proposed to learn internal reward functions and goal embeddings jointly.

How to Prioritize Goal Selection?
Autotelic agents also need to select their own goals.While goals can be generated by uninformed sampling of the goal space, agents can benefit from mechanisms optimizing goal selection.In practice, this boils down to the automatic adaptation of the goal sampling distribution as a function of the agent performance.

Automatic Curriculum Learning for Goal Selection
In real-world scenarios, goal spaces can be too large for the agent to master all goals in its lifetime.Some goals might be trivial, others impossible.Some goals might be reached by chance sometimes, although the agent cannot make any progress on them.Some goals might be reachable only after the agent mastered more basic skills.For all these reasons, it is important to endow autotelic agents learning in open-ended scenarios with the ability to optimize their goal selection mechanism.This ability is a particular case of automatic curriculum learning acl applied for goal selection: mechanisms that organize goal sampling so as to maximize the long-term performance improvement (distal objective).As this objective is usually not directly differentiable, curriculum learning techniques usually rely on a proximal objective.In this section, we look at various proximal objectives used in automatic curriculum learning strategies to organize goal selection.Interested readers can refer to Portelas et al. (2020a), which present a broader review of acl methods for rl.Note that knowledge-based ims can rely on similar proxies but focus on the optimization of the experienced states instead of on the selection of goals (e.g.maximize next-state prediction errors).A recent review of knowledge-based ims approaches can be found in Linke et al. (2020).
Intermediate or uniform difficulty.Intermediate difficulty has been used as a proxy for long-term performance improvement, following the intuition that focusing on goals of intermediate difficulty results in short-term learning progress that will eventually turn into long-term performance increase.goalgan assigns feasibility scores to goals as the proportion of time the agents successfully reaches it (Florensa et al., 2018).Based on this data, a gan is trained to generate goals of intermediate difficulty, whose feasibility scores are contained within an intermediate range.Sukhbaatar et al. (2018) and Campero et al. (2021) train a goal policy with rl to propose challenging goals to the rl agent.The goal policy is rewarded for setting goals that are neither too easy nor impossible.In the same spirit, Stooke et al. (2021) use a mixture of three criteria to filter valid goals: 1) the agent has a low probability of scoring high; 2) the agent has a high probability of scoring higher than a control policy; 3) the control policy performs poorly.Finally, Zhang et al. (2020) select goals that maximize the disagreement in an ensemble of value functions.Value functions agree when the goals are too easy (the agent is always successful) or too hard (the agent always fails) but disagree for goals of intermediate difficulty.Racanière et al. (2019) propose a variant of the goalgan approach and train a goal generator to sample goals of all levels of difficulty, uniformly.This approach seems to lead to better stability and improved performance on more complex tasks compared to goalgan (Florensa et al., 2018).
Note that measures of intermediate difficulty are sensitive to the presence of stochasticity in the environment.Indeed, goals of intermediate difficulty can be detected as such either because the agent has not yet mastered them, or because the environment makes them impossible to achieve sometimes.In the second case, the agent should not focus on them, because it cannot learn anything new.Estimating medium-term learning progress helps overcoming this problem (see below).
Novelty -diversity.Warde-Farley et al. (2019), Pong et al. (2020), Pitis et al. (2020) all bias the selection of goals towards sparse areas of the goal space.For this purpose, they train density models in the goal space.While Warde-Farley et al. ( 2019), Pong et al. (2020) aim at a uniform coverage of the goal space (diversity), Pitis et al. (2020) skew the distribution of selected goals even more, effectively maximizing novelty.Kovač et al. (2020) proposed to enhance these methods with a goal sampling prior focusing goal selection towards controllable areas of the goal space.Finally, Fang et al. (2021) use procedural content generation (pcg) to train a task generator that produces diverse environments in which agents can explore customized skills.
These algorithms have strong connections with empowerment methods (Mohamed & Rezende, 2015;Gregor et al., 2016;Achiam et al., 2018;Eysenbach et al., 2019;Campos et al., 2020;Sharma et al., 2020;Choi et al., 2021).Indeed, the mutual information between goals and states that empowerment methods aim to maximize can be rewritten as: Thus, maximizing empowerment can be seen as maximizing the entropy of the goal distribution while minimizing the entropy of goals given experienced states.Algorithm that both learn to sample diverse goals (H(Z) ↗) and learn to represent goals with variational auto-encoders (H(Z|S) ↘) can be seen as maximizing empowerment.The recent wealth of empowerment methods, however, rarely discusses the link with autotelic agents: they do not mention the notion of goals or goal-conditioned reward functions and do not discuss the problem of goal representations (Gregor et al., 2016;Achiam et al., 2018;Eysenbach et al., 2019;Campos et al., 2020;Sharma et al., 2020).In a recent paper, Choi et al. (2021) investigated these links and formalized a continuum of methods from empowerment to visual goal-conditioned approaches.
While novelty refers to the originality of a reached outcome, diversity is a term that can only be applied to a collection of these outcomes.An outcome will be said novel if it is semantically different from what exists in the set of known outcomes.A set of outcomes will be said diverse when outcomes are far from each other and cover well the space of possible outcomes.Note that agents can also express diversity in their behavior towards a unique outcome, a skill known as versatility (Hausman et al., 2018;Kumar et al., 2020;Osa et al., 2021;Celik et al., 2021).
Medium-term learning progress.The idea of using learning progress (lp) as a intrinsic motivation for artificial agents dates back to the 1990s (Schmidhuber, 1991a(Schmidhuber, , 1991b;;Kaplan & Oudeyer, 2004;Oudeyer et al., 2007).At that time, however, it was used as a knowledgebased ims and rewarded progress in predictions.From 2007, (Oudeyer & Kaplan, 2007) suggested to use it as a competence-based ims to reward progress in competence instead.In such approaches, agents estimate their lp in different regions of the goal space and bias goal sampling towards areas of high absolute learning progress using bandit algorithms (Baranes & Oudeyer, 2013;Moulin-Frier et al., 2014;Forestier & Oudeyer, 2016;Fournier et al., 2018Fournier et al., , 2021;;Colas et al., 2019;Blaes et al., 2019;Portelas et al., 2020b;Akakzia et al., 2021).Such estimations attempts to disambiguate the incompetency or uncertainty the agent could resolve with more practice (epistemic) from the one it could not (aleatoric).Agents should indeed focus on goals towards which they can make progress and avoid goals that are either too easy, currently too hard, or impossible.Forestier and Oudeyer (2016), Colas et al. (2019), Blaes et al. (2019) and Akakzia et al. (2021) organize goals into modules and compute average lp measures over modules.Fournier et al. (2018) defines goals as a discrete set of precision requirements in a reaching task and computes lp for each requirement value.The use of absolute lp enables agents to focus back on goals for which performance decreases (due to perturbations or forgetting).Akakzia et al. (2021) introduces the success rate in the value optimized by the bandit: v = (1 − sr) × lp, so that agents favor goals with high absolute lp and low competence.

Hierarchical Reinforcement Learning for Goal Sequencing.
Hierarchical reinforcement learning (hrl) can be used to guide the sequencing of goals (Dayan & Hinton, 1993;Sutton et al., 1998Sutton et al., , 1999;;Precup, 2000b).In hrl, a high-level policy is trained via rl or planning to generate sequence of goals for a lower level policy so as to maximize a higher-level reward.This allows to decompose tasks with long-term dependencies into simpler sub-tasks.Low-level policies are implemented by traditional goal-conditioned rl algorithms (Levy et al., 2018;Röder et al., 2020) and can be trained independently from the high-level policy (Kulkarni et al., 2016;Frans et al., 2018) or jointly (Levy et al., 2018;Nachum et al., 2018;Röder et al., 2020).In the option framework, option can be seen as goal-directed policies that the high-level policy can choose from (Sutton et al., 1999;Precup, 2000a).In that case, goal embeddings are simple indicators.Most approaches consider hand-defined spaces for the sub-goals (e.g.positions in a maze).Recent approaches propose to use the state space directly (Nachum et al., 2018) or to learn the sub-goal space (e.g.Vezhnevets et al. (2017), or with generative model of image states in Nasiriany et al. (2019)).

Open Challenges
This section discusses open challenges in the quest for autotelic agents tackling the intrinsically motivated skills acquisition problem.

Challenge #1: Targeting a Greater Diversity of Goals
Section 4 introduces a typology of goal representations found in the literature.The diversity of goal representations seems however limited, compared to the diversity of goals human target (Ram et al., 1995).Time-extended goals.All rl approaches reviewed in this paper consider time-specific goals, that is, goals whose completion can be assessed from any state s.This is due to the Markov property requirement, where the next state and reward need to be a function of the previous state only.Time-extended goals -i.e.goals whose completion can be judged by observing a sequence of states (e.g.jump twice) -can however be considered by adding time-extended features to the state (e.g. the difference between the current state and the initial state Colas et al., 2020a).To avoid such ad-hoc state representations, one could imagine using reward function architectures that incorporate forms of memory such as Recurrent Neural Network (rnn) architectures (Elman, 1993) or Transformers (Vaswani et al., 2017).Although recurrent policies are often used in the literature (Chevalier-Boisvert et al., 2019;Hill et al., 2020a;Loynd et al., 2020;Goyal et al., 2021), recurrent reward functions have not been much investigated.Some work Sutton and Tanner (2004), Schlegel et al. (2021) investigate the benefit of computing relations between value functions when learning predictive representations.Sutton and Tanner (2004) propose to represent the interrelation of predictions in a TD-network where nodes are predictions computed from states.The network allows to perform predictions that have complex temporal semantics.Schlegel et al. (2021) train a RNN architecture where hidden-states are multi-step predictions.Finally, recent work by Karch et al. (2021) show that agents can derive rewards from linguistic descriptions of time-extended behaviors.Time-extended goals include interactions that span over multiple time steps (e.g.shake the blue ball) and spatio-temporal references to objects (e.g.get the red ball that was on the left of the sofa yesterday).
Learning goals.Goal-driven learning is the idea that humans use learning goals, goals about their own learning abilities as a way to simplify the realization of task goals (Ram et al., 1995).Here, we refer to task goals as goals that express constraints on the physical state of the agent and/or environment.On the other hand, learning goals refer to goals that express constraints on the knowledge of the agent.Although most rl approaches target task goals, one could envision the use of learning goals for rl agents.
In a way, learning-progress-based learning is a form of learning goal: as the agent favors regions of the goal space to sample its task goals, it formulates the goal of learning about this specific goal region (Baranes & Oudeyer, 2013;Fournier et al., 2018Fournier et al., , 2021;;Colas et al., 2019;Blaes et al., 2019;Akakzia et al., 2021).
Embodied Question Answering problems can also be seen as using learning goals.The agent is asked a question (i.e. a learning goal) and needs to explore the environment to answer it (acquire new knowledge) (Das et al., 2018;Yuan et al., 2019).
In the future, one could envision agents that set their own learning targets as sub-goals towards the resolution of harder task or learning goals, e.g.I'm going to learn about knitting so I can knit a pullover to my friend for his birthday.
Goals as optimization under selected constraints.We discussed the representations of goals as a balance between multiple objectives.An extension of this idea is to integrate the selection of constraints on states or trajectories.One might want to maximize a given metric (e.g.walking speed), while setting various constraints (e.g.maintaining the power consumption below a given threshold or controlling only half of the motors).The agent could explore in the space of constraints, setting constraints to itself, building a curriculum on these, etc.This is partially investigated in Colas et al. (2021), where the agent samples constraint-based goals in the optimization of control strategies to mitigate the economic and health costs in simulated epidemics.This approach, however, only considers constraints on minimal values for the objectives and requires the training of an additional Q-function per constraint.
Meta-diversity of goals.Finally, autotelic agents should learn to target all these goals within the same run; to transfer their skills and knowledge between different types of goals.For instance, targeting visual goals could help the agent explore the environment and solve learning goals or linguistic goals.As the density of possible goals increases, agents can organize more interesting curricula.They can select goals in easier representation spaces first (e.g.sensorimotor spaces), then move on to target more difficult goals (e.g. in the visual space), before they can target the more abstract goals (e.g.learning goals, abstract linguistic goals).
This can take the form of goal spaces organized hierarchically at different levels of abstractions.The exploration of such complex goal spaces has been called meta-diversity (Etcheverry et al., 2020).In the outer-loop of the meta-diversity search, one aims at learning a diverse set of outcome/goal representations.In the inner-loop, the exploration mechanism aims at generating a diversity of behaviors in each existing goal space.How to efficiently transfer knowledge and skills between these multi-modal goal spaces and how to efficiently organize goal selection in large multi-modal goal spaces remains an open question.

Challenge #2: Learning to Represent Diverse Goals
This survey mentioned only a handful of complete autotelic architectures.Indeed, most of the surveyed approach assume pre-existing goal embeddings or reward functions.Among the approaches that learn goal representations autonomously, we find that the learned representations are often restricted to very specific domains.Visual goal-conditioned approaches for example, learn reward functions and goal embeddings but restrict them to the visual space (Nair et al., 2018b(Nair et al., , 2020;;Warde-Farley et al., 2019;Venkattaramanujam et al., 2019;Pong et al., 2020;Hartikainen et al., 2020).Empowerment methods, on the other hand, develop skills that maximally cover the state space, often restricted to a few of its dimensions (e.g. the x-y space in navigation tasks Achiam et al., 2018;Eysenbach et al., 2019;Campos et al., 2020;Sharma et al., 2020).
These methods are limited to learn goal representations within a bounded, pre-defined space: the visual space, or the (sub-) state space.How to autonomously learn to represent the wild diversity of goals surveyed in Section 4 and discussed in Challenge #1 remains an open question.

Challenge #3: Imagining Creative Goals
Goal sampling methods surveyed in Section 6 are all bound to sample goals within the distribution of known effects.Indeed, the support of the goals distribution is either pre-defined (e.g.Schaul et al., 2015;Andrychowicz et al., 2017;Colas et al., 2019;Li et al., 2020) or learned using a generative model (Florensa et al., 2018;Nair et al., 2018bNair et al., , 2020;;Pong et al., 2020) trained on previously experienced outcomes.On the other hand, humans can imagine creative goals beyond their past experience which, arguably, powers their exploration of the world.
In this survey, one approach opened a path in this direction.The imagine algorithm uses linguistic goal representation learned via social supervision and leverages the compositionality of language to imagine creative goals beyond its past experience (Colas et al., 2020a).This is implemented by a simple mechanism detecting templates in known goals and recombining them to form new ones.This is in line with a recent line of work in developmental psychology arguing that human play might be about practicing to generate plans to solve imaginary problems (Chu & Schulz, 2020).
Another way to achieve similar outcomes is to compose known goals with Boolean algebras, where new goals can be formed by composing existing atomic goals with negation, conjunction and disjunctions.The logical combinations of atomic goals was investigated in Tasse et al. (2020), Chitnis et al. (2021), andColas et al. (2020), Akakzia et al. (2021).The first approach represents the space of goals as a Boolean algebra, which allows immediate generalization to compositions of goals (and, or, not).The second approach considers using general symbolic and logic languages to express goals, but uses symbolic planning techniques that are not yet fully integrated in the goal-conditioned deep rl framework.The third and fourth train a generative model of goals conditioned on language inputs.Because it generates discrete goals, it can compose language instructions by composing the finite sets of discrete goals associated to each instruction (and is the intersection, or the union etc).However, these works fall short of exploring the richness of goal compositionality and its various potential forms.Tasse et al. (2020) seem to be limited to specific goals as target features, while Akakzia et al. (2021) requires discrete goals.Finally, Barreto et al. (2019) proposes to target new goals that are represented by linear combination of pseudo-rewards called cumulants.They use the option framework and show that an agent that masters a set of options associated with cumulants can generalize to any new behavior induced by a linear combination of those known cumulants.

Challenge #4: Composing Skills for Better Generalization
Although this survey focuses on goal-related mechanisms, autotelic agents also need to learn to achieve their goals.Progress in this direction directly relies on progress in standard rl and goal-conditioned rl.In particular, autotelic agents would considerably benefit from better generalization and skill composition.Indeed, as the set of goals agents can target grows, it becomes more and more crucial that agents can efficiently transfer knowledge between skills, infer new skills from the ones they already master and compose skills to form more complex ones.Although hierarchical rl approach learn to compose skills sequentially, concurrent skill composition remains under-explored.

Challenge #6: Leveraging Socio-Cultural Environments
Decades of research in psychology, philosophy, linguistics and robotics have demonstrated the crucial importance of rich socio-cultural environments in human development (Vygotsky, 1934;Whorf, 1956;Wood et al., 1976;Rumelhart et al., 1986;Berk, 1994;Clark, 1998;Tomasello, 1999Tomasello, , 2009;;Zlatev, 2001;Carruthers, 2002;Dautenhahn et al., 2002;Lindblom & Ziemke, 2003;Mirolli & Parisi, 2011;Lupyan, 2012).However, modern ai may have lost track of these insights.Deep reinforcement learning rarely considers social interactions and, when it does, models them as direct teaching; depriving agents of all autonomy.A recent discussion of this problem and an argument for the need of agents that are both autonomous and teachable can be found in a concurrent work (Sigaud, Caselles-Dupré, Colas, Akakzia, Oudeyer, & Chetouani, 2021).As we embed autotelic agents in richer socio-cultural worlds and let them interact with humans, they might start to learn goal representations that are meaningful for us, in our society.

Discussion & Conclusion
This paper defined the intrinsically motivated skills acquisition problem and proposed to view autotelic rl algorithms or rl-imgep as computational tools to tackle it.These methods belong to the new field of developmental reinforcement learning, the intersection of the developmental robotics and rl fields.We reviewed current goal-conditioned rl approaches under the lens of autotelic agents that learn to represent and generate their own goals in addition of learning to achieve them.
We propose a new general definition of the goal construct: a pair of compact goal representation and an associated goal-achievement function.Interestingly, this viewpoint allowed us to categorize some rl approaches as goal-conditioned, even though the original papers did not explicitly acknowledge it.For instance, we view the Never Give Up (Badia et al., 2020b) and Agent 57 (Badia et al., 2020a) architectures as goal-conditioned, because agents actively select parameters affecting the task at hand (parameter mixing extrinsic and intrinsic objectives, discount factor) and see their behavior affected by this choice (goal-conditioned policies).
This point of view also offers a direction for future research.Autotelic agents need to learn to represent goals and to measure goal achievement.Future research could extend the diversity of considered goal representations, investigate novel reward function architectures and inductive biases to allow time-extended goals, goal composition and to improve generalization.
The general vision we convey in this paper builds on the metaphor of the learning agent as a curious scientist.A scientist that would formulate hypotheses about the world and explore it to find out whether they are true.A scientist that would ask questions, and setup intermediate goals to explore the world and find answers.A scientist that would set challenges to itself to learn about the world, to discover new ways to interact with it and to grow its collection of skills and knowledge.Such a scientist could decide of its own agenda.It would not need to be instructed and could be guided only by its curiosity, by its desire to discover new information and to master new skills.Autotelic agents should nonetheless be immersed in complex socio-cultural environment, just like humans are.In contact with humans, they could learn to represent goals that humans and society care about.

Approach
Goal Type Goal Rep.

Reward Function
Goal sampling strategy RL-IMGEPs that assume goal embeddings and reward functions (Fournier et al., 2018) Target features (+tolerance) Pre-def Pre-def lp-Based hac (Levy et al., 2018) Target features Pre-def Pre-def hrl hiro (Nachum et al., 2018) Target features Pre-def Pre-def hrl CURIOUS (Colas et al., 2019) Target features Pre-def Pre-def lp-based CLIC (Fournier et al., 2021) Target features Pre-def Pre-def lp-based CWYC (Blaes et al., 2019) Target agents to sample their own goals.The proposed classification groups algorithms depending on their degree of autonomy: 1) rl-imgeps that rely on pre-defined goal representations (embeddings and reward functions); 2) rl-imgeps that rely on pre-defined reward functions but learn goal embeddings and 3) rl-imgeps that learn complete goal representations (embeddings and reward functions).For each algorithm, we report the type of goals being pursued (see Section 4), whether goal embeddings are learned (Section 5), whether reward functions are learned (Section 5.3) and how goals are sampled (Section 6).We mark in bold algorithms that use a developmental approaches and explicitly pursue the intrinsically motivated skills acquisition problem.

Figure 1 :
Figure 1: A typology of intrinsically-motivated and/or goal-conditioned rl approaches.pop-imgep, rl-imgep and rl-emgep refer to population-based intrinsically motivated goal exploration processes, rl-based imgep and rl-based externally motivated goal exploration processes respectively.pop-imgep, rl-imgep and rlemgep all represent goals, but knowledge-based ims do not.While imgeps (popimgep and rl-imgep) generate their own goals, rl-emgeps require externallydefined goals.This paper is interested in rl-imgeps, autotelic methods at the intersection of goal-conditioned rl agents and intrinsically motivated processes that train learning agents to generate and pursue their own goals with goalconditioned rl algorithms.

Figure 2 :
Figure 2: Representation of the different learning modules in a rl-imgep algorithm.In contrast, externally motivated goal exploration processes (rl-emgeps) only train the goal-conditioned policy and assume external goal generator and goal-conditioned reward function.Learning goal embeddings, goal space support and goalconditioned reward functions are all about learning to represent goals.Learning a sampling distribution is about learning to prioritize their selection.

Figure 3 :
Figure 3: Examples of environments in autotelic RL approaches.We organize them by dominant feature but they might share features from other catagories as well.Toy Envs.are used to investigate and visualise goal-as-state coverage over 2D worlds; Hard-Exploration Envs.are used to benchmark goal generation algorithms; Object Manipulation Envs.allow for the study of the diversity of learned goals as well as curriculum learning; Interactive Envs permit to represent goals using language and to model interaction with caregivers; Procedurally Generated Envs.enhance the vastness of potentially reachable goals.