General Policies, Subgoal Structure, and Planning Width

It has been observed that many classical planning domains with atomic goals can be solved by means of a simple polynomial exploration procedure, called IW, that runs in time exponential in the problem width, which in these cases is bounded and small. Yet, while the notion of width has become part of state-of-the-art planning algorithms such as BFWS, there is no good explanation for why so many benchmark domains have bounded width when atomic goals are considered. In this work, we address this question by relating bounded width with the existence of general optimal policies that in each planning instance are represented by tuples of atoms of bounded size. We also define the notions of (explicit) serializations and serialized width that have a broader scope as many domains have a bounded serialized width but no bounded width. Such problems are solved non-optimally in polynomial time by a suitable variant of the Serialized IW algorithm. Finally, the language of general policies and the semantics of serializations are combined to yield a simple, meaningful, and expressive language for specifying serializations in compact form in the form of sketches, which can be used for encoding domain control knowledge by hand or for learning it from small examples. Sketches express general problem decompositions in terms of subgoals, and sketches of bounded width express problem decompositions that can be solved in polynomial time.


Introduction
Width-based search methods exploit the structure of states to enumerate the state space in ways that are different than "blind" search methods such as breadth-first and depth-first (Lipovetzky & Geffner, 2012).This is achieved by associating a non-negative integer to each state generated in the search, a so-called novelty measure, which is defined by the size of the smallest factor in the state that has not been seen in previously generated states.
States are deemed more novel and hence preferred in the exploration search when this novelty measure is smaller.Other types of novelty measures have been used in reinforcement learning for dealing with sparse rewards and in genetic algorithms for dealing with local minima (Tang, Houthooft, Foote, Stooke, Chen, Duan, Schulman, DeTurck, & Abbeel, 2017;Pathak, Agrawal, Efros, & Darrell, 2017;Ostrovski, Bellemare, Oord, & Munos, 2017), but the results are mostly empirical.In classical planning, where novelty measures are part of state-of-the-art search algorithms (Lipovetzky & Geffner, 2017b, 2017a), there is a solid body of theory that relates a specific type of novelty measures with a notion of problem width that bounds the complexity of planning problems (Lipovetzky & Geffner, 2012).
The basic width-based planning algorithms are simple and assume a fixed number of Boolean state features F that in classical planning are given by the atoms in the problem.The procedure IW(1) is a breadth-first search that starts in the given initial state and prunes all the states that do not make a feature from F true for the first time in the search.IW(k) is like IW(1) but using conjunctions of up to k features from F instead.Alternatively, IW(k) can be regarded as a breadth-first search that prunes states s with novelty measures that are greater than k, where the novelty measure of s is the size of the minimum conjunction (set) of features that is true in s but false in all the states generated before s.
For many benchmark domains, it has been shown that IW(k) for a small value of k suffices to compute plans, and indeed optimal plans, for any atomic goal (Lipovetzky & Geffner, 2012).State-of-the-art planning algorithms like BFWS (Lipovetzky & Geffner, 2017b, 2017a) make use of this property for serializing conjunctive goals into atomic ones, an idea that is also present in algorithms that pre-compute atomic landmarks (Hoffmann, Porteous, & Sebastia, 2004) and use them as counting heuristics (Richter & Westphal, 2010).
An important open question in the area is why these width-based methods are effective, and in particular, why so many domains have a small width when atomic goals are considered.Is this a property of the domains?Is it an accident of the domain encodings used?In this work, we address these and related questions.For this, we bring the notion of general policies; policies that solve multiple instances of a planning domain all at once (Srivastava, Immerman, & Zilberstein, 2008;Bonet & Geffner, 2015;Hu & De Giacomo, 2011;Belle & Levesque, 2016;Segovia, Jiménez, & Jonsson, 2016), while using the formulation of general policies expressed in terms of finite sets of rules over a fixed set of Boolean and numerical features (Bonet & Geffner, 2018).
In this paper, a class of instances is shown to have bounded width when there is a general optimal policy for the class which can be "followed" in each instance by just considering atom tuples of bounded size, without assuming knowledge of the policies or the features involved.The existing planning domains that have been shown to have bounded width can be all characterized in this way.In addition, the notion of general policies is extended to comprise serializations that split problems into subproblems.A serialization has bounded width when the resulting subproblems have bounded width and can be solved greedily for reaching the problem goal.A general policy turns out to be a serialization of width zero; namely, one in which the subproblems can be solved in a single step.Finally, the syntax of general policies is combined with the semantics of serializations to yield a simple, meaningful, and expressive language for specifying serializations succinctly in the form of sketches, which can be used to encode domain control knowledge by hand, or for learning it from examples (Drexler, Seipp, & Geffner, 2021, 2022).
The paper is organized as follows.We review first the notions of planning, width, and general policies, and relate width with the size of the tuples of atoms that are needed to apply such policies.We then introduce serializations, the more general notion of serialized width, the relation between general policies and serialized width, and policy sketches.We finally summarize the main contributions, discuss extensions and limitations, related work, and conclusions.

Planning
A classical planning problem is a pair P = D, I where D is a first-order domain, such as a STRIPS domain, and I contains information about the instance (Geffner & Bonet, 2013;Ghallab, Nau, & Traverso, 2016;Haslum, Lipovetzky, Magazzeni, & Muise, 2019).The domain D has a set of predicate symbols p and a set of action schemas with preconditions and effects given by atoms p(x 1 , . . ., x k ), where p is a predicate symbol of arity k, and each x i is an argument of the schema.The instance information is a tuple I = O, Init, G where O is a set of object names c i , and Init and G are sets of ground atoms p(c 1 , . . ., c k ) denoting the initial and goal situations.
A classical problem P = D, I encodes a state model S(P ) = S, s 0 , S G , Act, A, f in compact form where the states s ∈ S are sets of ground atoms from P (assumed to be the true atoms in s), s 0 is the initial state I, S G is the set of goal states s such that G ⊆ s, Act is the set of ground actions in P , A(s) is the set of ground actions whose preconditions are (true) in s, and f (a, s), for a ∈ A(s), represents the state s ′ that follows action a in the state s.An action sequence σ = a 0 , a 1 , . . ., a n is applicable in P if a i ∈ A(s i ) and s i+1 = f (a i , s i ), for i = 0, . . ., n.The states s i in the sequence are said to be reachable in P , and states(P ) denotes the set of reachable states in P .The applicable action sequence σ = a 0 , a 1 , . . ., a n is a plan if s n+1 ∈ S G .The cost of a plan is given by its length, and a plan is optimal if there is no shorter plan.If there is a plan for P , the goal G of P is said to be reachable, and P is said to be solvable.The cost of P is the cost of an optimal plan for P , if such a plan exists, and infinite otherwise.A reachable state s is a dead-end if the goal is not reachable in the problem P [s] that is like P but with s as the initial state.
We refer to subsets of ground atoms in a planning problem P as tuples of atoms, or atom tuples.An atom tuple t is reachable in P if there is a reachable state s in P that makes true all the atoms in t.The cost of a reachable state s is the minimum cost of a plan that reaches s, the cost of a reachable tuple t is the minimum cost of a state that makes t true, and the cost of a state or tuple that is unreachable is infinite.

Width
While the standard definition of width is based on the consideration of sequences of atom tuples (Lipovetzky & Geffner, 2012), the formulation below is slightly more general and more convenient for our purposes.
Definition 1 (Admissible tuple set).A set T of reachable atom tuples in a planning problem P is admissible for P if 1. some tuple in T is true in the initial state of P , and 2. any optimal plan for a tuple t in T , that is not an optimal plan for P , can be extended into an optimal plan for another tuple t ′ in T by adding a single action.Lipovetzky and Geffner (2012) say that a sequence (t 0 , t 1 , . . ., t n ) of atom tuples is admissible if t 0 is true at the initial state, any optimal plan for t i can be extended with one action into an optimal plan for t i+1 , 0 ≤ i < n, and any optimal plan for t n is an optimal plan for P .It is easy to show that in such a case, the set T = {t 0 , t 1 , . . ., t n } is admissible according to the new definition of admissibility.There cases, however, where all optimal plans for a fixed tuple t in T can be extended with a single action into plans that are optimal for either t ′ or t ′′ but not for a unique tuple of the same size.In such cases, the width of a problem can be smaller according to the new definition.
For a tuple t and a set of tuples T , |t| denotes the number of atoms in t, |T | denotes the number of tuples in T , and size(T ) denotes the maximum |t| for a tuple t in T .The width of a problem P is the size of a minimum-size set of tuples T that is admissible: Definition 2 (Width).The width of a STRIPS planning problem P is w(P ) .= min T size(T ) where T ranges over the sets of tuples that are admissible in P .If P is solvable in zero or one step, w(P ) is set to 0, and if P is not solvable at all, w(P ) is set to ∞.
The definition of width for classes of problems Q is then: The reason for defining the width of P as 0 or ∞ according to whether P is solvable in one step or not solvable at all, respectively, is convenience.Basically, problems are solvable in time exponential in their width, and hence, w(P ) = 0 implies that P can be solved in constant time, if a constant branching factor is assumed.At the same time, a bounded width for a class of problems implies that all the problems in the class are solvable, something which is not ensured by setting w(P ) to N + 1, where N is the number of problem atoms (Lipovetzky & Geffner, 2012).
Example 1: The Blocksworld domain • Q Blocks is the class of all Blocksworld problems over the standard domain specification with 4 operator schemas: stack/unstack and pick/putdown operators.This class of problems has unbounded width as shown below.
• Q Clear is the subclass of Q Blocks made of the problems whose goal is the single atom clear(x), for some block x, and where the gripper is initially empty.Let B 1 , . . ., B ℓ be the blocks above x, from top top bottom in the initial state, and let us consider the set of tuples T = {t 0 , t 1 , . . ., t 2ℓ−1 } where t 0 = {clear(B 1 )}, and It is easy to check that the two conditions in Definition 1 hold for T .Hence, w(P ) ≤ 1, and since w(P ) > 0, as the goal cannot be reached in zero or one step in general, w(P ) = 1 and w(Q Clear ) = 1.
• Q On is the subclass of Q Blocks made of the problems whose goal is the single atom on(x, y) for two blocks x and y.
Let us calculate the width of an arbitrary instance P with the assumption that the blocks x and y are initially at different towers.Let B 1 , . . ., B ℓ (resp.D 1 , . . ., D m ) be the blocks above x (resp.y), in order from top to bottom, in the initial state.Let us consider the set T = {t 0 , . . ., t 2ℓ , t ′ 0 , . . ., t ′ 2m , t ′′ 0 , t ′′ 1 } of tuples where t i , 0 ≤ i < 2ℓ, is as in the previous example, Algorithm 1: IW(T ) Search 1: Input: Planning problem P with N ground atoms 2: Input: Set T of atom tuples from P 3: Initialize perfect hash table H for storing the tuples in T on which the operations of insertion and look up take constant time 4: Initialize FIFO queue Q on which the enqueue and dequeue operations take constant time 5: Enqueue node for the initial state s 0 of P 6: While Q is not empty:

7:
Dequeue node n for state s 8: If s is a goal state, return the path to node n (Solution found)

9:
If s makes true some tuple t from T that is not in H: 10: Insert all tuples from T made true by s in H 11: Enqueue a node n ′ for each successor s ′ of s 12: Return FAILURE (T is not admissible for P ) Figure 1: IW(T ) is a breadth-first search that prunes nodes that do not satisfy a tuple in T for the first time in the search.Algorithm IW(k) is IW(T ) where T is the set of conjunctions of up to k atoms in T .Conditions for the completeness and optimality of IW(T ) are given in Theorem 4. - , clear(y)}, and t ′′ 1 = {on(x, y)}.It is not difficult to check that T is admissible.Later, we will show that w(P ) > 1 and thus w(P ) = 2.If the two blocks x and y are in the same tower in the initial state, a different set of tuples must be considered but the width is still bounded by 2. Hence, w(Q On ) = 2.

Algorithms IW(T ), IW(k), and IW
If T is an admissible set of tuples for P , there is a very simple algorithm IW(T ) that solves P optimally by expanding no more than |T | states.The algorithm, shown in Fig. 1, carries out a forward, breadth-first search where every newly generated node that does not make a tuple in T true for the first time in the search is pruned.
Theorem 4 (Completeness of IW(T )).If T is an admissible set of atom tuples in problem P , IW(T ) finds an optimal plan for P .Moreover, IW(T ′ ) finds an optimal plan for P for any set T ′ that contains T .
Proof.The first claim is a special case of the second.Assuming that T ′ contains an admissible set T , it is easy to show by induction that at the beginning of each iteration of the loop, the queue contains at least one node that represents an optimal path for some tuple in T not yet visited.Hence, IW(T ) cannot return FAILURE and must find a plan for P .Since the nodes are ordered in the queue by their costs, such a plan must be optimal.
To obtain simple expressions that bound the time and space used by IW(T ), and other algorithms, we make the following assumptions on the time/space required by the basic operations on states and tuples.For some domains, these bounds can be attained by suitable modifications of the algorithms, like clever choice of data structures, and preprocessing that can be amortized.
Assumption (Simple time and space analyses).For any planning problem P and reachable state s in P , the generation of the set Succ(s) of its successor states takes time proportional to |Succ(s)|, and checking whether s is a goal state or makes true a given atom tuple t takes constant time.Likewise, each such state can be stored in a constant amount of memory.These complexities are independent of the numbers of ground atoms and actions in P , and the size of the tuple t.
Under this assumption, the time and space requirements of IW(T ) can be expressed as: 2. If P has N ground atoms, the number of atoms that "flip" value across a transition in P is known and bounded by a constant (as in STRIPS domains), and size(T ) ≤ k, then a running time of O(bT N k−1 ) can be obtained.
Proof.Recall that a node is generated if it is inserted into the queue (cf.lines 5 and 11), and it is expanded if its successor nodes are generated (cf.line 11).A node is expanded if it makes true some tuple in T for the first time in the search (cf.line 9).Thus, IW(T ) expands and generates up to |T | and b|T | nodes, respectively.The construction of the hash table in line 3 requires time and space linear in the number |T | of keys to be stored.
The test in line 8 requires checking whether some tuple t in T that belongs to the state s is not in the hash table.This can be done in O(T ) time by iterating over each tuple in T and checking whether it belongs to s and the hash table.Hence, assuming constant time and space for the generation of successor states and the check of tuple satisfiability by states, IW(T ) runs in O(bT 2 ) time and O(bT ) space.
For obtaining the bound described in 2. an implementation that keeps track of the set of atoms ∆ that change value when a successor state s ′ of s is generated is needed.Then, a tuple t true in s ′ is novel iff it contains some atom in ∆ as the tuples true in s ′ that do not contain such atoms are also true in s.If all the tuples in T are of size at most k, and the size of ∆ is bound by a constant (omitted in the O-notation), the number of tuples that need to be checked in line 8 is at most O(N k−1 ).
The algorithm IW(k) (Lipovetzky & Geffner, 2012) is a special case of IW(T ) where T is the set T k of all conjunctions of up to k atoms. 2 Versions of IW(T ) have been used before 2.There is a minor difference between the algorithm IW(k) of Lipovetzky and Geffner (2012) and the version that results from IW(T ) when T is set to T k .In the first case, non-novel nodes are pruned once Algorithm 2: IW Search 1: Input: Planning problem P with N ground atoms 2: For k = 0, 1, . . ., N do: Run IW(k) on P

4:
If IW(k) finds a plan for P , return the plan 5: Return "no plan exists for P " (P has no solution) Figure 2: IW performs multiple IW(k) searches for increasing values of k = 0, 1, . . ., N , where N is the number of ground atoms in N .The completeness of IW is given in Theorem 7.
for planning in a class of video-games (Geffner & Geffner, 2015), where IW(1) explored too few nodes and IW(2) too many.The set T was then defined to comprise a selected class of atoms pairs.The properties of IW(k) are: Theorem 6 (Lipovetzky and Geffner, 2012).Let P be a planning problem with N atoms, branching factor bounded by b, and where the number of atoms that "flip" value across a transition is known and bounded.For each non-negative integer k, IW(k) expands up to N k nodes, generates up to bN k nodes, and runs in O(bN 2k−1 ) time and O(bN k ) space.IW(k) is guaranteed to solve P optimally (shortest path) if w(P ) ≤ k.
Proof.Direct from Theorems 4 and 5.
Finally, the algorithm IW, shown in Fig. 2, runs IW(k) for increasing values k = 0, 1, . . ., N stopping when a plan is found, or when no plan is found after k = N , where N is the number of atoms in P .
Theorem 7 (Completeness of IW).If w(P ) ≤ k, IW finds a plan for P , not necessarily optimal, in time and space bounded by O(bN 2k−1 ) and O(bN k ) respectively.
Proof.If w(P ) ≤ k, IW(k) finds an optimal plan for P , but a sub-optimal plan may be found by IW(i) for i < k.In either case, by Theorem 6, IW(i) runs in time and space O(bN 2i−1 ) and O(bN i ), respectively, for i = 1, . . ., k, which are dominated by the expressions with i = k.
It is not always necessary to execute all the iterations in the loop in IW.Indeed, if the call to IW(k), for k = i, ends up pruning only duplicate states, the search is already complete and further calls of IW(k) for k = i+1, i+2, . . ., N , will expand and prune exactly the same sets of nodes expanded and pruned by IW(i).
The procedure IW solves problems P of width bounded by k in polynomial time but not necessarily optimally.IW(k) solves such problems optimally: they are generated and thus not enqueued during the search.In the second case, a node is pruned when it is selected for expansion if it is not novel and it is not a goal.This difference does not affect the formal properties of the algorithm (optimality and time/memory complexity) except in a "border case", ensuring that IW(0) is complete and optimal for problems of width zero, solvable in one step, according to the new definition.
Theorem 8. Let Q be a collection of problems of bounded width.Then, any problem P in Q is solved in polynomial time by the IW algorithm.If w(Q) ≤ k, IW(k) optimally solves any instance in Q in polynomial time.
Proof.Let w(Q) = k.For any planning problem P in Q, IW runs IW(i) until i = k when IW(k) is guaranteed to find a plan for P .Each run of IW(i) takes polynomial time as k is fixed for any problem P in Q.If the bound k is known, IW(k) can be run instead, solving each problem P in Q optimally.
Example 2: IW(k) for Blocksworld instances every problem P in Q Blocks would be optimally solvable in polynomial time by IW(k), Theorem 8. Since computing optimal plans for arbitrary instances of Blocksworld is NP-hard (Chenoweth, 1991;Gupta & Nau, 1992), the width of Q Blocks must be unbounded, unless P = NP.
• Any transition in a Blocksworld instance flips at most 5 atoms.Hence, since w(Q Clear ) = 1 (resp.w(Q On ) = 2), IW(1) (resp.IW(2)) is guaranteed to find an optimal plan for any problem P in Q Clear (resp.Q On ) with N ground atoms in O(N ) time and space (resp.O(N 3 ) time and O(N 2 ) space).
From now on, we assume that the actions in the instances for a class Q flip at most a constant number of atoms, which is actually the case when the instances come from a STRIPS domain.

General Policies
The (general) policies over a class Q of problems P are first defined semantically, then syntactically.Semantically, a policy π is regarded as a relation on state pairs (s, s ′ ) that is only true for state transitions.A state transition is a pair of states (s, s ′ ) such that there is an action a that is applicable in s and which maps s into s ′ .The reason for policies to be defined as relations on state pairs and not on state-action pairs (e.g., as in RL and MDPs) is that the set of actions changes across the instance in Q. Policies π expressed as relations on state transitions (s, s ′ ) specify the possible actions to do in s indirectly and non-deterministically (Bonet & Geffner, 2018): if (s, s ′ ) is a state transition in the relation π, any action a that maps s into s ′ is allowed by π and can thus be applied at the state s.
Definition 9 (Policies).A policy π for a class of problems Q is a binary relation on ∪ P ∈Q states(P ); i.e., on the reachable states of the problems in Q, such that the state pair (s, s ′ ) in P is in π only if (s, s ′ ) is a state transition in P .Furthermore, 1.A state transition (s, s ′ ) in a problem P is a π-transition if (s, s ′ ) ∈ π, and s is not a goal state in P .
2. A π-trajectory in P is a sequence s 0 , s 1 , . . ., s n of states in P such that (s i , s i+1 ) is a π-transition, 0 ≤ i < n, and s 0 is the initial state of P .
3. A π-trajectory s 0 , s 1 , . . ., s n in P is maximal if there are no π-transitions (s n , s) in P , or s n = s i for some 0 ≤ i < n.In the latter case, the trajectory is cyclic.
4. The policy π solves P if every maximal π-trajectory s 0 , s 1 , . . ., s n ends in a goal state (i.e., s n is a goal state).
5. The policy π solves P optimally if every maximal π-trajectory reaches a goal state in n steps, where n is the cost of P .
6.The policy π solves Q if π solves each problem P in Q, and it is optimal for Q if it solves each problem in Q optimally.
Sufficient and necessary conditions for a general policy π to solve a class of problems Q can be expressed with suitable notions that apply to each of the instances in Q: Definition 10 (Policy concepts).Let Q be a class of problems, and let π be a policy for Q.Then, 2. π is acyclic in Q if there is no cyclic π-trajectory starting in an initial state for P ∈ Q.
Theorem 11 (Requirements for solvability).A policy π solves a class of problems Q iff π is closed and acyclic in Q.
Proof.Direct.If π is closed and acyclic, π solves every problem in Q.Conversely, if π is either not closed or acyclic, there is at least one problem P in Q that π does not solve.
Example 3: General (semantic) policy for Q Clear • Let π be the general policy for Q Clear defined as the set of transitions (s, s ′ ) such that: a) for a block y above x, s clear(y) ∧ hand-empty and s ′ hold(y), or b) s hold(y) and s ′ ontable(y).
To show that π solves all the instances in Q Clear one needs to show that π is closed and acyclic.
For closedness, it is easy to see that in every reachable state s in an instance P in Q that is not a goal state, there are successors s ′ for which condition (a) or (b) holds.Finally, π is acyclic because in any transition of a π-trajectory, one of the blocks that is initially above x is unstacked or placed on the table, and no block is ever stacked above x, so every π-trajectory must be finite.

Rule-based Policies
General policies are relations over the state pairs that are state transitions, and following Bonet and Geffner (2018), these relations can be defined in a compact way by means of rules of the form C → E over a fixed set Φ of Boolean and numerical features over Q.
Definition 12 (Features).A Boolean (resp.numerical) feature for a problem P is a function that maps reachable states into Boolean values (resp.non-negative integers).A feature for a class Q of problems is a feature that is defined on each problem P in Q.
When providing time bounds, we will be interested in linear features defined as follows: Definition 13 (Linear features).A feature φ for problem P with N ground atoms is linear if for any reachable state s in P , the value φ(s) can be computed in O(N ) time (i.e., linear time), and if φ is numerical, the size of {φ(s) : s ∈ states(P )} is O(N ).A set Φ of features is a set of linear features for P if each φ in Φ is linear for P , and it is a set of linear features for a class Q, if Φ is a set of linear features for each problem P in Q.
The form of the rules that make use of the features is as follows: Definition 14 (Rules).Let Φ be a set of features.A Φ-rule is a rule of the form C → E where the condition C is a set (conjunction) of Boolean feature conditions and the effect E is a set (conjunction) of feature value changes.A Boolean feature condition is of the form p, ¬p, n = 0, and n > 0 for Boolean and numerical features p and n in Φ. Feature value changes are of the form p, ¬p, p? for Boolean p, and n↓, n↑, and n? for numerical n.
A collection of rules defines a binary relation over state transitions in Q.While we consider rule-based policies first that define policies, we will also use rules to define another type of binary relation, serializations.While policies select state transitions (s, s ′ ) from states s, serializations select state pairs (s, s ′ ) where s ′ is not necessarily reachable from s through a single action: The state pair (s, s ′ ) is compatible with a rule C → E if s satisfies C, and the pair satisfies E. The pair is then also said to satisfy the rule itself.The state pair is compatible with a set R of rules if it is compatible with at least one rule in the set.The pair (s, s ′ ) is then said to be an R-pair, or to be in R.
A set of rules R defines a policy in the following way: Definition 16 (Rule-based policies).Let Q be a collection of problems and let Φ be a set of features for Q.A set of rules R over Φ defines the policy π R for Q in which a state pair (s, s ′ ) over an instance Example 4: General policy for Q Clear • A general policy for Q Clear can be expressed using rules over the set Φ = {H, n} of two features, where H is the Boolean feature that is true when the gripper holds a block, and n is the numerical feature that counts the number of blocks above x.The policy has the two rules: The first rule says that when the gripper is empty and there are blocks above x, an action that decreases n and makes H true must be chosen, while the second rule says that when the gripper holds a block, an action that makes H false and does not affect n must be selected.This policy is slightly more general than the one in the previous example as a block being held can be put "away" in any position except above x.
• As we saw before, w(Q Clear ) = 1 and this policy solves any instance Q Clear .
Example 5: The Grid domain • The Grid domain involves an agent that moves in a rectangular grid of arbitrary but finite size, and its goal is to reach a specific cell in the grid.Any instance in Grid is encoded with objects for the cells in the grid, and a unary predicate pos(c) that is true when the position of the agent is cell c.The class of Grid problems is denoted by Q Grid .
The Grid 2 domain is like Grid except that positions are encoded with horizontal and vertical coordinates.For a m × n grid, there are m (resp.n) objects to encode the vertical (resp.horizontal) coordinates, and two unary predicates hpos(h) and vpos(v) to encode that the agent is at column h and row v in the grid.The class of Grid 2 problems is denoted by Q Grid2 .
• Observe that each reachable state in a problem P in Q Grid (resp.Q Grid2 ) makes true exactly 1 (resp.2) atoms.Since all problems are solvable, w(Q Grid ) ≤ 1 and w(Q Grid2 ) ≤ 2. On the other hand, problems in Q Grid cannot be solved in one step, in general, and thus • For either encoding, an optimal general policy can be expressed with a single rule over the singleton Φ = {d} of features where d measures the distance to the goal cell: Clearly, any transition (s, s ′ ) compatible with the rule moves the agent one step closer to the goal, and the policy solves optimally any problem in Example 6: The Delivery domain • The Delivery domain D involves an agent that moves in a rectangular grid, and packages spread over the grid that can be picked up or dropped by the agent, which can hold one package at a time.A state for the domain thus specifies the position of the agent (i.e., a cell on the grid), and positions for the different packages, where the position of a package can be either a cell in the grid, or the agent's gripper.The task of the agent is to "deliver" all packages, one at a time, to a designated "target" cell.
The position of the agent can be encoded using a single unary predicate as for the Grid domain, or two unary predicates as for the Grid 2 domain.For simplicity, we assume an encoding of positions using a single predicate.
An instance for Delivery consists of objects for the different cells in the grid, and objects for the different packages.The atoms are as follows: pos(c) (resp.ppos(p, c)) that is true when the agent (resp.the package p) is in cell c, holding(p) that is true when the agent is holding package p, and empty that is true when the agent holds no package.Under this encoding, the goal can be expressed as the conjunction i ppos(p i , target) where p i is the ith package, target is the target cell, and the conjunction is over all the packages.
• The class of all Delivery problems in denoted by Q D , while Q D1 denotes the class of Delivery problems with exactly one package.Below we show that the width of Q D is unbounded, and w(Q D ) > 1.On the other hand, if s is a state in a problem P in Q D1 , s either makes true exactly two atoms of form pos(c) and holding(p), for the unique package p, or makes true exactly three atoms of form pos(c), ppos(p, c), and empty.However, the atom empty is determined by the atoms ppos(p, c) and holding(p): holding(p) ⇒ ¬empty, and ppos(p, c) ⇒ empty for any cell c.This fact allows the construction of an admissible set T of size 2 for any problem • A set of meaningful features for Delivery is Φ = {H, p, t, u} that capture whether the agent is holding a package, the distance to a nearest undelivered package (zero if the agent is holding a package, or no package to be delivered remains), the distance to the target cell, and the number of undelivered packages, respectively.The following set R of rules define a policy π R : The first rule says that when holding no package and undelivered packages remain in the grid, the agent must move closer to a nearest undelivered package.The second that when the agent holds no package and is at the same cell of an undelivered package, the agent must pick the package.The third that when holding a package and the agent is not at the target cell, it must move closer to the target cell.The last rule says that when the agent holds a package and is at the target cell, it must drop the package at the cell (i.e., deliver it).
It is not difficult to see that the policy R solves any problem P in both Q D and Q D1 .

Envelopes
We address next the relation between general policies and problem width by introducing the notion of envelopes.An envelope for a binary relation µ over the states in a problem P is a set of reachable states E that obeys certain closure properties: Definition 17 (Envelopes).Let P be a problem, and let µ be a binary relation on states(P ).
A subset E ⊆ states(P ) is an envelope of µ for P , or simply a µ-envelope iff 1. the initial state s 0 of P belongs to E, and 2. if s is a non-goal state in E, there is a state s ′ in E such that (s, s ′ ) is a µ-transition; i.e., a state transition in P that is also in µ.
The envelope is backward-closed, abbreviated as closed, iff for each state s ′ in E that is not the initial state, there is a state s in E such that (s, s ′ ) is a µ-transition.E is an optimal envelope iff each maximal µ-trajectory contained in E and that starts at a non-goal state s in E is a suffix of an optimal trajectory for P .
Notice that if π is a policy that solves P , the set of states reached by π in the way of the goal form a µ-envelope for µ = π, and similarly, the set of states in any non-empty set of π-trajectories also form a µ-envelope.These envelopes are actually closed, and indeed, for any state s in these envelopes there is a µ-trajectory that reaches the goal and passes through s.If the policy π is optimal for P , the envelopes based on the relation µ = π are optimal too.
Envelopes can be defined in ways that do not involve policies.In particular, costenvelopes are defined in terms of a binary relation µ =≺ cost that appeals to cost considerations.For this, optimal state costs and the binary ≺ cost relation are defined as follows: Definition 18 (Optimal state costs and cost relation).The cost of a state s in P , cost(s), is the cost (length) of a min-cost plan for reaching s from the initial state s 0 , and ∞ if there is no such plan.The optimal cost of a state s in P , cost * (s), is cost(s) if s is a non-goal state, and the cost of P (i.e., the cost of an optimal plan for P ) if s is a goal state.The cost relation ≺ cost is the set of state pairs (s, s ′ ) from P such that cost * (s) The reason that the optimal cost of goal states is defined as the cost of the problem P is to have a correspondence between goal reaching trajectories that are optimal and ≺ cost -trajectories: Lemma 19 (≺ cost -trajectories).Let P be a planning problem with initial state s 0 , and let τ = s 0 , s 1 , . . ., s n be a goal-reaching state trajectory in P .Then, τ is a ≺ cost -trajectory iff τ is an optimal trajectory for P .
Proof.(⇒) Let us assume that τ is a ≺ cost -trajectory.We use induction on i to show that the prefix τ i .= s 0 , s 1 , . . ., s i is an optimal trajectory for s i , 0 ≤ i ≤ n.The claim is true for i = 0. Let us assume that it holds for i = k, and let us consider the trajectory τ k+1 leading to state s k+1 .By assumption, cost * (s k ) < cost * (s k+1 ).On the other hand, the existence of the trajectory implies cost(s k+1 ) ≤ 1 + cost(s k ).Therefore, cost(s k+1 ) = 1 + cost(s k ) and the trajectory τ k+1 is optimal for s k+1 .Finally, τ must be an optimal trajectory for P since otherwise cost (⇐) Let us assume that τ is an optimal trajectory for P .By the principle of optimality, the trajectory τ i must be optimal for s i , 0 ≤ i < n.Therefore, cost(s 0 ) = 0 and cost(s i+1 ) = 1 + cost(s i ), 0 ≤ i < n.Observe that s n must be a closest goal state to s 0 and thus cost * (s n ) = cost(s n ).Hence, τ is a ≺ cost -trajectory.
A property of cost-envelopes, i.e., ≺ cost -envelopes, is that they are optimal, and that optimal µ-envelopes are cost-envelopes independently of the relation µ: Theorem 20 (Optimality of cost-envelopes).Let P be a planning problem, and let E be a subset of reachable states in P .If E is an optimal envelope for some binary relation µ on states(P ), then it is a cost-envelope.If E is a cost-envelope, then it is an optimal envelope.Proof.(⇒) Let us assume that E is an optimal envelope for some relation µ.We need to show that E satisfies the two conditions in Definition 17 for the relation ≺ cost .The first condition is direct since the initial state s 0 of P belongs to E as E is a µ-envelope.For the second condition, let s be a non-goal state in E. By the optimality of E, there is an optimal trajectory for P of form τ = s 0 , s 1 , . . ., s ℓ , . . ., s n such that s ℓ = s and {s ℓ , s ℓ+1 , . . ., s n } ⊆ E. Since τ is optimal, cost * (s i ) = i for i = 0, 1, . . ., n, and thus the transition (s, s ℓ+1 ) is a ≺ cost -transition with s ℓ+1 ∈ E.
(⇐) Let us assume that E is a cost-envelope, and let τ 1 = s ℓ , s ℓ+1 , . . ., s n be a maximal ≺ cost -trajectory contained in E that starts at state s ℓ .In particular, Since s ℓ is a reachable state in P , there is an optimal trajectory τ 2 = s 0 , s 1 , . . ., s ℓ from s 0 to the state s ℓ which satisfies, by its optimality, Therefore, the combined trajectory τ = τ 2 , τ 1 is a goal-reaching ≺ cost -trajectory.By Lemma 19, τ is an optimal trajectory for P that contains τ 1 as a suffix.Since τ 1 is arbitrary, the envelope E is optimal.
We will see that a relation between the width of a problem and optimal policies can be established by considering cost-envelopes defined by tuples of atoms.

Cost Envelopes and Problem Width
For a set T of reachable atom tuples over P , the set OP T (T ) is the set of min-cost states that reach the tuples in T : Definition 21 (Optimal states for T ).Let P be a problem, let t be a reachable atom tuple in P , and let T be a set of such tuples.OP T (t) is the set of optimal states for t in P , i.e., the set of min-cost states where t is true, and OP T (T ) .= ∪{OP T (t) : tuple t is in T }.
It turns out that OP T (T ) is a cost-envelope iff T is admissible: Theorem 22 (Admissibility and cost-envelopes).Let P be a planning problem, and let T be a set of reachable atom tuples in P .Then, T is admissible for P iff OP T (T ) is a cost-envelope.
Proof.(⇒) Let us assume that T is admissible for P , and let s 0 be the initial state of P .We need to show that OP T (T ) is a cost-envelope: 1. Since T is admissible, there is tuple t in T with s 0 t.Hence, s 0 belongs to OP T (T ).
2. Let s be a non-goal state in OP T (T ); i.e., there is tuple t in T with s t, and there is an optimal trajectory τ for t that ends in s.Using that T is admissible, it is easy to show that τ can be extended into an optimal trajectory τ, s 1 , . . ., s n for P , where the trajectories τ, s 1 , . . ., s i are optimal for tuples t i in T ; i.e., the states s 1 , s 2 , . . ., s n all belong to OP T (T ).On the other hand, by Lemma 19, τ, s 1 , . . ., s n is a ≺ cost -trajectory.Hence, (s, s 1 ) is a ≺ cost -transition.
(⇐) Let us assume that OP T (T ) is a cost-envelope.We need to show that T is admissible; namely, that the two conditions in Definition 1 hold for T .
1. Since OP T (T ) is an envelope, it contains the initial state s 0 of P .Hence, there is a tuple t ∈ T such that s 0 t.
2. Let τ be an optimal trajectory for a tuple t ∈ T that ends in a non-goal state s.We need to show that τ can be extended with a single step into an optimal trajectory for a tuple t ′ ∈ T .Since OP T (T ) is a cost-envelope and s ∈ OP T (T ), there is a ≺ cost -trajectory s, s ′ , τ ′ entirely contained in OP T (T ) that starts at s, transitions to s ′ , and ends in a goal state.Hence, there is a tuple t ′ ∈ T such that s ′ ∈ OP T (t ′ ), and the joined trajectory τ, s ′ , τ ′ is a goal-reaching ≺ cost -trajectory that starts at s 0 .By Lemma 19, the trajectory τ, s ′ is optimal for s ′ , and thus for the tuple t ′ in T .
Therefore, optimal plans for P can be found by running the IW(T ) algorithm, provided that there is a subset of tuples T ′ ⊆ T such that OP T (T ′ ) is a cost-envelope: Theorem 23.The algorithm IW(T ) finds an optimal plan for P if OP T (T ′ ) is a costenvelope for some T ′ ⊆ T .Likewise, IW(k) finds an optimal plan for P if OP T (T ) is a cost-envelope for some set T of conjunctions of up to k atoms in P .
Proof.If OP T (T ′ ) is a cost-envelope, T ′ is admissible by Theorem 22, and IW(T ) finds an optimal plan by Theorem 4. The second claim follows from the first and Theorem 6.
The width of a problem P can thus be related to the min size of a tuple set T for which OP T (T ) is a cost-envelope for P : Theorem 24.Let P be a planning problem, and let T be a set of atom tuples in P .If OP T (T ) is a cost-envelope, w(P ) ≤ size(T ).

Optimal Policies and Problem Width
A sufficient condition for OP T (T ) to be a cost-envelope is for OP T (T ) to be a π-envelope for an optimal policy π for a class Q of problems that includes the problem P : Theorem 25 (Envelopes for optimal policies).Let P be a planning problem, let π be an optimal policy for a class Q that includes P , and let E be a subset of reachable states in P .Then, E is a cost-envelope in P if E is a closed π-envelope in P .
Proof.By Theorem 20, it is sufficient to show that E is an optimal envelope.However, this is direct since for any maximal π-trajectory τ 1 that is contained in E and that starts at some state s, by closedness, there is a π-trajectory τ 2 from the initial state of P to the state s.Hence, the trajectory τ = τ 2 , τ 1 is a goal-reaching π-trajectory from the initial state that must be optimal as π is an optimal policy.A closed π-envelope is an envelope formed by the states in a (non-empty) set of πtrajectories, starting in the initial state of the problem and ending at a goal state.The theorem implies that if OP T (T ) is a closed π-envelope, then OP T (T ) is a cost-envelope, and then from Theorem 24, that the width of P is bounded by the size of T : Theorem 26 (Optimal policies and width).Let Q be a class of planning problems, and let π be an optimal policy for Q.Then, 1.If P is a planning problem in Q, and T is a set of atom tuples in P such that OP T (T ) is a closed π-envelope, then w(P ) ≤ size(T ).
2. If k is a non-negative integer such that for any problem P in Q, there is a set T of atom tuples in P such that size(T ) ≤ k and OP T (T ) is a closed π-envelope, then w(Q) ≤ k.
Proof.The first claim follows by Theorems 24 and 25 as the former establishes that w(P ) ≤ size(T ) if OP T (T ) is a cost envelope, and the latter that OP T (T ) is a cost envelope if it a π-envelope of an optimal policy π.The second claim follows from the first since for any problem P in Q, w(P ) ≤ k by the first claim and the assumed T .
This theorem is important as it sheds light on the notion of width and why many standard domains have bounded width when atomic goals are considered.Indeed, in such cases, the classes of instances Q admit optimal policies π that in each instance P in Q can be "followed" by considering a set of tuples T over P .If OP T (T ) is a closed-envelope of π, it is possible to reach the goal of P optimally through a π-trajectory without knowing π at all: one can then pay attention to the set of tuples in T only and just run the IW(T ) algorithm.Indeed: Theorem 27.Under the assumptions of Theorem 26, IW(T ) reaches the goal of P optimally, through a π-trajectory; i.e., there is a goal-reaching π-trajectory τ = s 0 , s 1 , . . ., s n seeded at the initial state s 0 of P such that all the states s i are expanded by IW(T ), except the goal state s n that is selected for expansion but is not expanded.
Proof.As shown in the proof of Theorem 26, OP T (T ) is a cost-envelope given the assumptions and T is an admissible set for P .Therefore, by Theorem 4, IW(T ) finds an optimal path τ = s 0 , s 1 , . . ., s n for P ; that is, IW(T ) expands nodes n i that represent the prefixes τ i = s 0 , s 1 , . . ., s i for 0 ≤ i < n, and selects for expansion the node that represents the path τ .Since s n is a goal state, IW(T ) returns the path τ .
When the conditions in Theorem 26 hold, the IW(T ) algorithm reaches the goal of the problem P ∈ Q, optimally through a π-trajectory, even if the policy π or the features involved in the policy are not known.We say in this case, that the set of tuples T represents the policy π in the problem P , as the IW(T ) search delivers then a π-trajectory to the goal, and it is thus "following" the policy.Notice however that even in this case, it is not necessary for the set of tuples T to capture all the possible π-trajectories to the goal.It suffices for T to capture one such trajectory.The goal of P then can be reached optimally through IW(T ) by expanding no more than |T | nodes (cf.Theorem 5).
It is also not necessary for the set of tuples T to be known for solving the problem P optimally.If the conditions in Theorem 26 hold for some set T of tuples of size k, i.e., k = size(T ), the algorithm IW(k) will deliver an optimal π-trajectory to the goal as well.On the other hand, if there is an upper bound k on the size of the tuples, but its value is not known, the algorithm IW would solve the problem in time and space exponential in k but not necessarily through an optimal π-trajectory, because non-optimal solutions can potentially be found by IW(k ′ ) when k ′ < k (cf.Theorem 7).

Lower Bound on Width
Finally, the following result provides a lower bound on width: Theorem 28 (Necessary conditions for bounded width).Let P be a planning problem, let k be a non-negative integer, and let T be the set of all atom tuples in P of size at most k.If for every optimal trajectory τ for P , τ has a state that is not in OP T (T ), then w(P ) > k.
Proof.We show the contrapositive of the claim.Namely, if w(P ) ≤ k, then there is an optimal trajectory τ for P that is entirely contained in OP T (T ).Thus, let P be a problem such that w(P ) ≤ k, and let T ′ be an admissible set for P with size(T ′ ) ≤ k.We construct an optimal trajectory τ = s 0 , s 1 , . . ., s n inductively, using the admissible set T ′ .Indeed, T ′ contains a tuple t 0 that is made true by the initial state s 0 , and thus s 0 ∈ OP T (T ′ ).For the inductive step, assume that we have already constructed the prefix τ i = s 0 , s 1 , . . ., s i such that τ i ⊆ OP T (T ′ ) and τ i is an optimal trajectory for s i ; in particular.There is a tuple t i in T ′ such that s i ∈ OP T (t i ).By admissibility, τ i can be extended with one step into an optimal trajectory for a tuple t ′ in T ′ ; i.e., there is a state s i+1 and tuple t i+1 such that s i+1 in OP T (t i+1 ) and τ i+1 .= τ i , s i+1 is an optimal trajectory for t i+1 .By definition of admissible sets, this process can be continued until τ i becomes an optimal trajectory for P .At such moment, by construction, τ i ⊆ OP T (T ′ ).To finish the proof, observe that T ′ ⊆ T since size(T ′ ) ≤ k.
Example 7: Width for the class Q On • Let P be a problem in Q On where in the initial state the blocks x and y are not clear and in different towers.
• Let T be the set of all atoms in P (i.e., all atom tuples of size 1).Any optimal trajectory for P contains a state s in which both x and y are clear, just before moving x on top of y.This state s is not in OP T (T ); e.g., it does not belong to OP T (clear(x)) because any optimal trajectory for clear(x) does not move any block above y.By Theorem 28, w(P ) > 1 and thus w(Q On ) > 1.As we saw before, w(Q On ) ≤ 2. Hence, w(Q On ) = 2.
Example 8: Width for Grid problems (classes Q Grid and Q Grid2 ) • The states in problems for Q Grid (resp.Q Grid2 ) make true exactly one (resp.two) atoms; i.e., the subset of atoms that identify the position of the agent.
• Let π is an optimal policy for Q Grid (resp.Q Grid2 ), let P be a problem in Q Grid , and let τ be a goal-reaching π-trajectory seeded at the initial state of P .Let us consider the set of atom tuples T .= {atoms(s) : s ∈ τ } where atoms(s) refers to the subset of atoms made true at the state s.It is easy to show that T is a closed π-envelope, and size(T ) = 1 (resp.size(T ) = 2).Therefore, by Theorem 26, w(Q Grid ) ≤ 1 and w(Q Grid2 ) ≤ 2.
• Likewise, w(Q Grid ) = 1 as there are problems that require more than one step.
• Let P be a problem in Q Grid2 in which the initial and target cells are neither in the same row nor same column.Any optimal trajectory for P contains some state s where the agent is at a row and column different than those at the initial state.Such a state does not belong to OP T (T ), where T is the set of all atoms in P .By Theorem 28, Example 9: Width for Delivery problems (classes Q D1 and Q D ) • Problems in Q D1 involve the agent and a single package.We showed before that w(Q D1 ) = 2.
• The class Q D has unbounded width, however.The intuition is that any admissible set of atom tuples must "track" the position of an arbitrary number of packages, those that have been already delivered.We make this intuition formal in the following.
• Let P be a problem in Q D with k + 2 packages, let π be an optimal policy for P , and let T be the set of all atom tuples in P of size at most k.Without loss of generality, we assume that the packages p 1 , p 2 , . . ., p k+2 must be delivered in such an order to guarantee optimality.
• Let s be the state along an optimal trajectory τ for P in which the packages p 1 , p 2 , . . ., p k+1 have been delivered, and the agent is at the target cell (just after delivering p k+1 ).We now show that s does not belong to OP T (T ).Indeed, the tuples in T can only track the position of at most k objects in the set {p 1 , p 2 , . . ., p k+1 } of k + 1 objects.Hence, no matter which tuple t in T is considered, the state s does not belong to OP T (t) as no state in OP T (t) has k + 1 distinct packages at the target cell.Therefore, by Theorem 28, w(P ) > k.
• Since Q D contains problems with an arbitrary number of packages, w(Q D ) = ∞.

Algorithm IW Φ
The results above shed light on the power of the algorithms IW(T ) and IW(k) for problems that belong to classes Q for which there is a general optimal policy π.Indeed, if the set of optimal T -states, OP T (T ), forms a closed π-envelope, the algorithm IW(T ) is complete and optimal for P , and moreover, reaches the goal of P through π-trajectories, even if knowledge of the policy or its features is not known.There is, however, a variant of IW(T ) that does not use tuples of atoms or other particular details about the encoding for P , and uses instead a set Φ of features.The new algorithm, called IW Φ and shown in Fig. 3, is like IW(T ) but works with feature valuations over Φ rather than with atom tuples.That is, IW Φ does a breath-first search that prunes the states s whose feature valuation f (s) has been seen before during the search.
The question is the following.Assuming that π is an optimal rule-based policy that solves a class Q that includes P , and that Φ is the set of features used by the policy: does IW Φ solve P optimally?
It turns out that without extra conditions, the answer to this question is no.One reason is that a policy π may solve a problem by using a number of feature valuations that is smaller (possibly, exponentially smaller) than the number of states required to reach the goal.In these cases, the IW Φ search cannot get to the goal because any plan must involve sequences where the same feature valuation repeats.For example, a policy for the Gripper domain where a number of balls have to be carried from Room A to Room B, one by one, Algorithm 3: IW Φ Search 1: Input: Planning problem P with N atoms 2: Input: Set Φ of features, and function f to compute its valuation on states in P 3: Initialize hash table T for storing feature valuations 4: Initialize FIFO queue Q on which enqueue and dequeue operations take constant time 5: Enqueue node n 0 for the initial state s 0 of P 6: While Q is not empty: can be defined in terms of a set Φ of three Boolean features encoding whether the robot is in Room A, whether there are balls still left in Room A, and whether a ball is being held by the robot.The number of possible feature valuations is 8 but the length of the plans grows linearly with the number of balls.
We have the tools at our disposal however to provide conditions that ensure that the algorithm IW Φ solves any problem P ∈ Q optimally, if the policy π does.For a set of feature valuations F over the features in Φ, let OP T (F ) stand for the set of min-cost states s in P with feature valuation f (s) in F .Sufficient conditions that ensure the completeness and optimality of IW Φ can be expressed as follows: Theorem 29 (Completeness and optimality of IW Φ ).Let Φ be a set of features for a planning problem P , and let F be a set of feature valuations over Φ.Then, 1.If OP T (F ) is a cost-envelope, IW Φ finds an optimal plan for P .
2. If π is an optimal policy for a class Q, and OP T (F ) is a closed π-envelope for P in Q, IW Φ finds an optimal plan for P .
In either case, if the features in Φ are linear (cf.Definition 13), IW Φ finds a plan of length O(N ℓ ) using O(bN ℓ ) time, where ℓ is the number of numerical features in Φ, N is the number of atoms in P , and b bounds the branching factor in P .
Proof.Essentially, the proof involves a similar but more complex invariant that the one used in the proof for the completeness of IW(T ) (cf.Theorem 4).
where the third equality is by the principle of optimality.Hence, the invariant holds for the next iteration. d We now show that the path found by IW Φ is optimal.Let n * be the last node dequeued by IW Φ ; i.e., n * [state] is a goal state.There are two complementary cases: ).Then, the path leading to n * is an optimal trajectory for P by Lemma 19.Notice that IW Φ provides complexity bounds for the solvability of classes of problems, independently of the underlying structure of states.That is, IW Φ does not know about atoms at all, only about feature valuations.Two encodings that are different syntactically but equivalent semantically will yield the same behaviour in IW Φ ; e.g., two encodings of Blocksworld, one in which clear is a primitive predicate and one where it is a derived predicate.
The set Φ of features in Theorem 29.2 does not need to be the set of features on which the policy π is defined.If F is the set of feature valuations reached by π, and Reachable(π, P ) is the set of states reachable by using π in P , IW Φ is guaranteed to find an optimal plan when π is an optimal policy, and OP T (F ) = Reachable(π, P ): Theorem 30.Let π be a rule-based policy that optimally solves a problem P , and that is defined over the features in Φ, and let F be the set of feature valuations reached by π in P .If OP T (F ) = Reachable(π, P ), then OP T (F ) is a closed π-envelope, and thus IW Φ finds an optimal plan for P .
Proof.By Theorem 29, it is enough to show that OP T (F ) is a closed π-envelope.
Clearly, the initial state s 0 belongs to OP T (F ).If s is a state in OP T (F ), then 1) there is a π-trajectory for s (as s is π-reachable), 2) there is a π-transition (s, s ′ ) (as π solves P ), and 3) s ′ belongs to OP T (F ) (as s ′ is reached by π).Closedness is direct by (1).
If different states have different feature valuations, IW Φ reduces to a breadth-first search, as only nodes for duplicate states are pruned.The interesting uses of the theorem however are on settings where the number of possible feature valuations is exponentially smaller than the number of states.We now present two examples involving the algorithm IW Φ .
Example 10: IW Φ search on Q Clear • Let us consider the policy π for Q Clear previously defined in Example 4 over the set of features Φ = {H, n}.Fix a problem P in Q Clear , and let F be the set of feature valuations reached by π on P .The policy π is optimal and OP T (F ) = Reachable(π, P ).By Theorem 30, IW Φ finds an optimal plan for problem P .Notice that if P has N blocks, there are 2N different feature valuations, but an exponential number of configurations for the N blocks.
Example 11: The Marbles domain • Marbles is not a STRIPS domain as the goal must be specified with negative literals; i.e., b ¬ontable(b) where the conjunction is over all boxes b in the problem.• Yet there is a simple optimal policy π for Q M defined over the set Φ = {n, m} of numerical features where n counts the number of boxes still on the table, and m counts the number of marbles in the "first box" among those still on the table, where a static ordering of the boxes is assumed.A general optimal policy π over Φ can be expressed as follows: The first rule says to remove a marble from the first box on the table when such a box is not empty, while the second rule says to take an action that decreases the number of boxes on the table.The policy π solves any problem in Q M or Q M1 optimally, as any optimal plan must execute a number of actions that is equal to the total number of marbles in all boxes plus the number of boxes.
• For analyzing the algorithm IW Φ over instances in Q M , let P be a problem in Q M , and let F be the set of feature valuations reached by π on P .The policy π is optimal for Q M and OP T (F ) = Reachable(π, P ).By Theorem 30, IW Φ finds an optimal plan for P in O(N 3 ) time, where N is the number of atoms in P , since the features in Φ are linear, and the branching factor in P is bounded by N .

Serializations
The problem of subgoal structure is critical in classical planning, hierarchical planning, and reinforcement learning although in most cases the problem has not been addressed formally.
We draw on the language for general policies to express decompositions into subproblems, and on the notion of width for expressing and evaluating such decompositions and the subgoal structures that result.
We start with the notion of serializations which are defined semantically as binary relations on states that are acyclic.
Definition 31 (Serializations).A serialization over a collection of problems Q is a binary relation '≺' over the states in ∪ P ∈Q states(P ) that is acyclic in Q, meaning that there is no set {s 0 , s 1 , . . ., s n } of reachable states in P , where s 0 is the initial state, such that s i+1 ≺ s i for i = 0, . . ., n − 1, and s j ≺ s n for some 0 ≤ j ≤ n.
For binary relations '≺' that express serializations, the notation s ′ ≺ s expresses in infix form that the state pair (s, s ′ ) is in '≺'.There is no assumption that the state pair (s, s ′ ) is a state transition; namely, that s ′ is at one step from s, and the meaning of s ′ ≺ s is that s ′ is a possible subgoal state from state s.
A serialization '≺' over Q splits a problem P in Q into subproblems.For a reachable state s in P , the subproblem P ≺ [s] is like P but with two changes: the initial state is s, and the goal states are the states s ′ such that s ′ is a goal state of P , or s ′ ≺ s.
Definition 32 (Subproblems).Let ≺ be a serialization over a class Q, and let P be a problem in Q.The class of subproblems induced by ≺ on P , denoted by P ≺ , is the smallest class that satisfies: 1. P ≺ [s 0 ] is in P ≺ for the initial state s 0 of P , and s, s ′ is not a goal state, and either a) s ′ is a successor state of s, or b) s ′ is a reachable state from s, and there is no successor s ′′ of s that is either a goal state of P or s ′′ ≺ s.
We say that the subproblem P ≺ [s] induces the subproblem P ≺ [s ′ ].
Cases 2a and 2b ensure that a non-successor state of s is regarded as a possible subgoal from s only when no successor state of s is subgoal from s.In other words, while the possible subgoal states from s do not have to be at a minimum distance from s, they have to be at distance 1 when there is a subgoal state that is at such a distance.
Intuitively, a serialization is "good" if it results in subproblems that have small, bounded width that can be solved greedily in the way to the goal.
Definition 33 (Serialized width).Let '≺' be a serialization over a class of problems Q, and let P ∈ Q.Then, 1.The (serialized) width of P , denoted as w ≺ (P ), is the minimum non-negative integer k that bounds the width w(P ≺ [s]) of all the subproblems P ≺ [s] in P ≺ .
2. The (serialized) width of Q, denoted as w ≺ (Q), is the minimum non-negative integer k that bounds the serialized width w ≺ (P ) of the problems P in Q.
Starting from a problem P in Q, a serialization may lead to state s if the subproblem P ≺ [s] belongs to the subclass P ≺ of subproblems induced by ≺.Hence, if for a dead-end state s, the subproblem P ≺ [s] belongs to P ≺ , by definition, w(P ≺ [s]) = ∞ and thus w(Q) = ∞ as well.
Interestingly, serializations of zero width are policies, and vice versa, policies are serializations of zero width.
Theorem 34 (Zero-width serializations and policies).Let Q be a class of problems, and let ≺ be a binary relation on ∪ P ∈Q states(P ).Then, ≺ is a serialization of zero width for Q iff ≺ is a policy that solves Q.
Proof.In this proof, we write s ≺ s ′ to denote that the pair (s, s ′ ) is in the relation ≺, either when ≺ denotes a serialization or a policy.
(⇒) Assume that ≺ is a serialization of zero width for Q.Clearly, ≺ is a policy as it is a binary relation on ∪ P ∈Q states(P ).It remains to show that for any problem P in Q, every maximal ≺-trajectory seeded at the initial state s 0 of P is goal reaching.
Let τ = s 0 , s 1 , . . ., s n , . . .be one such trajectory; i.e., s i ≺ s i+1 for i ≥ 0. Since states(P ) is a finite set and ≺ is acyclic in P , τ must be of finite length.Let us assume that it ends at s n .If s n is not a goal state, the subproblem P ≺ [s n ] belongs to P ≺ by Definition 32.Then, w(P ≺ [s n ]) = 0 since ≺ is of zero width.This means that there is a successor state s ′ of s n such that s n ≺ s ′ .Hence, τ is not a maximal trajectory contradicting the assumption.Therefore, s n must be a goal state.line 5 of SIW ≺ takes O(bN 2k−1 ) time and O(bN k ) space, where N is the number of atoms, and b bounds the branching factor in P .Proof.Let s 0 be the initial state of a problem P in Q, and let τ = s 0 , s 1 , . . ., s n be a state sequence in P where all states, except perhaps s n , are non-goal states.We say that τ is a ≺-sequence iff for each index 0 ≤ i < n, the state s i+1 is reachable from the state s i and s i+1 ≺ s i , but if the state s i+1 is not a successor of s i , then s i has no successor s ′ such that s ′ ≺ s i .Observe that if τ is a ≺-sequence, then a simple inductive argument shows that the subproblem P ≺ [s i ] induces the subproblem P ≺ [s i+1 ], 0 ≤ i < n.If s n is a goal state, the sequence is called a ≺-solution.If τ = s 0 , s 1 , . . ., s n is a ≺-sequence for P , there is a ≺-solution τ ′ for P that extends τ .Indeed, if s n is a goal state, τ is already a ≺-solution.Otherwise, the subproblem P ≺ [s n ] belongs to P ≺ .Since w ≺ (P ) ≤ k, there is a state s n+1 reachable from s n that is either a goal state, or s n+1 ≺ s n ; i.e., the sequence τ, s n+1 is a ≺-sequence.Iterate until finding an extension τ ′ of τ that ends in a goal state, which can be done because ≺ is acyclic, and the number of states in P is finite.
It is easy to see that at the start of each iteration of the loop, SIW ≺ has discovered a ≺-sequence τ = s 0 , s 1 , . . ., s n that ends at the current state s n = s.Hence, since ≺ is acyclic, the loops eventually ends.The time and space bounds for each call of IW in line 5 of SIW ≺ follow directly from Theorem 4.However, even if the subproblems are solved greedily and in polynomial time by IW, the total number of calls to IW, and hence the total running time of SIW ≺ , cannot be bounded without extra assumptions on the structure of the serialization.Indeed, there are serializations that split a problem into an exponential number of subproblems, like the Hanoi example below.However, once we move to serializations expressed by means of rules akin to those used to express policies, we will be able to provide conditions and bound the running time of SIW ≺ .
Example 13: The Hanoi domain • Q Hanoi is the class of Towers of Hanoi problems involving 3 pegs, numbered from 0 to 2, and any number of disks, where the initial and goal states correspond to single towers at different pegs, respectively.Recently, Liu, Xu, Van den Broeck, and Liang (2023) refer to a general strategy that solves problems of moving a single tower from peg 0 to peg 2: Alternate actions between the smallest disk and a non-smallest disk.When moving the smallest disk, always move it to the left.If the smallest disk is on the first pillar, move it to the third one.When moving a non-smallest disk, take the only valid action.
• This strategy can be expressed as a rule-based policy using three Boolean features p i,j , 1 ≤ i < j ≤ 3, that are true if the top disk at peg i is smaller than the top disk at peg j.However, in order to account for the alternation of movements, it must be assumed that the planning encoding adds an extra atom e that is true initially, and that flips with each movement.Provided then with a Boolean feature q that tracks the value of e, a general policy for Hanoi can be expressed with the following set R of rules over the features Φ = {q, p 1,2 , p 1,3 , p 2,3 }:
• The policy π R defined by rules in R is thus general for the class Q HanoiOdd that contains the problems with an odd number of disks, initial situation with a single tower at peg 0, and goal situation with a single tower at peg 2. A general policy for Q Hanoi can be obtained by considering additional Boolean features that tell the parity of the number of disks, and the pegs for the initial and final towers.
• The policy π R defines a serialization of zero width by Theorem 34.This serialization splits a problem P in Q HanoiOdd into an exponential number of subproblems as 2 n − 1 steps are needed to solve a Hanoi problem with n disks.The algorithm SIW R solves any problem P in Q HanoiOdd , but not in polynomial time.

Rule-Based Serializations: Sketches
As with policies, the binary relations that encode serializations can be compactly represented by means of rules.The syntax of the rules is exactly the syntax of policy rules, and the only difference is in the semantics of the rules where state pairs (s, s ′ ) are not limited to state transitions: Definition 36 (Sketches).Let Q be a collection of problems, let Φ be a set of features for Q, and let R be a set of rules over Φ.The rules in R define the binary relation ≺ R over the states in ∪ P ∈Q states(P ) given by s ′ ≺ R s iff the state pair (s, s ′ ) is compatible with some rule in R.
The rules that define serializations are called sketch rules, and sets of such rules are called sketches.The sketch width of Q given a sketch R is the serialized width of Q under the serialization ≺ R defined by R.
Definition 37 (Sketch width).Let R be a set of rules that define a serialization ≺ R over a class Q of problems.The sketch width of R over Q, denoted by w A rule-based policy π that solves Q is a sketch R for Q of zero width: Algorithm 5: SIW R Search 1: Input: Sketch R that defines relation ≺ R 2: Input: Planning problem P in collection Q 3: Initialize state s to initial state s 0 for P 4: While s is not a goal state of P :

5:
Do an IW search from s to find s ′ that is either a goal state or f (s ′ ) ≺ R f (s) (i.e., the goal test in line 9 of IW(k) is augmented with f (s) ≺ R f (s ′ ) where s ′ is the state for the dequeued node n in line 7)

6:
If s ′ is found, set s ← s ′ .Else, return FAILURE (Serialized width of P is ∞) 7: Return the path from s 0 to the goal state s (Solution found) Figure 5: SIW R is SIW ≺ with the serialization ≺ R induced by the sketch R, which is testable.The completeness and complexity of SIW R is given in Theorem 40.
Theorem 38 (Rule-based policies and sketches).Let R be a set of rules defined in terms of a set of features Φ for a class Q of problems.Then, R is a rule-based policy that solves Q iff R is a sketch of zero width for Q.

Algorithms
If R is a sketch of bounded width over a class Q, the problems in Q can be solved by the SIW R algorithm, shown in Fig. 5, where s ′ ≺ R s is tested by checking if some rule in R is compatible with the state pair (s, s ′ ).However, to bound the complexity of SIW R , a bound in the total number of subproblems that need to be solved is needed.A simple way to bound such a number is to require that the subgoal states s 0 , . . ., s n in state sequences compatible with a sketch R have different feature valuations: Definition 39 (Feature-acyclic sketches).Let Q be a class of problems, and let R be a set of rules for Q defined on a set Φ of features that define a binary relation ≺ R .The relation ≺ R , or simply R, is said to feature-acyclic over Q if it is so for each problem P in Q, where the latter means that there is no set {s 1 , s 2 , . . ., s n } of reachable states in P such that Clearly, if R is feature-acyclic over Q, then ≺ R is (state) acyclic over Q, and hence ≺ R is a serialization, and R is a sketch.The complexity bound for algorithm SIW R follows: Theorem 40 (Completeness of SIW R ).Let R be a feature-acyclic sketch for a class Q of problems of width bounded by k.SIW R solves any problem P in Q in polynomial time (exponential only k, not in the size of P ).In particular, if the features are linear, P is solved by SIW R in O(N ℓ (N k+1 + bN 2k−1 )) time and O(bN k ) space, producing a plan of length O(N ℓ+k ), where N is the number of atoms in P , b bounds the branching factor in P , and ℓ is the number of numerical features in Φ.
Proof.The SIW R algorithm is the SIW ≺ algorithm that uses the serialization '≺ R ' induced by the sketch R. Hence, by Theorem 35, SIW R solves any problem P in Q, and each call to IW in line 5 of SIW R takes O(bN 2k−1 ) time and O(bN k ) space.
As ≺ R is feature-acyclic, the number of subproblems to solve is bounded by the maximum number of feature valuations that can appear when solving P .In the case of linear features, this number is O(N ℓ ).For each expanded state in each call to IW, the value of the features are computed in O(|Φ|N ) = O(N ) time.Thus, the total running time of SIW R is O(N ℓ (N k+1 + bN 2k−1 )).
For the space required by SIW R , since the solutions to the subproblems produced by IW do not need to be stored, the space complexity of SIW R is is the space complexity of the IW calls; namely, O(bN k ).The length of the overall plan, however, is bounded by the number of subproblems times their maximum possible lengths as O(N ℓ+k ).
Example 14: Sketches for the Delivery domain Table 1 contains different sets of rules over the set Φ = {H, p, t, u} of features for the Delivery domain.For each such set, the table indicates whether the set is feature-acyclic, and contains the sketch width for the classes Q D1 and Q D .The width is only specified for sets that are acyclic.We briefly explain the entries in the table without providing formal proofs, but all the details can be easily filled in with the results in the paper.
• R 0 is the empty sketch whose width is the same as the plain width.
• The rule {H} → {¬H, p?, t?} in R 1 does not help in initial states that do not satisfy H, and hence the width remains 2 and ∞ for Q D1 and Q D , respectively.
• The rule {¬H} → {H, p?, t?} in R 2 says that a state s where ¬H holds can be "improved" by finding a state s ′ where H holds, while possibly affecting p, t, or both.This rule splits every problem P in Q D1 into two subproblems: achieve H first and then the goal, reducing the sketch width of Q D1 to 1.
• The rule set R 3 is not acyclic and thus not a proper sketch.
• The sketch R 4 decomposes problems using the feature u that counts the number of undelivered packages, reducing the width of Q D to 2, but not affecting the width of Q D1 .The reduction occurs because each problem P in Q D is split into subproblems, each one for delivering a single package, similar to the problems in Q D1 .
• R 5 combines the rules in R 2 and R 4 .Each problem in Q D is decomposed into subproblems, each one like a problem in Q D1 , and each problem in Q D1 is further decomposed into two subproblems of width 1 each.The combined result is that the sketch width of Q D1 and Q D both get reduced to 1.
• The sketches R 6 and R 7 do not help to reduce the width for either class.The rule in R 6 generate subproblems of zero width until reaching a state where ¬H and p = 0 holds, for which the remaining problem has width 2 or ∞ for either Q D1 or Q D , respectively.R 5 , on the other hand, does not help as the initial states do not satisfy H.
• Finally, the sketch R 8 yields a serialization of zero width, and hence a full policy, where each subproblem is solved in a single step.

Rule set Acyclic
Table 1: Different sketches for the Delivery domain, one rule set per line.The table shows whether each rule set is feature-acyclic and also upper bounds the width for sketch for the classes Q D 1 and Q D of Delivery problems.The rule set R 3 is not a proper sketch as it is not acyclic; hence, the entries marked as '-'.For feature-acyclic sketches of bounded width, SIW R solves any instance in the class in polynomial time.

Acyclicity and Termination
The notion of acyclicity appears in three places in our study.First, if a policy π is closed and acyclic in a problem P , then π solves P .Second, serializations must be acyclic, as otherwise, even if subproblems have small, bounded width, the SIW ≺ procedure may get stuck in a cycle.Third, feature acyclicity has been used above to provide runtime bounds.Interestingly, there are structural conditions on the set of rules R that ensure that the resulting binary relation on pairs of states (s, s ′ ) is feature-acyclic by virtue of the form of the rules and the features involved, independently of the domain.This is the case, for example, if R only contains the rules r 1 = {¬H} → {H, n↓} and r 2 = {H} → {¬H}.A sequence of states s 0 , s 1 , s 2 , . . .compatible with R cannot contain infinite state pairs (s i , s i+1 ) compatible with rule r 1 , because such a rule requires feature n to decrease but n cannot decrease below zero and no rule allows n to increase.Then, since rule r 1 cannot be "applied" infinitely often, neither can rule r 2 which requires r 1 to restore the truth of the condition H.This analysis is independent of the underlying planning problem and the semantics of the features.
The notion of termination as captured by the Sieve algorithm for QNPs (Srivastava, Zilberstein, Immerman, & Geffner, 2011b;Bonet & Geffner, 2020) can be used to check, among other things, that a rule-based policy or sketch is feature-acyclic.Indeed, if R is a terminating set of rules over the features in Φ, as determined by Sieve, s 0 , s 1 , s 2 , . . . is a state sequence compatible with R, and s i and s j , with i < j, are two states with identical valuation over the Boolean conditions defined by Φ (see below), then there is a numerical feature n such that its values satisfy n(s j ) < n(s i ) and n(s k ) ≤ n(s i ) for i ≤ k ≤ j.This condition ensures that R is feature-acyclic, and thus that it is acyclic over any class Q.
The Sieve algorithm, shown in Fig. 6, receives as input a directed and edge-labeled graph G = V, E, ℓ , where the edge labels ℓ(e) contain effects over Boolean and numerical Choose SCC T and numerical feature n that is decreased but not increased in T ; i.e., -T contains some edge e such that n↓ ∈ ℓ(e), and -T contains no edge e ′ such that n↑ ∈ ℓ(e ′ ) or n? ∈ ℓ(e ′ ) 7: Remove the edges in T where n is decreased 8: until G ′ is acyclic or no such SCC exist Figure 6: The Sieve algorithm takes as input a directed edge-labeled graph G that is either accepted or rejected.The graph G is the graph G(R) constructed using the feature-based rules in a set R. If G is accepted, the binary relation on feature valuations induced by R is deemed as terminating, and thus R is feature-acyclic (cf.Theorem 41).features; i.e., expression of the form p, ¬p and p? for Boolean features p, and expressions of the form n↓, n↑, and n? over numerical features n.Sieve iteratively computes the strongly connected components (SCCs) of a graph G ′ , initially set to the input graph G, and removes edges from the graph until it becomes acyclic, or no more edges can be removed.The graph is accepted iff it becomes acyclic, otherwise is rejected.An edge e in a component C of G ′ can be removed if some feature n is decreased in e (i.e.n↓ ∈ ℓ(e)), and is not increased in any other edge e ′ in the same component (i.e.n↑ ∈ ℓ(e ′ ) and n? ∈ ℓ(e ′ )).
The graph G(R) = V, E, ℓ that is passed to Sieve as input is constructed from a set R of rules over a set Φ of features as follows.The vertices in V correspond to the 2 |Φ| valuations v for the conditions p and n = 0 for the Boolean and numerical features p and n in Φ, and there is an edge (v, v ′ ) in E if the pair of valuations v and v ′ is compatible with some rule C → E in R. A set of rules R is terminating iff Sieve accepts the graph G(R).
In our setting, this can be means the following: Theorem 41 (Srivastava et al. (2011b), Bonet and Geffner (2020)).Let Φ be a set of features, and let R be a set of rules over Φ for a class Q of problems.If Sieve accepts G(R), then the binary relation ≺ R is feature-acyclic over Q, and therefore, R is a sketch that defines a serialization ≺ R for Q.
Proof.If Sieve accepts G(R), τ = s 0 , s 1 , s 2 , . . . is a state sequence that is compatible with R, and the states s i and s j , i < j, have identical valuation over the Boolean conditions for Φ, then there is a numerical feature n that is decremented in the path τ i,j = s i , s i+1 , . . ., s j and not incremented in τ i,j (Srivastava et al., 2011b;Bonet & Geffner, 2020).Therefore, f (s i ) = f (s j ).This implies that there is no such state sequence τ that contains two different states s and s ′ such that f (s) = f (s ′ ).key changes are: an slightly more general and convenient definition of admissibility based on sets of tuples and not sequences, the new notion of envelopes, and the definition of both policies and serializations as binary relations on states, expressed syntactically by means of rules.In addition, serializations are no longer assumed to be transitive relations.As usual, P is a planning instance from a class Q, π is a general policy, T is a set atom tuples over P , and T k denotes the set of conjunctions of up to k atoms.A summary of the main theorems above and their meaning follows: • Theorems 4-7.If T is admissible, IW(T ) finds an optimal plan and w(P ) ≤ size(T ).If w(P ) ≤ k, IW(k) finds an optimal plan, and IW finds a (not necessarily optimal) plan.
• Theorem 22: T admissible iff OP T (T ) is a cost-envelope.
• Theorem 23: IW(T ) is optimal if T contains T ′ such that OP T (T ′ ) is a cost-envelope, and IW(k) is optimal if such T ′ is contained in T k .
Meaning.If T is admissible and hence a cost envelope, P is solved optimally by IW(T ) which expands up to |T | nodes and also by IW(T ′ ) if T ′ ⊆ T .The width of P , w(P ), is the minimum size(T ) of an admissible T , and IW(k) solves P optimally if w(P ) ≤ k.IW(k) is equivalent to IW(T k ).
• Theorem 25: OP T (T ) is a cost-envelope in P if it is a closed π-envelope of an optimal policy π for Q, P ∈ Q.
• Theorem 27 and corollaries: If OP T (T ) is a closed π-envelope of an optimal policy π, and T ⊆ T k , IW(k) reaches goal of P through an optimal π-trajectory.
• Theorem 28: w(P ) > k if every optimal plan for P contains a state outside OP T (T k ).
Meaning.These results and the ones above explain why many standard planning domains have bounded width when goal atoms are considered.The reason is that such classes of problems admit general optimal policies π that can be "applied" in an instance P by just considering tuples of atoms of bounded size.If this bound is k, IW(k) finds the goal of P in polynomial time through a π-trajectory, without having to know π at all.The width of P is greater than k if no optimal plan "goes through" states that are all in OP T (T k ).
• Theorem 29: IW Φ is optimal if OP T (F ) is a closed π-envelope for some set F of feature valuations, and π is optimal.
• Theorem 30: OP T (F ) is a closed π-envelope if π is optimal and reaches all and only the states in OP T (F ).
Meaning.The IW Φ procedure is like IW(T ) but with the feature valuations over Φ playing the role of the atom tuples in T .The procedure is meaningful because in many tasks the number of possible feature valuations for a given Φ is exponentially smaller than the number of states (e.g., Q Clear above).Two relevant questions are what sets of features Φ ensure that IW Φ solves a problem (optimally) and whether the features used by a policy π that solves the problem do.The general answer to this last question is no: as shown in the example for Towers of Hanoi, one can define general policies in terms of a bounded and small set of Boolean features Φ, yet the length of any plan for Hanoi will grow exponentially with the number of disks.This simple combinatorial argument rules out Φ as a good set of features for IW Φ in Hanoi, even when Φ supports a solution policy.The results above provide a more general argument that cuts in both ways.Namely, if no subset F of feature valuations results in OP T (F ) being a closed cost-envelope (as in Hanoi), then IW Φ will not solve P optimally in general, and if F is one such set of feature valuations, then IW Φ will solve P optimally.
The definition of policies and serialization as binary relation on states, expressed in compact form by the same type of rules, makes the relation between policies and serializations direct: • Theorem 34: Policy π solves Q iff π is a serialization over Q of width zero (semantics).
• Theorem 38: Rule-based policy R solves Q iff R is a rule-based serialization (sketch) over Q of width zero (syntax).
• Theorem 35: A serialization of width k over Q implies that problems P in Q can be solved by solving subproblems of width bounded by k, greedily, with the SIW R procedure.The number of subproblems to be solved, and hence the running time of SIW R , however, is not necessarily polynomial (e.g., Hanoi).
• Theorem 40 and 41: A terminating rule-based serialization (sketch) of bounded width over Q, implies that problems P in Q are solved in polynomial time by SIW R .
• Theorem 41: Termination of rule-based serializations and policies can be checked in time exponential in the number of features by Sieve.
Meaning.A general policy is a general serialization of zero width; namely, a particular, type of serialization in which the subproblems can be solved greedily in a single step.Serializations of bounded width result in subproblems that can be solved greedily in polynomial time, while terminating rule-based serializations always result in a polynomial bounded of number of subproblems.The direct correspondence between policies and serializations is new and important, although it has not been recognized before.The reasons have been the lack of general and flexible formal accounts of serializations, compact languages for describing them, and width-like measures for bounding the complexity of subproblems.Policies have been formulated as binary relations on states and not as mapping from states to actions because actions do not generalize across instances.This choice has also helped to make the relation between policies and serializations more explicit.Finally, termination is a property of the set of rules and has nothing do with the class of problems.Introduced by Srivastava et al. (2011b), termination gives us state acyclicity, needed in the definition of serializations, and feature-value acyclicity, needed for the polynomial bound N ℓ on the number of subproblems (provided that features are linear), where N is the number of atoms in the problem and ℓ is the number of numerical features used in the rules.If a terminating sketch has bounded width over a class Q of problems, the problems P in Q are solved greedily by SIW R in polynomial time.

Extensions, Variations, and Limitations
We have presented a framework that accommodates policies and serializations, and have established relations between width and policies, on the one hand, and policies and serializations, on the other.Extensions, variations, and limitations of this framework are briefly discussed next.
Optimality and width.In the definition of an admissible set of tuples T and, hence, in the definition of width that follows, it is said that if an optimal plan σ for a tuple t in T is not an optimal plan for P , then an optimal plan for another tuple t ′ in T can be obtained by appending a single action to σ.A similar condition appears in the original definition of admissibility and width for sequences of tuples (Lipovetzky & Geffner, 2012).If the condition that "σ is not an optimal plan for P " is replaced by "σ is not a plan for P ," the resulting definition of width (size of a min-size admissible set T ) still guarantees that P is solved by IW(k) if the width of P is bounded by k, but not that P is solved optimally by IW(k).In the serialized setting, where optimal solutions of subproblems do not translate into optimal solution of problems, this relaxation of the definitions of admissibility (and width) makes sense, and it has been used for learning sketches of bounded width more effectively (Drexler et al., 2022).
Syntax of policy and sketch rules.The features and rules provide a convenient, compact, and general language for expressing policies and serializations, while the choice of features, Boolean and numerical, follow the type of variables used in qualitative numerical planning problems (QNPs) for defining bounds and termination conditions (Srivastava et al., 2011b;Bonet & Geffner, 2020).In QNPs, it is critical that numerical variables change via "qualitative" increments and decrements, as reasoning with arbitrary numerical is undecidable (Helmert, 2002).Still, the restriction that numerical features n can only appear in effect expressions of the form n↑, n↓ or n? is somewhat arbitrary, and other effect expressions like ¬n↑, ¬n↓, n = 0, or n > 0 can be accommodated with minor changes.
Non-deterministic sketches and policies.Policies and sketches are non-deterministic in the sense that many state transitions and pairs (s, s ′ ) can be compatible with a policy or sketch rule.If a policy π solves a problem P , it is because all π-transitions lead to the goal, and in the case of a bounded-width sketch, it is because the achievement of any such subgoal s ′ leads to the goal.This means that in a state s, one can pick any (policy or sketch) rule C → E such that C is true in s, and move from s to any state s ′ that satisfies the rule.
If the sketch has bounded width (a policy is a sketch of zero width), then at least one such rule exists.In certain cases, however, it is convenient to guarantee that there is one such state s ′ for any rule C → E whose antecedent C is true at s, so that one can choose the rule to "apply" in a state without having to look ahead for the existence of such states s ′ .Sketches that have this additional property have been called modular, as the sketch rules are considered independently of each other.One can then talk about the width of a sketch rule C → E as the maximum width of the problems with initial state s and goal states s ′ such that the state pair (s, s ′ ) satisfies the rule, or s ′ is a goal state of the problem.Modular sketches are useful for learning hierarchical policies, where a sketch rule representing a class of problems of width greater than k is decomposed into sketch rules of width bounded by k, and so on iteratively, until sketch rules are obtained with width zero (Drexler, Seipp, & Geffner, 2023).
Non-deterministic domains.The notion of width and the type of general policies considered are for deterministic planning domains.It is not yet clear how to extend the width notion to non-deterministic domains while preserving certain key properties like that bounded width problems can be solved in polynomial time, and that large classes of benchmark domains fall into such a class for suitable types of goals.The extension of general policies for non-deterministic domains appears to be simpler although it has not been explored.
In principle, a policy π for non-deterministic domains would still classify state transitions (s, s ′ ) as good or bad (in π or not), and non-deterministic actions in s are compatible with π if they give rise to good state transitions from s only.

Related Work
We review a number of related research threads.
Width, general policies, and sketches.This paper builds on prior works that introduced sketches (Bonet & Geffner, 2021), the language for expressing general policies in terms of features and rules (Bonet & Geffner, 2018), and the notion of width and the IW search procedures (Lipovetzky & Geffner, 2012;Lipovetzky, 2021).Methods for learning general policies and sketches of bounded width have also been developed (Frances, Bonet, & Geffner, 2021;Drexler et al., 2022), leading more recently to methods for learning hierarchical policies (Drexler et al., 2023).
Hierarchical RL and intrinsic rewards.Hierarchical structures have also been used in reinforcement learning in the form of options (Sutton, Precup, & Singh, 1999), hierarchies of machines (Parr & Russell, 1997) and MaxQ hierarchies (Dietterich, 2000), among others.
While this "control knowledge" is often provided by hand, a vast literature has explored techniques for learning them by considering "bottleneck states" (McGovern & Barto, 2001), "eigenpurposes" of the matrix dynamics (Machado, Bellemare, & Bowling, 2017), and informal width-based considerations (Junyent, Gómez, & Jonsson, 2021).Intrinsic rewards have also been introduced for improving exploration leading to exogenous rewards (Singh, Lewis, Barto, & Sorg, 2010), and some authors have addressed the problem of learning intrinsic rewards.Interestingly, the title of one of the papers in the area is the question "What can learned intrinsic rewards capture?" (Zheng, Oh, Hessel, Xu, Kroiss, Van Hasselt, Silver, & Singh, 2020).The answer that follows from our setting is clean and simple: intrinsic rewards should capture the general, low-width subgoal structure of the domain.Lacking a language to talk about families of problems and about subgoal structure, however, the answer to the question found in the RL literature is purely experimental and less crisp: learned intrinsic rewards are just supposed to speed up the convergence of (deep) RL.
Reward machines and sketches.A recent language for encoding subgoal structure in RL is based on reward machines (Icarte, Klassen, Valenzano, & McIlraith, 2018) and the closely related proposal of restraining bolts (De Giacomo, Iocchi, Favorito, & Patrizi, 2020).In these cases, the temporal structure of the (sub)goals to be achieved results in an automata which is combined with the system MDP to produce the so-called cross-product MDP.A number of RL algorithms for exploiting the known structure of the subgoal automata have been developed (Icarte, Klassen, Valenzano, & McIlraith, 2022) as well as algorithms for learning them (Toro Icarte, Waldie, Klassen, Valenzano, Castro, & McIlraith, 2019).There is indeed a close relation between reward machines and sketches, as both convey subgoal structure, but there some important differences too.First, reward machines encode the structure of explicit temporal goals (e.g., do X, then Y , and finally Z), while sketches encode structure that is implicit in the problem goal given the domain.Second, reward machines are defined in terms of additional propositional variables; sketches, in terms of state features that do not require cross-products.Third, sketches come with a theory of width that tells us where to split problems into subproblems and why.And, finally, sketches come with a notion of termination that ensures that subgoaling does not result in cycles.

Conclusions
We have established results that explain why many standard planning domains have bounded width, and have introduced a number of notions, like policy and cost envelopes that shed light on this relation and on the optimality and completeness of old and new IW-algorithms like IW(T ) and IW Φ .We have also redefined the semantic and syntactic notions of general policies and serializations, making their relation direct and clean: a policy is a serialization that gives rise to subproblems of zero width which can be solved greedily.This relation between policies and problem decompositions has not been recognized before.The paper is revised version of an earlier paper (Bonet & Geffner, 2021) that touched similar themes and introduced the notion of sketches.The goal has been to make the results more transparent, useful, and meaningful.

Theorem 5 (
Complexity of IW(T )).Let P be a planning problem with branching factor bounded by b, and let T be a set of atom tuples.Then, 1. IW(T ) expands and generates at most |T | and b|T | nodes, respectively, thus running in O(bT 2 ) time and O(bT ) space (where the T inside the O-notation refers to |T |).
Figure 3: An IW Φ search is like an IW(T ) search but instead of tracking the tuples in T , it tracks feature valuations, and prune nodes whose valuation have been already seen.Guarantees for completeness and optimality are given in Theorem 29.

Algorithm 6 :
Sieve 1: Input: Directed edge-labeled graph G = V, E, ℓ , where the labels contain feature effects 2: Output: Either accept or reject G 3: Initialize the graph G ′ ← G We provide full details in what follows.The invariant that must be shown is: at the start of each iteration, the queue contains a node n such that n[state] is in OP T (F ) and n[cost] = cost The claim is true for the first iteration as Q only contains the node n 0 for the initial state s 0 which is in OP T (F ), and n 0 [cost] and cost * (s 0 ) are both equal to zero.2.Suppose that the claim holds at the start of iteration i.That is, Q contains a node n such that n[state] is in OP T (F ), and n[cost] = cost * (n[state]).We consider 4 cases: a) The node n is not dequeued.It then remains in Q and the invariant holds for the next iteration.b) The node n is dequeued and n[state] is a goal state.Then, IW Φ terminates with a goal-reaching path.c) The node n is dequeued, n[state] is not a goal state, and the node is not pruned (cf.line 9 in IW Φ ).Since n[state] belongs to OP T (F ), a node n ′ is enqueued for a successor s ′ of n[state] such that s ′ is in OP T (F ).On the other hand, since OP T (F ) is a cost-envelope, there is a goal-reaching ≺ *  (n[state]), where n[state] denotes the state associated with the node n, n[cost] denotes the cost of n, and cost * (s) is the state function in Definition 18.We do an induction on the number of iterations:1.cost -trajectory s 0 , . . ., n[state], s ′ , . . .which is an optimal trajectory for P by Lemma 19.Therefore,

•
The negation of the first case.By the claim, at the time when n the number of different feature valuations, which is O(N ℓ ) if the features in Φ are linear and the number of numerical features in Φ is ℓ.Thus, the plan length is O(N ℓ ) while the number of generated nodes is O(bN ℓ ), if b bounds the branching factor.On the other hand, the operations on the hash table can be done in constant time on a perfect hash.Hence, IW Φ runs in O(bN ℓ ) time and space.
* is dequeued, the queue contains another node n such that n[state] ∈ OP T (F ) and n[cost] = cost * (n[state]).On one hand, n[cost] ≤ C * where C * is the cost of P .On the other hand, n * [cost] ≤ n[cost] since n * is dequeued before n.Combining both inequalities, n * [cost] ≤ C * which means that the path found by IW Φ of cost n * [cost] is an optimal trajectory for P .The second claim is implied by the first since OP T (F ) is a cost-envelope by Theorem 25.Finally, for the complexity bounds, notice that the number of nodes expanded by IW Φ is bounded by

•
The Marbles domain M involves boxes and marbles.Boxes can be on the table and marbles inside boxes.Problems are specified with atoms ontable(b) to tell that box b is on the table, and in(r, b) to tell that marble r is in box b.The goal is to remove all boxes from the table, where a box can be removed only if it is empty.Marbles thus must be removed from boxes one at a time, in no specific order.The collection of all problems over the Marbles domain is denoted by Q M , while Q M1 ⊆ Q M denotes the class of such problems with exactly one box.