Online Speedup Learning for Optimal Planning

Domain-independent planning is one of the foundational areas in the field of Artificial Intelligence. A description of a planning task consists of an initial world state, a goal, and a set of actions for modifying the world state. The objective is to find a sequence of actions, that is, a plan, that transforms the initial world state into a goal state. In optimal planning, we are interested in finding not just a plan, but one of the cheapest plans. A prominent approach to optimal planning these days is heuristic state-space search, guided by admissible heuristic functions. Numerous admissible heuristics have been developed, each with its own strengths and weaknesses, and it is well known that there is no single"best heuristic for optimal planning in general. Thus, which heuristic to choose for a given planning task is a difficult question. This difficulty can be avoided by combining several heuristics, but that requires computing numerous heuristic estimates at each state, and the tradeoff between the time spent doing so and the time saved by the combined advantages of the different heuristics might be high. We present a novel method that reduces the cost of combining admissible heuristics for optimal planning, while maintaining its benefits. Using an idealized search space model, we formulate a decision rule for choosing the best heuristic to compute at each state. We then present an active online learning approach for learning a classifier with that decision rule as the target concept, and employ the learned classifier to decide which heuristic to compute at each state. We evaluate this technique empirically, and show that it substantially outperforms the standard method for combining several heuristics via their pointwise maximum.


Introduction
At the center of the problem of intelligent autonomous behavior is the task of selecting the actions to take next.Planning in AI is best conceived as the model-based approach to automated action selection (Geffner, 2010).The models represent the current situation, goals, and possible actions.Planning-specific languages are used to describe such models concisely.The main challenge in planning is computational, as most planning languages lead to intractable problems in the worst case.However, using rigorous search-guidance tools often allows for efficient solving of interesting problem instances.
In classical planning, which is concerned with the synthesis of plans constituting goal-achieving sequences of deterministic actions, significant algorithmic progress has been achieved in the last two decades.In turn, this progress in classical planning is translated to advances in more involved planning languages, allowing for uncertainty and feedback (Yoon, Fern, & Givan, 2007;Palacios & Geffner, 2009;Keyder & Geffner, 2009;Brafman & Shani, 2012).In optimal planning, the objective is not just to find any plan, but to find one of the cheapest plans.
A prominent approach to domain-independent planning, and to optimal planning in particular, is state-space heuristic search.It is very natural to view a planning task as a search problem, and use a heuristic search algorithm to solve it.Recent advances in automatic construction of heuristics for domain-independent planning established many heuristics to choose from, each with its own strengths and weaknesses.However, this wealth of heuristics leads to a new question: given a specific planning task, which heuristic to choose?
In this paper, we propose selective max -an online learning approach that combines the strengths of several heuristic functions, leading to a speedup in optimal heuristic-search planning.At a high level, selective max can be seen as a hyper-heuristic (Burke, Kendall, Newall, Hart, Ross, & Schulenburg, 2003) -a heuristic for choosing among other heuristics.It is based on the seemingly trivial observation that, for each state, there is one heuristic which is the "best" for that state.In principle, it is possible to compute several heuristics for each state, and then choose one according to the values they provide.However, heuristic computation in domain-independent planning is typically expensive, and thus computing several heuristic estimates for each state takes a long time.Selective max works by predicting for each state which heuristic will yield the "best" heuristic estimate, and computes only that heuristic.
As it is not always clear how to decide what the "best" heuristic for each state is, we first analyze an idealized model of a search space and describe how to choose there the best heuristic for each state in order to minimize the overall search time.We then describe an online active learning procedure that uses a decision rule formulated for the idealized model.This procedure constitutes the essence of selective max.
Our experimental evaluation, which we conducted using three state-of-the-art heuristics for domain-independent planning, shows that selective max is very effective in combining several heuristics in optimal search.Furthermore, the results show that using selective max results in a speedup over the baseline heuristic combination method, and that selective max is robust to different parameter settings.These claims are further supported by selective max having been a runnerup ex-aequo in the last International Planning Competition, IPC-2011(García-Olaya, Jiménez, & Linares López, 2011).This paper expands on the conference version (Domshlak, Karpas, & Markovitch, 2010) in several ways.First, we improve and expand the presentation of the selective max decision rule.Second, we explain how to handle non-uniform action costs in a principled way.Third, the empirical evaluation is greatly extended, and now includes the results from IPC-2011, as well as controlled experiments with three different heuristics, and an exploration of how the parameters of selective max affect its performance.

Previous Work
Selective max is a speedup learning system.In general, speedup learning is concerned with improving the performance of a problem solving system with experience.The computational difficulty of domain-independent planning has led many researchers to use speedup learning techniques in order to improve the performance of planning systems; for a survey of many of these, see the work of Minton (1994), Zimmerman and Kambhampati (2003), and Fern, Khardon, and Tadepalli (2011).
Speedup learning systems can be divided along several dimensions (Zimmerman & Kambhampati, 2003;Fern, 2010).Arguably the most important dimension is the phase in which learning takes place.An offline, or inter-problem, speedup learner analyzes the problem solver's performance on different problem instances in an attempt to formulate some rule which would not only improve this performance but would also generalize well to future problem instances.Offline learning has been applied extensively to domain-independent planning, with varying degrees of success (Fern et al., 2011).However, one major drawback of offline learning is the need for training examples -in our case, planning tasks from the domains of interest.
Learning can also take place online, during problem solving.An online, or intra-problem, speedup learner is invoked by the problem solver on a concrete problem instance the solver is working on, and it attempts to learn online, with the objective of improving the solver's performance on that specific problem instance being solved.In general, online learners are not assumed to be pretrained on some other, previously seen problem instances; all the information they can rely on has to be collected during the process of solving the concrete problem instance they were called for.Online learning has been shown to be extremely helpful in propositional satisfiability (SAT) and general constraint satisfaction (CSP) solving, where nogood learning and clause learning are now among the essential components of any state-of-the-art solver (Schiex & Verfaillie, 1993;Marques-Silva & Sakallah, 1996;Bayardo Jr. & Schrag, 1997).Thus, indirectly, SAT-and CSP-based domainindependent planners already benefit from these online learning techniques (Kautz & Selman, 1992;Rintanen, Heljanko, & Niemelä, 2006).However, to the best of our knowledge, our work is the first application of online learning to optimal heuristic-search planning.

Background
A domain-independent planning task (or planning task, for short) consists of a description of an initial state, a goal, and a set of available operators.Several formalisms for describing planning tasks are in use, including STRIPS (Fikes & Nilsson, 1971), ADL (Pednault, 1989), and SAS + (Bäckström & Klein, 1991;Bäckström & Nebel, 1995).We describe the SAS + formalism, the one used by the Fast Downward planner (Helmert, 2006), on top of which we have implemented and evaluated selective max.Nothing, however, precludes using selective max in the context of other formalisms.
A SAS + planning task is given by a 4-tuple Π = V, A, s 0 , G .V = {v 1 , . . ., v n } is a set of state variables, each associated with a finite domain dom(v i ).A complete assignment s to V is called a state.s 0 is a specified state called the initial state, and the goal G is a partial assignment to V .A is a finite set of actions.Each action a is given by a pair pre(a), eff(a) of partial assignments to V called preconditions and effects, respectively.Each action a also has an associated cost C(a) ∈ R 0+ .An action a is applicable in a state s iff s |= pre(a).Applying a changes the value of each state variable v to eff(a) [v] if eff(a) [v] is specified.The resulting state is denoted by s a .We denote the state obtained from sequential application of the (respectively applicable) actions a 1 , . . ., a k starting at state s by s a 1 , . . ., a k .Such an action sequence is a plan if s 0 a 1 , . . ., a k |= G.In optimal planning, we are interested in finding one of the cheapest plans, where the cost of a plan a 1 , . . ., a k is the sum of its constituent action costs k i=1 C(a i ).A SAS + planning task Π = V, A, s 0 , G can be easily seen as a state-space search problem whose states are simply complete assignments to the variables V , with transitions uniquely determined by the actions A. The initial and goal states are also defined by the initial state and goal of Π.An optimal solution for a state-space search problem can be found by using the A * search algorithm with an admissible heuristic h.A heuristic evaluation function h assigns an estimate of the distance to the closest goal state from each state it evaluates.The length of a cheapest path from state s to the goal is denoted by h * (s), and h is called admissible if it never overestimates the true goal distance -that is, if h(s) ≤ h * (s) for any state s.A * works by expanding states in the order of increasing f (s) := g(s) + h(s), where g(s) is the cost of the cheapest path from the initial state to s known so far.

Selective Max as a Decision Rule
Many admissible heuristics have been proposed for domain-independent planning; these vary from cheap to compute yet not very accurate, to more accurate yet expensive to compute.In general, the more accurate a heuristic is, the fewer states would be expanded by A * when using it.As the accuracy of heuristic functions varies for different planning tasks, and even for different states of the same task, we may be able to produce a more robust optimal planner by combining several admissible heuristics.Presumably, each heuristic is more accurate, that is, provides higher estimates, in different regions of the search space.The simplest and best-known way for doing that is using the point-wise maximum of the heuristics in use at each state.Given n admissible heuristics, h 1 , . . ., h n , a new heuristic, max h , is defined by max h (s) := max 1≤i≤n h i (s).It is easy to see that max h (s) ≥ h i (s) for any state s and for any heuristic h i .Thus A * search using max h is expected to expand fewer states than A * using any individual heuristic.However, if we denote the time needed to compute h i by t i , the time needed to compute max h is n i=1 t i .As mentioned previously, selective max is a form of hyper-heuristic (Burke et al., 2003) that chooses which heuristic to compute at each state.We can view selective max as a decision rule dr, which is given a set of heuristics h 1 , . . ., h n and a state s, and chooses which heuristic to compute for that state.One natural candidate for such a decision rule is the heuristic which yields the highest, that is, most accurate, estimate: Using this decision rule yields a heuristic which is as accurate as max h , while still computing only one heuristic per state -in time t argmax 1≤i≤n h i (s) .This analysis, however, does not take into account the different computation times of the different heuristics.For instance, let h 1 and h 2 be a pair of admissible heuristics such that h 2 ≥ h 1 .A priori, it seems that using h 2 should always be preferred to using h 1 because the former should cause A * to expand fewer states.However, suppose that on a given planning task, A * expands 1000 states when guided by h 1 and only 100 states when guided by h 2 .If computing h 1 for each state takes 10 ms, and computing h 2 for each state takes 1000 ms, then switching from h 1 to h 2 increases the overall search time.Using max h over h 1 and h 2 only makes things worse, because h 2 ≥ h 1 , and thus computing the maximum simply wastes the time spent on computing h 1 .It is possible, however, that computing h 2 for a few carefully chosen states, and computing h 1 for all other states, would result in expanding 100 states, while reducing the overall search time when compared to running A * with only h 2 .
As this example shows, even given knowledge of the heuristics' estimates in advance, it is not clear what heuristic should be computed at each state when our objective is to minimize the overall search time.Therefore, we begin by formulating a decision rule for choosing between one of two heuristics, with respect to an idealized state-space model.Selective max then operates as an online An illustration of the idealized search space model and the f -contours of two admissible heuristics active learning procedure, attempting to predict the outcome of that decision rule and choose which heuristic to compute at each state.

Decision Rule with Perfect Knowledge
We now formulate a decision rule for choosing which of two given admissible heuristics, h 1 and h 2 , to compute for each state in an idealized search space model.In order to formulate such a decision rule, we make the following assumptions: • The search space is a tree with a single goal, constant branching factor b, and uniform cost actions.Such an idealized search space model was used in the past to analyze the behavior of A * (Pearl, 1984).
• The time t i required for computing heuristic h i is independent of the state being evaluated; w.l.o.g.we assume t 2 ≥ t 1 .
• The heuristics are consistent.A heuristic h is said to be consistent if it obeys the triangle inequality: For any two states s, s , h(s) ≤ h(s ) + k(s, s ), where k(s, s ) is the optimal cost of reaching s from s.
• We have: (i) perfect knowledge about the structure of the search tree, and in particular the cost of the optimal solution c * , (ii) perfect knowledge about the heuristic estimates for each state, and (iii) a perfect tie-breaking mechanism.
Obviously, none of the above assumptions holds in typical search problems, and later we examine their individual influence on our framework.Adopting the standard notation, let g(s) be the cost of the cheapest path from s 0 to s. Defining max h (s) = max(h 1 (s), h 2 (s)), we then use the notation f 1 (s) = g(s) + h 1 (s), f 2 (s) = g(s) + h 2 (s), and max f (s) = g(s) + max h (s).The A * algorithm with a consistent heuristic h expands states in increasing order of f = g + h (Pearl, 1984).In particular, every state s with f (s) < h * (I) = c * will surely be expanded by A * , and every state with f (s) > c * will surely not be expanded by A * .The states with f (s) = c * might or might not be expanded by A * , depending on the tie-breaking rule being used.Under our perfect tie-breaking assumption, the only states with f (s) = c * that will be expanded are those that lie along some optimal plan.Let us consider the states satisfying f 1 (s) = c * (the dotted line in Fig. 1) and those satisfying f 2 (s) = c * (the solid line in Fig. 1).The states above the f 1 = c * and f 2 = c * contours are those that are surely expanded by A * with h 1 and h 2 , respectively.The states above both these contours (the grid-marked region in Fig. 1), that is, the states SE = {s | max f (s) < c * }, are those that are surely expanded by A * using max h (Pearl, 1984, Thm. 4, p. 79).
Under the objective of minimizing the search time, note that the optimal decision for any state s ∈ SE is not to compute any heuristic at all, since all these states are surely expanded anyway.Assuming that we still must choose one of the heuristics, we would choose to compute the cheaper heuristic h 1 .Another easy case is when f 1 (s) ≥ c * .In these states, computing h 1 (s) suffices to ensure that s is not surely expanded, and using a perfect tie-breaking rule, s will not be expanded unless it must be.Because h 1 is also cheaper to compute than h 2 , h 1 should be preferred, regardless of the heuristic estimate of h 2 for state s.
Let us now consider the optimal decision for all other states, that is, those with f 1 (s) < c * and f 2 (s) ≥ c * .In fact, it is enough to consider only the shallowest such states; in Figure 1, these are the states on the part of the f 2 = c * contour that separates between the grid-marked and line-marked areas.Since f 1 (s) and f 2 (s) are based on the same g(s), we have h 2 (s) > h 1 (s), that is, h 2 is more accurate in state s than h 1 .If we were interested solely in reducing state expansions, then h 2 would obviously be the right heuristic to compute at s.However, for our objective of reducing the actual search time, h 2 may actually be the wrong choice because it might be much more expensive to compute than h 1 .
Let us consider the effects of each of our two alternatives.If we compute h 2 (s), then s is not surely expanded, because f 2 (s) = c * , and thus whether or not A * expands s depends on tiebreaking.As before, we are assuming perfect tie-breaking, and thus s will not be expanded unless it must be.Computing h 2 would "cost" us t 2 time.
In contrast, if we compute h 1 (s), then s is surely expanded because f 1 (s) < c * .Note that not computing h 2 for s and then computing h 2 for one of the descendants s of s is clearly a sub-optimal strategy as we do pay the cost of computing h 2 , yet the pruning of A * is limited only to the search sub-tree rooted in s .Therefore, our choices are really either computing h 2 for s, or computing h 1 for all the states in the sub-tree rooted in s that lie on the f 1 = c * contour.Suppose we need to expand l complete levels of the state space from s to reach the f 1 = c * contour.Thus, we need to generate an order of b l states, and then invest b l t 1 time in calculating h 1 for all these states that lie on the f 1 = c * contour.
Considering these two options, the optimal decision in state s is thus to compute h 2 iff t 2 < b l t 1 , or to express it differently, if l > log b ( t 2 t 1 ).As a special case, if both heuristics take the same time to compute, this decision rule reduces to l > 0, that is, the optimal choice is simply the more accurate heuristic for state s.
Putting all of the above cases together yields the decision rule dr opt , as below, with l s being the depth to go from s until f 1 (s) = c * : .

Decision Rule without Perfect Knowledge
The idealized model above makes several assumptions, some of which appear to be very problematic to meet in practice.Here we examine these assumptions more closely, and when needed, suggest pragmatic compromises.First, the model assumes that the search space forms a tree with a single goal state, that the heuristics in question are consistent, and that we have a perfect tie-breaking rule.Although the first assumption does not hold in most planning tasks, the second assumption is not satisfied by many state-of-the-art heuristics (Karpas & Domshlak, 2009;Helmert & Domshlak, 2009;Bonet & Helmert, 2010), and the third assumption is not realistic, they do not prevent us from using the decision rule suggested by the model.
The idealized model also assumes that both the branching factor and the heuristic computation times are constant across the search states.In our application of the decision rule to planning in practice, we deal with this assumption by adopting the average branching factor and heuristic computation times, estimated from a random sample of search states.
Finally, the decision rule dr opt above requires unrealistic knowledge of both heuristic estimates, as well as of the optimal plan cost c * and the depth l s to go from state until f 1 (s) = c * .As we obviously do not have this knowledge in practice, we must use some approximation of the decision rule.
The first approximation we make is to ignore the "trivial" cases that require knowledge of c * ; these are the cases where either s is surely expanded, or h 1 is enough to prune s.Instead, we apply the reasoning for the "complicated" case for all states, resulting in the following decision rule: .
The next step is to somehow estimate the "depth to go" l s -the number of layers we need to expand in the tree until f 1 reaches c * .In order to derive a useful decision rule, we assume that l s has a positive correlation with ∆ h (s) = h 2 (s) − h 1 (s); that is, if h 1 and h 2 are close, then l s is low, and if h 1 yields a much lower estimate than h 2 , implying that h 1 is not very accurate for s, then the depth to go until f 1 (s) = c * is large.Our approximation uses the simplest such correlation -a linear one -between ∆ h (s) and l s , with a hyper-parameter α for controlling the slope.
Recall that in our idealized model, all actions were unit cost, and thus cost-to-go and depthto-go are the same.However, some planning tasks, and notably, all planning tasks from the 2008 International Planning Competition, feature non-uniform action costs.Therefore, our decision rule converts heuristic estimates of cost-to-go into heuristic estimates of depth-to-go by dividing the cost-to-go estimate by the average action cost.We do this by modifying our estimate of the depthto-go, l s , with the average action cost, which we denote by ĉ.Plugging all of the above into our decision rule yields: .
Given b, t 1 , t 2 , and ĉ, the quantity α • ĉ • log b (t 2 /t 1 ) becomes fixed, and in what follows we denote it simply by threshold τ .Note that linear correlation between ∆ h (s) and l s occurs in some simple cases.The first such case is when the h 1 value remains constant in the subtree rooted at s, that is, the additive error of h 1 increases by 1 for each level below s.In this case, f 1 increases by 1 for each expanded level of the sub-tree (because h 1 remains the same, and g increases by 1), and it will take expanding exactly ∆ h (s) = h 2 (s) − h 1 (s) levels to reach the f 1 = c * contour.The second such case is when the absolute error of h 1 remains constant, that is, h 1 increases by 1 for each level expanded, and so f 1 increases by 2. In this case, we will need to expand ∆ h (s)/2 levels.This can be generalized to the case where the estimate h 1 increases by any constant additive factor c, which results in ∆ h (s)/(c+1) levels being expanded.
Furthermore, there is some empirical evidence to support our conclusion about exponential growth of the search effort as a function of heuristic error, even when the assumptions made by the model do not hold.In particular, the experiments of Helmert and Röger (2008) on IPC benchmarks with heuristics with small constant additive errors show that the number of expanded nodes most typically grows exponentially as the (still very small and additive) error increases.
Finally, we remark that because our decision rule always chooses an admissible heuristic, the resulting heuristic estimate will always be admissible.Thus, even if the chosen heuristic is not the "correct" one according to dr opt , this will not result in loss of optimality of the solution, but only in a possible increase in search time.

Online Learning of the Decision Rule
While decision rule dr app2 still requires knowledge of h 1 and h 2 , we can now use it as a binary label for each state.We can compute the value of the decision rule by "paying" the computation time of both heuristics, t 1 + t 2 , and, more importantly, we can use a binary classifier to predict the value of this decision rule for some unknown state.Note that we use the classifier online, during the problem solving process, and the time spent on learning and classification is counted as time spent on problem solving.Furthermore, as in active learning, we can choose to "pay" for a label for some state, where the payment is also in computation time.Therefore we refer to our setting as active online learning.
In what follows, we provide a general overview of the selective max procedure, and describe several alternatives for each of its components.Our decision rule states that the more expensive heuristic h 2 should be computed at a search state s when h 2 (s) − h 1 (s) > τ .This decision rule serves as a binary target concept, which corresponds to the set of states where the more expensive heuristic h 2 is "significantly" more accurate than the cheaper heuristic h 1 -the states where, according to our model, the reduction in expanded states by computing h 2 outweighs the extra time needed to compute it.Selective max then uses a binary classifier to predict the value of the decision rule.There are several steps to building the classifier: for the decision rule.
We then generate a label for each training example by calculating and comparing it to the decision threshold: If ∆ h (s) > τ , we label s with h 2 , otherwise with h 1 .If t 1 > t 2 we simply switch between the heuristics -our decision is always whether or not to compute the more expensive heuristic; the default is to compute the cheaper heuristic, unless the classifier says otherwise.
3. Feature Extraction: Having obtained a set of training examples, we must decide about the features to characterize each example.Since our target concept is based on heuristic values, the features should represent the information that heuristics are derived from -typically the problem description and the current state.
While several feature-construction techniques for characterizing states of planning tasks have been proposed in previous literature (Yoon, Fern, & Givan, 2008;de la Rosa, Jiménez, & Borrajo, 2008), they were all designed for inter-problem learning, that is, for learning from different planning tasks which have already been solved offline.However, in our approach, we are only concerned with one problem, in an online setting, and thus these techniques are not applicable.In our implementation, we use the simplest features possible, taking each state variable as a feature.As our empirical evaluation demonstrates, even these elementary features suffice for selective max to perform well.After completing the steps described above, we have a binary classifier that can be used to predict the value of our decision rule.However, as the classifier is not likely to have perfect accuracy, we further consult the confidence the classifier associates with its classification.The resulting state evaluation procedure of selective max is depicted in Figure 2.For every state s evaluated by the search algorithm, we use our classifier to decide which heuristic to compute.If the classification confidence exceeds a confidence threshold ρ, a parameter of selective max, then only the indicated heuristic is computed for s.Otherwise, we conclude that there is not enough information to make a selective decision for s, and compute the regular maximum over h 1 (s) and h 2 (s).However, we use this opportunity to improve the quality of our prediction for states similar to s, and update our classifier by generating a label based on h 2 (s)−h 1 (s) and learning from the newly labeled example.These decisions to dedicate computation time to obtain a label for a new example constitute the active part of our learning procedure.It is also possible to update the estimates for b, t 1 , t 2 , and ĉ, and change the threshold τ accordingly.However, this would result in the concept we are trying to learn constantly changing -a phenomenon known as concept drift -which usually affects learning adversely.Therefore, we do not update the threshold τ .

State-Space Sampling
The initial state-space sample serves two purposes.First, it is used to estimate the branching factor b, the heuristic computation times t 1 and t 2 , the average action cost ĉ, and then to compute the threshold τ = α • ĉ • log b (t 2 /t 1 ), which is used to specify our concept.After the concept is specified, the state-space sample also provides us with a set of examples on which the classifier is initially trained.Therefore, it is important to have an initial state-space sample that is representative of the states which will be evaluated during search.The number of states in the initial sample is controlled by a parameter N .
One option is to use the first N states of the search.However, this method is biased towards states closer to the initial state, and therefore is not likely to represent the search space well.Thus, we discuss three more sophisticated state-space sampling procedures, all of which are based on performing random walks, or "probes," from the initial state.While the details of these sampling procedures vary, each such "probe" terminates at some pre-set depth limit.
The first sampling procedure, which we refer to as "biased probes," uses an inverse heuristic selection bias for choosing the next state to go to in the probe.Specifically, the probability of choosing state s as the successor from which the random walk will continue is proportional to 1/ max h (s).This biases the sample towards states with lower heuristic estimates, which are more likely to be expanded during the search.
The second sampling procedure is similar to the first one, except that it chooses the successor uniformly, and thus we refer to it as "unbiased probes."Both these sampling procedures add all of the generated states (that is, the states along the probe as well as their "siblings") to the statespace sample, and they both terminate after collecting N training examples.The depth limit for all random walks is the same in both sampling schemes, and is set to some estimate of the goal depth; we discuss this goal depth estimate later.
The third state-space sampling procedure, referred to here as PDB sampling, has been proposed by Haslum, Botea, Helmert, Bonet, and Koenig (2007).This procedure also uses unbiased probes, but only adds the last state reached in each probe to the state-space sample.The depth of each probe is determined individually, by drawing a random depth from a binomial distribution around the estimated goal depth.
Note that all three sampling procedures rely on some estimate of the minimum goal depth.When all actions are unit cost, the minimum goal depth is the same as h * (s 0 ), and thus we can use a heuristic to estimate it.In our evaluation, we used twice the heuristic estimate of the initial state, 2 • max h (s 0 ), as the goal depth estimate.However, with non-uniform action costs, goal depth and cost are no longer measured in the same units.While it seems we could divide the above heuristicbased estimate by the average action cost ĉ, recall that we use the state-space sample in order to obtain an estimate for estimate ĉ, thus creating a circular dependency.Although it is possible to estimate ĉ by taking the average cost of all actions in the problem description, there is no reason to assume that all actions are equally likely to be used.Another option is to modify the above state-space sampling procedures, and place a cost limit, rather than a depth limit, on each probe.However, this would pose a problem in the presence of 0-cost actions.In such a case, when a probe reaches its cost limit yet has a possible 0-cost action to apply, it is not clear whether the probe should terminate.Therefore, we keep using depth-limited probes and attempt to estimate the depth of the cheapest goal.We compute a heuristic estimate for the initial state, and then use the number of actions which the heuristic estimate is based on as our goal depth estimate.While this is not possible with every heuristic, we use in our empirical evaluation the monotonically-relaxed plan heuristic.This heuristic, also known as the FF heuristic (Hoffmann & Nebel, 2001), does provide such information: we first use this heuristic to find a relaxed plan from the initial state, and then use the number of actions in the relaxed plan as our goal depth estimate.

Classifier
The last decision to be made is the choice of classifier.Although many classifiers can be used here, several requirements must be met due to our particular setup.First, both training and classification must be very fast, as both are performed during time-constrained problem solving.Second, the classifier must be incremental to support active learning.This is achieved by allowing online updates of the learned model.Finally, the classifier should provide us with a meaningful measure of confidence for its predictions.
While several classifiers meet these requirements, we found the Naive Bayes classifier to provide a good balance between speed and accuracy.One note on the Naive Bayes classifier is that it assumes a very strong conditional independence between the features.Although this is not a fully realistic assumption for planning tasks, using a SAS + task formulation in contrast to the classical STRIPS formulations helps a lot: instead of many highly dependent binary variables, we have a much smaller set of less dependent ones.
Although, as the empirical evaluation will demonstrate, Naive Bayes appears to be the most suitable classifier to use with selective max, other classifiers can also be used.The most obvious choice for a replacement classifier would be a different Bayesian classifier.One such classifier is AODE (Webb, Boughton, & Wang, 2005), an extension of Naive Bayes, which somewhat relaxes the assumption of independence between the features, and is typically more accurate than Naive Bayes.However, this added accuracy comes at the cost of increased training and classification time.
Decision trees are another popular type of classifier that allows for even faster classification.While most decision tree induction algorithms are not incremental, the Incremental Tree Inducer (ITI) algorithm (Utgoff, Berkman, & Clouse, 1997) supports incremental updating of decision trees by tree restructuring, and also has a freely available implementation in C. In our evaluation, we used ITI in incremental mode, and incorporated every example into the tree immediately, because the tree is likely to be used for many classifications between pairs of consecutive updates with training examples from active learning.The classification confidence with the ITI classifier is obtained by the frequency of examples at the leaf node from which the classification came.
A different family of possible classifiers is k-Nearest Neighbors (kNN) (Cover & Hart, 1967).In order to use kNN, we need a distance metric between examples, which, with our features, are simply states.As with our choice of features, we opt for simplicity and use Euclidean distance as our metric.kNN enjoys very fast learning time but suffers from slow classification time.The classification confidence is obtained by a simple (unweighted) vote between the k nearest neighbors.
Another question related to the choice of classifier is feature selection.In some planning tasks, the number of variables, and accordingly, features, can be over 2000 (for example, task 35 of the AIRPORT domain has 2558 variables).While the performance of Naive Bayes and kNN can likely be improved using feature selection, doing so poses a problem when the initial sample is considered.Since feature selection will have to be done right after the initial sample is obtained, it will have to be based only on the initial sample.This could cause a problem since some features might appear to be irrelevant according to the initial sample, yet turn out to be very relevant when active learning is used after some low-confidence states are encountered.Therefore, we do not use feature selection in our empirical evaluation of selective max.

Extension to Multiple Heuristics
To this point, we have discussed how to choose which heuristic to compute for each state when there are only two heuristics to choose from.When given more than two heuristics, the decision rule presented in Section 4 is inapplicable, and extending it to handle more than two heuristics is not straightforward.However, extending selective max to use more than two heuristics is straightforward -simply compare heuristics in a pair-wise manner, and use a voting rule to choose which heuristic to compute.
While there are many possible such voting rules, we go with the simplest one, which compares every pair of heuristics, and chooses the winner by a vote, weighted by the confidence for each pairwise decision.The overall winner is simply the heuristic which has the highest total confidence from all pairwise comparisons, with ties broken in favor of the cheaper-to-compute heuristic.Although this requires a quadratic number of classifiers, training and classification time (at least with Naive Bayes) appear to be much lower than the overall time spent on heuristic computations, and thus the overhead induced by learning and classification is likely to remain relatively low for reasonable heuristic ensembles.

Experimental Evaluation
To evaluate selective max empirically, we implemented it on top of the open-source Fast Downward planner (Helmert, 2006).Our empirical evaluation is divided into three parts.First, we examine the performance of selective max using the last International Planning Competition, IPC-2011, as our benchmark.Selective max was the runner-up ex-aequo at IPC-2011, tying for 2nd place with a version of Fast Downward using an abstraction "merge-and-shrink" heuristic (Nissim, Hoffmann, & Helmert, 2011), and losing to a sequential portfolio combining the heuristics used in both runners-up (Helmert, Röger, & Karpas, 2011).Second, we present a series of controlled parametric experiments, where we examine the behavior of selective max under different settings.Finally, we compare selective max to a simulated sequential portfolio, using the same heuristics as selective max.

Performance Evaluation: Results from IPC-2011
The IPC-2011 experiments (García-Olaya et al., 2011) were run by the IPC organizers, on their own machines, with a time limit of 30 minutes and a memory limit of 6 GB per planning task.
The competition included some new domains, which none of the participants had seen before, thus precluding the participants from using offline learning approaches.Although many planners participated in the sequential optimal track of IPC-2011, we report here only the results relevant to selective max.The selective max entry in IPC-2011 was called selmax, and consisted of selective max over the uniform action cost partitioning version of h LA (Karpas & Domshlak, 2009) and h LM-CUT (Helmert & Domshlak, 2009) heuristics.The parameters used for selective max in IPC-2011 are reported in Table 1.Additionally, each of the heuristics selmax used was entered individually as BJOLP (h LA ) and lmcut (h LM-CUT ), and we report results for all three planners.While a comparison of selective max with the regular maximum of h LA and h LM-CUT would be interesting, there was no such entry at IPC-2011, and thus we can not report on it.In our controlled experiments, we do compare selective max to the regular maximum, as well as to other baseline combination methods.
Figure 3 shows the anytime profile of these three planners on IPC-2011 tasks, plotting the number of tasks solved under different timeouts, up to the time limit of 30 minutes.Additionally, Table 2 shows the number of tasks solved in each domain of IPC-2011, after 30 minutes, and includes the number of problems solved by the winner, Fast Downward Stone Soup 1 (FDSS-1), for reference.
As these results show, selective max solves more problems than each of the individual heuristics it uses.Furthermore, the anytime profile of selective max dominates each of these heuristics, in the range between 214 seconds until the full 30 minute timeout.The behavior of the anytime plot with shorter timeouts is due to the overhead of selective max, which consists of obtaining the initial statespace sample, as well as learning and classification.However, it appears that selective max quickly compensates for its relatively slow start.

Controlled Experiments
In our series of controlled experiments, we attempted to evaluate the impact of different parameters on selective max.We controlled the following independent variables: • Heuristics: We used three state-of-the-art admissible heuristics: h LA (Karpas & Domshlak, 2009), h LM-CUT (Helmert & Domshlak, 2009), and h LM-CUT + (Bonet & Helmert, 2010).None  of these base heuristics yields better search performance than the others across all planning domains.Of these heuristics, h LA is typically the fastest to compute and the least accurate, h LM-CUT is more expensive to compute and more accurate, and h LM-CUT + is the most expensive to compute and the most accurate. 1From the data we have gathered in these experiments, h LM-CUT takes on average 4.5 more time per state than h LA , and h LM-CUT + takes 53 more time per state than h LA .We evaluate selective max with all possible subsets of two or more of these three heuristics.
While there are other admissible heuristics for SAS + planning that are competitive with the three above (for example, Helmert, Haslum, & Hoffmann, 2007;Nissim et al., 2011;Katz & Domshlak, 2010), they are based on expensive offline preprocessing, followed by very fast online per-state computation.In contrast, h LA , h LM-CUT and h LM-CUT + perform most of their computation online, and thus can be better exploited by selective max.
Additionally, we empirically examine the effectiveness of selective max in deciding whether to compute a heuristic value at all.This is done by combining our most accurate heuristic, h LM-CUT + , with the blind heuristic.
• Heuristic difference bias α: The hyper-parameter α controls the tradeoff between computation time and heuristic accuracy.Setting α = 0 sets the threshold τ to 0, forcing the decision rule to always choose the more accurate heuristic.Increasing α increases the threshold, forcing the decision rule to choose the more accurate heuristic h 2 only if its value is much higher than that of h 1 .We evaluate selective max with values for α of 0.1, 0.5, 1, 1.5, 2, 3, 4, and 5.
• Confidence threshold ρ: The confidence threshold ρ controls the active learning part of selective max.Setting ρ = 0.5 turns off active learning completely, because the chosen heuristic always comes with a confidence of at least 0.5.Setting ρ = 1 would mean using active learning almost always, essentially reducing selective max to regular point-wise maximization.We evaluate selective max with values for ρ of 0.51, 0.6, 0.7, 0.8, 0.9, and 0.99.
• Initial sample size N : The initial sample size N is an important parameter, not just because it is used to train the initial classifier before any active learning is done, but also because it is the only source of estimates for branching factor, average action cost, and heuristic computation times.It thus affects the threshold τ : Increasing N increases the accuracy of the initial classifier and of the various aforementioned estimates, but also increases the preprocessing time.We evaluate selective max with values for N of 10, 100, and 1000.
• Sampling method: The sampling method used to obtain the initial state-space sample is important in that it affects this initial sample, and thus the accuracy of both the threshold τ and of the initial classifier.We evaluate selective max with three different sampling methods, all described in Section 5.1: biased probes (sel P h ), unbiased probes (sel U P h ), and the sampling method of Haslum et al. (2007) (sel PDB h ).
• Classifier: The choice of classifier is also very important.The Naive Bayes classifier combines very fast learning and classification (sel N B h ).A more sophisticated variant of Naive Bayes called AODE (Webb et al., 2005) , 1997), which offer even faster classification, but more expensive learning when the tree structure needs to be changed (sel IT I h ).We also consider kNN classifiers (Cover & Hart, 1967), which offer faster learning than Naive Bayes, but usually more expensive classification, especially as k grows larger (sel kN N h , for k = 3, 5).
Table 3 describes our default values for each of these independent variables.In each of the subsequent experiments, we vary one of these independent variables, keeping the rest at their default values.In all of these experiments, the search for each planning task instance was limited to 30 minutes2 and to 3 GB of memory.The search times do not include the time needed for translating the planning task from PDDL to SAS + and building some of the Fast Downward data structures, which is common to all planners, and is tangential to the issues considered in our study.The search times do include learning and classification time for selective max.

• Heuristics
We begin by varying the set of heuristics in use.For every possible choice of two or more heuristics out of the uniform action cost partitioning version of h LA (which we simply refer to as h LA ), h LM-CUT and h LM-CUT + , we compare selective max to other methods of heuristic combination, as well as to the individual heuristics.We compare selective max (sel h ) to the regular maximum (max h ), as well as to a planner which chooses which heuristic to compute at each state randomly (rnd h ).As it is not clear whether the random choice should favor the more expensive and accurate heuristic or the cheaper and less accurate one, we simply use a uniform random choice.
This experiment was conducted on all 31 domains with no conditional effects and axioms (which none of the heuristics we used support) from the International Planning Competitions 1998-2008.Because domains vary in difficulty and in the number of tasks, we normalize the score for each planner in each domain between 0 and 1. Normalizing by the number of problems in the domain is not a good idea, as it is always possible to generate any number of effectively unsolvable problems in each domain, so that the fraction of solved problems will approach zero.Therefore, we normalize the number of problems solved in each domain by the number of problems in that domain that were solved by at least one of our planners.While this measure of normalized coverage has the undesirable property that introducing a  new planner could change the normalized coverage of the other planners, we believe that it best reflects performance nonetheless.As an overall performance measure, we list the average normalized coverage score across all domains.Using normalized coverage means that domains have equal weight in the aggregate score.Additionally, we list for each domain the number of problems that were solved by any planner (in parentheses next to the domain name), and for each planner we list the number of problems it solved in parentheses.
Tables 4 and 5 summarize the results of this experiment.We divided the domains in our experiment into 3 sets: domains with non-uniform action costs, domains with unit action costs which exhibited a high variance in the number of problems solved between different Table 5: Geometric mean of ratio of expansions relative to max h , broken down by groups of domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs.
planners, and domains with unit action costs which exhibited a low variance in the number of problems solved between different planners.We make this distinction because we conducted the following experiments, which examine the effects of the other parameters of selective max, only on the unit cost action domains which exhibited high variance.Tables 4 and 5 summarize the results for these three sets of domains, as well as for all domains combined.Detailed, per-domain results are relegated to Appendix A.
Table 4 lists the normalized coverage score, averaged across all domains, and the total number of problems solved in parentheses.Table 4a lists these for each individual heuristic, and Table 4b for every combination method of every set of two or more heuristics.Table 5 shows how accurate each of these heuristic combination methods is.Since, for a given set of base heuristics, max h is the most accurate heuristic possible, the accuracy is evaluated relative to max h .We evaluate each heuristic's accuracy on each task as the number of states expanded by A * using that heuristic, divided by the number of states expanded by A * using max h .We compute the geometric mean for each domain over the tasks solved by all planners of this "accuracy ratio," and list here the geometric mean over these numbers.Each row lists the results for a combination of two or three heuristics; for combinations of two heuristics, we leave the cell representing the heuristic that is not in the combination empty.
Looking at the results of individual heuristics first, we see that the most accurate heuristic (h LM-CUT + ) does not do well overall, while the least accurate heuristic (h LA ) solved the most tasks in total, and h LM-CUT wins in terms of normalized coverage.However, when looking at the results for individual domains, we see that the best heuristic to use varies, indicating that combining different heuristics could indeed be of practical value.
We now turn our attention to the empirical results for the combinations of all possible subsets of two or more heuristics.The results clearly demonstrate that when more than one heuristic is used, selective max is always better than regular maximum or random choice, both in terms of normalized coverage and absolute number of problems solved.Furthermore, the poor performance of rnd h , in both coverage and accuracy, demonstrates that the decision rule and the classifier used in selective max are important to its success, and that computing only one heuristic at each state randomly is insufficient, to say the least.
When compared to individual heuristics, selective max does at least as well as each of the individual heuristics it uses, for all combinations except that of h LM-CUT and h LM-CUT + .This is most likely because h LM-CUT and h LM-CUT + are based on a very similar procedure, and thus their heuristic estimates are highly correlated.To see why this hinders selective max, consider the extreme case of two heuristics which have a correlation of 1.0 (that is, yield the same heuristic values), where selective max can offer no benefit.Finally, we remark that the best planner in this experiment was the selective max combination of h LA and h LM-CUT .
The above results are all based on a 30 minute time limit, which, while commonly used in the IPC, is arbitrary, and the number of tasks solved after 30 minutes does not tell the complete tale.Here, we examine the anytime profile of the different heuristic combination methods, by plotting the number of tasks solved under different timeouts, up to a timeout of 30 minutes.
Figure 4 shows this plot for the three combination methods when all three heuristics are used.
As the figure shows, the advantage of sel h over the baseline combination methods is even greater under shorter timeouts.This indicates that the advantage of sel h over max h is even Table 6: Selective max overhead.Each row lists the average percentage of time spent on learning and classification, out of the total time taken by selective max, for each set of heuristics.
greater than is evident from the results after 30 minutes, and that sel h is indeed effective for minimizing search time.Since the anytime plots for the combinations of pairs of heuristics are very similar, we omit them here for the sake of brevity.
Finally, we present overhead statistics for using selective max -the proportion of time spent on learning and classification, including the time spent obtaining the initial state-space sample, out of the total solution time.Table 6 presents the average overhead on selective max for each of the combinations of two or more heuristics.Detailed, per-domain results are presented in Table 18 in Appendix A. As these results show, selective max does incur a noticeable overhead, but it is still relatively low.It is also worth mentioning that the overhead varies significantly between different domains.
We also performed an empirical evaluation of using selective max with an accurate heuristic alongside the blind heuristic.The blind heuristic returns 0 for goal states, and the cost of the cheapest action for non-goal states.For this experiment, we chose our most accurate heuristic, h LM-CUT + .We compare the performance of A * using h LM-CUT + alone, to that of A * using selective max of h LM-CUT + and the blind heuristic.Because the blind heuristic returns a constant value for all non-goal states, the decision rule that selective max uses to combine some heuristic h with the blind heuristic h b is simply h(s) ≥ τ + h b , that is, compute h when the predicted value of h is greater than some constant threshold.Recall that, when h(s) + g(s) < c * , computing h is simply a waste of time, because s will not be pruned.Therefore, it only makes sense to compute h(s) when h(s) ≥ c * − g(s).Note that this threshold for computing h depends on g(s), and thus is not constant.This shows that a constant threshold for computing h(s) is not the best possible decision rule.Unfortunately, the selective max decision rule is based on an approximation that fails to capture the subtleties of this case.
Table 7 shows the normalized coverage of A * using h LM-CUT + , and A * using selective max of h LM-CUT + and the blind heuristic.As the results show, selective max has little effect in most domains, though it does harm performance in some, and in one domain -OPENSTACKS -it actually performs better than the single heuristic.Table 8 shows the average expansions ratio, using the number of states expanded by h LM-CUT + as the baseline; note that using the blind heuristic never increases heuristic accuracy.As these results show, selective max chooses to use the blind heuristic quite often, expanding on average more than twice as many states than A * with h LM-CUT + alone.
• hyper-parameter α Figure 5a plots the total number of problems solved, under different values of α.As these results show, selective max is fairly robust with respect to the value of α, unless a very large value for α is chosen, making it more difficult for selective max to choose the more accurate heuristic.
Detailed, per-domain results appear in Table 19 in Appendix A, as well as in Figure 6.These results show a more complex picture, where there seems to be some cutoff value for each domain, such that increasing α past that value impairs performance.The one exception to this is the PIPESWORLD-TANKAGE domain, where setting α = 5 helps.
• confidence threshold ρ Figure 5b plots the total number of problems solved, under different values of ρ, Detailed, per-domain results appear in Table 20 in Appendix A. These results indicate that selective max is also robust to values of ρ, unless it is set to a very low value, causing selective max to behave like the regular point-wise maximum.
• initial sample size N Figure 5c plots the total number of problems solved under different values of N .with the x-axis in logscale.Detailed, per-domain results appear in Table 21 in Appendix A. As the results show, our default value of N = 100 is the best (of the three values we tried), although selective max is still fairly robust with respect to the choice of parameter.

• sampling method
Figure 7 shows the total number of problems solved using different methods for the initial state-space sampling.Detailed, per-domain results appear in Table 22 in Appendix A. As the results demonstrate, the choice of sampling method can notably affect the performance of selective max.However, as the detailed results show, this effect is only evident in the FREE-CELL domain.We also remark that our default sampling method, PDB, performs worse than the others.Indeed by using the probe based sampling methods, selective max outperforms A * using h LA alone.However, as this difference is only due to the FREECELL domain, we can not state with certainty that this would generalize across all domains.

• classifier
Figure 8 shows the total number of problems solved using different classifiers.Detailed, per-domain results appear in Table 23 in Appendix A. Naive Bayes appears to be the best classifier to use with selective max, although AODE also performs quite well.Even though kNN enjoys very fast learning, the classifier is used mostly for classification, and as expected, kNN does not do well.However, the increased accuracy of k = 5 seems to pay off against the faster classification when k = 3.

Comparison with Sequential Portfolios
Sequential portfolio solvers for optimal planning are another approach for exploiting the merits of different heuristic functions, and they have been very successful in practice, with the Fast Downward Stone Soup sequential portfolio (Helmert et al., 2011) winning the sequential optimal track at IPC-2011.A sequential portfolio utilizes different solvers by running them sequentially, each with a prespecified time limit.If one solver fails to find a solution under its allotted time limit, the sequential portfolio terminates it, and moves on to the next solver.However, a sequential portfolio solver needs to know the time allowance for the problem it is trying to solve beforehand, a setting known as contract anytime (Russell & Zilberstein, 1991).In contrast, selective max can be used in an interruptible anytime manner, where the time limit need not be known in advance.
Here, we compare selective max to sequential portfolios of A * with the same heuristics.As we have the exact time it took A * search using each heuristic alone to solve each problem, we can determine whether a sequential portfolio which assigns each heuristic some time limit will be able to solve each problem.Using this data, we simulate the results of two types of sequential portfolio planners.In the first setting, we assume that the time limit is known in advance, and simulate the results of a contract portfolio giving an equal share of time to all heuristics.In the second setting, we simulate an interruptible anytime portfolio by using binary exponential backoff time limits: starting Figure 9: Anytime profiles of sequential portfolios and selective max.Each plot shows the number of problems solved by selective max (sel h ), a simulated contract anytime portfolio (port ctr ), and a simulated interruptible portfolio (port int ) using (a) h LA and h LM-CUT (b) h LA and h LM-CUT + (c) h LM-CUT and h LM-CUT + , and (d) h LA , h LM-CUT , and h LM-CUT + .
with a time limit of 1 second for each heuristic, we increase the time limit by a factor of 2 if none of the heuristics were able to guide A * to solve the planning problem.There are several possible orderings for the heuristics here, and we use the de facto best ordering for each problem.We denote the contract anytime portfolio by port ctr , and the interruptible anytime portfolio by port int .
Figure 9 shows the number of problems solved under different time limits for selective max, the contract anytime sequential portfolio, and the interruptible anytime sequential portfolio.As these results show, the contract anytime sequential portfolio almost always outperforms selective max.On the other hand, when the sequential portfolio does not know the time limit in advance, its performance deteriorates significantly.The best heuristic combination for selective max, h LA and h LM-CUT , outperforms the interruptible anytime portfolio using the same heuristics, and so does the selective max combination of h LM-CUT and h LM-CUT + .With the other combinations of heuristics, the interruptible anytime portfolio performs better than selective max.

Discussion
Learning for planning has been a very active field since the early days of planning (Fikes, Hart, & Nilsson, 1972), and is recently receiving growing attention in the community.However, despite some early work (Rendell, 1983), relatively little work has dealt with learning for state-space search guided by distance-estimating heuristics, one of the most prominent approaches to planning these days.Most works in this direction have been devoted to learning macro-actions (see, for example, Finkelstein & Markovitch, 1998;Botea, Enzenberger, Müller, & Schaeffer, 2005;Coles & Smith, 2007).Recently, learning for heuristic search planning has received more attention: Yoon et al. (2008) suggested learning (inadmissible) heuristic functions based upon features extracted from relaxed plans.Arfaee, Zilles, and Holte (2010) attempted to learn an almost admissible heuristic estimate using a neural network.Perhaps the most closely related work to ours is that of Thayer, Dionne, and Ruml (2011), who learn to correct errors in heuristic estimates online.Thayer et al. attempt to improve the accuracy of a single given heuristic, while selective max attempts to choose one of several given heuristics for each state.The two works differ technically on this point.More importantly, however, none of the aforementioned approaches can guarantee that the resulting heuristic will be admissible, and thus that an optimal solution will be found.In contrast, our focus is on optimal planning, and we are not aware of any previous work that deals with learning for optimal heuristic search.
Our experimental evaluation demonstrates that selective max is a more effective method for combining arbitrary admissible heuristics than the baseline point-wise maximization.Also advantageous is selective max's ability to exploit pairs of heuristics, where one is guaranteed to always be at least as accurate as the other.For example, the h LA heuristic can be used with two action cost partitioning schemes: uniform and optimal (Karpas & Domshlak, 2009).The heuristic induced by the optimal action cost partitioning is at least as accurate the one induced by the uniform action cost partitioning, but takes much longer to compute.Selective max might be used to learn when it is worth spending the extra time to compute the optimal cost partitioning, and when it is not.In contrast, the max-based combination of these two heuristics would simply waste the time spent on computing the uniform action cost partitioning.
The controlled parametric experiments demonstrate that the right choice of classifier and of the sampling method for the initial state-space sample is very important.The other parameters of selective max do not appear to affect performance too much, as long as they are set to reasonable values.This implies that selective max could be improved by using faster, more accurate, classifiers, and by developing sampling methods that can represent the state-space well.

4.
Learning: Once we have a set of labeled training examples, each represented by a vector of features, we can train a binary classifier.Several different choices of classifier are discussed in Section 5.2.

Figure 3 :
Figure 3: IPC-2011 anytime performance.Each line shows the number of problems from IPC-2011 solved by the BJOLP, lmcut, and selmax planners, respectively, under different timeouts.

Figure 5 :Figure 6 :Figure 7 :Figure 8 :
Figure 5: Number of problems solved by selective max under different values for (a) hyperparameter α (b) confidence threshold ρ, and (c) initial sample size N .

Table 1 :
Parameters for the selmax entry in IPC-2011.

Table 2 :
Number of planning tasks solved at IPC 2011 in each domain by the BJOLP, lmcut, and selmax planners.The best result from these 3 planners is in bold.The number of problems solved by Fast Downward Stone Soup 1 (FDSS-1) in each domain is also included for reference.

Table 3 :
is also considered here (sel AODE h ).AODE is more 1.Of course, all three heuristics are computable in polynomial time from the SAS + description of the planning task.Default parameters for sel h .accurate than Naive Bayes, but has higher classification and learning times, as well as increased memory overhead.Another possible choice is using incremental decision trees (Utgoff et al.

Table 4 :
Average normalized coverage, and total coverage in parentheses, broken down by groups of domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs.Table(a)shows the results for A * with individual heuristics, and table (b) shows the results for the maximum (max h ), random choice (rnd h ), and selective max (sel h ) combinations of the set of heuristics listed in each major row.

Table 7 :
Normalized coverage of h LM-CUT + and selective max combining h LM-CUT + with the blind heuristic.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 9 :
Detailed per-domain results of A * with each individual heuristic.Normalized coverage is shown, with the number of problems solved shown in parentheses.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 10 :
Detailed per-domain normalized coverage using h LA and h LM-CUT .Each line shows the normalized coverage in each domain, with the number of problems solved is shown in parentheses.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 12 :
Detailed per-domain normalized coverage using h LA and h LM-CUT + .Each line shows the normalized coverage in each domain, with the number of problems solved is shown in parentheses.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 14 :
Detailed per-domain normalized coverage using h LM-CUT and h LM-CUT + .Each line shows the normalized coverage in each domain, with the number of problems solved is shown in parentheses.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 16 :
Detailed per-domain normalized coverage using h LA , h LM-CUT and h LM-CUT + .Each line shows the normalized coverage in each domain, with the number of problems solved is shown in parentheses.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 18 :
Selective max overhead.Each row lists the average percentage of time spent on learning and classification, out of the total time taken by selective max, in each domain, for each set of heuristics.Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 19 :
Number of problems solved by selective max in each domain with varying values of hyper-parameter α

Table 20 :
Number of problems solved by selective max in each domain with varying values of confidence threshold ρ Table 18 lists the average overhead of selective max in each domain, for each combination of two or more heuristics.Tables19, 20, 21, 22 and 23 list the number of problems solved in each domain, under various values for α, ρ, N , sampling method and classifier, respectively.

Table 21 :
Number of problems solved by selective max in each domain with varying values of initial Sample Size N

Table 22 :
Haslum et al. (2007)olved by selective max in each domain with different sampling methods.PDB is the sampling method ofHaslum et al. (2007), P is the biased probes sampling method, and U P is the unbiased probes sampling method.

Table 23 :
Number of problems solved by selective max in each domain with different classifiers

Table 24 :
Detailed coverage of portfolio using h LA / h LM-CUT .Number of problems solved by selective max (sel h ), a simulated interruptible portfolio (port int ), and a simulated contract anytime portfolio (port ctr ) in each domain using heuristics h LA / h LM-CUT .Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 25 :
Detailed coverage of portfolio using h LA / h LM-CUT + .Number of problems solved by selective max (sel h ), a simulated interruptible portfolio (port int ), and a simulated contract anytime portfolio (port ctr ) in each domain using heuristics h LA / h LM-CUT + .Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.

Table 26 :
Detailed coverage of portfolio using h LM-CUT / h LM-CUT + .Number of problems solved by selective max (sel h ), a simulated interruptible portfolio (port int ), and a simulated contract anytime portfolio (port ctr ) in each domain using heuristics h LM-CUT / h LM-CUT + .Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with nonuniform action costs, respectively.

Table 27 :
Detailed coverage of portfolio using h LA / h LM-CUT / h LM-CUT + .Number of problems solved by selective max (sel h ), a simulated interruptible portfolio (port int ), and a simulated contract anytime portfolio (port ctr ) in each domain using heuristics h LA / h LM-CUT / h LM-CUT + .Domains are grouped into domains with unit cost actions and high variance in coverage, domains with unit cost actions and low variance in coverage, and domains with non-uniform action costs, respectively.