A Bilinear Programming Approach for Multiagent Planning

Multiagent planning and coordination problems are common and known to be computationally hard. We show that a wide range of two-agent problems can be formulated as bilinear programs. We present a successive approximation algorithm that significantly outperforms the coverage set algorithm, which is the state-of-the-art method for this class of multiagent problems. Because the algorithm is formulated for bilinear programs, it is more general and simpler to implement. The new algorithm can be terminated at any time and-unlike the coverage set algorithm-it facilitates the derivation of a useful online performance bound. It is also much more efficient, on average reducing the computation time of the optimal solution by about four orders of magnitude. Finally, we introduce an automatic dimensionality reduction method that improves the effectiveness of the algorithm, extending its applicability to new domains and providing a new way to analyze a subclass of bilinear programs.


Introduction
We present a new approach for solving a range of multiagent planning and coordination problems using bilinear programming.The problems we focus on represent various extensions of the Markov decision process (MDP) to multiagent settings.The success of MDP algorithms for planning and learning under uncertainty has motivated researchers to extend the model to cooperative multiagent problems.One possibility is to assume that all the agents share all the information about the underlying state.This results in a multiagent Markov decision process (Boutilier, 1999), which is essentially an MDP with a factored action set.A more complex alternative is to allow only partial sharing of information among agents.In these settings, several agents-each having different partial information about the world-must cooperate with each other in order to achieve some joint objective.Such problems are common in practice and can be modeled as decentralized partially observable MDPs (DEC-POMDPs) (Bernstein, Zilberstein, & Immerman, 2000).Some refinements of this model have been studied, for example by making certain independence assumptions (Becker, Zilberstein, & Lesser, 2003) or by adding explicit communication actions (Goldman & Zilberstein, 2008).DEC-POMDPs are closely related to extensive games (Rubinstein, 1997).In fact, any DEC-POMDP represents an exponentially larger extensive game with a common objective.Unfortunately, DEC-POMDPs with just two agents are intractable in general, unlike MDPs that can be solved in polynomial time.
Despite recent progress in solving DEC-POMDPs, even state-of-the-art algorithms are generally limited to very small problems (Seuken & Zilberstein, 2008).This has motivated the development of algorithms that either solve a restricted class of problems (Becker, Lesser, & Zilberstein, 2004;Kim, Nair, Varakantham, Tambe, & Yokoo, 2006) or provide only approximate solutions (Emery-Montemerlo, Gordon, Schneider, & Thrun, 2004;Nair, Roth, Yokoo, & Tambe, 2004;Seuken & Zilberstein, 2007).In this paper, we introduce an efficient algorithm for several restricted classes, most notably decentralized MDPs with transition and observation independence (Becker et al., 2003).For the sake of simplicity, we denote this model as DEC-MDP, although this is usually used to denote the model without the independence assumptions.The objective in these problems is to maximize the cumulative reward of a set of cooperative agents over some finite horizon.Each agent can be viewed as a single decision-maker operating on its own "local" MDP.What complicates the problem is the fact that all these MDPs are linked through a common reward function that depends on their states.
The coverage set algorithm (CSA) was the first optimal algorithm to solve efficiently transition and observation independent DEC-MDPs (Becker, Zilberstein, Lesser, & Goldman, 2004).By exploiting the fact that the interaction between the agents is limited compared to their individual local problems, CSA can solve problems that cannot be solved by the more general exact DEC-POMDP algorithms.It also exhibits good anytime behavior.However, the anytime behavior is of limited applicability because solution quality is only known in hindsight, after the algorithm terminates.
We develop a new approach to solve DEC-MDPs-as well as a range of other multiagent planning problems-by representing them as bilinear programs.We also present an efficient new algorithm for solving these kinds of separable bilinear problems.When the algorithm is applied to DEC-MDPs, it improves efficiency by several orders of magnitude compared with previous state-of the art algorithms (Becker, 2006;Petrik & Zilberstein, 2007a).In addition, the algorithm provides useful runtime bounds on the approximation error, which makes it more useful as an anytime algorithm.Finally, the algorithm is formulated for general separable bilinear programs and therefore it can be easily applied to a range of other problems.
The rest of the paper is organized as follows.First, in Section 2, we describe the basic bilinear program formulation and how a range of multiagent planning problems can be expressed within this framework.In Section 3, we describe a new successive approximation algorithm for bilinear programs.The performance of the algorithm depends heavily on the number of interactions between the agents.To address that, we propose in Section 4 a method that automatically reduces the number of interactions and provides a bound on the degradation in solution quality.Furthermore, to be able to project the computational effort required to solve a given problem instance, we develop offline approximation bounds in Section 5.In Section 6, we examine the performance of the approach on a standard benchmark problem.We conclude with a summary of the results and a discussion of future work that could further improve the performance of this approach.

Formulating Multiagent Planning Problems as Bilinear Programs
We begin with a formal description of bilinear programs and the different types of multiagent planning problems that can be formulated as such.In addition to multiagent planning problems, bilinear programs can be used to solve a variety of other problems such as robotic manipulation (Pang, Trinkle, & Lo, 1996), bilinear separation (Bennett & Mangasarian, 1992), and even general linear complementarity problems (Mangasarian, 1995).We focus on multiagent planning problems where this formulation turns out to be particularly effective.
Definition 1.A separable bilinear program in the normal form is defined as follows: maximize w,x,y,z f (w, x, y, z) = s T 1 w + r T 1 x + x T Cy + r T 2 y + s T 2 z subject to The size of the program is the total number of variables in w, x, y and z.The number of variables in y determines the dimensionality of the program1 .
Unless otherwise specified, all vectors are column vectors.We use boldface 0 and 1 to denote vectors of zeros and ones respectively of the appropriate dimensions.This program specifies two linear programs that are connected only through the nonlinear objective function term x T Cy.The program contains two types of variables.The first type includes the variables x, y that appear in the bilinear term of the objective function.The second type includes the additional variables w, z that do not appear in the bilinear term.As we show later, this distinction is important because the complexity of the algorithm we propose depends mostly on the dimensionality of the problem, which is the number of variables y involved in the bilinear term.
The bilinear program in Eq. ( 1) is separable because the constraints on x and w are independent of the constraints on y and z.That is, the variables that participate in the bilinear term of the objective function are independently constrained.The theory of nonseparable bilinear programs is much more complicated and the corresponding algorithms are not as efficient (Horst & Tuy, 1996).Thus, we limit the discussion in this paper to separable bilinear programs and often omit the term "separable".As discussed later in more detail, a separable bilinear program may be seen as a concave minimization problem with multiple local minima.It can be shown that solving this problem is NP-complete, compared to polynomial time complexity of linear programs.
In addition to the formulation of the bilinear program shown in Eq. ( 1), we also use the following formulation, stated in terms of inequalities: The latter formulation can be easily transformed into the normal form using standard transformations of linear programs (Vanderbei, 2001).In particular, we can introduce slack variables w, z to obtain the following identical bilinear program in the normal form: maximize w,x,y,z We use the following matrix and block matrix notation in the paper.Matrices are denoted by square brackets, with columns separated by commas and rows separated by semicolons.Columns have precedence over rows.For example, the notation [A, B; C, D] corresponds to the matrix As we show later, the presence of the variables w, z in the objective function may prevent a crucial function from being convex.Since this has an unfavorable impact on the properties of the bilinear program, we introduce a compact form of the problem.
Definition 2. We say that the bilinear program in Eq. ( 1) is in a compact form when s 1 and s 2 are zero vectors.It is in a semi-compact form if s 2 is a zero vector.
The compactness requirement is not limiting because any bilinear program in the form shown in Eq. ( 1) can be expressed in a semi-compact form as follows: maximize w,x,y,z,x,ŷ Clearly, feasible solutions of Eq. (1) and Eq. ( 4) have the same objective value when ŷ is set appropriately.Notice that the dimensionality of the bilinear term in the objective function increases by 1 for both x and y.Hence, this transformation increases the dimensionality of the program by 1.
The rest of this section describes several classes of multiagent planning problems that can be formulated as bilinear programs.Starting with observation and transition independent DEC-MDPs, we extend the formulation to allow a different objective function (maximizing average reward over an infinite horizon), to handle interdependent observations, and to find Nash equilibria in competitive settings.

DEC-MDPs
As mentioned previously, any transition-independent and observation-independent DEC-MDP (Becker et al., 2004) may be formulated as a bilinear program.Intuitively, a DEC-MDP is transition independent when no agent can influence the other agents' transitions.A DEC-MDP is observation independent when no agent can observe the states of other agents.These assumptions are crucial since they ensure a lower complexity of the problem (Becker et al., 2004).In the remainder of the paper, we use simply the term DEC-MDP to refer to transition and observation independent DEC-MDP.
The DEC-MDP model has proved useful in several multiagent planning domains.One example that we use is the Mars rover planning problem (Bresina, Golden, Smith, & Washington, 1999), first formulated as a DEC-MDP by Becker et al. (2003).This domain involves two autonomous rovers that visit several sites in a given order and may decide to perform certain scientific experiments at each site.The overall activity must be completed within a given time limit.The uncertainty about the duration of each experiment is modeled by a given discrete distribution.While the rovers operate independently and receive local rewards for each completed experiment, the global reward function also depends on some experiments completed by both rovers.The interaction between the rovers is thus limited to a relatively small number of such overlapping tasks.We return to this problem and describe it in more detail in Section 6.
A DEC-MDP problem is composed of two MDPs with state-sets S 1 , S 2 and action sets A 1 , A 2 .The functions r 1 and r 2 define local rewards for action-state pairs.The initial state distributions are α 1 and α 2 .The MDPs are coupled through a global reward function defined by the matrix R. Each entry R(i, j) represents the joint reward for the state-action i by one agent and j by the other.Our definition of a DEC-MDP is based on the work of Becker et al. (2004), with some modifications that we discuss below.Definition 3. A two-agent transition and observation independent DEC-MDP with extended reward structure is defined by a tuple S, F, α, A, P, R : is the factored set of terminal states.
• α = (α 1 , α 2 ) where α i : S i → [0, 1] are the initial state distribution functions is the factored set of actions • P = (P 1 , P 2 ), P i : S i × A i × S i → [0, 1] are the transition functions.Let a ∈ A i be an action, then P a i : S i × S i → [0, 1] is a stochastic transition matrix such that P i (s, a, s ) = P a i (s, s ) is the probability of a transition from state s ∈ S i to state s ∈ S i of agent i, assuming it takes action a.The transitions from the final states have 0 probability; that is Local rewards r i are represented as vectors, and R is a matrix with (s 1 , a 1 ) as rows and (s 2 , a 2 ) as columns.
Definition 3 differs from the original definition of transition and observation independent DEC-MDP (Becker et al. 2004, Definition 1) in two ways.The modifications allow us to explicitly capture assumptions that are implicit in previous work.First, the individual MDPs in our model are formulated as stochastic shortest-path problems (Bertsekas & Tsitsiklis, 1996).That is, there is no explicit time horizon, but instead some states are terminal.The process stops upon reaching a terminal state.The objective is to maximize the cumulative reward received before reaching the terminal states.
The second modification of the original definition is that Definition 3 generalizes the reward structure of the DEC-MDP formulation, using the extended reward structure.The joint rewards in the original DEC-MDP are defined only for the joint states (s ).However, our DEC-MDP formulation with extended reward structure also allows the reward to depend on (s 1 1 , s 2 2 ) and (s 1 2 , s 2 1 ), even when they are not visited simultaneously.As a result, the global reward may depend on the history, not only on the current state.Note that this reward structure is more general than what is commonly used in DEC-POMDPs.
We prefer the more general definition because it has been already implicitly used in previous work.In particular, this extended reward structure arises from introducing the primitive and compound events in the work of Becker et al. (2004).This reward structure is necessary to capture the characteristics of the Mars rover benchmark.Interestingly, this extension does not complicate our proposed solution methods in any way.Note that the stochastic shortest path formulation (right side of Figure 1) inherently eliminates any loops because time always advances when an action is taken.Therefore, every state in that representation may be visited at most once.This property is commonly used when an MDP is formulated as a linear program (Puterman, 2005).
The solution of a DEC-MDP is a deterministic stationary policy π = (π 1 , π 2 ), where π i : S i → A i is the standard MDP policy (Puterman, 2005) for agent i.In particular, π i (s i ) represents the action taken by agent i in state s i .To define the bilinear program, we use variables x(s 1 , a 1 ) to denote the probability that agent 1 visits state s 1 and takes action a 1 and y(s 2 , a 2 ) to denote the same for agent 2. These are the standard dual variables in MDP formulation.Given a solution in terms of x for agent 1, the policy is calculated for s ∈ S 1 as follows, breaking ties arbitrarily.
The policy π 2 is similarly calculated from y.The correctness of the policy calculation follows from the existence of an optimal policy that is deterministic and depends only on the local states of that agent (Becker et al., 2004).
The objective in DEC-MDPs in terms of x and y is then to maximize: The stochastic shortest path representation is more general because any finite-horizon MDP can be represented as such by keeping track of time as part of the state, as illustrated in Figure 1.This modification allows us to apply the model directly to the Mars rover benchmark problem.Actions in the Mars rover problem may have different durations, while all actions in finite-horizon MDPs take the same amount of time.
A DEC-MDP problem with an extended reward structure can be formulated as a bilinear mathematical program as follows.Vector variables x and y represent the state-action probabilities for each agent, as used in the dual linear formulation of MDPs.Given the transition and observation independence, the feasible regions may be defined by linear equalities A 1 x = α 1 and x ≥ 0, and A 2 y = α 2 and y ≥ 0. The matrices A 1 and A 2 are the same as for the dual formulation of total expected reward MDPs (Puterman, 2005), representing the following equalities for agent i: for every s ∈ S i .As described above, variables x(s, a) represent the probabilities of visiting the state s and taking action a by the appropriate agent during plan execution.Note that for agent 2, the variables are y(s, a) rather than x(s, a).Intuitively, these equalities ensure that the probability of entering each non-terminal state, through either the initial step or from other states, is the same as the probability of leaving the state.The bilinear problem is then formulated as follows: In this formulation, we treat the initial state distributions α i as vectors, based on a fixed ordering of the states.The following simple example illustrates the formulation.
Example 4. Consider a DEC-MDP with two agents, depicted in Figure 2. The transitions in this problem are deterministic, and thus all the branches represent actions a i , ordered for each state from left to right.In some states, only one action is available.The shared reward r 1 , denoted by a dotted line, is received when both agents visit the state.The local rewards are denoted by the numbers above next to the states.The terminal states are omitted.The agents start in states s 1 1 and s 2 1 respectively.The bilinear formulation of this problem is: While the results in this paper focus on two-agent problems, our approach can be extended to DEC-MDPs with more than two agents in two ways.The first approach requires that each component of the global reward depends on at most two agents.The DEC-MDP then may be viewed as a graph with vertices representing agents and edges representing the immediate interactions or dependencies.To formulate the problem as a bilinear program, this graph must be bipartite.Interestingly, this class of problems has been previously formulated (Kim et al., 2006).Let G 1 and G 2 be the indices of the agents in the two partitions of the bipartite graph.Then the problem can be formulated as follows: Here, R ij denotes the global reward for interactions between agents i and j.This program is bilinear and separable because the constraints on the variables in G 1 and G 2 are independent.
The second approach to generalize the framework is to represent the DEC-MDP as a multilinear program.In that case, no restrictions on the reward structure are necessary.An algorithm to solve, say a trilinear program, could be almost identical to the algorithm we propose, except that the best response would be calculated using bilinear, not linear programs.However, the scalability of this approach to more than a few agents is doubtful.

Average-Reward Infinite-Horizon DEC-MDPs
The previous formulation deals with finite-horizon DEC-MDPs.An average-reward problem may also be formulated as a bilinear program (Petrik & Zilberstein, 2007b).This is particularly useful for infinite-horizon DEC-MDPs.For example, consider the infinitehorizon version of the Multiple Access Broadcast Channel (MABC) (Rosberg, 1983;Ooi & Wornell, 1996).In this problem, which has been used widely in recent studies of decentralized decision making, two communication devices share a single channel, and they need to periodically transmit some data.However, the channel can transmit only a single message at a time.When both agents send messages at the same time, this leads to a collision, and the transmission fails.The memory of the devices is limited, thus they need to send the messages sooner rather than later.We adapt the model from the work of Rosberg (1983), which is particularly suitable because it assumes no sharing of local information among the devices.
The definition of average-reward two-agent transition and observation independent DEC-MDP is the same as Definition 3, with the exception of the terminal states, policy, and objective.There are no terminal states in average-reward DEC-MDPs, and the policy (π 1 , π 2 ) may be stochastic.That is, π i (s, a) → [0, 1] is the probability of agent i taking an action a in state s.The objective is to find a stationary infinite-horizon policy π that maximizes the average reward, or gain, defined as follows.
Definition 5. Let π = (π 1 , π 2 ) be a stochastic policy, and X t and Y t be random variables that represent the probability distributions over the state-action pairs at time t of the two agents respectively according to π.The gain G of the policy π is then defined for states s 1 ∈ S 1 and s 2 ∈ S 2 as: where π i (s i ) is the distribution over the actions in state s i .Note that the expectation is with respect to the initial states and action distributions (s 1 , π 1 (s 1 )), (s 2 , π 2 (s 2 )).
The actual gain of a policy depends on the agents' initial state distributions α 1 , α 2 and may be expressed as α T 1 Gα 2 , with G represented as a matrix.Puterman (2005), for example, provides a more detailed discussion of the definition and meaning of policy gain.
To simplify the bilinear formulation of the average-reward DEC-MDP, we assume that r 1 = 0 and r 2 = 0.The bilinear program follows.maximize The variables in the program come from the dual formulation of the average-reward MDP linear program (Puterman, 2005).The state sets of the MDPs is divided into recurrent and transient states.The recurrent states are expected to be visited infinitely many times, while the transient states are expected to be visited finitely many times.Variables p 1 and p 2 represent the limiting distributions of each MDP, which is non-zero for all recurrent states.The (possibly stochastic) policy π i of agent i is defined in the recurrent states by the probability of taking action a ∈ A i in state s ∈ S i : .  The variables p i are 0 in transient states.The policy in the transient states is calculated from variables q i as: .
The correctness of the constraints follows from the dual formulation of optimal average reward (Puterman 2005, Equation 9.3.4).Petrik and Zilberstein (2007b) provide further details of this formulation.

General DEC-POMDPs and Extensive Games
The general DEC-POMDP problem and extensive-form games with two agents, or players, can also be formulated as bilinear programs.However, the constraints may not be separable because actions of one agent influence the other agent.The approach in this case may be similar to linear complementarity problem formulation of extensive games (Koller, Megiddo, & von Stengel, 1994), and integer linear program formulation of DEC-POMDPs (Aras & Charpillet, 2007).The approach we develop is closely related to event-driven DEC-POMDPs (Becker et al., 2004), but it is in general more efficient.Nevertheless, the size of the bilinear program is exponential in the size of the DEC-POMDP.This can be expected since solving DEC-POMDPs is NEXP-complete (Bernstein et al., 2000), while solving bilinear programs is NP-complete (Mangasarian, 1995).Because the general formulation in this case is somewhat cumbersome, we only illustrate it using the following simple example.Aras (2008) provides the details of a similar construction.
Example 6.Consider the problem depicted in Figure 3, assuming that the agents are cooperative.The actions of the other agent are not observable, as denoted by the information sets.This approach can be generalized to any problem with any observable sets as long as the perfect recall condition is satisfied.Agents satisfy the perfect recall condition when they remember the set of actions taken in the prior moves (Osborne & Rubinstein, 1994).Rewards are only collected in the leaf-nodes in this case.The variables on the edges represent the probability of taking the action.Here, variables a denote the actions of one agent, and variables b of the other.The total common reward received in the end is: The constraints in this problem are of the following form: a 11 + a 12 = 1.
Any DEC-POMDP problem can be represented using the approach used above.It is also straightforward to extend the approach to problems with rewards in every node.However, the above formulation is clearly not bilinear.To apply our algorithm to this class of problems, we need to reformulate the problem in a bilinear form.This can be easily accomplished in a way similar to the construction of the dual linear program for an MDP.Namely, we introduce variables: and so on for every set of variables on any path to a leaf node.Then, the objective may be reformulated as follows: The constraints may be reformulated as follows.The constraint a 21 + a 22 = 1 can be multiplied by a 11 and then replaced by c 21 + c 22 = c 11 , and so on.That is, the variables in each level have to sum to the variable that is their least common parent in the level above for the same agent.

General Two-Player Games
In addition to cooperative problems, some competitive problems with 2 players may be formulated as bilinear programs.It is known that the problem of finding an equilibrium for a bi-matrix game may be formulated as a linear complementarity problem (Cottle, Pang, & Stone, 1992).It has also been shown that a linear complementarity problem may be formulated as a bilinear problem (Mangasarian, 1995).However, a direct application of these two reductions results in a complex problem with a large dimensionality.Below, we demonstrate how a general game can be directly formulated as a bilinear program.There are many ways to formulate a game, thus we take a very general approach.We simply assume that each agent optimizes a linear program, as follows.
In Eq. ( 8), the variable y is considered to be a constant and similarly in Eq. ( 9) the variable x is considered to be a constant.For normal form games, the constraint matrices A 1 and A 2 are simply rows of ones, and b 1 = b 2 = 1.For competitive DEC-MDPs, the constraint matrices A 1 and A 2 are the same as in Section 2.1.Extensive games may be formulated similarly to DEC-POMDPs, as described in Section 2.3.
The game specified by linear programs Eq. ( 8) and Eq. ( 9) may be formulated as a bilinear program as follows.First, define the reward vectors for each agent, given a policy of the other agent.
These values are unrelated to those of Eq. ( 7).The complementary slackness values ( Vanderbei, 2001) for the linear programs Eq. ( 8) and Eq. ( 9) are: where λ 1 and λ 2 are the dual variables of the corresponding linear programs.For any primal feasible x and y, and dual feasible λ 1 and λ 2 , we have that k 1 (x, y, λ 1 ) ≥ 0 and k 2 (x, y, λ 2 ) ≥ 0. The equality is attained if and only if x and y are optimal.This can be used to write the following optimization problem, in which we implicitly assume that x,y,λ 1 ,λ 2 are feasible in the appropriate primal and dual linear programs: Therefore, any feasible x and y that set the right hand side to 0 solve both linear programs in Eq. ( 8) and Eq. ( 9) optimally.Adding the primal and dual feasibility conditions to the above, we get the following bilinear program: The optimal solution of Eq. ( 10) is 0 and it corresponds to a Nash equilibrium.This is because both the primal variables x, y and dual variables λ 1 , λ 2 are feasible and the complementary slackness condition is satisfied.The open question in this example are the interpretation of an approximate result and a formulation that would select the equilibrium.It is not clear yet whether it is possible to formulate the program so that the optimal solution will be a Nash equilibrium that maximizes a certain criterion.The approximate solutions of the program probably correspond to -Nash equilibria, but this remain an open question.
The algorithm in this case also relies on the number of shared rewards being small compared to the size of the problem.But even if this is not the case, it is often possible that the number of shared rewards may be automatically reduced as described in Section 4. In fact, it is easy to show that a zero-sum normal form game is automatically reduced to two uncoupled linear programs.This follows from the dimensionality reduction procedure in Section 4.

Solving Bilinear Programs
One simple method often used for solving bilinear programs is the iterative procedure shown in Algorithm 1.The parameter B represents the bilinear program.While the algorithm often performs well in practice, it tends to converge to a suboptimal solution (Mangasarian, 1995).When applied to DEC-MDPs, this algorithm is essentially identical to JESP (Nair, Tambe, Yokoo, Pynadath, & Marsella, 2003)-one of the early solution methods.In the following, we use f (w, x, y, z) to denote the objective value of Eq. ( 1).
The rest of this section presents a new anytime algorithm for solving bilinear programs.The goal of the algorithm to is to produce a good solution quickly and then improve the solution in the remaining time.Along with each approximate solution, the maximal approximation bound with respect to the optimal solution is provided.As we show below, our algorithm can benefit from results produced by suboptimal algorithms, such as Algorithm 1, to quickly determine tight approximation bounds.

The Successive Approximation Algorithm
We begin with an overview of a successive approximation algorithm for bilinear problems that takes advantage of a low number of interactions between the agents.It is particularly suitable when the input problem is large in comparison to its dimensionality, as defined in Section 2. We address the issue of dimensionality reduction in Section 4.
We begin with a simple intuitive explanation of the algorithm, and then show how it can be formalized.The bilinear program can be seen as an optimization game played by two agents, in which the first agent sets the variables w, x and the second one sets the variables y, z.This is a general observation that applies to any bilinear program.In any practical application, the feasible sets for the two sets of variables may be too large to explore exhaustively.In fact, when this method is applied to DEC-MDPs, these sets are infinite and continuous.The basic idea of the algorithm is to first identify the set of best responses of one of the agents, say agent 1, to some policy of the other agent.This is simple because once the variables of agent 2 are fixed, the program becomes linear, which is relatively easy to solve.Once the set of best-response policies of agent 1 is identified, assuming it is of a reasonable size, it is possible to calculate the best response of agent 2.
This general approach is also used by the coverage set algorithm (Becker et al., 2004).One distinction is that the representation used in CSA applies only to DEC-MDPs, while our formulation applies to bilinear programs-a more general representation.The main distinction between our algorithm and CSA is the way in which the variables y, z are chosen.In CSA, the values y, z are calculated in a way that simply guarantees termination in finite time.We, on the other hand, choose values y, z greedily so as to minimize the approximation bound on the optimal solution.This is possible because we establish bounds on the optimality of the solution throughout the calculation.As a result, our algorithm converges more rapidly and may be terminated at any time with a guaranteed performance bound.Unlike the earlier version of the algorithm (Petrik & Zilberstein, 2007a), the version described in this paper calculates the best response using only a subset of the values of y, z.
As we show, it is possible to identify regions of y, z in which it is impossible to improve the current best solution and exclude these regions from consideration.
We now formalize the ideas described above.To simplify the notation, we define feasible sets as follows: We use y ∈ Y to denote that there exists z such that (y, z) ∈ Y .In addition, we assume that the problem is in a semi-compact form.This is reasonable because any bilinear program may be converted to semi-compact form with an increase in dimensionality of one, as we have shown earlier.
Assumption 7. The sets X and Y are bounded, that is, they are contained in a ball of a finite radius.While Assumption 7 is limiting, coordination problems under uncertainty typically have bounded feasible sets because the variables correspond to probabilities bounded to [0, 1].
Assumption 8.The bilinear program is in a semi-compact form.
The main idea of the algorithm is to compute a set X ⊆ X that contains only those elements that satisfy a necessary optimality condition.The set X is formally defined as follows: X ⊆ (x * , w * ) ∃(y, z) ∈ Y f (w * , x * , y, z) = max (x,w)∈X f (w, x, y, z) .
As described above, this set may be seen as a set of best responses of one agent to the variable settings of the other.The best responses are easy to calculate since the bilinear program in Eq. ( 1) reduces to a linear program for fixed w, x or fixed y, z.In our algorithm, we assume that X is potentially a proper subset of all necessary optimality points and focus on the approximation error of the optimal solution.Given the set X, the following simplified problem is solved.maximize w,x,y,z f (w, x, y, z) Unlike the original continuous set X, the reduced set X is discrete and small.Thus the elements of X may be enumerated.For a fixed w and x, the bilinear program in Eq. ( 11) reduces to a linear program.
To help compute the approximation bound and to guide the selection of elements for X, we use the best-response function g(y), defined as follows: with the second equality for semi-compact programs only and feasible y ∈ Y .Note that g(y) is also defined for y / ∈ Y , in which case the choice of z is arbitrary since it does not influence the objective function.The best-response function is easy to calculate using a linear program.The crucial property of the function g that we use to calculate the approximation bound is its convexity.The following proposition holds because g(y) = max {x,w (x,w)∈X} f (w, x, y, 0) is a maximum of a finite set of linear functions.
Proposition 9.The function g(y) is convex when the program is in a semi-compact form.
Proposition 9 relies heavily on the separability of Eq. ( 1), which means that the constraints on the variables on one side of the bilinear term are independent of the variables on the other side.The separability ensures that w, x are valid solutions regardless of the values of y, z.The semi-compactness of the program is necessary to establish convexity, as shown in Example 23 in Appendix C. The example is constructed using the properties described in the appendix, which show that f (w, x, y, z) may be expressed as a sum of a convex and a concave function.
We are now ready to describe Algorithm 2, which computes the set X for a bilinear problem B such that the approximation error is at most 0 .The algorithm iteratively adds the best response (x, w) for a selected pivot point y into X.The pivot points are selected hierarchically.At an iteration j, the algorithm keeps a set of polyhedra S 1 . . .S j which represent the triangulation of the feasible space Y , which is possible based on Assumption 7.For each polyhedron S i = (y 1 . . .y n+1 ), the algorithm keeps a bound i on the maximal difference between the optimal solution on the polyhedron and the best solution found so far.This error bound on a polyhedron S i is defined as: // Add the best response to the pivot point y to the set X X ← X ∪ {arg max (x,w)∈X f (w, x, y, 0)} ; 8 // Calculate errors and pivot points of the refined polyhedra for k = 1, . . ., n + 1 do 9 j ← j + 1 ; 10 // Replace the k-th vertex by the pivot point y S j ← (y, y 1 . . .y k−1 , y k+1 , . . .y n+1 ) ; 11 ( j , φ j ) ← P olyhedronError(S j ) ; // Section 3.2,Section 3.3 12 // Take the smaller of the errors on the original and the refined polyhedron.The error may not increase with the refinement, although the bound may j ← min{ i , j } ; 13 // Set the error of the refined polyhedron to 0, since the region is covered by the refinements i ← 0 ; 14 (w, x, y, z) ← arg max {w,x,y,z (x,w)∈ X,(y,z)∈Y } f (w, x, y, 0) ; 15 return (w, x, y, z) ; 16 where X represents the current, not final, set of best responses.
Next, a point y 0 is selected as described below and n + 1 new polyhedra are created by replacing one of the vertices by y 0 to get: (y 0 , y 2 , . ..), (y 1 , y 0 , y 3 , . ..), . . ., (y 1 , . . ., y n , y 0 ).This is depicted for a 2-dimensional set Y in Figure 4.The old polyhedron is discarded and the above procedure is then repeatedly applied to the polyhedron with the maximal approximation error.
For the sake of clarity, the pseudo-code of Algorithm 2 is simplified and does not address any efficiency issues.In practice, g(y i ) could be cached, and the errors i could be stored in a prioritized heap or at least in a sorted array.In addition, a lower bound l i and an upper bound u i is calculated and stored for each polyhedron S i = (y 1 . . .y n+1 ).The function e(S i ) calculates their maximal difference on the polyhedron S i and the point where it is attained.The error bound i on the polyhedron S i may not be tight, as we describe in Remark 13.As a result, when the polyhedron S i is refined to n polyhedra S 1 . . .S n with online error bounds 1 . . .n , it is possible that for some k: k > i .Since S 1 . . .S n ⊆ S i , the true error on S k is less than on S i and therefore k may be set to i .
Conceptually, the algorithm is similar to CSA, but there are some important differences.The main difference is in the choice of the pivot point y 0 and the bounds on g.CSA does not keep any upper bound and it evaluates g(y) on all the intersection points of planes defined by the current solutions in X.That guarantees that g(y) is eventually known precisely (Becker et al., 2004).A similar approach was also taken for POMDPs (Cheng, 1988).The upper bound on the number of intersection points in CSA is | X| dim Y .The principal problem is that the bound is exponential in the dimension of Y , and experiments do not show a slower growth in typical problems.In contrast, we choose the pivot points to minimize the approximation error.This is more selective and tends to more rapidly reduce the error bound.In addition, the error at the pivot point may be used to determine the overall error bound.The following proposition states the soundness of the triangulation, proved in Appendix A. The correctness of the triangulation establishes that in each iteration the approximation error over Y is equivalent to the maximum of the approximation errors over the current polyhedra S 1 . . .S j .
Proposition 10.In the proposed triangulation, the sub-polyhedra do not overlap and they cover the whole feasible set Y , given that the pivot point is in the interior of S.

Online Error Bound
The selection of the pivot point plays a key role in the performance of the algorithm, in both calculating the error bound and the speed of convergence to the optimal solution.In this section we show exactly how we use the triangulation in the algorithm to calculate an error bound.To compute the approximation bound, we define the approximate best-response function g(y) as: g(y) = max {x,w (x,w)∈ X} f (w, x, y, 0).
Notice that z is not considered in this expression, since we assume that the bilinear program is in the semi-compact form.The value of the best approximate solution during the execution of the algorithm is: This value can be calculated at runtime when each new element of X is added.Then the maximal approximation error between the current solution and the optimal one may be calculated from the approximation error of the best-response function g(•), as stated by the following proposition.
Proposition 11.Consider a bilinear program in a semi-compact form.Then let w, x, ỹ be an optimal solution of Eq. ( 11) and let w * , x * , y * be an optimal solution of Eq. ( 1).The approximation error is then bounded by: Now, the approximation error is max y∈Y g(y) − g(y), which is bounded by the difference between an upper bound and a lower bound on g(y).Clearly, g(y) is a lower bound on g(y).Given points in which g(y) is the same as the best-response function g(y), we can use Jensen's inequality to obtain the upper bound.This is summarized by the following lemma.
The actual implementation of the bound relies on the choice of the pivot points.Next we describe the maximal error calculation on a single polyhedron defined by S = (y 1 . . .y n ).Let matrix T have y i as columns, and let L = {x 1 . . .x n+1 } be the set of the best responses for its vertices.The matrix T is used to convert any y in absolute coordinates to a relative representation t that is a convex combination of the vertices.This is defined formally as follows: where the y i 's are column vectors.We can represent a lower bound l(y) for g(y) and an upper bound u(y) for g(y) as: The upper bound correctness follows from Lemma 12. Notice that u(y) is a linear function, which enables us to use a linear program to determine the maximal-error point.
Algorithm 3: PolyhedronError(B, S) P ← one of Eq. ( 12), or (13), or ( 14), or (20) ; Remark 13.Notice that we use L instead of X in calculating l(y).Using all of X would lead to a tighter bound, as it is easy to show in three-dimensional examples.However, this also would substantially increase the computational complexity.Now, the error on a polyhedron S may be expressed as: As a result, the point with the maximal error bound may be determined using the following linear program in terms of variables t, : Here x is not a variable.The formulation is correct because all feasible solutions are bounded below the maximal error and any maximal-error solution is feasible.
We thus select the next pivot point to greedily minimize the error.The maximal difference is actually achieved in points where some of the planes meet, as Becker et al. (2004) have suggested.However, checking these intersections is very similar to running the simplex algorithm.In general, the simplex algorithm is preferable to interior point methods for this program because of its small size (Vanderbei, 2001).
Algorithm 3 shows a general way to calculate the maximal error and the pivot point on the polyhedron S.This algorithm may use the basic formulation in Eq. ( 12), or the more advanced formulations in Eqs. ( 13), ( 14), and (20) defined in Section 3.3.
In the following section, we describe a more refined pivot point selection method that can in some cases dramatically improve the performance.

Advanced Pivot Point Selection
As described above, the pivot points are chosen greedily to both determine the maximal error in each polyhedron and to minimize the approximation error.The basic approach described in Section 3.1 may be refined, because the goal is not to approximate the function g(y) with the least error, but to find the optimal solution.Intuitively, we can ignore those regions of Y that will not guarantee any improvement of the current solution, as illustrated in Figure 5.As we show below, the search for the maximal error point could be limited to this region as well.We first define a set Y h ⊆ Y that we will search for the maximal error, given that the optimal solution f * ≥ h.
The next proposition states that the maximal error needs to be calculated only in a superset of Y h .
Proposition 15.Let w, x, ỹ, z be the approximate optimal solution and w * , x * , y * , z * be the optimal solution.Also let f (w * , x * , y * , z * ) ≥ h and assume some Ỹh ⊇ Y h .The approximation error is then bounded by: Proof.First, f (w * , x * , y * , z * ) = g(y * ) ≥ h and thus y * ∈ Y h .Then: Proposition 15 indicates that the point with the maximal error needs to be selected only from the set Y h .The question is how to easily identify Y h .Because the set is not convex in general, a tight approximation of this set needs to be found.In particular, we use methods that approximate the intersection of a superset of Y h with the polyhedron that is being refined, using the following methods: 1. Feasibility [Eq.( 13)]: Require that pivot points are feasible in Y .2. Linear bound [Eq.( 14)]: Use the linear upper bound u(y) ≥ h.

3.
Cutting plane [Eq.( 20)]: Use the linear inequalities that define Y C h , where Any combination of these methods is also possible.
Feasibility The first method is the simplest, but also the least constraining.The linear program to find the pivot point with the maximal error bound is as follows: This approach does not require that the bilinear program is in the semi-compact form.
Linear Bound The second method, using the linear bound, is also very simple to implement and compute, and it is more selective than just requiring feasibility.Let: This set is convex and thus does not need to be approximated.The linear program used to find the pivot point with the maximal error bound is as follows: The difference from Eq. ( 12) is the last constraint.This approach requires that the bilinear program is in the semi-compact form to ensure that u(y) is a bound on the total return.

Cutting Plane
The third method, using the cutting plane elimination, is the most computationally intensive one, but also the most selective one.Using this approach requires additional assumptions on the other parts of the algorithm, which we discuss below.The method is based on the same principle as α-extensions in concave cuts (Horst & Tuy, 1996).We start with the set Y C h because it is convex and may be expressed as: Figure 6: Approximating Y h using the cutting plane elimination method.
To use these inequalities in selecting the pivot point, we need to make them linear.But there are two obstacles: Eq. ( 15) contains a bilinear term and is a maximization.Both of these issues can be addressed by using the dual formulation of Eq. ( 15).The corresponding linear program and its dual for fixed y, ignoring constants h and r T 2 y, are: Using the dual formulation, Eq. ( 15) becomes: Now, we use that for any function φ and any value θ the following holds: Finally, this leads to the following set of inequalities.
The above inequalities define the convex set Y C h .Because its complement Y h is not necessarily convex, we need to use its convex superset Ỹh on the given polyhedron.This is done by projecting Y C h , or its subset, onto the edges of each polyhedron as depicted in Figure 6 and described in Algorithm 4. The algorithm returns a single constraint which cuts off part of the set Y C h .Notice that only the combination of the first n points f k is Algorithm 4: PolyhedronCut({y 1 , . . ., y n+1 }, h) returns constraint σ T y ≤ τ // Find vertices of the polyhedron {y 1 , . . ., // Find at least n points f k in which the edge of Y h intersects an edge of the polyhedron k ← 1 ; 13 return σ T y ≤ τ 14 used.In general, there may be more than n points, and any subset of points f k of size n can be used to define a new cutting plane that constraints Y h .This did not lead to significant improvements in our experiments.The linear program to find the pivot point with the cutting plane option is as follows: Here, σ, and τ are obtained as a result of running Algorithm 4. Note that this approach requires that the bilinear program is in the semi-compact form to ensure that g(y) is convex.The following proposition states the correctness of this procedure.
Proposition 16.The resulting polyhedron produced by Algorithm 4 is a superset of the intersection of the polyhedron S with the complement of Y h .
Proof.The convexity of g(y) implies that Y C h is also convex.Therefore, the intersection

Dimensionality Reduction
Our experiments show that the efficiency of the algorithm depends heavily on the dimensionality of the matrix C in Eq. ( 1).In this section, we show the principles behind automatically determining the necessary dimensionality of a given problem.Using the proposed procedure, it is possible to identify weak interactions and eliminate them.Finally, the procedure works for arbitrary bilinear programs and is a generalization of a method we have previously introduced (Petrik & Zilberstein, 2007a).
The dimensionality is inherently part of the model, not the problem itself.There may be equivalent models of a given problem with very different dimensionality.Thus, procedures for reducing the dimensionality are not necessary when the modeler can create a model with minimal dimensionality.However, this is nontrivial in many cases.In addition, some dimensions may have little impact on the overall performance.To determine which ones can be discarded, we need a measure of their contribution that can be computed efficiently.We define these notions more formally later in this section.
We assume that the feasible sets have bounded L 2 norms, and assume a general formulation of the bilinear program, not necessarily in the semi-compact form.Given Assumption 7, this can be achieved by scaling the constraints when the feasible region is bounded.
Assumption 17.For all x ∈ X and y ∈ Y , their norms satisfy x 2 ≤ 1 and y 2 ≤ 1.
We discuss the implications of and problems with this assumption after presenting Theorem 18. Intuitively, the dimensionality reduction removes those dimensions where g(y) is constant, or almost constant.Interestingly, these dimensions may be recovered based on the eigenvectors and eigenvalues of C T C. We use the eigenvectors of C T C instead of the eigenvectors of C, because our analysis is based on L 2 norm of x and y and thus of C. The L 2 norm C 2 is bounded by the largest eigenvalue of C T C. In addition, a symmetric matrix is required to ensure that the eigenvectors are perpendicular and span the whole space.
Given a problem represented using Eq. ( 1), let F be a matrix whose columns are all the eigenvectors of C T C with eigenvalues greater than some λ.Let G be a matrix with all the remaining eigenvectors as columns.Notice that together, the columns of the matrices span the whole space and are real-valued, since C T C is a symmetric matrix.Assume without loss of generality that the eigenvectors are unitary.The compressed version of the bilinear program is then the following: Notice that the program is missing the element x T CGy 2 , which would make its optimal solutions identical to the optimal solutions of Eq. ( 1).We describe a more practical approach to reducing the dimensionality in Appendix B. This approach is based on singular value decomposition and may be directly applied to any bilinear program.The following theorem quantifies the maximum error when using the compressed program.
Theorem 18.Let f * and f * be optimal solutions of Eq. ( 1) and Eq. ( 21) respectively.Then: Moreover, this is the maximal linear dimensionality reduction possible with this error without considering the constraint structure.
Proof.We first show that indeed the error is at most √ λ and that any linearly compressed problem with the given error has at least f dimensions.Using a mapping that preserves the feasibility of both programs, the error is bounded by: Denote the feasible region of y 2 as Y 2 .From the orthogonality of [F, G], we have that y 2 2 ≤ 1 as follows: Then we have: The result follows from Cauchy-Schwartz inequality, the fact that C T C is symmetric, and Assumption 17.The matrix L denotes a diagonal matrix of eigenvalues corresponding to eigenvectors of G. Now, let H be an arbitrary matrix that satisfies the preceding error inequality for G. also that this dimensionality reduction technique ignores the constraint structure.When the constraints have some special structure, it might be possible to obtain an even tighter bound.As described in the next section, the dimensionality reduction technique generalizes the reduction that Becker et al. (2004) used implicitly.
The result of Theorem 18 is based on an approximation of the feasible set Y by y 2 ≤ 1, as Assumption 17 states.This approximation may be quite loose in some problems, which may lead to a significant multiplicative overestimation of the bound in Theorem 18.For example, consider the feasible set depicted in Figure 7.The bound may be achieved in a point ŷ, which is far from the feasible region.In specific problems, a tighter bound could be obtained by either appropriately scaling the constraints, or using a weighted L 2 with a better precision.We partially address this issue by considering the structure of the constraints.To derive this, consider the following linear program and corresponding theorem: Theorem 19.The optimal solution of Eq. ( 22) is the same as when the objective function is modified to where I is the identity matrix.
Proof.The objective function is: The first term may be ignored because it does not depend on the solution x.
The following corollary shows how the above theorem can be used to strengthen the dimensionality reduction bound.For example, in zero-sum games, this stronger dimensionality reduction splits the bilinear program into two linear programs.
Corollary 20.Assume that there are no variables w and z in Eq. (1).Let: where A i are defined in Eq. ( 1).Let C be: where C is the bilinear-term matrix from Eq. ( 1).Then the bilinear programs will have identical optimal solutions with either C or C.
Proof.Using Theorem 19, we can modify the original objective function in Eq. ( 1) to: For the sake of simplicity we ignore the variables w and z, which do not influence the bilinear term.Because both (I − A T i (A i A T i ) −1 A i ) for i = 1, 2 are orthogonal projection matrices, none of the eigenvalues in Theorem 18 will increase.
The dimensionality reduction presented in this section is related to the idea of compound events used in CSA.Allen, Petrik, andZilberstein (2008a, 2008b) provide a detailed discussion of this issue.

Offline Bound
In this section we develop an approximation bound that depends only on the number of points for which g(y) is evaluated and the structure of the problem.This kind of bound is useful in practice because it provides performance guarantees without actually solving the problem.In addition, the bound reveals which parameters of the problem influence the algorithm's performance.The bound is derived based on the maximal slope of g(y) and the maximal distance among the points.
Theorem 21.To achieve an approximation error of at most , the number of points to be evaluated in a regular grid with k points in every dimension must satisfy: where n is the number of dimensions of Y .
The theorem follows using basic algebraic manipulations from the following lemma.
Lemma 22. Assume that for each y 1 ∈ Y there exists y 2 ∈ Y such that y 1 − y 2 2 ≤ δ and g(y 2 ) = g(y 2 ).Then the maximal approximation error is: Proof.Let y 1 be a point where the maximal error is attained.This point is in Y , because this set is compact.Now, let y 2 be the closest point to y 1 in L 2 norm.Let x 1 and x 2 be the best responses for y 1 and y 2 respectively.From the definition of solution optimality we can derive: The error now can be expressed, using the fact that x 1 − x 2 2 ≤ 1, as: The above derivation follows from Assumption 17, and the bound reduces to the matrix norm using Cauchy-Schwartz inequality.
Not surprisingly, the bound is independent of the local rewards and transition structure of the agents.Thus it in fact shows that the complexity of achieving a fixed approximation with a fixed interaction structure is linear in the problem size.However, the bounds are still exponential in the dimensionality of the space.Notice also that the bound is additive.

Experimental Results
We now turn to an empirical analysis of the performance of the algorithm.For this purpose we use the Mars rover problem described earlier.We compared our algorithm with the original CSA and with a mixed integer linear program (MILP), derived for Eq.(1) as Petrik and Zilberstein (2007b) describe.Although Eq. ( 1) can also be modeled as a linear complementarity problem (LCP) (Murty, 1988;Cottle et al., 1992), we do not evaluate that option experimentally because LCPs are closely related to MILPs (Rosen, 1986).We expect these two formulations to exhibit similar performance.We also do not compare to any of the methods described by Horst and Tuy (1996) and Bennett and Mangasarian (1992) due to their very different nature and high complexity, and because some of these algorithms do not provide any optimality guarantees.
In our experiments, we applied the algorithm to randomly generated problem instances with the same parameters that Becker et al. (2003Becker et al. ( , 2004) ) used.Each problem instance includes 2 rovers and 6 sites.At each site, the rovers can decide to perform an experiment or to skip the site.Performing experiments takes some time, and all the experiments must be performed in 15 time units.The time required to perform an experiment is drawn from a discrete normal distribution with the mean uniformly chosen from 4.0-6.0.The variance Figure 8: The six algorithm configurations that were evaluated.Feasible, linear bound, and cutting plane refer to methods used to determine the optimal solution.
problem setup with at most 4 shared sites, CSA solved only 76% of the problems, and the longest solution took approximately 4 hours (Becker et al., 2004).In contrast, MPBP solved all 200 problems with 4 shared sites optimally in less than 1 second on average, about 10000 times faster.In addition, MPBP returns solutions that are guaranteed to be close to optimal in the first few iterations.While CSA also returns solutions close to optimal very rapidly, it takes a very long time to confirm that.
Figure 9 shows the average guaranteed ratio of the optimal solution, achieved as a function of the number of iterations, that is, points for which g(y) is evaluated.This figure, as all others, shows the result of the online error bound.This value is guaranteed and is not based on the optimal solution.This compares the performance of the various configurations of the algorithm, without using the presolve step.While the optimal solution was typically discovered in the first few iterations, it takes significantly longer to prove its optimality.
The average of absolute errors in both linear and log scale are shown in Figure 10.These results indicate that the methods proposed to eliminate the dominated region in searching for the pivot point can dramatically improve performance.While requiring that the new pivot points are feasible in Y improves the performance, it is much more significant with  a better approximation of Y h .As expected, the cutting plane elimination is most efficient, but also most complex.
To evaluate the tradeoffs in the implementation, we also show the average time per iteration and the average total time in Figure 11.These figures show that the time per iteration is significantly larger when the cutting plane elimination is used.Overall, the algorithm is faster when the simpler linear bound is used.This trend is most likely problem specific.In problems with higher dimensionality, the more precise cutting plane algorithm may be more efficient.Implementation issues play a significant role in this problem too, and it is likely that the implementation of Algorithm 4 can be further improved.Figure 12 shows the influence of using the presolve method.The plots of C 3 and C 4 are identical to the plots of C 5 and C 6 respectively, indicating that the presolve method does not have any significant influence.This also indicates that a solution that is very close to optimal is obtained when the values of the initial points are calculated.
We also performed experiments with CPLEX-a state-of-the-art MILP solver on the direct MILP formulation of the DEC-MDP.CPLEX was not able to solve any of the problems within 30 minutes, no matter how many of the sites were shared.The main reason for this that it does not take any advantage of the limited interaction.Nevertheless, it is possible that some specialized MILP solvers may perform better.

Conclusion and Further Work
We present an algorithm that significantly improves the state-of-the-art in solving two-agent coordination problems.The algorithm takes as input a bilinear program representing the problem, and solves the problem using a new successive approximation method.It provides a useful online performance bound that can be used to decide when the approximation is good enough.The algorithm can take advantage of the limited interaction among the agents, which is translated into a small dimensionality of the bilinear program.Moreover, using our approach, it is possible to reduce the dimensionality of such problems automatically, without extensive modeling effort.This makes it easy to apply our new method in practice.When applied to DEC-MDPs, the algorithm is much faster than the existing CSA method, on average reducing computation time by four orders of magnitude.We also show that a variety of other coordination problems can be treated within this framework.
Besides multiagent coordination problems, bilinear programs have been previously used to solve problems in operations research and global optimization (Sherali & Shetty, 1980;White, 1992;Gabriel, Garca-Bertrand, Sahakij, & Conejo, 2005).Global optimization deals with finding the optimal solutions to problems with multi-extremal objective function.Solution techniques often share the same idea and are based on cutting plane methods.The main idea is to iteratively restrict the set of feasible solutions, while improving the incumbent solution.Horst and Tuy (1996) provide an excellent overview of these techniques.These algorithms have different characteristics and cannot be directly compared to the algorithm we developed.Unlike these traditional algorithms, we focus on providing quickly good approximate solutions with error bounds.In addition, we exploit the small dimensionality of the best-response space Y to get tight approximation bounds.
Future work will address several interesting open questions with respect to the bilinear formulation as well as further improvement of the efficiency of the algorithm.With regard to the representation, it is yet to be determined whether the anytime behavior can be exploited when applied to games.That is, it is necessary to verify that an approximate solution to the bilinear program is also a meaningful approximation of the Nash equilibrium.It is also important to identify the classes of extensive games that can be efficiently formulated as bilinear programs.
The algorithm we present can be made more efficient in several ways.In particular, a significant speedup could be achieved by reducing the size of the individual linear programs.The programs are solved many times with the same constraints, but a different objective function.The objective function is always from a small-dimensional space.Therefore, the problems that are solved are all very similar.In the DEC-MDP domain, one option would be to use a procedure similar to action elimination.In addition, the performance could be significantly improved by starting with a tight initial triangulation.In our implementation, we simply use a single large polyhedron that covers the whole feasible region.A better approach would be to start with something that approximates the feasible region more tightly.A tighter approximation of the feasible region could also improve the precision of the dimensionality reduction procedure.Instead of the naive ellipsis used in Assumption 7, it is possible to use one that approximates the feasible region as tightly as possible.It is however very encouraging to see that even without these improvements, the algorithm is very effective compared with existing solution techniques.
denotes the original polyhedron and ŷ = T c is the pivot point, where 1 T c = 1 and c ≥ 0. Note that T is a matrix and c, d, ŷ are vectors, and β is a scalar.
We show that the sub-polyhedra cover the original polyhedron S as follows.Take any a = T d such that 1 T d = 1 and d ≥ 0. We show that there exists a sub-polyhedron that contains a and has ŷ as a vertex.First, let This matrix is square and invertible, since the polyhedron is non-empty.To get a representation of a that contains ŷ, we show that there is a vector o such that for some i, o(i) = 0: for some β > 0. This will ensure that a is in the sub-polyhedron with ŷ with vertex i replaced by ŷ.The value o depends on β as follows: This can be achieved by setting: Since both d and c = T −1 ŷ are non-negative.This leaves us with an equation for the sub-polyhedron containing the point a.Notice that the resulting polyhedron may be of a smaller dimension than n when o(j) = 0 for some i = j.
To show that the polyhedra do not overlap, assume there exists a point a that is common to the interior of at least two of the polyhedra.That is, assume that a is a convex combination of the vertices: where T 3 represents the set of points common to the two polyhedra, and y 1 and y 2 represent the disjoint points in the two polyhedra.The values h 1 , h 2 , β 1 , and β 2 are all scalars, while c 1 and c 2 are vectors.Notice that the sub-polyhedra differ by at most one vertex.The coefficients satisfy: Since the interior of the polyhedron is non-empty, this convex combination is unique.First assume that h = h 1 = h 2 .Then we can show the following: This holds since y 1 and y 2 are independent of T 3 when the polyhedron is nonempty and y 1 = y 2 .The last equality follows from the fact that y 1 and y 2 are linearly independent.This is a contradiction, since β 1 = β 2 = 0 implies that the point a is not in the interior of two polyhedra, but at their intersection.Finally, assume WLOG that h 1 > h 2 .Now let ŷ = T 3 ĉ + α 1 y 1 + α 2 y 2 , for some scalars α 1 ≥ 0 and α 2 ≥ 0 that represent a convex combination.We get: The coefficients sum to one as shown below.
Now, the convex combination is unique, and therefore the coefficients associated with each vertex for the two representations of a must be identical.In particular, equating the coefficients for y 1 and y 2 results in the following: We have that α 1 > 0 and α 2 > 0 from the fact that ŷ is in the interior of the polyhedron S.
Then, having β 2 ≤ 0 is a contradiction with a being a convex combination of the vertices of S.

Figure 3 :
Figure 3: A tree form of a policy for a DEC-POMDP or an extensive game.The dotted ellipses denote the information sets.

r
= c 21 b 11 r 1 + c 22 b 11 r 2 + c 23 b 12 r 3 + c 24 b 12 r 4 + c 25 b 11 r 5 + c 26 b 11 r 6 + c 27 b 12 r 7 + c 28 b 12 r 8 .Variables b ij are replaced in the same fashion.This objective function is clearly bilinear.

Figure 4 :
Figure 4: Refinement of a polyhedron in two dimensions with a pivot y 0 .

1t←←
the optimal solution of P ; 2 the optimal objective value of P ; 3 // Coordinates t are relative to the vertices of S, convert them to absolute values in Y φ ← T t ; 4 return ( , φ) ;5

Figure 5 :
Figure5: The reduced set Y h that needs to be considered for pivot point selection.

Figure 7 :
Figure 7: Approximation of the feasible set Y according to Assumption 17.

Figure 9 :
Figure 9: Guaranteed fraction of optimality according to the online bound.

Figure 10 :
Figure 10: Comparison of absolute error of various region elimination methods.

Figure 11 :Figure 12 :
Figure 11: Time per iteration and the total time to solve.With configurations C 1 and C 2 , the optimal value is not reached with 200 iterations .The figure only shows the time to compute up to 200 iterations.

Figure 13 :
Figure13: A plot of a non-convex best-response function g for a bilinear program, which is not in a semi-compact form.