Distributed Constraint Optimization Problems and Applications: A Survey

The field of Multi-Agent System (MAS) is an active area of research within Artificial Intelligence, with an increasingly important impact in industrial and other real-world applications. Within a MAS, autonomous agents interact to pursue personal interests and/or to achieve common objectives. Distributed Constraint Optimization Problems (DCOPs) have emerged as one of the prominent agent architectures to govern the agents' autonomous behavior, where both algorithms and communication models are driven by the structure of the specific problem. During the last decade, several extensions to the DCOP model have enabled them to support MAS in complex, real-time, and uncertain environments. This survey aims at providing an overview of the DCOP model, giving a classification of its multiple extensions and addressing both resolution methods and applications that find a natural mapping within each class of DCOPs. The proposed classification suggests several future perspectives for DCOP extensions, and identifies challenges in the design of efficient resolution algorithms, possibly through the adaptation of strategies from different areas.

system in the pursuit of some goals.A multi-agent system (MAS) is a system where multiple agents interact in the pursuit of goals.Within a MAS, agents may interact with each other directly, via communication acts, or indirectly, by acting on the shared environment.In addition, agents may decide to cooperate, to achieve a common goal, or to compete, to serve their own interests at the expense of other agents.In particular, agents may form cooperative teams, which can in turn compete against other teams of agents.Multi-agent systems play an important role in distributed artificial intelligence, thanks to their ability to model a wide variety of real-world scenarios, where information and control are decentralized and distributed among a set of agents.Figure 2 illustrates a MAS, representing a sensor network scenario, where a group of agents, equipped with sensors, seeks to determine the position of some targets-identified in the figure as star-shaped objects.Agents may interact with each other-the dotted lines in the figure define the interaction graph.Agents may move away from the current position-the directional arrows illustrate such actions.In addition, var- ious events that dynamically obstruct the sensors of the agents (e.g., the presence of an obstacle along the sensing range of an agent) may dynamically occur.Within a MAS, an agent is: • Autonomous, as it operates without the direct intervention of humans or other entities and has full control over its own actions and internal state (e.g., in the example, an agent can decide to sense, to move, etc.); • Interactant, in the sense that it interacts with other agents in order to achieve its objectives (e.g., in the example, agents may exchange information concerning results of sensing activities); • Reactive, as it responds to changes that occur in the environment and/or to the requests from other agents (e.g., in the example, agents may react with a move to the sudden appearance of obstacles).• Proactive, because of its goal-driven behavior, which allows the agent to take initiatives beyond the reactions in response to its environment.
Agent architectures are the fundamental mechanisms underlying the autonomous agent components, supporting their behavior in real-world, dynamic and uncertain environments.Currently, agent architectures based on Decision Theory (DT), Game Theory (GT), and Constraint Programming (CP) have successfully been developed and are popular in the Autonomous Agents and Multi-Agent Systems (AAMAS) community.
Decision Theory (DT) [155] assumes that the agent and the environment are inherently uncertain, and DT models such uncertainty explicitly.Acting in complex and dynamic environments requires agents to deal with various sources of uncertainty.Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) [18] are one of the most general multi-agent frameworks focused on team coordination in presence of uncertainty about agents' actions and observations.The ability to capture a wide range of complex scenarios makes Dec-POMDPs of central interest within MAS research.However, the result of this generality is a high complexity for generating optimal solutions.Dec-POMDPs are non-deterministic exponential (NEXP) complete [18], even for two-agent problems, and scalability remains a critical challenge [7].
Game Theory (GT) [25] studies interactions between self-interested agents, aiming at maximizing the welfare of the participants.Some of the most compelling applications of game theory to MAS have been in the area of auctions and negotiations [82,132,139].Such approaches model the trading process by which agents can reach agreements on matters of common interest, using market oriented and cooperative mechanisms, such as reaching Nash Equilibria (NEs).Typical resolution approaches aim at deriving a set of equilibrium strategies for each agent, such that, when these strategies are employed, no agent can profit by unilaterally deviating from their strategies.A limitation of game-theoretical approaches is the lack of an agent's ability to reason upon a global objective, as the underlying model relies on the interactions of self-interested agents.Constraint Programming (CP) [161] aims at solving decision-making problems formulated as optimization problems of some real-world objective.Constraint programs use the notion of constraints-i.e., relations among entities of the problems (variables)-in both problem modeling and problem solving.CP relies on inference techniques which prevent the exploration of those parts of the solution search space whose assignments to variables are inconsistent with the constraints and/or dominated with respect to the objective function.Distributed Constraint Optimization Problems (DCOPs) [121,147,193] are problems where agents need to coordinate their value assignments, in a decentralized manner, to optimize their objective functions.DCOPs focus on attaining a global optimum given the interaction graph of a collection of agents.This approach is flexible and can effectively model a wide range of problems.In general, problem solving and communication strategies are directly linked in DCOPs.This feature makes DCOP's algorithmic components suitable for exploiting the structure of the interaction graph to generate efficient solutions.
The absence of a framework to model dynamic problems and uncertainty makes DCOPs unsuitable at solving certain classes of multi-agent problems, such as those characterized by actions uncertainty and dynamic environments.However, since its original introduction, the DCOP model has undergone a continuous evolution, to capture a variety of characteristics regarding agents' behavior and the environment in which agents act.Researchers have proposed a number of DCOP frameworks that differ from each other in terms of expressivity and classes of problem they can target, extending the DCOP model to handle both dynamic and uncertain environments.However, current research has not explored how the different DCOP frameworks related within the general MAS context, which is critical to understand: (a) what resolution methods could be borrowed from other MAS paradigms, and (b) what applications can be most effectively modeled within each framework.This survey aims to comprehensively analyze and categorize the different DCOP frameworks proposed by the AAMAS community.We do so by presenting an extensive review of the DCOP model and its extensions, the different resolution methods, as well as a number of applications modeled within each particular DCOP extension.This analysis provides also opportunities to identify and discuss future directions in the general DCOP research area.
This survey paper is organized as follows.In the next section, we provide an overview on two relevant Constraint Satisfaction models and their generalization to the distributed cases.We then introduce, in Section 3, DCOPs, give an overview on the representation and coordination models adopted in DCOP resolution, and propose a classification of the different variants of DCOPs based on the characteristics of the agents and the environment.Section 4 presents the classical DCOP model as well as two notable extensions characterized, respectively, by asymmetric function utilities and multi-objective optimization.In Section 5, we present DCOP models where the environment changes over time.In Section 6, we discuss DCOP models in which agents act under uncertainty and may have partial knowledge of the environment in which they act.Section 7 discusses DCOP models in which agents are non-cooperative.For each of these models, we introduce their formal definitions, related concepts and resolution algorithms; Table 4 summarizes the complexity of the various classes of problems.In Section 8, we describe a number of applications that have been proposed in the DCOP literature.Section 9 provides a critical review on the various DCOP variants surveyed and focus on their applicability.Additionally, it describes some potential future directions for research.Finally, Section 10 provides some concluding remarks.To facilitate the reading of this survey, we have provided in Table 1 a summary of the most commonly used notations.

Overview of (Distributed) Constraint Satisfaction and Optimization
In this section, we provide an overview of several models of Constraint Satisfaction Problems, which form the foundation of Distributed Constraint Optimization.Figure 3 illustrates the relations among these models.

Constraint Satisfaction Problems
Constraint Satisfaction Problems (CSPs) [9,161] are decision problems that involve the assignment of values to variables, under a set of specified constraints on how variable values should be related to each another.A number of problems can be formulated as CSPs, including resource allocation, vehicle routing, circuit diagnosis, scheduling, and bioinformatics.Over the years, CSPs have become a paradigm of choice to address hard combinatorial problems, drawing and integrating insights from diverse domains, including Artificial Intelligence and Operations Research [161].
Formally, a CSP is a triple X, D, C , where: • X = {x 1 , . . ., x n } is a finite set of variables.
• D = {D 1 , . . ., D n } is a set of finite domains for the variables in X, with D i being the set of possible values for the variable x i .• C is a finite set of constraints over subsets of X, where a constraint c i , defined on the m variables x i 1 , . . ., x i m , is a relation c i ⊆ m j=1 D i j .The set of variables x i = {x i 1 , . . ., x i m } is referred to as the scope of c i . 1 If m = 1, c i is called a unary constraint; if m = 2, it is called a binary constraint.For all other m > 2, the constraint is referred to as a global constraint.A solution is a value assignment for a subset of variables from X that is consistent with their respective domains; i.e., it is a partial function σ : X → n i=1 D i such that, for each x j ∈ X, if σ(x j ) is defined, then σ(x j ) ∈ D j .A solution is complete if it assigns a value to each variable in X.We will use the notation σ to denote a complete solution, and, for a set of variables V = {x i 1 , . . ., x i h } ⊆ X, σ V = σ(x i 1 ), . . ., σ(x i h ) , where i 1 < • • • < i h , denoting the projection of the values in σ associated to the variables in V.The goal in a CSP is to find a complete solution σ such that for each c i ∈ C, σ x i ∈ c i , that is, one that satisfies all the problem constraints.

Weighted Constraint Satisfaction Problems
A solution to a CSP must satisfy all its constraints.In many practical cases, however, it is not possible to satisfy all constraints of a problem, while at the same time it is desirable to consider assignments whose constraints can be violated (according a violation degree) and in which preferences among solutions can be expressed.To faithfully represent such properties, researchers introduced the notion of Weighted Constraint Satisfaction Problems (WC-SPs) [88,164], which are problems whose constraints are considered as preferences, specifying to what extent they are satisfied (or violated).
A weighted constraint satisfaction problem (WCSP) [164,88] is a tuple X, D, F , where where X and D are the set of variables and their domains defined as in a CSP, and F is a set of weighted constraints (or reward functions).A weighted constraint f i ∈ F is a function, f i : x j ∈x i D j → R + ∪ {−∞}, where x i ⊆ X is the scope of f i .The utility of an assignment σ is the sum of the evaluation of the constraints involving all the variables in σ.A solution is a complete assignment with non-negative utility, and an optimal solution is a solution with maximal cost.
Thus, a WCSP is a generalization of a CSP which, in turn, can be seen as a WCSP whose constraints use exclusively the costs 0 and −∞.When the elements of a CSP are distributed among a set of autonomous agents, we refer to it as Distributed Constraint Satisfaction Problem (DisCSP) [194,195].Formally, a DCSP is described by a tuple A, X, D, C, α , where X, D and C are the set of variables, their domains, and the set of constraints, defined as in a classical CSP, A = {a 1 , . . ., a m } is a finite set of autonomous agents, and α : X → A is a surjective function, from variables to agents, which assigns the control of each variable x ∈ X to an agent α(x).The goal in a DisCSP is to find a complete solution that satisfies all the constraints of the problem.DisCSPs can be seen as an extension of CSPs to the multi-agent case, where agents communicate with each other to assign values to the variables they control so as to satisfy all the problem constraints.For a more detailed treatment of the argument we address the reader to [161] (Chapter 20).

Distributed Constraint Optimization Problems
Similar to the generalization of CSPs to WCSPs, the Distributed Constraint Optimization Problem (DCOP) [121,147,193] model emerges as a generalization of the DisCSP model, where constraints specify a degree of preference over their violation, rather than a boolean satisfaction metric.DCOPs can be viewed as an extension of the WCSP framework to the multi-agent case, where agents control variables and constraints, and they need to coordinate the value assignment for the variables they control so to optimize a global objective function.We formally introduce the DCOP framework in the next section.

DCOP Classification
The DCOP model has undergone a process of continuous evolution to capture diverse characteristics of agents behavior and the environment in which they operate.We propose a classification of DCOP models from a Multi-Agent Systems perspective, that accounts for the different assumptions made about the behavior of agents and their interactions with the environment.The classification is based on the following elements (summarized in  • Agent Behavior: This parameter captures the stochastic nature of the effects of an action being executed.In particular, we distinguish between deterministic and stochastic effects.• Agent Knowledge: This parameter captures the knowledge of an agent about its own state and the environmentdistinguishing between total and partial (i.e., incomplete) knowledge.• Agent Teamwork: This parameter characterizes the approach undertaken by (teams of) agents to solve a distributed problem.It can be either a cooperative resolution approach or a competitive resolution approach.In the former class, all agents cooperate to achieve a common goal (i.e., optimize a reward function).In the latter class, each agent (or team of agents) seeks to achieve its own individual goal.• Environment Behavior: This parameter captures the exogenous properties of the environment.For example, it is possible to distinguish between deterministic and stochastic responses of the environment to the execution of an action.• Environment Evolution: This parameter captures whether the DCOP is static (i.e., it does not change over time) or dynamic (i.e., it changes over time).
Figure 4 illustrates a categorization of the DCOP models proposed to date from a MAS perspective.In particular, we focus on the DCOP models proposed at the junction of Constraint Programming (CP), Game Theory (GT), and Decision Theory (DT).The classical DCOP model is directly inherited from CP and characterized by a static model, a deterministic environment and agent behavior, total agent knowledge, and with cooperative agents.Concepts from auctions and negotiations, traditionally explored in GT, have influenced the DCOP framework, leading to Asymmetric DCOPs, which has asymmetric agent payoffs, and Multi-Objective DCOPs.The DCOP framework has borrowed fundamental DT concepts related to modeling uncertain and dynamic environments, re-   sulting in models like Probabilistic DCOPs and Dynamic DCOPs.Researchers from the DCOP community have also designed solutions that inherit from all of the three communities.
In the next sections, we will describe the different DCOP frameworks, starting with classical DCOPs before proceeding to its various extensions.We will focus on a categorization based on three dimensions: Agent knowledge, environment behavior, and environment evolution.We assume a deterministic agent behavior, fully cooperative agent teamwork, and total agent knowledge (unless otherwise specified), as they are, by far, common assumptions adopted by the DCOP community.The DCOP models associated to such categorization are summarized in Table 3.The bottom-right entry of the table is left empty, indicating a promising model with dynamic and uncertain environments that, to the best of our knowledge, has not been explored yet.There has been only a modest amount of effort in modeling the different aspects of teamwork within the DCOP community.We will describe a formalism that has been adopted to model DCOPs with mixed cooperative and competitive agents in Section 7.

Classical DCOP
With respect to our categorization, in the classical DCOP model [121,147,193], the agents are completely cooperative and they have deterministic behavior and total knowledge.Additionally, the environment is static and deterministic.In this section, we review the formal definitions of classical DCOPs, present some relevant solving algorithms, and provide details of selected variants of classical DCOPs of particular interest.

Definition
A classical DCOP is described by a tuple P = A, X, D, F, α , where: , the set of domains for the variables in X, with D i being the domain of variable x i .
x i ⊆ X is the set of variables relevant to f i , referred to as the scope of f i , and ⊥ is a special element used to denote that a given combination of values for the variables in x i is not allowed. 2Each function f i represents a factor in a global objective function, F g (X) = k i=1 f i (x i ).In the DCOP literature, the functions f i are also called constraints, cost functions, utility functions, or reward functions.
• α : X → A is an onto total function, from variables to agents, which assigns the control of each variable x ∈ X to an agent α(x).
With a slight abuse of notation, we will denote with α(f i ) the set of agents whose variables are involved in the scope of f i , i.e., α(f i ) = {α(x) | x ∈ x i }.A solution is a value assignment for a subset of variables of X.A solution is complete if it assigns a value to each variable in X.For a given solution σ, we say that a constraint f i is satisfied by σ if f i (σ x i ) = ⊥.The goal in a DCOP is to find a complete solution that maximizes 3 the total problem reward expressed by its reward functions: where Σ is the state space, defined as the set of all possible complete solutions satisfying all problem constraints.Let us also introduce the following notations.Given an agent a i , we denote with L a i = {x j ∈ X | α(x j ) = a i } the set of variables controlled by agent a i , or its local variables, and we denote with Otherwise, the constraint is said to be soft.
Finding an optimal solution for a classical DCOP is known to be NP-hard [121].

DCOP: Representation and Coordination
Representation in DCOPs plays a fundamental role, both from an agent coordination perspective and from an algorithmic perspective.We discuss here the most predominant representations adopted in various DCOP algorithms.Let us start by describing some widely adopted assumptions regarding agent knowledge and coordination, which will apply throughout this document, unless otherwise stated: i.A variable and its domain are known exclusively to the agent controlling it and its neighboring agents.ii.Each agent knows the reward values of the constraints involving at least one of its local variables.No other agent has knowledge about such constraints.iii.Each agent knows exclusively (and it may communicate with) its own neighboring agents.

Constraint Graph
Given a DCOP P , G P = (X, E C ) is the constraint graph of P , where an undirected edge {x, y} ∈ E C exists if and only if there exists f j ∈ F such that {x, y} ⊆ x j .A constraint graph is a standard way to visualize a DCOP instance.It underlines the agents' locality of interactions and therefore it is commonly adopted by DCOP resolution algorithms.
Given an ordering o on X, we say that a variable x i has a higher priority with respect to a variable x j if x i appears before x j in o.Given a constraint graph G P and an ordering o on its nodes, the induced graph G * P on o, is the graph obtained by connecting nodes, processed in increasing order of priority, to all their higher-priority neighbors.For a given node, the number of higher-priority neighbors is referred to as its width.The induced width w * o of G P is the maximum width over all the nodes of G * P on ordering o. Figure 5(a) shows an example constraint graph of a DCOP with four agents a 1 through a 4 , each controlling one variable with domain {0,1}.There are two constraint: a ternary constraint, f 123 with scope x 123 = {x 1 , x 2 , x 3 } and represented by a clique among x 1 , x 2 and x 3 , and a binary constraint f 24 with scope x 24 = {x 2 , x 4 }.

Pseudo-Tree
A number of DCOP algorithms require a partial ordering among the agents.In particular, when such order is derived from a depth-first search exploration, the resulting structure is known as (DFS) pseudo-tree.A pseudotree arrangement for a DCOP P is a subgraph T P = X, E T of G P such that T P is a spanning tree of G Pi.e., a connected subgraph of G P containing all the nodes and being a rooted tree-with the following additional condition: for each x, y ∈ X, if {x, y} ⊆ x i for some f i ∈ F, then x, y appear in the same branch of T P (i.e., x is an ancestor of y in T P or vice versa).Edges of G P that are in (respectively out of) E T are called tree edges (respectively backedges).The tree edges connect parent-child nodes, while backedges connect a node with its pseudo-parents and its pseudo-children.We use C a i , P C a i , P a i , P P a i , to denote the set of children, pseudo-children, parent and pseudo-parents of the agent a i .Both constraint graph and pseudo-tree representations cannot deal explicitly with n-ary constraints (functions whose scope has more than two variables).A typical artifact to deal with n-ary constraints in a pseudo-tree representation is to introduce a virtual variable which monitors the value assignments for all the variables in the scope of the constraint, and generates the reward values [26]-the role of the virtual variables can be delegated to one of the variables participating in the constraint [141,110].
Figure 5(b) shows one possible pseudo-tree of the example DCOP in Figure 5(a), where C a 1 = {x 2 }, P C a 1 = {x 3 }, P a 4 = {x 2 }, and P P a 3 = {x 1 }.The solid lines are tree edges and dotted lines are backedges.

Factor Graph
Another way to represent DCOPs is through a factor graph [83].A factor graph is a bipartite graph used to represent the factorization of a function.In particular, given the global objective function F g , the corresponding factor graph F P = X, F, E F is composed of variable nodes x i ∈ X, factor nodes f j ∈ F and edges E F such that there is an undirected edge between factor node f j and variable node Factor graphs can handle n-ary constraints explicitly.To do so, they use a similar method as that adopted within pseudo-trees with n-ary constraints: they delegate the control of a factor node to one of the agents controlling a variable in the scope of the constraint.From an algorithmic perspective, the algorithms designed over factor graphs can directly handle n-ary constraints, while algorithms designed over pseudo-trees require changes in the algorithm design so to delegate the control of the n-ary constraints to some particular entity.
Figure 5(c) shows the factor graph of the example DCOP in Figure 5(a), where each agent a i controls its variable x i and, in addition, a 3 controls the constraint f 123 and a 4 controls f 24 .

Algorithms
The field of classical DCOPs is mature and a number of different resolution algorithms have been proposed.DCOP algorithms can be classified as being either complete or incomplete, based on whether they can guarantee the optimal solution or they trade optimality for shorter execution times, producing approximated solutions.In addition, each of these classes can be categorized into several groups, depending on the degree of locality exploited by the algorithms, the way local information is updated, and the type of exploration process adopted.We describe next these different DCOP algorithms categories.

Partial Centralization
In general, the DCOP solving process is fully decentralized, driving DCOP algorithms to obey to the agents knowledge and communication restrictions described in Section 4.2.However, some solution approaches explored methods to centralize the decisions to be taken by a group of agents, by delegating them to one of the agent in the group.Such approaches explore the concept of partial centralization [67,105,152], and thus they are classified as partially decentralized algorithms.In contrast, the former algorithms are said to be fully decentralized.Typically, partial centralization improves the algorithms' performance allowing agents to coordinate their local assignments more efficiently.However, such performance enhancement comes with a loss of information privacy, as the centralizing agent needs to be granted access to the local subproblem of other agents in the group [55,105].In contrast, fully distributed algorithms prevent loss of information privacy, at cost of a larger communication effort.

Synchronicity
DCOP algorithms can enhance their effectiveness by exploiting distributed and parallel processing.Based on the way the agents update their local information, DCOP algorithms are classified as synchronous or asynchronous.
Asynchronous algorithms allow agents to update the assignment for their variables based solely on their local view of the problem, and thus independently from the actual decision of the other agents [43,50,121].In contrast, synchronous algorithms constrain the agents decisions to follow a particular order, typically enforced by the representation structure adopted [105,140,147].
Synchronous methods tend to delay the actions of some agents guaranteeing that their local view of the problem is always consistent with that of the other agents.In contrast, asynchronous methods tend to minimize the idle-time of the agents, which in turn can react quickly to each message being processed; however, they provide no guarantee on the consistency of the state of the local view of each agent.Such effect has been studied in [145], concluding that inconsistent agents' views may cause a negative impact on network load and algorithm performance, and that introducing some level of synchronization may be beneficial for some algorithms, enhancing their performance.

Exploration Process
The resolution process adopted by each algorithm can be classified in three categories [188]
• Sampling-based methods are incomplete approaches that sample the search space to approximate a function (usually a probability distribution) as a product of statistical inference.
Figure 6 illustrates a taxonomy of classical DCOP algorithms.In the following subsections, we briefly describe some representative complete and incomplete algorithms from each of the classes introduced above.A detailed description of the DCOP algorithms is beyond the scope of this manuscript.We refer the interested readers to the original articles that introduce each algorithm.
Throughout this document, we will often adopt the following notation when discussing the complexity of the algorithms: the size of the largest domain is denoted by d = max D i ∈D |D i |, and w * refers to the induced width of the pseudo-tree.

COMPLETE ALGORITHMS
SynchBB [67].Synchronous Branch-and-Bound (SynchBB) is a complete, synchronous, search-based algorithm that can be considered as a distributed version of a branch-and-bound algorithm.It uses a complete ordering of the agents in order to extend a Current Partial Assignment (CPA) via a synchronous communication process.The CPA holds the assignments of all the variables controlled by all the visited agents, and, in addition, functions as a mechanism to propagate bound information.The algorithm prunes those parts of the search space whose solution quality is sub-optimal, by exploiting the bounds that are updated at each step of the algorithm.SynchBB agents space requirement and maximum size of message are in O(n), while they require, in the worst case, to perform O(d m ) number of operations.The network load is also in O(d m ).
AFB [50].Asynchronous Forward Bounding (AFB) is a complete, asynchronous, search-based algorithm that can be considered as the asynchronous version of SynchBB.In its original description the algorithm was defined for minimization problems.In this algorithm, agents communicate their reward estimates, which in turn are used to compute bounds and prune the search space.In AFB, agents extend a CPA sequentially, provided that the upper bound on their rewards exceed the global bound, that is, the reward of the best solution found so far.Each agent performing an assignment (the "assigning" agent) triggers asynchronous checks of bounds, by sending forward messages containing copies of the CPA to agents that have not yet assigned their variables.The unassigned agents that receive a CPA, estimate the upper bound of the CPA, given their local view of the constraint graph.They also send such estimates back to the agent that originated the forward message.This assigning agent will receive these estimates asynchronously and aggregate them into an updated upper bound.If the updated upper bound falls behind the current lower bound, the agent initiates a backtracking phase.As in SynchBB, the worst case complexity for network load and agent's operations is O(d m ), while the size of messages and each agent's space requirement are in O(n).
ADOPT [121].Asynchronous Distributed OPTimization (ADOPT) is a complete, asynchronous, search-based algorithm that makes use of a DFS pseudo-tree ordering of the agents.In its original description the algorithm was defined for minimization problems.The algorithm relies on maintaining, in each agent, lower and upper bounds on the solution reward for the subtree rooted at its node(s) in the DFS tree.Agents explore partial solutions in best-first order, that is, in decreasing upper bound order.Agents use COST messages (propagated upwards in the DFS pseudotree) and THRESHOLD and VALUE messages (propagated downwards in the tree) to iteratively tighten the lower and upper bounds, until the upper bound of the best reward solution is equal to its lower bound.ADOPT agents store uppwer bounds as thresholds, which can be used to prune partial solutions that are provably sub-optimal.ADOPT agents need to maintain a context which stores the assignments of higher priority neighbors, and a lower bound and an upper bound for each domain value and child; thus, the space requirement for each agent is in O(d(l + 1)), where l = max a i ∈A |N a i |.Its worst case network load and agent complexity is O(d m ), while its maximum message size is in O(l).
ADOPT has been extended in several ways.In particular, BnB-ADOPT [189,62] uses a branch-and-bound method to reduce the amount of computation performed during search, and ADOPT(k) combines both ADOPT and BnB-ADOPT into an integrated algorithm [63].There are also extensions that trade solution optimality for smaller runtimes [190], extensions that use more memory for smaller runtimes [191], and extensions that maintain soft arc-consistency [21,20,61,60].
DPOP [147].Distributed Pseudo-tree Optimization Procedure (DPOP) is a complete, synchronous, inferencebased algorithm that makes use of a DFS pseudo-tree ordering of the agents.It involves three phases.In the first phase, the agents order themselves into a DFS pseudo-tree.In the second phase, each agent, starting from the leaves of the pseudo-tree, aggregates the rewards in its subtree for each value combination of variables in its separator. 4he aggregated rewards are encoded in a UTIL message, which is propagated from children to their parents, up to the root.In the third phase, each agent, starting from the root of the pseudo-tree, selects the optimal values for its variables.The optimal values are calculated based on the UTIL messages received from the agent's children and the VALUE message received from its parent.The VALUE messages contain the optimal values of the agents and are propagated from parents to their children, down to the leaves of the pseudo-tree.Thus, DPOP generates a number of messages that is in O(m).However, the size of the messages and the agent's space requirement are exponential in the induced width of the pseudo-tree: O(d w * ).Finally, the number of operations performed by DCOP agents is in the order complexity of O(d w * +z ), with z = max a i ∈A |L i |.DPOP has also been extended in several ways to enhance its performance and capabilities.O-DPOP and MB-DPOP trade runtimes for smaller memory requirements [149,150], A-DPOP trades solution optimality for smaller runtimes [146], SS-DPOP trades runtime for increased privacy [55], PC-DPOP trades privacy for smaller runtimes [152], H-DPOP propagates hard constraints for smaller runtimes [85], BrC-DPOP enforces branch consistency for smaller runtimes [45], and ASP-DPOP is a declarative version of DPOP that uses Answer Set Programming [92].
OptAPO [105].Optimal Asynchronous Partial Overlay (OptAPO) is a complete, synchronous, search-based algorithm.It trades agent privacy for smaller runtimes through partial centralization.It employs a cooperative mediation schema, where agents can act as mediators and propose value assignments to other agents.In particular, agents check if there is a conflict with some neighboring agent.If a conflict is found, the agent with the highest priority acts as a mediator.During mediation, OptAPO solves subproblems using a centralized branchand-bound-based search, and when solutions of overlapping subproblems still have conflicting assignments, the solving agents increase the centralization to resolve them.By sharing their knowledge with centralized entities, agents can improve their local decisions, reducing the communication costs.For instance, the algorithm has been shown to be superior to ADOPT on simple combinatorial problems.However, it is possible that several mediators solve overlapping problems, duplicating efforts [152], which can be a bottleneck especially for dense problems.The worst case agent complexity is in O(d n ), as an agent might solve the entire problem.The agent space requirement is in O(nd), as a mediator agent needs to maintain the domains of all the variables involved in the mediation section, while the message size is in the order of O(d).The network load decreases with the amount of partial centralization required, however, its worst case order complexity is exponential in the number of agents O(d m ).The original version of OptAPO has been shown to be incomplete [57], but a complete variant has been proposed [57].

INCOMPLETE ALGORITHMS
Max-Sum [43].Max-Sum is an incomplete, asynchronous, inference-based algorithm based on belief propagation.It operates on factor graphs by performing a marginalization process of the reward functions, and optimizing the rewards for each given variable.This process is performed by recursively propagating messages between variable nodes and functions nodes.The value assignments take into account their impact on the marginalized reward function.Max-Sum is guaranteed to converge to an optimal solution in acyclic graphs, but convergence is not guaranteed in cyclic graphs.Max-Sum has also been extended in several ways to improve it.Bounded Max-Sum is able to bound the quality of the solutions found by removing a subset of edges from a cyclic DCOP graph to make it acyclic, and by running Max-Sum to solve the acyclic problem [159], Improved Bounded Max-Sum improves on the error bounds [160], and Max-Sum AD guarantees convergence in acyclic graphs through a twophase value propagation phase [199].Max-Sum and its extensions have been successfully employed to solve a number of large scale, complex MAS applications (see Section 8).
Region Optimal [140].Region-optimal algorithms are incomplete, synchronous, search-based algorithms that allow users to specify regions of the constraint graph (e.g., regions with a maximum size of k agents [140], t hops from each agent [79], or a combination of both size and hops [177]) and solve the subproblem within each region optimally.The concept of k-optimality is defined with respect to the number of agents whose assignments conflict, whose set is denoted by c(σ, σ ), for two assignments σ and σ .The deviating cost of σ with respect to σ , denoted by ∆(σ, σ ), is defined as the difference of the aggregated reward associated to the assignment σ (F (σ)) minus the reward associated to σ (F (σ )).An assignment σ is k-optimal if ∀σ ∈ Σ, such that |c(σ, σ )| ≤ k, we have that ∆(σ, σ ) ≥ 0. In contrast, the concept of t-distance emphasizes the number of hops from a central agent a of the region Ω t (a), that is the set of agents which are separated from a by at most t hops.An assignment σ is t-distance optimal if, ∀σ ∈ Σ, F (σ) ≥ F (σ ) with c(σ, σ ) ⊆ Ω t (a), for any a ∈ A. The Distributed Asynchronous Local Optimization (DALO) simulator provides a mechanism to coordinate the decision of local groups of agents based on the concepts of k-optimality and t-distance [79].The quality of the solutions found is bounded by a function of k or t [177].
MGM [103].Maximum Gain Message (MGM) is an incomplete, synchronous, search-based algorithm that performs a distributed local search.Each agent starts by assigning a random value to each of its variables.Then, it sends this information to all its neighbors.Upon receiving the values of its neighbors, it calculates the maximum gain in reward if it changes its value and sends this information to all its neighbors as well.Upon receiving the gains of its neighbors, it changes its value if its gain is the largest among its neighbors.This process repeats until a termination condition is met.MGM provides no quality guarantees on the returned solution.
DSA [197].Distributed Stochastic Algorithm (DSA) is an incomplete, synchronous, search-based algorithm that is similar to MGM, except that each agent does not send its gains to its neighbors and it does not change its value to the value with the maximum gain.Instead, it decides stochastically if it takes on the value with the maximum gain or other values with smaller gains.This stochasticity allows DSA to escape from local minima.Similarly to MGM, it repeats until a termination condition is met, and it cannot provide bounded solution quality.
DUCT [137].The Distributed Upper Confidence Tree (DUCT) algorithm is an incomplete, sampling-based algorithm that is inspired by Monte-Carlo tree search and employs confidence bounds to solve DCOPs.DUCT emulates a search process analogous to that of ADOPT, where agents select the values to assign to their variables according to the information encoded in their context messages (i.e., the assignments to all the variables in the receiving variable's separator).However, rather than systematically selecting the next value to assign to their own variables, DUCT agents sample such values.To focus on promising assignments, DUCT constructs a confidence bound B, such that the best value for any context is at least B, and hence agents sample the choice with the lowest bound.This process starts from the pseudo-tree root agent, that, after sampling a value for its variable, communicates its assignment to its children in a context message.When an agent receives a context message it repeats such process, which proceeds until the leaf agents are reached.When the leaf agents choose a value assignment, they calculate the utility within their context and propagate this information up to the tree in a cost message.This process continues for a given number of iterations or until convergence is achieved, i.e., until the sampled values in two successive iterations do not change.[130].The Distributed Gibbs (D-Gibbs) algorithm is an incomplete, synchronous, sampling-based algorithm that extends the Gibbs sampling process [49] by tailoring it to solve DCOPs in a decentralized manner.The Gibbs sampling process is a centralized Markov chain Monte Carlo algorithm that can be used to approximate joint probability distributions.By mapping DCOPs to maximum a-posteriori estimation problems, probabilistic inference algorithms like Gibbs sampling can be used to solve DCOPs.D-Gibbs provides convergence guarantees.A version of the algorithm which speeds up the agents' sampling process with Graphic Processing Units (GPUs) was proposed in [44].

Notable Variant: Asymmetric DCOPs
Asymmetric DCOPs [56] are used to model multi-agent problems where agents controlling variables in the scope of a reward function can receive different rewards, given a fixed join assignment.Such a problem cannot be naturally represented by classical DCOPs, which require that all agents controlling variables participating in a reward function receive the same rewards as each other.

Definition
An Asymmetric DCOP is defined by a tuple A, X, D, F, α , where A, X, D and α are defined as in Definition 4.1, and each f i ∈ F is defined as: In other words, an Asymmetric DCOP is a DCOP where the reward that an agent obtains from a reward function may differ from the reward another agent obtains from the same reward function.
As rewards for participating agents may differ from each other, the goal in Asymmetric DCOPs is also different than the goal in classical DCOPs.Given a reward function f j ∈ F and complete solution σ, let f j (σ, a i ) denote the reward obtained by agent a i from reward function f j with solution σ.Then, the goal in Asymmetric DCOPs is to find the complete solution σ * : As for classical DCOPs solving Asymmetric DCOPs is NP-hard.In particular, it is possible to reduce any Asymmetric DCOP to an equivalent classical DCOP by introducing a polynomial number of variables and constraints, as described in the next section.

Relation to Classical DCOPs
One way to solve MAS problems with asymmetric rewards via classical DCOPs is through the Private Event As Variables (PEAV) model [103].It can capture asymmetric rewards by introducing, for each agent, as many "mirror" variables as the number of variables held by neighboring agents.The consistency with the neighbors' state variables is imposed by a set of equality constraints.However such formalism suffers from scalability problems, as it may result in a significant increase in the number of variables in a DCOP.In addition, Grinshpoun et al. showed that most of the existing incomplete classical DCOP algorithms cannot be used to effectively solve Asymmetric DCOPs, even when the problems are reformulated through the PEAV model [56].They show that such algorithms are unable to distinguish between different solutions that satisfies all hard constraints, resulting in a convergence to one of those solutions and the inability to escape that local optimum.Therefore, it is important to generate ad-hoc algorithms to solve Asymmetric DCOPs.

Algorithms
The current research direction in the design of Asymmetric DCOP algorithms has focused on adapting existing classical DCOP algorithms to handle the asymmetric rewards.Asymmetric DCOPs require that each agent, whose variables participate in a reward function, coordinate the aggregation of their individual rewards.To do so, two approaches have been identified [27]: • A two-phase strategy, where only one side of the constraint (i.e., the reward of one agent) is considered in the first phase.The other side(s) (i.e., the reward of the other agent(s)) is considered in the second phase once a full assignment is produced.As a result, the rewards of all agents are aggregated.• A single-phase strategy, which requires a systematic check of each side of the constraint before reaching a full assignment.Checking each side of the constraint is often referred to as back checking, a process that can be performed either synchronously or asynchronously.

COMPLETE ALGORITHMS
SyncABB-2ph [56].Synchronous Asymmetric Branch and Bound -2-phase (SyncABB-2ph) is a complete, asymmetric, search-based algorithm that extends SynchBB with the two-phase strategy.Phase 1 emulates SynchBB, where each agent considers the rewards of its constraints with higher-priority agents.Phase 2 starts once a full assignment is found.During this phase, each agent aggregates the sides of the constraints that were not considered during Phase 1 and verifies that the known bound is not exceeded.If the bound is exceeded, Phase 2 ends and the agents restart Phase 1 by backtracking and resuming the search from the lower priority agent that exceeded the bound.The worst case complexity of this algorithm reflects that of SynchBB in terms of space requirement, network load, size of messages, and number of operations performed by each agent.
SyncABB-1ph [56,96].Synchronous Asymmetric Branch and Bound -1-phase (SyncABB-1ph) is a complete, asymmetric, search-based algorithm that extends SynchBB with the one-phase strategy.Each agent, after having extended the CPA, updates the bound with its local reward associated to the constraints involving its variables-as done in SynchBB.In addition, the CPA is sent back to the assigned agents to update its bound via a sequence of back checking operations.Its worst case order complexity reflects that of SynchBB in terms of space requirement, network load, size of messages, and number of operations performed by each agent.
ATWB [56].The Asymmetric Two-Way Bounding (ATWB) algorithm is a complete, asynchronous, search-based algorithm that extends AFB to accommodate both forward bounding and backward bounding.The forward bounding is performed analogously to AFB.The backward bounding, instead, is achieved by sending copies of the CPA backward to the agents whose assignments are included in the CPA.Similarly to what is done in AFB, agents that receive a copy of the CPA compute their estimates and send them forward to the assigning agent.Its worst case order complexity reflects that of AFB in terms of space requirement, network load, size of messages, and number of operations performed by each agent.

INCOMPLETE ALGORITHMS
ACLS [56].Asymmetric Coordinated Local Search (ACLS) is an incomplete, synchronous, search-based algorithm.After a random value initialization, each agent exchanges its values with all its neighboring agents.At the end of this step, each agent knows the values of all its neighboring agents, and identifies all possible improving assignments for its own variables, given the current neighbors choices.Each agent then picks one such assignments, according to the distribution of gains from each proposal assignment, and exchanges it with its neighbors.When an agent receives a proposal assignment, it responds with the evaluation of its side of the constraints, resulting from its current assignment and the proposal assignments of the other agents participating in the constraint.After receiving the evaluations from each of its neighbors, each agent estimates the potential gain or loss derived from its assignment, and commits to a change with a given probability, similarly to agents in DSA, to escape from local minima.[56].Minimal Constraint Sharing MGM (MCS-MGM) is an incomplete, synchronous, search-based algorithm that extends MGM by considering each side of the constraint.Like MGM, the agents operate in an iterative fashion, where they exchange their current values at the start of each iteration.Afterwards, each agent sends the utility for its side of each constraint to its neighboring agents that participate in the same constraint. 5pon receiving this information, each agent knows the total utility for each constraint-by adding together the utilities of both sides of the constraint.Therefore, like in MGM, the agent can calculate the maximum gain in utility if it changes its values, and will send this information to all its neighbors.Upon receiving the gains of its neighbors, each agent changes its value if its gain is the largest among its neighbors.

Notable Variant: Multi-Objective DCOPs
Multi-objective optimization (MOO) [116,107] aims at solving problems involving more than one objective function to be optimized simultaneously.In a MOO problem, optimal decisions need to accommodate conflicting objectives.Examples of MOO problems include optimization of electrical power generation in a power grid while minimizing emission of pollutants and minimization of the costs of buying a vehicle while maximizing comfort.
Multi-objective DCOPs extend MOO problems and DCOPs.

Definition
A Multi-objective DCOP (MO-DCOP) is defined by a tuple A, X, D, F, α , where A, X, D, and α are defined as in Definition 4.1, and F = [F 1 , . . ., F h ] T is a vector of multi-objective functions, where each F i is a set of objective functions f j defined as in Definition 4.1.For a solution σ of a MO-DCOP, let the reward for σ according to the i th multi-objective optimization function set The goal of a MO-DCOP is to find an assignment σ * , such that: where F(σ) is a reward vector for the MO-DCOP.A solution to a MO-DCOP involves the optimization of a set of partially-ordered assignments.Note that we consider, in the above definition, point-wise comparison of vectorsi.e., F(σ Typically, there is no single global solution where all the objectives are optimized at the same time.Thus, solutions of a MO-DCOP are characterized by the concept of Pareto optimality, which can be defined through the concept of dominance: Definition 2 (Pareto Optimality) A complete solution σ * ∈ Σ is Pareto optimal iff it is not dominated by any other complete solution.
Therefore, a solution is Pareto optimal iff there is no other solution that improves at least one objective function without deteriorating the reward of another function.Another important concept is the Pareto front: The Pareto front is the set of all reward vectors of all Pareto optimal solutions.
Solving a MO-DCOP is equivalent to finding the Pareto front.Even for tree-structured MO-DCOPs, the size of the Pareto front may be exponential in the number of variables. 6Thus, multi-objective algorithms often provide solutions that may not be Pareto optimal but may satisfy other criteria that are significant for practical applications.
A widely-adopted criteria is that of the weak Pareto optimality: In other words, a solution is weakly Pareto optimal if there is no other solution that improves all of the objective functions simultaneously.An alternative approach to Pareto optimality is one that uses the concept of utopia points: Definition 5 (Utopia Point) A reward vector Thus, a utopia point is the vector of rewards obtained by independently optimizing h DCOPs, each associated to one objective of the multi-objective function vector.In general F • is unattainable.Therefore, different approaches focus on finding a compromise solution [162], which is a Pareto optimal solution that is close to the utopia point.The concept of close is dependent on the approach adopted.
Similarly to the centralized version, MODCOPs have been shown to be NP-hard (their decision versions), and #P-hard (the related counting versions), and to have exponentially many efficient solutions and non-dominated points [54].

Algorithms
We categorize the proposed MO-DCOP algorithms into two classes: complete and incomplete algorithms, according to their ability to find the complete set of Pareto optimal solutions or only a subset of it.[114].Multi-Objective Synchronous Branch and Bound (MO-SBB) is a complete, synchronous, searchbased algorithm that extends SynchBB.It uses an analogous search strategy as the mono-objective SynchBB: after establishing a complete ordering, MO-SBB agents extend a Current Partial Assignment with their own value assignments and the current associated reward vectors.Once a non-dominated solution is found, it is broadcast to all agents, which add the solution to a list of global bounds.Thus, agents maintains an approximation of the Pareto front, which is used to bound the exploration, and extend the CPA only if the new partial assignment is not dominated by solutions in the list of global bounds.When the algorithm terminates, it returns the set of Pareto optimal solutions obtained by filtering the list of global bounds by dominance.Its worst case complexity reflects that of SynchBB in terms of network load, size of messages, and number of operations performed by each agent, while the agent's memory requirement is O(np), where p is the size of the Pareto set.

MO-SBB
Pseudo-tree Based Algorithm [112].The proposed algorithm is a complete, asynchronous, search-based algorithm that extends ADOPT.It introduces the notion of boundaries on the vectors of multi-objective values, which extends the concept of lower and upper bounds to vectors of values.The proposed approach starts with the assumption that functions within each F i are sorted according to a predefined ordering, and for each 1 ≤ j ≤ k, we have that the scope of f i j (i.e., the j th function in F i ) is the same for each i (i.e., all functions in the same position in different F i have the same scope).Thus, without loss of generality, we will refer to the scope of f i j as x j .In such context, for 1 ≤ j ≤ k, given a complete assignment σ we define the vector of reward values The notion of non-dominance is applied to these vectors, where a vector σ j is non-dominated iff there is no other vector σ j such that ub(σ s j ) ≤ lb(σ j s ) for all 1 ≤ s ≤ h and ub(σ s j ) < lb(σ j s ) for at least one s.The algorithm uses the notion of non-dominance for bounded vectors to retain exclusively non-dominated vectors.As in ADOPT, the network load is in O(d m ).The worst case computational and memory complexity at each agent is in O(p), as the number of combinations of utility vectors grows exponentially with the number of tuples of utility values, in the worst case.
Such method has also been proposed to solve Asymmetric MO-DCOPs [113], which is an extension of both Asymmetric DCOPs and MO-DCOPs.[39].Bounded multi-objective max-sum (B-MOMS) is an incomplete, asynchronous, inference-based algorithm, and was the first MO-DCOP algorithm introduced.It extends Bounded Max-Sum to compute bound approximations for multi-objective DCOPs.It consists of three phases.The Bounding Phase generates an acyclic subgraph of the multi-objective factor graph, using a generalization of the maximum spanning tree problem to vector weights.During the Max-sum Phase, the agents coordinate to find the Pareto optimal set of solutions to the acyclic factor graph generated in the bounding phase.This is achieved by extending the addition and marginal maximization operators adopted in Max-Sum to the case of multiple objectives.Finally, the Value Propagation Phase allows agents to select a consistent variable assignment, as there may multiple Pareto optimal solutions.The bounds provided by the algorithm are computed using the notion of utopia points.B-MOMS requires O( d|x i | i ) evaluations of function f i , where di is the largest domain among variables in x i .Furthermore, since in the worst case, every variable assignment of x is Pareto optimal, the message size complexity is O(h d n+1 ), where h is the size of each non-dominated vector.[133].Dynamic Programming based on Aggregate Objective Functions (DP-AOF) is an incomplete, synchronous, inference-based algorithm.It adapts the AOF technique [116], designed to solve centralized multiobjective optimization problems, to solve MO-DCOPs.Centralized AOF adopts a scalarization to convert a MOO problem into a single objective optimization.This is done by assigning weights (α 1 , . . ., α h ) to each of the functions in the objective vector [F 1 , . . ., F h ] T such that h i=1 α i = 1 and α i > 0 for all 1 ≤ i ≤ h.The resulting mono-objective function h i=1 α i F i can be solved using any mono-objective optimization technique with guarantee to find a Pareto optimal solution [116].

DP-AOF
DP-AOF proceeds in two phases.First, it computes the utopia point F • by solving as many mono-objective DCOPs as the number of objective functions in the MO-DCOP.It then constructs a new problem building upon the solutions obtained from the first phase.Such a problem is used to assign weights to each objective function of the MO-DCOP to construct the new mono-objective function in the same way as centralized AOF, which then can be solved optimally.The complexity of DP-AOF, in terms of number of operations, is given by O(h d w * ), as the algorithms solves h objectives using DPOP.
MO-DPOP L p [134].Multi-Objective L p -norm based Distributed Pseudo-tree Optimization Procedure (MO-DPOP Lp ) is an incomplete, synchronous, inference-based algorithm.It adapts DPOP using a scalarization measure based on the L p -norm to find a subset of the Pareto front of a MO-DCOP.Similar to DP-AOF, the algorithm proceeds in two phases.During the first phase, it uses DPOP to find the utopia point F • .In the second phase, the agents coordinate to find a complete solution that minimizes the distance from F • according to the L p -norm.The algorithm is guaranteed to find a Pareto optimal solution only when the L 1 -norm (Manhattan norm) is adopted.In this case, MO-DPOP L 1 finds a Pareto optimal solution that maximizes the average of the reward values of all objectives.
DIPLS [178].Distributed Iterated Pareto Local Search (DIPLS) is an incomplete, synchronous, search-based algorithm.It extends the Pareto Local Search (PLS) algorithm [138], which is a hill climbing algorithm designed to solve centralized multi-objective optimization problems, to solve MO-DCOPs.The idea behind DIPLS is to evolve an initial solution toward the Pareto front.To do so, it starts from an initial set of random assignments, and applies PLS iteratively to generate new non-dominated solutions.DIPLS requires a total ordering of agents and elects one agent as the controller.At each iteration, the controller filters the set of solutions by dominance and broadcasts them to the agents in the MO-DCOP.Upon receiving a solution, an agent generates a list of neighboring solutions by modifying the assignments of the variables that it controls, and sends them back to the controller.When the controller receives the messages from all agents, it proceeds to filter (by dominance) the set of solutions received, and if a new non-dominated solution is found, it repeats the process.DIPLS is shown to outperform B-MOMS on random graph problems [178].

Dynamic DCOPs
Within a real-world MAS application, agents often act in dynamic environments that evolve over time.For instance, in a disaster management search and rescue scenario, new information (e.g., the number of victims in particular locations, or priorities on the buildings to evacuate) typically becomes available in an incremental manner.Thus, the information flow modifies the environment over time.To cope with such requirement, researchers have introduced the Dynamic DCOP (D-DCOP) model, where reward functions can change during the problem solving process, agents may fail, and new agents may be added to the DCOP being solved.With respect to our categorization, in the D-DCOP model, the agents are completely cooperative and they have deterministic behavior and total knowledge.On the other hand, the environment is dynamic and deterministic.

Definition
The Dynamic DCOP (D-DCOP) model is defined as a sequence of classical DCOPs: D 1 , . . ., D T , where each D t = A t , X t , D t , F t , α t is a DCOP, representing the DCOP at time step t, for 1 ≤ t ≤ T .The goal in a D-DCOP is to solve optimally the DCOP at each time step.We assume that the agents have total knowledge about their current environment (i.e., the current DCOP), but they are unaware of changes to the problem in future time steps.In a dynamic system, agents are required to adapt as fast as possible to environmental changes.Stability [40,175] is a core algorithmic concept, where an algorithm seeks to minimize the number of steps that it requires to converge to a solution each time the problem changes.In such a context, these converged solutions are also called stable solutions.Self-stabilization is a related concept derived from the area of fault-tolerance: Definition 6 (Self-stabilization) A system is self-stabilizing if and only if the following two properties hold: • Convergence: The system reaches a stable solution in a finite number of steps, starting from any given state.In the DCOP context, this propriety expresses the ability of the agents to coordinate a joint variables's assignment that optimizes the problem at time t + 1, starting from an assignment of the problem's variables at time t.• Closure: The system remains in a stable solution, provided that no changes in the environment happens.In the DCOP context, this means that agents do not change the assignment for their variables after converging to a solution.
An extension of the concept of self-stabilization is that of super-stabilization [41], which focuses on stabilization after topological changes.In the context of D-DCOPs, differently from self-stabilizing algorithms, where convergence after a single change in the constraint graph can be as slow as the convergence from an arbitrary starting state, super-stabilizing algorithms take special care of the time required to adapt to a single change in the constraint graph.
Solving Dynamic DCOPs is NP-hard, as it requires to solve each DCOP of the Dynamic DCOP independently.

Algorithms
In principle, one could use classical DCOP algorithms to solve each DCOP D t at each time step 1 ≤ t ≤ T .However, the dynamic environment evolution encourages firm requirements on the algorithm design, in order for the agents to respond automatically and efficiently to environmental changes over time.In particular, D-DCOP algorithms often follow the self-stabilizing property.As in the previous sections, we categorize the algorithms as being either complete or incomplete, according to their ability to determine the optimal solution at each time step.

COMPLETE ALGORITHMS
SDPOP [148].Self-stabilizing DPOP (SDPOP) is a synchronous, inference-based algorithm that extends DPOP to handle dynamic environments.It is composed of three self-stabilizing phases: (1) A self-stabilizing DFS tree generation, whose goal is to create and maintain a DFS pseudo-tree structure; (2) A self-stabilizing algorithm for populating the utility messages; and (3) A self-stabilizing algorithm for the value propagation phase.Such procedures work as in DPOP and they are invoked whenever any change in the DCOP problem sequence is revealed.
In [148], the authors discuss two self-stabilizing extensions, namely super-stabilization and fault-containment, which can be used to provide guarantees about the way the system transitions from a valid state to the next, after an environment change.The complexity of SDPOP is similar to that of DPOP: at each time step, it uses O(m) messages of maximal size O(d w * ).Upon individual changes, SDPOP stabilizes after at most UTIL messages and k VALUE messages, where is the length of the longest branch in the pseudo-tree, and k the number of reward functions of the problem.
RSDPOP [151].RSDPOP7 is a synchronous, inference-based that extends the SDPOP algorithm by introducing the possibility of specifying commitment deadlines and stability constraints.In the proposed model, some of the variables may be unassigned at a given point in time, while others must be assigned within a specific deadline.Commitment deadlines can be either hard or soft.Hard commitments model irreversible processes.When a hard committed variable is assigned, its value cannot be changed.Soft commitments model contracts with penalties.If a soft committed variable x t i has been assigned at time t, its value can be changed at time t > t, at the price of a cost penalty.Such costs are modeled via stability constraints, which are defined as binary relations, representing the cost of changing value of variable x i from time t to time t + 1.Given the set of stability constraints S ⊆ F, at each time t, the goal is to find a solution The latter term accounts for the penalties associated to the value assignment updates for the soft committed variables.RSDPOP has the same order complexity as SDPOP.
I-ADOPT and I-BnB-ADOPT [192].Incremental Any-space ADOPT (I-ADOPT) and Incremental Any-space BnB-ADOPT (I-BnB-ADOPT) are asynchronous, search-based algorithms that extend ADOPT and BnB-ADOPT, respectively.In the incremental any-space versions of the algorithms, each agent maintains bounds for multiple contexts (in contrast agents in ADOPT and BnB-ADOPT maintain bounds for one context only).By doing so, when solving the next DCOP in the sequence, agents may reuse the bounds information computed in the previous DCOP.In particular, the algorithms identify affected agents, which are agents that cannot reuse the information computed in the previous iterations, and they recompute bounds exclusively for such agents.
The algorithm worst-case complexity is analogous to that of ADOPT, in terms of network load, size of messages, and number of operations performed by each agent.In contrast, the agent space requirements increase with the respect to that of ADOPT, by a factor proportional to the amount of cache used by the agents to reuse bounds of the previous DCOP.

INCOMPLETE ALGORITHMS
SBDO [22].Support Based Distributed Optimization (SBDO) is an asynchronous search-based algorithm that extends the Support Based Distributed Search algorithm [65] to the multi-agent case.It uses two types of messages: is-good and no-good.Is-good messages contain an ordered partial assignment; These messages are exchanged among neighboring agents upon a change in their value assignments.Each agent, upon receiving a message, decides what value to assign to its own variables, attempting to maximize their local utilities, and communicates such decisions to its neighboring agents via is-good messages.No-good messages are used in response to violations of hard constraints, or in response to obsolete assignments.A no-good message is augmented with a justification, that is, the set of hard constraints that are violated, and are saved locally within each agent.This information is used to discard partial assignments, when they are supersets of one of the known no-goods.The changes of the dynamic environment are communicated via messages, which are sent from the environment to the agents.In particular, changes in hard constraints require the update of all the justifications in all no-goods.
FMS [156].Fast Max-Sum (FMS) is an asynchronous inference-based algorithm that extends Max-Sum to the dynamic DCOP model.As in Max-Sum, the algorithm operates on a factor graph.Solution stability is maintained by recomputing only those factors that changed between the previous DCOP D t−1 and the current DCOP D t .In [156], the authors exploit domain-specific properties in a task allocation problem to reduce the number of states over which each factor has to compute its solution.In addition, FMS is able to efficiently manage addition or removal of tasks (e.g., factors), by incurring in message propagation exclusively on the factor graph regions affected by such topological changes.FMS has been extended in several ways.Bounded Fast Max-Sum provides bounds on the solution found, as well as it guarantees super-stabilization [100].Branch-and-Bound Fast Max-Sum (BnB-FMS) extends FMS providing online domain pruning using a branch-and-bound technique [101].
Mobed [170].Multiagent Organization with Bounded Edit Distance (Mobed) does not directly solves D-DCOPs, but instead is a self-stabilizing pseudo-tree generation algorithm.It operates in dynamic environments where constraints may change, new agents may appear, and existing agents may be dropped.As virtually all algorithms solving D-DCOPs are required to deal with network structural reconfigurations, we believe that describing this approach in this section is appropriate.Unlike existing distributed DFS algorithms, Mobed is able to bound the edit distance, a metric that measures the difference between two pseudo-trees, between pseudo-trees for subsequent DCOPs in a D-DCOP.Mobed is able to do so by exploiting the concept of agent hierarchies, which are defined as subtrees in the original multi-agent constraint graph, during the operation of addition and removal of an agent in the constraint graph.In particular the addition of an agent connected to all nodes in a pseudo-tree, consisting of a root with children, has a worst case edit distance of + 1, for a DFS algorithm.In Mobed this operation is performed by determining the insertion point (defined as a neighbor of the agent being added) that minimizes the tree depth when inserting it as the parent or the child of the insertion point.The number of edits for each agent addition is thus bounded by two.
Distributed Q-learning and R-learning [131].The Distributed Q-learning and R-learning algorithms are synchronous reinforcement-learning-based algorithms that extend the centralized Q-learning [3] and centralized Rlearning [163,102] algorithms.The algorithms solve a variant of D-DCOPs, called Markovian Dynamic DCOPs (MD-DCOPs), where the DCOP in the current time step D t+1 depends on the variables assignment performed by the agents in the DCOP in the current time step D t .However, the transition function between these two DCOPs are not known to the agents and the agents must, thus, learn them.Each agent maintains Q-values and R-values for each σ t−1 , d t i pair, where σ t−1 is the solution for the DCOP D t−1 and d t i is the value of its variables in the reward function f t i ∈ F t .These Q-values and R-values represent the predicted reward the agent will get if it assigns its variables values according to d t i when σ t−1 is the previous solution.The agents repeatedly refine these values at every iteration and choose the values with the maximum Q-value or R-value at each iteration.

Probabilistic DCOPs
So far, we have discussed DCOP models that can model MAS problems in environments that are deterministic.However, many real-world applications are characterized by environments with stochastic behavior.In other words, there are exogenous events that can influence the outcome of an agent's action.For example, weather conditions or the state of a malfunctioning device can affect the reward of an agent's action.To cope with such scenarios, researchers have introduced Probabilistic DCOP (P-DCOP) models, where the uncertainty in the state of the environment is modeled through stochasticity in the reward functions.With respect to our categorization, in the P-DCOP model the agents are completely cooperative and they have deterministic behavior.Additionally, the environment is static and stochastic.While a large body of research has focused on problems where agents have total knowledge, we will discuss a subclass of P-DCOPs where the agents' knowledge of the environment is limited, and agents must balance exploration of the unknown environment and the exploitation of the known rewards.

Definition
A common strategy to model uncertainty is to augment the outcome of the reward functions with a stochastic character [11,167,129].Another method is to introduce additional random variables to the reward functions, which simulate exogenous uncontrollable traits [93,94,180].To cope with such a variety, we introduce the Probabilistic DCOP (P-DCOP) model, which generalizes the proposed models of uncertainty.A P-DCOP is defined by a tuple A, X, D, F, α, I, Ω, P, E, U , where A and D are defined as in Definition 4.1.In addition, • X is a mixed set of decision variables and random variables.
• I = {r 1 , . . ., r q } ⊆ X is a set of random variables modeling uncontrollable stochastic events, such as weather or a malfunctioning device.• F is the set of reward functions, each defined over a mixed set of decision variables and random variables, and such that each value combination of the decision variables on the reward function, results in a probability distribution.As a result, the local value assignment σ x i \I , given an outcome for the random variables involved in f i , is itself a random variable.• α : X \ I → A is a mapping from decision variables to agents.Notice that random variables are not controlled by any agent, as their outcomes do not depend on the agents' actions.
• Ω = {Ω 1 , . . ., Ω q } is the (possibly discrete) set of events for the random variables (e.g., different weather conditions or stress levels a device is subjected to) such that each random variable r i ∈ I takes values in Ω i .In other words, Ω i is the domain of random variable r i .• P = {p 1 , . . ., p q } is a set of probability distributions for the random variables, such that p i : assigns a probability value to an event for r i , and ω∈Ω i p i (ω) dω = 1, for each random variable r i ∈ I. • E is an evaluator function from random variables to real values, that, given an assignment of values to the decision variables, summarizes the distribution of the aggregated reward functions.• U is a function that given a random variable returns an ordered set of different outcomes, and it is based on the decision maker preferences.This function is needed when the reward functions have uncertain outcomes, and thus these distribution are not readily comparable.The goal in a P-DCOP is to find a complete solution σ * , that is, an assignment of values to all the decision variables, such that: where argmin or argmax are selected depending on the algorithm adopted, is the operator which is used to aggregate the values from the functions f i ∈ F. Typically such operator is a summation, however to handle continuous distribution other operators have been proposed.In other words, agents attempt to maximize the utility of the cumulative reward functions of the P-DCOP, with respect to the evaluator function E.
The probability distribution over the domain of random variables r i ∈ I is called a belief.An assignments of all random variables in I describes a (possible) scenario governed by the environment.As the random variables are not under the control of the agents, they act independently of the decision variables.Specifically, their beliefs are drawn from probability distributions.Furthermore, they are assumed to be independent of each other and, thus, they model independent sources of exogenous uncertainty.
The reward function U enables us to compare the uncertain reward outcomes of the reward functions.In general, the reward function is non-decreasing, that is, the higher the reward, the higher the utility.However, the reward function should be defined for the specific application of interest.For example, in farming, the utility increases with the amount of produce harvested.However, farmers may prefer a smaller but highly certain amount of produce harvested over a larger but highly uncertain and, thus, risky outcome.The evaluation function E is used to summarize in one criterion the rewards of a given assignment that depends on the random variables.A possible evaluation function is the expectation function: Let us now introduce some concepts that are commonly adopted in the study of P-DCOPs.

Definition 7 (Convolution)
The convolution of the probability density function (PDF) f (x) and g(x) of two independent random variables X and Y is the integral of the product of the two functions after one is reversed and shifted: It produces a new PDF h(z) that defines the overlapping area between f (x) and g(y) as a function of the quantity that one of the original functions is translated by.In other words, the convolution is a method of determination of the sum of two random variables.The counterpart for the distribution of the sum Z = X + Y of two independent discrete variables is: In a P-DCOP, the value returned by a function f i , for an assignment on its scope x i , is a random variable V i (V i ∼ f i (x i )).Thus, the global value f i ∈F V i is also a random variable, whose probability density function (PDF) is the convolution of the PDFs of the individual V i .Thus, the concept of convolution of two PDFs in a P-DCOP is related to the summation of the utilities of two reward functions in classical DCOPs.
A common concept in optimization with uncertainty is that of ranking a set of random variables {r 1 , r 2 , . . .} with cumulative PDFs (CDFs) {F 1 (x), F 2 (x), . . .}.Such distributions are also commonly called lotteries, a concept related to that of stochastic dominance, which is a form of stochastic ordering based on preference regarding outcomes.It refers to situations where a probability distribution over possible outcomes can be ranked as superior to another.
The first-order stochastic dominance refers to the situation when one lottery is unambiguously better than Definition 8 (First-Order Stochastic Dominance) Given two random variables r i and r j with CDFs F i (x) and F j (x), respectively, F i first-order stochastically dominates F j iff: for all x with a strict inequality over some interval.
If F i first-order stochastically dominates F j , then F i necessarily has a strictly larger expected value: In other words, if F i dominates F j , then the decision maker prefers F i over F j regardless of his reward function U is, as long as it is weakly increasing.It is not always the case that one CDF will first-order stochastically dominate another.In such a case, one can use the second-order stochastic dominance to compare them.The latter refers to the situation when one lottery is unambiguously less risky than another: Definition 9 (Second-Order Stochastic Dominance) Given two random variables r i and r j with CDFs F i (x) and F j (x), respectively, F i second-order stochastically dominates F j iff: for all c with a strict inequality for some values of c. 9holds for all c ≥ c , for some sufficiently large c , then E[F i (x)] = E[F j (x)].In this case, as both lotteries are equal in expectation, the decision maker prefers the lottery F i , which has less variance and is, thus, less risky.
Another common concept in P-DCOPs is that of regret.In decision theory, regret expresses the negative emotion arising from learning that a different solution than the one adopted, would have had a more favorable outcome.In P-DCOPs the regret of a given solution is typically defined as the difference between its associated reward and that of the theoretical optimal solution.The notion of regret is especially useful in allowing agents to make robust decisions in settings where they have limited information about the reward functions.
An important type of regret is the minimax regret.Minimax regret is a decision rule used to minimize the possible loss for a worst case (i.e, maximum) regret.As opposed to the (expected) regret, minimax regret is independent of the probabilities of the various outcomes.Thus, minimax regret could be used when the probabilities of the outcomes are unknown or difficult to estimate.
Solving Probabilistic DCOPs is PSPACE-hard, as in general, the process is required to remember a complete solution for each possible state associated to the uncertain random variables.The study of complexity classes for Probabilistic DCOPs is largely unexplored.Thus, we foresee this as a potential direction for future research, in which particular focus could be given in determining fragments of Probabilistic DCOP characterized by lower complexity than the one above.

Algorithms
We categorize P -DCOP algorithms into complete and incomplete algorithms, according to their ability to guarantee to find the optimal solutions or not, for a given evaluator and reward functions.Unless otherwise specified the ordering operator in equation ( 5) refers to the argmax operator.[94].E[DPOP] is an (in)complete, synchronous, sampling-based and inference-based algorithm.It uses a collaborative sampling strategy, where all agents concerned with a given random variable agree on a common sample set that will be used to estimate the PDF of that random variable.Agents performing collaborative sampling independently propose sample sets for the random variables influencing the variables they control, and elect one agent among themselves as responsible for combining the proposed sample sets into one.The algorithm is defined over P-DCOPs with I = ∅, and deterministic reward functions outcome, that is, for each combination of values for the variables in x i , f i (σ x i \I ) is a degenerate distribution (i.e., a distribution that results in a single value) and the reward function U is the identity function.E is an arbitrary evaluator function summing over all functions in F.

COMPLETE ALGORITHMS E[DPOP]
E[DPOP] builds on top of DPOP, and proceeds in four phases: In Phase 1, the agents order themselves into a pseudo-tree ignoring the random variables.In Phase 2, the agents bind random variables to some decision variable.In Phases 3 and 4, the agents run the UTIL and VALUE propagation phases like in DPOP except that random variables are sampled.Based on different strategies adopted in binding the random variables in Phase 2, the algorithm has two variants [93].In Local-E[DPOP], a random variable r i ∈ I is assigned to each decision variable responsible for enforcing a constraint involving r i .In this approach, the agents do not collaborate by exchanging information about how their utilities depend on the random variables.In contrast, Global-E[DPOP] assigns r i to the lowest common ancestor agent, 8 which is responsible for combining the proposed samples.While this additional information can produce higher-quality solutions, both algorithms are generally incomplete.One exception is when the evaluation function E adopted is linear, as in the case of the expectation function, in which case the algorithms are complete.
In terms of complexity, E[DPOP] has the same order complexity of DPOP, for both number and size of messages.

Local-E[DPOP] produces messages of size O(d w *
), where d is the size of the largest decision variable domain and w * is the induced width of the DFS pseudo-tree.In contrast, Global-E[DPOP] generates, in the worst case, UTIL messages of size O(d w * s q ), where s is the largest sample set size; this occurs when the root as well as all leaves are constrained with all q random variables.[129].Stochastic Dominance DPOP (SD-DPOP) also operates on a P-DCOP model, where I = ∅ and E is the second order stochastic dominance criteria, and Σ denotes the convolution of the distributions f i (σ x i \I ), while U is the identity function.It is a complete synchronous inference-based algorithm that extends DPOP to solve P-DCOPs.Similar to DPOP, it has three phases.In Phase 1, like DPOP, it constructs a pseudo-tree.In Phase 2, instead of summing up utilities, the agents convolve the reward functions, and instead of propagating utilities up the pseudo-tree, they propagate convolved reward functions.In Phase 3, like DPOP, the agents choose values for their variables.However, instead of choosing values that maximize the utility of their subtrees, the agents choose their values according to the second-order stochastic dominance criteria.

SD-DPOP
Like DPOP, SD-DPOP requires a linear number of messages.In addition, in SD-DPOP, VALUE messages contain each Pareto optimal value of the sending agent, and UTIL messages contain a representation of the reward function for each Pareto optimal solution and each combination of values of the parent and pseudo-parents of the sending agents.Thus, for continuous PDFs that could be represented by mean and variance, the message complexity is O(p d w * ), where p is the size of the Pareto set.If the reward functions are represented by discretized bins, then the message complexity is O(b p d w * ), with b being the maximum number of bins used to represent a reward function.

INCOMPLETE ALGORITHMS
DNEA [11].The P-DCOP model proposed in [11] is characterized by uncertainty exclusively at the level of the outcome of the reward functions, and not due to random variables.Thus, I = ∅.In addition, the reward function U is the identity function, while E is a given evaluator function (e.g., the expectation) for the functions f i ∈ F. In such settings, by employing the evaluation function, they show that one can reduce the uncertainty associated to each reward function to the deterministic case.Thus, one can solve the proposed P-DCOP problems using classical DCOP approaches.
In particular, they propose the Distributed Neighbor Exchange Algorithm (DNEA), which is an incomplete synchronous search-based algorithm that is similar to DSA.Each agent starts by assigning a random value to each of its variables and sends this information to all its neighbors.Upon receiving the values of its neighbors, it computes a reward vector, which contains the reward for each possible combination of values for all its variables under the assumption that its neighbors' values are those in the messages received.It then sends this reward vector to all its neighbors.Upon receiving the reward vector of its neighbors, it computes the best value for each of its variable, assigns those values to its variable probabilistically, and sends the assigned values to all its neighbors.This process repeats until a termination condition is satisfied.At each iteration, each agent a i has complexity O(l d 2 ), where l = max a i ∈A |N a i |, and d = max D i ∈D |D i |, and the total number of messages exchanged is U-GDL [167].The P-DCOP model proposed in [167] also assumes that the reward functions are not dependent on random variables.Thus, I = ∅.Additionally, they assume that E is the expectation of the convolution (Σ) of the distributions f i (σ x i \I ), and U is a given risk function.They propose the Uncertain Generalized Distributive Law (U-GDL) algorithm, which is an incomplete asynchronous inference-based algorithm similar to Max-Sum.It extends the Generalized Distributive Law (GDL) algorithm [5] by redefining the (max, +) algebra to the setting where rewards are random variables rather than scalars.The + operator is extended to perform convolution of two random variables.To cope with the potential issue that not all PDFs are closed under convolution, the authors suggest to resort to sampling methods to approximate such operations.The max operator is defined to distribute over convolution and to select the maximal elements from a set of random variables based on their expected utility.
In order to filter partial potential solutions that can never achieve global optimality, the authors introduce a firstorder stochastic dominance condition, which is employed in the context of the max operator.They also discuss necessary and sufficient conditions for dominance, where the former discards all dominated solutions, but it might also discard some non-dominated solution as well-this is equivalent to using a standard DCOP to solve the P-DCOP model adopted in their work.The latter preserves optimal solutions, but retains in general sub-optimal ones as well.Prior to convergence, U-GDL requires a number of messages equals to twice the diameter of the resulting acyclic graph, while the message size is exponential in the maximum size of the merged variables' domain.

Notable Variant: P-DCOPs with Partial Agent Knowledge
We now describe a class of probabilistic DCOPs where agents have partial knowledge about the environment.That is, the reward functions are only partially known and, therefore, agents may discover the unknown rewards via exploration [173].The new model aims at capturing those domains where agents have an "explorative nature," i.e., one of the agents' goals is to acquire knowledge about the environment in which they act.Agents are concerned with a total, online reward achievable in a limited time frame.In this context, agents must balance the coordinated exploration of the unknown environment and the exploitation of the known portion of the rewards, in order to maximize the global utility.This model was originally called Distributed Coordination of Exploration and Exploitation (DCEE) [172].

Definition
The P-DCOP model for agents with partial knowledge is described by extending the P-DCOP model introduced in Section 6.1, as follows: A, X, D, F, α, I, Ω, P, E, U, T , where T > 0 is a finite time horizon characterizing the time within which the agents can exploit the unknown reward functions and explore the search space.The goal in such a P-DCOP problem is to find a set of assignment σ * = [σ * 1 , . . ., σ * T ] that maximizes the utility of the cumulative reward within the finite time horizon T : where σ t ∈ Σ denotes a complete solution at time t.In other words, agents have at most T steps to modify the value of their decision variables, and solve T P-DCOP problems by acquiring more and more knowledge on the environment as the time unrolls.

Algorithms
In a stochastic and a priori unknown environment, the reward functions need to be learned online through interactions between the agents and their environment.Thus, the algorithms presented in this section are targeted to coordinate agents to solve a sequence of optimization problems in order to simultaneously reduce uncertainty about the local reward functions (exploration) and optimize the global objective (exploitation).In addition, the following algorithms are incomplete.Unless otherwise specified, the ordering operator in equation ( 10) refers to the argmax operator.[172].The Balanced Exploration Rebid (BE-Rebid) is a synchronous, search-based algorithm that solves P-DCOPs with I = ∅, and U and E are the identity functions, It extends MGM as it calculates and communicates its expected gain.The algorithm is introduced in the context of a wireless sensor network problem, where agents can perform small movements in order to enhance their communication capabilities, which are characterized by the distance between pairs of agents.Each agent can perform three actions: stay in the current position, explore another position, or backtrack to a previously explored position and halt movement.In each time step, BE-Rebid computes the expected reward of executing the explore or backtrack actions, assuming complete knowledge of the underlying distribution of the reward functions.Exploring is evaluated by using order statistics and is based on the reward of the best value found during exploration.Backtracking to a known position results in a reward associated to the backtracked state for the remainder of the time steps (i.e., it stays in that state).Following the region-optimal approaches presented in the context of classical DCOPs, the authors propose a version of the algorithm, called BE-Rebid-2 [173], that allows pairs of agents to explore in a coordinated fashion.Interestingly, in such settings, the authors find that increasing coordination (measured by the number of agents that can execute a joint action) can decrease solution quality.This phenomenon is referred to as team uncertainty penalty.

BE-Rebid
HEIST [169].HEIST is a synchronous, inference-based algorithm that solves P-DCOPs with I = ∅, E is the expectation function, and U is the identity function.Thus, it aims at maximizing the expected utility of the cumulative reward function F g , within the finite time horizon T .It does so by modifying a Multi-Armed Bandit (MAB) approach [176] to a distributed scenario.A MAB is a slot machine with multiple arms, each of which yields a reward, drawn from an unknown but fixed probability distribution.It trades exploration and exploitation by pulling the arms in order to maximize the cumulative reward over a finite horizon.To cope with the uncertain and stochastic nature of the reward functions, HEIST models each reward function as a MAB, such that the joined assignment of the variables in the scope of the given reward function becomes an arm of that bandit.It seeks to maximize the expected cumulative optimization reward received over a finite time horizon by repeatedly pulling the MAB arms to select the joint action with the highest estimated upper confidence bound (UCB) [13] on the sum of the local gains received in a single time step.To do so, it employs a belief propagation algorithm, known as generalized distributive law (GDL) [5], in order to maximize the UCB in a decentralized fashion.The authors show that HEIST enables agents to balance between exploration and exploitation, and derive optimal asymptotic bounds on the regret of the global cumulative reward attained.[183].The Iterative Constraint Generation Max-Sum (ICG-MaxSum) algorithm is a synchronous, inference-based algorithm that solves P-DCOPs with I = ∅, E is the identity function, and U(f i (•)) is the maximal regret function.The algorithm uses the argmin operator.Thus it aims at minimizing the sum of maximal regrets for all the functions in F. Furthermore, the horizon is T = 1.Thus, unlike the previous algorithms, IGC-MaxSum does not attempt to learn the outcome of the reward functions.Its objective is to find robust solutions to the uncertain problem distributions; it does so by finding the solution that minimizes the maximum regret.The algorithm extends the Iterative Constraint Generation (ICG) method [16,158] to the decentralized case, by decomposing the overall problem into a master problem and a subproblem that are iteratively solved until convergence.At each iteration, the resolutions of these problems are attained by using Max-Sum.The master problem solves a relaxation of the minimax regret goal, where only a subset of all possible joint beliefs is considered, attempting to minimize the loss for the worst case derived from the considered joint belief.Once it generates a solution, the subproblem finds the maximally violated constraint associated to such a solution.This is referred to as the witness point, indicating that the current solution is not the best one in terms of the minimax regret.This point is added to the set of joint beliefs considered by the master problem, and the process is repeated until no new witness points can be found.The computational complexity of the algorithm is dominated by the master problem, whose computation is exponential in the number of variables in the scope of the associated reward function (similarly to the standard Max-Sum) and linear in the number of witness points.The messages in the master problem have size proportional to the size of the set of the joint beliefs considered during the master problem, while the messages in the subproblems are normal Max-Sum messages.

ICG-MaxSum
A variation of this framework, which aims at maximizing the expected regret, rather than minimizing the maximum regret, has been reported in [91].

Quantified DCOPs
The various extensions of the DCOP model discussed so far differ from each other in terms of agent behavior (deterministic vs. stochastic), agent knowledge (total vs. partial), environmental behavior (deterministic vs. stochastic), and environment evolution (static vs. dynamic).But in terms of the agent teamwork, all of the models assume that the agents are completely cooperative.Recently, researchers have introduced the Quantified DCOP (QDCOP) model [111], which assumes that a subset of agents are adversarial, that is, the agents are partially cooperative or competitive.

Definition
The Quantified DCOP (QDCOP) model [111] adapts the Quantified Constraint Satisfaction Problem (QCSP) [17] and Quantified Distributed CSP (QDCSP) [14,15] models to DCOPs.In QCSPs and QDCSPs, all variables are associated to quantifiers and the constraints should be satisfied independently of the value taken by universally quantified variables.Analogously, in QDCOPs, existential (∃) and universal (∀) quantifiers are introduced to differentiate the cooperative agents from the adversarial ones.
A QDCOP has the form Q(F) = q 0 x 0 . . .q n x n .9Q is a sequence of quantified variables, where each q i ∈ {∃, ∀} quantifies the variable x i .The goal of a QDCOP is to find a global optimal solution of the corresponding DCOP.However, a universally quantified variable is not coordinated nor assigned, as the result has to hold when it takes any value from its domain.In contrast, an existentially quantified variable takes exactly one value from its domain, as in (cooperative) DCOPs.Thus, the optimal solution of a QDCOP may be different from that of the corresponding DCOP.While a DCOP solution defines a single value, associated to its utility, a QDCOP defines upper and lower bounds to the optimal solution.In particular, the best choice in a QDCOP defines the smallest lower bound.In the worst case, the universally quantified variables can worsen the overall objective as much as possible.Therefore, the worst case defines the smallest upper bound.While find an optimal solution for a DCOP is NP-hard, solving a QDCOP is in general P-SPACE-hard [106,17].

Algorithms
QDCOPs impose a rigid order on the variables, which reflects the correct order of evaluation of the quantifiers.Therefore, classical DCOP algorithms cannot be directly applied to solve QDCOPs.In [111] the authors proposed several variations of ADOPT to solve QDCOPs, which are all based on a DFS pseudo-tree ordering.To keep the ordering of the quantifiers unchanged, the pseudo-tree can be reshaped by applying extra null edges for each pair of nodes, if necessary.
The algorithms presented here are based on the intuition that universally quantified variables can be seen as adversarial virtual agents, whose goal is to minimize the overall objective.Following this intuition and the pseudotree modifications discussed above, pseudo-tree-based DCOP algorithms can be extended to solve QDCOP.
Min-max ADOPT [111].Min-max ADOPT is a synchronous, search-based algorithm that extends ADOPT to solve QDCOPs.It uses VALUE messages to communicate values of the variables, and COST messages to announce their utilities, similarly to ADOPT.Each agent, starting from the root of the pseudo-tree, assigns values to its variables and propagates them to its neighboring agents with lower priority.Upon receiving VALUE messages from all higher-priority neighbors, the agent updates its context and repeats the same process by choosing an assignment that maximizes its local utility.In Min-max ADOPT, the existentially quantified variables are used to compute the lower bounds, while the universally quantified variables are used to compute the upper bounds.This process is executed until the root agent detects that the upper bound is equal to the lower bound.This algorithm has a relatively simple structure and does not adopt any major pruning strategy.
Alpha-beta ADOPT [111].Alpha-beta ADOPT is a synchronous, search-based algorithm that extends Min-max ADOPT by adapting the alpha-beta search strategy, a common pruning strategy adopted in game-tree search.This strategy employs two boundary parameters, alpha and beta, representing the lower bound and the upper bound for each possible utility of an assignment, respectively.Alpha represents the lower bound, controlled by the universally quantified variables, while beta represents the upper bound, controlled by the existentially quantified variables.Lower bound and upper bound can be modified exclusively by equally universally quantified and existentially quantified variables, respectively.In Alpha-beta ADOPT, when an agent reports the utility value of the current partial solution, its parent reduces the alpha/beta threshold accordingly.Thus, the new alpha/beta values are used to prune the search when an agent detects that the current solution cannot be better than any other solution already evaluated.Alpha and beta values are obtained using a backtracking technique similar to how thresholds are obtained through backtracking in the original ADOPT.
Bi-threshold ADOPT [111].Bi-threshold ADOPT extends ADOPT by employing two backtracking thresholds instead of one as in ADOPT.In ADOPT, each agent a i maintains the threshold invariant lb * i ≤ t i ≤ ub * i , where lb * i and ub * i are the smallest lower and upper bounds, respectively, of the agent over all of its values and t i is the threshold of the agent.In contrast, in Bi-threshold ADOPT, each agent maintains the threshold invariant lb * i ≤ t α i ≤ t β i ≤ ub * i , where t α i is a lower bound on the threshold, similar to alpha in Alpha-beta ADOPT, and t β i is an upper bound on the threshold, similar to beta in Alpha-beta ADOPT.

DCOP MODEL COMPLEXITY
Classical NP-hard Asymmetric NP-hard Multiobjective NP-hard Dynamic NP-hard Probabilistic PSPACE-hard Quantified PSPACE-hard Table 4: Complexity of the DCOP models.

DCOP Applications
DCOP models have been adopted to represent a wide range of MAS applications, thanks to their ability to capture essential and fundamental MAS aspects as well as the support for the development of general domain-independent algorithms, including applications in wireless sensor networks, power networks, and scheduling.We provide here a description of some of the most compelling applications as well as a general overview of their corresponding DCOP models.A comprehensive list of DCOP applications, categorized according to the DCOP classification of Table 2, is given in Table 5 with related references.

Disaster Management and Coordination Problems
Disaster management and coordination problems refer to how to efficiently and effectively respond to an emergency.In such scenarios, low-powered mobile devices that require limited bandwidth are often deployed and utilized.Due to their decentralized nature, the DCOP approach fits naturally with this application.We now describe several problems within this application domain.
Disaster Evacuation Problems.In a disaster scenario, moving evacuees to the closest refuge shelter can quickly overwhelm shelter capacities.A number of researchers have proposed a DCOP model for disaster evacuation, in which several groups of evacuees have to be led to available shelters [81,30,89,90,80].Group leaders can communicate via mobile devices to monitor and coordinate actions.Each group is represented by a DCOP agent managing variables that represent shelter allocations.Thus, the domain of each variable corresponds to the available shelters.Group sizes and shelter capacity, as well as additional group requirements (e.g., medical needs) and the distance of a group to shelters, are encoded as reward functions.Solving the DCOP ensures an assignment of all groups to shelters that minimizes overflow, such that groups receive the services they need and their travel distances are minimal.
Coalition Formation with Spatial and Temporal Constraints Problems.In a Coalition Formation with Spatial and Temporal Constraints (CFST) problem [156,154,166], ambulance and fire brigade agents cooperate in order to react efficiently to an emergency scenario so as to rescue victims and extinguish fires located in different parts of a city.Agents can travel from one location to another in a given time.Each task (i.e., rescuing a victim, extinguishing a fire) has a deadline, representing the time until which the victim will survive, and a workload,  denoting the amount of time necessary to rescue the victim or put out the fire.The locations of the victims and the fires may be unknown to the agents, and need to be discovered at run time, which requires agents to dynamically update the sequence of the tasks they will attempt, taking account of two main constraints: (1) Spatial constraints, which model where an agent can travel and at what time; and (2) Temporal constraints, which model task deadlines and completion times.Agents may also form coalitions to execute a given task faster or if the requirements of a given task cannot be met by a single agent.Hence, the agents' arrival times at each task need to be coordinated in order to form the desired coalition.The objective is to maximize the number of tasks to be completed.A DCOP formalization for the CFST is described in [156], where ambulances and fire brigades are modeled as DCOP agents.Each agent controls a variable that encodes the current task that the agent will attempt, and whose domain represents task locations.Unary constraints restrict the set of reachable locations, according to distance from a destination and the victim's deadline.Agents' coalitions are defined as groups of agents traveling to the same location.Each task is associated to a utility, which encodes the success for a coalition to complete such task.The goal is to find an assignment of agents to tasks that maximizes the overall utility.

Radio Frequency Allocation Problems
The performance of a wireless local area network (WLAN) depends on the channel assignments among neighboring access points (APs).Neighboring transmissions occurring in APs on the same channel or adjacent channels degrade network performance due to transmission interference.Typically, in dense urban areas, APs may belong to different administrative domains, whose control are delegated to different entities.Thus, a distributed approach to the channel assignment is necessary.
Cooperative Channel Assignment Problems.In a cooperative channel assignment problem [69,124,125], APs need to be configured in order to reduce the overall interference between simultaneous transmissions on neighboring channels.In [124,125], a DCOP-based approach is proposed for cooperative channel assignment in WLANs where APs may belong to different administration entities.In the proposed model, each AP is represented by a DCOP agent, which controls a decision variable modeling a choice for the AP's channels.The signal-tointerference-and-noise ratio (SINR) perceived by an AP or a client is modeled as a reward function, as the overall concurrent transmissions occurring in the same channel and in partially overlapped adjacent channels.The goal is to find an assignment of channels to APs that minimizes the total interference experienced in the WLAN.
In [184,120], the authors study dynamic solutions to the problem of allocating and utilizing the wireless network's available spectrum.In such a problem, the agents operate in a dynamic radio frequency environment that is composed of time-varying interference sources, which are periodically sampled measured.

Recommendation Systems
Recommendation systems are tools that provide user-tailored information about items to users.These systems provide information that is tailored to the characteristics and preferences of the users.
Group Recommendation Problems.Like individual recommendations, group recommendations need to take into account the preferences of all group members and formulate a recommendation that suits the whole group.In [98], the authors proposed a DCOP-based travel package recommendation system for groups of users.The users in the group share a common goal (the travel package recommendation), and have individual preferences for each travel service (hotel, flight companies, tour operators, etc).In such a problem, the objective is to find a group recommendation that optimizes the users' preferences.The proposed DCOP solution is composed of two types of agents: user agents and recommender agents.Each user agent controls a decision variable that models that user's travel choices, while each recommender agent controls a decision variable that models a travel service supplier's recommendations.User travel preferences are modeled via unary constraints on the user agents' decision variables.Binary constraints between user agents in a group and their associated recommender agent ensure that each user's choice in the group is compatible to the recommendation of the recommender agent.The goal is to find the best recommendation for the entire group.

Scheduling Problems
Scheduling problems are an important class of problems, and they have been long studied in the area of Constraint Programming and Operations Research [52,165,118,66].In such problems time schedules are to be associated to resource usage.The problem is made particularly difficult when the scheduling process needs to be coordinated in a distributed manner across several entities.In such a context, many scheduling problems can be naturally mapped to DCOPs.
Distributed Meeting Scheduling Problems.The distribute meeting scheduling problem captures generic scheduling problems where one wishes to schedule a set of events within a time range [47,73].Each event is defined by: (i) the resources required for it to be completed, (ii) the time required for it to be completed, within which it holds the required resources, and (iii) the cost of using such resources at a given time.A scheduling conflict occurs if two events with at least one common resource are scheduled in overlapping time slots.The goal is to maximize the utilities over all the resources, defined as the net gain between the opportunity benefit and opportunity cost of scheduling various events.
In [104], the authors discuss three possible DCOP formulations for this problem: time slots as variables (TSAV), events as variables (EAV), and private events as variables (PEAV).We describe the EAV formulation and refer the reader to the original article for the other two formulations and additional details.In the EAV formulation, events are considered as decision variables.Each variable can take on a value from the time slot range that is sufficiently early to schedule the required resources for the required amount of time or zero to denote that an event is not scheduled.If a variable takes on a non-zero value, then all its required resources cannot be assigned to any other overlapping event.
Water Allocation Scheduling Problems.The management of water resources in large-scale systems is often associated with multiple institutionally-independent decision makers (DMs), which may represent different and conflicting interests, such as flood prevention, hydropower production, and water supply [53].The aim of such problems is to find an efficient use of water allocation and distribution according to the different users' interests.
In [53], the authors formalize a regulatory mechanism in water management as a DCOP.The model involves several active human agents and passive ecological agents.Each agent is associated with an objective function that it seeks to maximize.Active agents make decisions about the amount of water to divert from the river or to be released from a dam in order to maximize their corresponding objective functions.Passive agents, on the other hand, represent ecological interests through their associated objective functions and do not make decisions.The agents model different water supplies for cities and agricultural districts, hydropower productions, and ecological preservation.The goal is to optimize the agents' objective functions, satisfying hard (physical) constraints and maximizing the soft (normative) constraints, which aim at protecting the interests of the passive agents.A solution to such problem, which makes use of a multi-objective DCOP formalization, is presented in [8].
Patient Scheduling Problems.Medical appointments scheduling problems are related to meeting scheduling problems, as they need to associate patients to resources (e.g., doctors, medical machinery) and times, but they require different types of constraints.Patients may require several services from different departments within the same hospital or in multiple hospitals.In general, the objective is to minimize the patient treatment waiting time under limited resource conditions, as well as ensuring efficient resource usage, taking into account patient preferences.
In [64], the authors formulate the problem of scheduling patients to diagnostic units in an hospital as a DCOP, where appointments are modeled as variables, whose domains describe times, durations, and locations.The constraints of the problem model the schedule feasibility, the patient preferences over hospitalization times, the workplace constraints, which restricts the types of appointment for a given workplace, and diagnostic unit constraints, which model resource usage.
In [24,23], the authors propose a Dynamic DCOP model for a radiotherapy patients scheduling problem.In this problem, each agent represents a patient, and it controls variables that represent private information (e.g., type of tumor, number of radiation doses per day, the use of chemotherapy) and public information (e.g., current schedule of the radiotherapy machine).The constraints of the problem model the duration of each daily treatment, as well as tumor-specific treatment restrictions.The problem objective considers patient waiting times to receive their treatment, patient priorities (based on tumor aggressiveness), and patient preferences.

Sensor Network Problems
Sensor networks typically consist of a large number of inexpensive and autonomous sensor nodes, constrained by a limited communication range and battery life.These networks have been deployed for environmental sensing (temperature, humidity, etc.), military applications (e.g., battlefield surveillance), and target tracking [6].When deploying sensor networks, it may not be possible to pre-determine the position of each sensor node.The distributed nature of the problem and the presence of several communication and sensing constraints create a natural fit for DCOPs to solve a wide range of related applications.
Target Tracking Problems.In a typical target tracking application [197,108,71,136,168,70], a collection of small Doppler sensors are scattered in an area to detect possible moving targets in the region.Each sensor is battery-powered, can communicate with one another through radio communication, and can scan and detect an object within a fixed range.Communication incurs an energy cost.Thus, to save energy, a sensor may turn itself off.Multiple sensors may be necessary to detect a single target with high accuracy.The overall objective is to maximize the number of targets detected, as quickly as possible and, at the same time, preserve energy so as to prolong the system's lifetime In [197], a simplified version of the above problem is viewed as a weighted graph coloring problem, where the total weight of violated constraints need to be minimized.A node corresponds to a sensor, an edge between two nodes represents the constraint of a shared region between agents, and the weight captures the importance of the common region.The size of the common region reflects the amount of energy loss when two sensors scan the shared region at the same time.Each color corresponds to a time slot in which a sector is scanned.A node must have at least one color so that the corresponding sector is scanned at least once.This graph coloring problem is mapped to a DCOP, where agents represent nodes, agent's variables represent the agents decision on their color, and reward functions represent the graph edges.
In [70], the authors use a hierarchical DCOP approach to scale to larger problems.The authors partition the original problem into n local regions, and use n DCOPs to solve the smaller subproblems.Their solutions are then combined in a hierarchical approach, solved by a DCOP that encompasses variables and constraints shared among the connected regions of the lower hierarchy DCOPs.
Robotic Network Optimization Problems.The robotic network optimization problem describes a sensor network problem where sensors are placed on top of robots that have limited movement capability.In such a problem, robots can make small movements to optimize the wireless connectivity with their neighbors, without affecting the network topology [37,71].
In [71], the authors propose a DCOP formulation where each robot is represented by an agent.Each agent controls one variable describing the decision on the robots' possible movements.Thus, the variables' domains consist of the valid positions the agent can move to.The reward functions of the problem model the power gain (or loss, depending on the optimization criteria) of the wireless link from a transmitter and a receiver robot, and depend on their positions.Radio communication in wireless sensor networks has a predictable signal strength loss that is roughly inversely proportional to the square of the distance between transmitter and receiver.However, radio wave interference, is very difficult to predict [122].Thus, in [71], the authors use a P-DCOP-based approach with partial agent knowledge to capture the robot's partial knowledge on its reward functions, and to balance exploration of the unknown rewards and exploitation of the known portion of the rewards.
Mobile Sensor Team Problem.The Mobile Sensor Team (MST) problem is similar to the target tracking problem with the difference that agents are capable of moving autonomously within the environment and that time is modeled explicitly as a discrete sequence of time steps.In an MST, agents are placed on a grid.For an agent a i , cur pos i denotes the agent's current position; SR i denotes the agent's perception sensing range, which determines the coverage range within which an agent can detect targets; MR i denotes the agent's mobility range, which defines the maximum distance that the agent can move within a single time step; and cred i denotes the agent's credibility, which reflects the likelihood of the correctness of the detected targets.The targets are defined implicitly through an environmental requirement (ER) function, which defines, for each point in space, the minimum joint credibility value (the sum of the credibility variables) required for that point to be sensed.In such a representation, targets are points p with ER(p) > 0. Given a set of agents SR p whose sensing range covers a target p, the remaining coverage requirement of p is the environmental requirement diminished by the joint credibility of the agents currently covering p: Cur REQ(p) = max{0, ER(p) a i ∈SR p cred i }, where : R × R → R is an operator that defines how the environmental requirement decreases by the joint credibility.The goal of the agents is to find positions that minimize the values of Cur REQ for all targets.
MST problems are modeled through a subclass of dynamic DCOPs, named DCOP MST [198,187,186].Each agent a i controls one variable x i representing its position, and whose domain contains all locations within MR i of cur pos i .Thus, the domains are updated each time the agent moves. 10The constraint C p of a target p involves exclusively those agents a i whose variable's domain includes a location within SR i of p. Thus, at each time step, both domains and constraints may change.As a consequence, the constraint graph changes as well-the neighbors of each agent has to be updated at each time step.Finally, in a DCOP MST two agents are neighbors if their sensing areas overlap.
Sensor Sleep Scheduling Problem.Wireless sensor nodes are equipped with a radio, which can be used to communicate with neighboring nodes, and a limited power source.These sensor nodes are often deployed in inaccessible terrains, thus replacing their power sources may not be possible.The wireless sensor sleeping scheduling problem aims at switching on/off a particular sensor node component (such as the sensor or the radio) for a certain period of time, so to ensure power conservation, maximizing the lifetime of the sensor network.
In [32], the authors propose a DCOP model for this problem, where each sensor is an agent whose variables denote its status (on or off) for each time step.Hard constraints are employed to enforced that if a sensor is on, then all its neighbors should be off, and that sensors cannot stay on for two consecutive time steps.The overall objective is to minimize the delay induced in the network.
A similar problem is solved in [168], where sensors are also able to harvest energy from the environment (e.g. using a photo-voltaic cell or vibration-harvesting microgenerators).In such a context, the goal is to find a schedule that maximizes the probability of detecting events while maintaining energy neutral operations (that is, exhibit an indefinite lifetime for each of the agents).

Service-Oriented Computing Problems
The service-oriented computing paradigm is one that relies on sharing resources over a network, focusing on maximizing the effectiveness of the shared resources, which are used by multiple applications.Efficient solutions with optimal use of resources is crucial in this paradigm and has a wide industrial impact [126].The distributed nature of the resources and the privacy concerns arising when different clouds are involved in the deployment, makes DCOP appealing to solve a range of problems in this paradigm [115].
Application Component Placement Problems.An application component placement (ACP) problem is defined over a network of servers offering storage devices with various capabilities, and component-based application with requirements for processing, communication, and storage [74,97].The ACP problem is a problem of deciding which server to assign to each application component.The component-based application is described by a set of characteristics that establish their requirements in terms of hardware (e.g., CPU speed, storage capacity) as well as constraints between components of the same application (e.g., minimum bandwidth, secure communication channel requirement).When the APC involves deployment on multiple clouds data privacy must be preserved.Additionally, in cloud environments computing resources are shared by many applications and the infrastructure is dynamically changing, making centralized solutions unfeasible.
In [74], the authors propose a DCOP model for the ACP problem where servers bid for a component to host, with an emphasis that is proportional to the affinity of the server characteristics and the component hardware and software requirements.Each server is modeled by an agent, which controls a decision variable representing the server bids.Thus, the domain of each variable is the set of possible components that may be deployed on the server.Unary functions express utility for each components.Hard constraints are employed to ensure that each component is deployed exactly on a single server, and that two components are placed between servers satisfying the required communication bandwidth.The objective is to find a feasible assignment of component to servers that maximizes the utilities.
Server Allocation Problems.Services-oriented middleware networks are composed of entities that can both provide and require multiple service connected within a physical network.In turn, each service can be provided by multiple servers and can serve multiple clients.A service request from a given client, takes into account various quality of service (QoS) parameters (e.g., service response time, service completion time).When a client generates a service request, it can be satisfied by any of the servers offering such requests.The server allocation problem is a problem of selecting servers to allocate services, ensuring maximum social welfare, while meeting the QoS requirements of all clients.
In [36], the authors present a DCOP model for this problem where agents correspond to network entities, variables correspond to services, and their values are either 1, if the associated agent is willing to provide/forward the service, or 0, otherwise.Clients' service requests are mapped to servers' service offers, accounting for the delays that occur when traversing between intermediate nodes using a routing multicast protocol [185].Moreover, in order to provide a service, all requested QoS requirements need to be satisfied.The utility associated to each variable is the combination of the utility for such a node when acting as a service consumer, a service provider, and a service forwarder, and depends on several parameters, such as available GPU cycles, battery power, memory, and bandwidth.The problem may change dynamically when a new request is made, or when a new service is offered or released, and as such can be modeled as a Dynamic DCOP.

Smart Grid Problems
The smart grid is a vision of the future electricity grid, where bidirectional flow of both electricity and information, in an automated fashion, improves efficiency and reliability of energy production and distribution.The development of smart grids poses several challenges: (1) How to deal with the increasing electric grid utilization due to growth of loads, such as electric vehicles (EVs) and heat pumps; (2) How to efficiently integrate a diverse range of energy sources, including intermittent generators, into the power network; and (3) How to deal with the uncertainty in the equipment as well as in the participation of consumers through demand-side technologies.Due to the distributed and dynamic nature of loads and generators participating in the smart grid, agent-based decentralized autonomous control of smaller distributed microgrids is a very compelling solution [38,157].In particular, several solutions based on agent-based decentralized optimization have been explored to deliver this vision [117,109,72,84].The following is a list of the most prominent DCOP approaches for smart grid applications.
Economic Dispatch Problems.The Economic Dispatch (ED) is the problem of coordinating the various settings of the power generators in order to meet the power loads with the lowest cost possible, while satisfying the physical network constraints [182].Researchers have cast this problem as a DCOP [117,58,10,72], considering a network of nodes (agents), each of which relays power to other nodes, but can also contain a combination of generators and loads.Generators are distributed across nodes, and are represented through variables whose domain describe a certain set of discrete power outputs.The distribution cables connecting nodes of the networks, are also associated to DCOP variables, each of which has a thermal capacity describing the maximum power that it can safely carry.The DCOP function describes a particular optimization criteria, such as minimizing the carbon emissions of generators within the network, as well as imposes load and network constraints.In particular, the constraints ensure that the overall demand and supply are in balance and that atrmal capacity constraints of the cables are satisfied.
Power Supply Restoration Problems.After (multiple) line failures, a power grid must to be reconfigured to ensure restoration of power supply.A power network distribution is a network of power lines connected by switching devices (SDs) and fed by circuit breakers (CBs).SDs are analogous to sinks (transformer stations), while CBs are analogous to power sources.Both of these devices can operate in two states: open or closed.Closed SDs consume some power and forward the rest of it on other lines.Open SDs stop power flow.CBs feed the network when they are closed.The configuration of the devices' state is such that energy flow traversing CBs takes the form of a (feeder) tree, and that no SD is powered by more than a power line.Flow conservation and transmission line capacity constraints must be enforced.The power supply restoration problem is the problem of finding a configuration that ensures power restoration for the maximum number of sinks affected by the line failures.
Researchers have proposed a DCOP formulation for this problem in [84,4].In such a framework, each node of the distribution network is controlled by an agent which owns all variables and constraints corresponding to that node.Two DCOP variables are associated to each network node: A load variable and a direction variable.Load variables model the amount of incoming flow for sink nodes, and the number of sinks fed for power source nodes.Direction variables model all the possibilities of feeding a node, as the set of possible configurations in which its neighboring nodes can forward power to it.The acyclicity of the power flow and the flow conservation are modeled as constraints.The former restricts the power path to be a tree as well as defines the optimization criterion.The latter enforces Kirchhoff's law, that the amount of incoming power flow to the node i must equal the sum of power consumed at i and the amount of power forwarded to other nodes.

Microgrid Islanding Problems.
A microgrid islanding problem is the problem of creating islands (i.e., clusters of generator units and loads able to operate without external energy supply) in response to major power outages and blackouts.In [59], the authors formalized this problem as a DCOP, where agents represent nodes in the network, and each agent has its own power generation and power consumption capabilities.Variables represent the amount of power that an agent generates and consumes, as well as transmission line flows and switch status between network nodes.Flow variables are constrained by their maximum transmission line capacities, while switches are modeled as binary variables that can be turned on or off.Flow conservation are modeled as constraints to enforce Kirchhoff' s law.The goal is to find a switching configuration that minimizes the unserved load of the system.
Prosumer Energy Trading Problems.In its more general form, a smart grid is populated by prosumers capable of both generating and consuming resources.The prosumer energy trading problem aims at setting market-based prices for prosumers to directly trade over the smart grid, while taking account of the grid constraints.This problem has been cast as an optimization problem called the energy allocation problem, where given a graph with nodes representing prosumers, and edges describing transmission lines connecting adjacent prosumers, the goal is to find the allocation which maximizes the benefits of all the prosumers while satisfying the capacity constraints of the network.
In [31], the authors propose a DCOP formalization for this problem, where each prosumer is an agent.Variables are associated with edges of the energy trading network (i, j) and describe the number of units of energy that prosumer i sells to/buys from prosumer j.Thus, two variables (i, j) and (j, i) are associated to each edge of the network.For each prosumer, an energy balance constraint models the utility of a given instantiation of its offers as the sum of the offers associated to the energy traded with each of its neighbor.Line capacity and flow conservation constraints ensure that the energy traded along the transmission lines is within their maximum capacity and is consistent with the Kirchhoff' s law.

Supply Chain Management Problems
The management of large businesses involves the management of the flow of goods from suppliers to customers.This flow of goods is called a supply chain.Supply chains have to be carefully managed to ensure that a sufficient quantity of raw material is available at factories for production and a sufficient quantity of processed goods is available at stores for consumers to buy.Additionally, since goods can be purchased from different producers and sold to different consumers, there is also the need to consider how much to buy/sell the goods and who to buy/sell the goods to.In such an environment, information, decision making and control are inherently decentralized.
Supply Chain Formation Problems.A supply chain formation problem is the process of determining the participants in a supply chain, who will exchange what, with whom, and the terms of the exchange.Several DCOP-based approaches for this problem have been proposed in the literature [48,142,143,181,144].In general, they rely on the notion of a Task Dependency Network (TDN) introduced by Walsh and Wellman [179], which is a graph-based representation to capture dependencies among production processes.A TDN is a bipartite directed acyclic graph representing exchanges, where nodes in the graph correspond to producers and consumers and edges in the graph correspond to feasible exchange of goods between the producers and consumers.A path from potential producers and consumers defines a feasible supply chain configuration and the goal is to find the feasible configuration that optimizes a particular reward function.A DCOP encoding for this problem models producers and consumers as agents, each of which controls a variable describing the agent's decision regarding whom to buy from/sell to as well as its associated quantities and costs.Cost functions are associated with each edge of the TDN encoding the willingness of the producer and consumer to trade with each other.
A dynamic version of the above formalization has been investigated in [35].Such a model allows for the entry and departure of producers and consumers as well as changes in properties of the problem (e.g., prices of goods, production capacity of producers, and consumption requirements of consumers).

Traffic Flow Control Problems
A challenge for the increase of transportation demand is to enforce traffic flow control using the existing infrastructure, such as traffic lights, loop detectors, and cameras.Coordinating the actions of these individual devices aims at smoothing the traffic flow at the network level.Such coordinated actions often generates coherent traffic control plans faster and more accurately compared to those of a human traffic operator [174].Due to the distributed nature of such devices, multi-agent solutions are particularly suitable for this class of problems.
Traffic Light Synchronization Problem.This problem is the problem of finding a synchronization schema for the traffic lights in adjacent intersections that creates green waves, which are waves of vehicles that are traveling at a given speed and are able to cross multiple intersections without stopping at red lights.
In [135,75], the authors model this problem as a DCOP, where agents represent traffic lights, each controlling one variable that models the coordination direction for the associated traffic signal.Thus, the domain of the variables is given by two possible directions of coordination (north-south/south-north, east-west/west-east). Conflicts that arise when two neighboring traffic signals choose different directions are modeled as constraints.Cost functions are defined to model the number of incoming vehicles in a junction, and the costs are influenced by whether two adjacent agents agree on a direction of coordination or not.The goal is to minimize the global cost.Due to the dynamic nature of the problem, the proposed algorithms fall under the umbrella of D-DCOP algorithms.
In [153,28], the authors propose a P-DCOP-based approach with partial agent knowledge to solve the traffic light synchronization problem in the context where agents have partial information about their reward functions.In particular, the authors argue that traffic patterns may vary during time and, thus, the agents should learn them to update their reward functions.In this context, agents are given a limited amount of time to explore different signal duration intervals, whose associated rewards are learned by evaluating the average travel time of the first 100 cars traveling across the agent's traffic light.

Analyses and Perspectives on DCOPs
DCOPs have emerged as a popular formalism for distributed reasoning and coordination in multi-agent systems.It provides an elegant modeling framework, which offers flexibility in both the agents reasoning and coordination strategies.Its ability to support the notions of preferences and constraints makes it suitable to model a variety of multi-agent optimization problems.Preferences are a central concept of decision making and arise in a multitude of contexts in multi-agent systems.They are fundamental for the analysis of human choice behavior, and allow agents to express their inclinations through specific actions and behaviors.Constraints have been long studied in centralized systems [161] and have been proved especially practical and efficient for modeling and solving resource allocation and scheduling problems.They are naturally handled within DCOPs and offer a flexible and effective mean to model a variety of complex problems.In addition, DCOPs support several aspects that are crucial in multi-agent systems, such as agent privacy, autonomy in reasoning, and cooperation.
The classical DCOP notion is unable to capture important aspects of the problem related with the environment characteristics, such as partial observability, environment evolution, and uncertainty.Therefore, several DCOP model extensions have been recently presented.Each DCOP model imposes distinctive algorithmic requirements, which concern both the agent's reasoning and cooperation aspects.These requirements, in turn, are strictly related to the characteristics of the problem domain.Due to the performance variability imposed by such requirements, an appropriate selection of the DCOP model and algorithm is essential to obtain desirable performances in realistic application domains.

Comparative Overview of DCOP Models
Classical DCOPs can be used to represent a wide range of MAS applications where agents in a team need to work cooperatively to achieve a single goal in a static, deterministic, and fully observable environment.Exploring the domain structural properties, as well as understanding the requirements of the problem designer, is crucial to design and apply effective DCOP algorithms.When an optimal solution is required, then a complete algorithm can be used to solve the problem.However, if particular assumptions can be made on the problem structure, more efficient solutions can be adopted.For instance, if the constraint graph of the DCOP is always a tree then an incomplete inference-based algorithm, like Max-Sum, is sufficient to guarantee the optimality of the solution found.Complete algorithms are often unsuitable for tackling large-scale problems, due to their exponential requirements in time or memory.In contrast, incomplete algorithms are more appropriate to rapidly find solutions, at the cost of sacrificing optimality.The communication requirements also need to be taken into account.For example, when communication is unreliable, it is not recommended to employ search-based solutions, such as ADOPT or AFB, where communication requirements are exponential in the size of the problem.In contrast, inference-based algorithms are more reliable in the presence of uncertain communication networks as they, in general, require only a linear number of messages to complete their computations.
Asymmetric DCOPs are suitable when constrained agents receive different rewards for a joint action, which arise especially in scenarios where privacy is a particular concern, where agents cannot reveal to the other agents the rewards associated to their putative actions.Examples of problems that can be suitably modeled as Asymmetric DCOPs are resource allocation problems where agents may have different gains from using the same resource, and where their preferences and constraints regarding usage time slots and durations are expected to be different.Asymmetric DCOPs are particularly attractive to model those domains that can be represented as graphical games [76], and where constraint reasoning could be actively used to exploit problem structure.In a graphical game, the utilities of each agent are affected exclusively by its neighboring agents.It is important to note that, even though the Asymmetric DCOP model bears similarities with many game-theoretic approaches, these two models are fundamentally different.While game-theoretic agents are self-interested, and their non-cooperative actions lead to a desirable global target, Asymmetric DCOP agents are cooperative and seek to maximize the global reward even at the expense of smaller local rewards.
Multi-Objective DCOPs are tailored to represent those classes of distributed problems that cannot be modeled with a single optimization function.The observations above for classical DCOPs still hold for MO-DCOPs and, in addition, agents cooperate to optimize multiple objectives.A number of MAS applications fall in the category of multi-objective optimization.One example is that of disaster management and coordination problems, where agents coordinate to effectively respond to an emergency scenario (see Section 8.1).Due to memory requirements, which are proportional to the number of objectives and the size of the Pareto set, incomplete strategies seems particularly promising for this research area.
Dynamic DCOPs capture the dynamic behavior of the evolving environment in which agents act.Dynamic environments play a fundamental role in real-wold MAS applications.Virtually all complex MAS applications involve dynamic situations, which may restructure the network topology due to agent movements, or bring additional information to the problem being solved.For example, in a search and rescue operation during disaster management, as the environment evolves over time, new information becomes available about the civilians to be rescued and new agencies may arrive at any time to help conduct rescue operations.In a smart grid domain, realtime pricing is commonly enforced, thus, agent preferences need to be adapted over time, while energy costs are updated (see Section 8.7).Dynamic DCOP therefore is a modern area that presents an exciting field for groundbreaking research.
Probabilistic DCOP extend the classical DCOP model to include the capability of handling uncertain events, allowing DCOP agents to handle a wider range of applications.In particular, Probabilistic DCOPs are suitable to capture those applications characterized by a static environment evolution with exogenous uncertain events (e.g., when the actions of agents on the environment can have different outcomes, based on external, uncertain factors) and, yet, agents have total knowledge of their own actions and of the observable environment and act in a fully cooperative context.The domain of multi-agent task planning and scheduling encompasses diverse problems that require complex models and robust solutions under uncertainty (see Section 8.4).The Probabilistic DCOP model for agents with partial knowledge is suitable to model those applications where agents have no prior knowledge of how the environment reacts to some of the actions.In such a model, the agents are aware of their own actions, which are performed deterministically.However, there is uncertainty in the reward associated to such actions, which is influenced by uncertain events that can be discovered over time.Thus, a common approach in such cases is to resort to sampling strategies, in order to obtain simple approximations of the probability distributions-in the form of sample realizations of the probabilistic rewards.Due to the uncertainty arising in such problems, it is especially appealing to recur to solutions which adopt approaches to decision making under uncertainty, such as minimax, maximin and regret.
As outlined in Table 3, more research effort is needed to solve DCOP models in which agents act in a combined uncertain and dynamic environment.This is by far the most realistic setting for teams of agents acting within a MAS.More generally, a coordination strategy that can adapt to the situation where the environment or network problem is evolving dynamically and rapidly and where several scenarios require different approaches to coordination, has not yet been studied.Finally, Quantified DCOPs, allow to model adversarial agents, which are common in many MAS applications.In a QDCOP, universally quantified variables can be considered as the choice of the nature of an adversary.QDCOPs can thus formalize application problems where a team of agents seeks to limit the effect of the adversarial agents, as well as problems associated to planning under uncertainty.Examples of relevant applications include distributed surveillance planning problems, where sensors in a network need to coordinate their surveillance areas to detect intruders, whose positions are unknown and who are trying to avoid detection.Thus, the sensors try to find a robust plan that can handle different intruding scenarios (see Section 8.5).

Algorithms and Theoretical Analysis
Despite the fact that classical DCOPs have reached a sufficient level of maturity from the algorithmic perspective, most of the other proposed formalisms fall short on both algorithmic and theoretical foundations.The proposed algorithms mostly extend classical DCOP algorithms and, therefore, they result in similar performance.Investigating strategies that are based on different backgrounds could help propel the evolution of this area.In particular, further investigations on relating DCOPs to the areas of game theory and decision theory are necessary.Similar to the work in [33], where the authors study relationship between DCOPs and potential games, analyzing the relationship of DCOPs to auctions mechanisms could shed light on how to effectively address coordination and reasoning strategies in DCOP with partially cooperative agents.Another direction is to relate DCOP with machine learning techniques.For instance, as outlined in [86,51], one can use inference-based algorithms such as expectation-maximization and convex optimization machinery to develop efficient message-passing algorithm for solving large DCOPs.Additionally, merging insights from decision theory, such as handling partial observability, with the inherent DCOP ability of naturally exploiting problem structure could result in improved performance and/or refined models.A step toward this direction is outlined in [127] and [68].
Due to the complexity of the DCOP models, the study of incomplete approaches to solve large DCOPs whose agents may act in a dynamic and/or uncertain environment seems particularly suitable.Within the current incomplete methods proposed, a considerable effort has been employed in developing anytime algorithms.Anytime algorithms can return a valid solution even if the DCOP agents are interrupted at any time before the algorithm terminate, and are expected to seek for solutions of increasing quality as they keep running.In addition to the anytime mechanism, which constrain the problem resolution within a particular time requirement, any-space algorithms have been proposed to limit the amount of space needed to an agent during problem resolution [63,150,192].Similarly to these mechanisms, we see an urge for investigating algorithms that allow problem resolution within any particular network load restriction.For instance, in a congested network agents might need to exchange smaller messages, or to communicate less frequently and/or with a restricted set of neighbors.Situations like these may arise in problems where a multitude of interconnected systems share the same medium, and thus limited bandwidth and interference may cause unsuitable delays.This context may be exacerbated with the advent of concepts such as the Internet of the Things, which expects to see million of connected devices, possibly sharing several mediums [12,34,119].Thus, orthogonal to the direction pursed by anytime and any-space algorithms, we envision the development of any-communication procedures.
Another open question is related to agent coordination.It has been observed that simple coordination strategies give good results in extremes (high or low) agent workload environments.However, at intermediate workload levels, such strategies lose their effectiveness, and more complex coordination strategies are necessary [95,196].For example, in [196] the authors study agent learning, where coordination is driven by a dynamic decomposition of a DCOP, each solving smaller independent subproblems.Their proposal effectively produced near-optimal agent policies for learning and significantly reduced the amount of communication.These empirical observations suggest the existence of phase transition behaviors occurring at the level of agent coordination.A formal understanding of these phenomena, which looks at the environment, agents' local reasoning strategies, their assumptions on other agents states, as well as team coordination could help researchers to understand the inherent complexity of coordination problems.Such a formal framework could be useful to build new and more efficient coordination strategies for a wide variety of multi-agent applications, perhaps resembling the way studies of phase transitions of NP-hard problems [123] led to the understanding of problem complexity and creation of effective heuristics and search strategies.

Evaluation Metrics and Benchmarks
Modeling many real-world problems as DCOPs often require each agent to solve large complex subproblems, each requiring many variables and constraints.A limitation of most DCOP algorithms is the assumption that each agent controls exactly one variable, and that all constraints are binary.Such assumptions simplify the algorithm organization and presentation.However, in order to operate under this assumption, DCOP algorithms have to reformulate the problem using pre-processing techniques [29,194], which can negatively impact their performance.A more realistic view is to allow each agent to solve its local subproblem (in a centralized fashion) since it is independent of the subproblems of other agents.The agent's subproblem resolution can then explore techniques from centralized reasoning, such as constraint optimization problems (COPs), linear programming (LP), and graphical models.One can even exploit novel hardware platforms, such as general-purpose graphical processing units (GPGPUs) to parallelize such solvers [46].Despite the wide applicability of the DCOP model, unfortunately, there is no general language being used to formally specify a DCOP.While there are several DCOP simulators, that include implementations of various DCOP algorithms using a common language specification [42,93,171], by and large, most stand-alone algorithms specify DCOPs in an ad-hoc manner.As a result, it is often inconvenient to experimentally compare different algorithms.More importantly, the great majority of such languages requires constraints values to be specified explicitly.Such requirement makes unpractical to convert problems which are naturally defined as mathematical optimization problems (such as Mixed Integer Programs) into an explicit form which specifies the utilities for each value combination of the variables in the scope of the problem constraints.The adoption of a common distributed constraint modeling language, which allows one to express constraints as standard algebraic or logic expressions, may be beneficial for developing standard benchmarks.Additionally, it would provide a tool for researchers outside the AI communities to model and test the applicability of new problems, extending the applicability of DCOPs to new areas.For instance, within the CP community, the adoption of the MiniZinc language [128] to model CSPs and COPs has gained wide traction, and it is becoming a modeling choice even outside the CP community.
Another open question in this research area concerns the definition of a systematic process to evaluate and compare the DCOP algorithms.There are multiple metrics that can be used to measure the runtime of an algorithm, such as the number of non-concurrent constraint checks (NCCCs) [87], the simulated runtime [171], and the number of cycles [99].However, there is no consensus on a standard metric.Such issue, combined with absence of a general DCOP language, makes it inconvenient to experimentally compare different algorithms.In addition, new proposed algorithms are, in general, evaluated on arbitrary benchmarks, some inherited from the CSP literature, some other from approximations of real-world problems.To cope with these issues, it would be useful to develop a benchmark repository, perhaps by taking inspiration from the efforts made by the constraint programming (CP) community with CSPLib [1], a library of test problems for constraint solvers, or by the planning community with their international planning competitions [2].

Conclusions
DCOPs have emerged as a popular formalism for distributed reasoning and coordination in multi-agent systems.Due to its limitation to support complex, real-time, and uncertain environment, researchers have introduced several model extensions to handle dynamic and uncertain environments, as well as different levels of cooperation among the agents.
While DCOP's theoretical foundation and algorithmic frameworks have matured significantly over the past decade, their applicability to realistic domains is lagging behind.This survey aims at linking the DCOP theoretical framework and solving strategies with a set of potential applications where its applicability is having or may have a significant impact.
In this survey, we provided an analysis of the recent advances made by the AAMAS community within the DCOP framework and propose a categorization based on agent characteristics, environment properties, and type of teamwork adopted.Within the proposed classification, we (1) presented a review of the characteristics of the different algorithmic solutions; (2) discussed a number of application domains that can be naturally modeled within each DCOP framework; and (3) identified some potential directions for future work with regards to agent coordination, algorithm scalability, modeling languages, and evaluation criteria of DCOP models and algorithms.

Fig. 2 :
Fig. 2: Illustration of a Multi-agent System: Sensors (agents) seek to determine the position of the targets.

Fig. 3 :
Fig. 3: DCOP problems as a generalization and extension of Constraint Satisfaction Problems

Table 1 :
Commonly Used Symbols and Notations
:• Search-based methods are based on the use of search techniques to explore the space of possible solutions.