Exploiting Causality for Selective Belief Filtering in Dynamic Bayesian Networks

Dynamic Bayesian networks (DBNs) are a general model for stochastic processes with partially observed states. Belief filtering in DBNs is the task of inferring the belief state (i.e. the probability distribution over process states) based on incomplete and noisy observations. This can be a hard problem in complex processes with large state spaces. In this article, we explore the idea of accelerating the filtering task by automatically exploiting causality in the process. We consider a specific type of causal relation, called passivity, which pertains to how state variables cause changes in other variables. We present the Passivity-based Selective Belief Filtering (PSBF) method, which maintains a factored belief representation and exploits passivity to perform selective updates over the belief factors. PSBF produces exact belief states under certain assumptions and approximate belief states otherwise, where the approximation error is bounded by the degree of uncertainty in the process. We show empirically, in synthetic processes with varying sizes and degrees of passivity, that PSBF is faster than several alternative methods while achieving competitive accuracy. Furthermore, we demonstrate how passivity occurs naturally in a complex system such as a multi-robot warehouse, and how PSBF can exploit this to accelerate the filtering task.


Introduction
Dynamic Bayesian networks (DBNs) (Dean & Kanazawa, 1989) are a general model for stochastic processes with partially observed states. The topology of a DBN is a compact specification of how variables in the process interact during transitions (cf. Figure 1). Given the possible incompleteness and noise in observations, it may not generally be possible to infer the state of the process with absolute certainty. Instead, we may infer beliefs about the process state based on the history of observations, in the form of a probability distribution over the state space of the process. This is often called a belief state and the task of calculating belief states is commonly referred to as belief filtering.
A number of exact and approximate inference methods exist for Bayesian networks (see, e.g., Koller & Friedman, 2009;Pearl, 1988) which can be used for filtering in DBNs, by applying them to the "unrolled" DBN in which the t + 1 slice is repeated for each observed time step, or via a successive update in which the current posterior (belief state) is used as the prior in the next time step (see also Murphy, 2002). However, it is clear that the unrolled variant becomes intractable as the network grows unboundedly with time. Even in the successive update, exact methods become intractable in high-dimensional process states and approximate methods may propagate growing errors over time. Therefore, filtering methods were developed which utilise the special structure of DBNs and maintain the errors propagated over time. (We defer a detailed discussion of such methods to Section 2.) Often, the key to developing efficient filtering methods is to identify structure in the process which can be leveraged for inference. In this article, we are interested in the application of DBNs as representations of actions in partially observed decision processes, such as POMDPs (Kaelbling, Littman, & Cassandra, 1998;Sondik, 1971) and their many variants. DBNs can be used to represent the effects of actions on the decision process, by specifying how variables interact and what information the decision maker observes. In many cases, decision processes exhibit high degrees of causal structure (Pearl, 2000), by which we mean that a change in one part of the process may cause a change in another part. Our experience with such processes is that this causal structure may be used to make the filtering task more tractable, because it can tell us that beliefs need only be revised for certain aspects of the process state. For example, if the variable x 2 in Figure 1 changes its value only if variable x 1 changed its value (i.e. a change in x 1 causes a change in x 2 ), then it seems intuitive to use this causal relation when deciding whether to revise one's belief about x 2 . Unfortunately, current filtering methods do not take such causal structure into account.
We refer to the above type of causal relation (between x 1 and x 2 ) as passivity. Intuitively, we say that a state variable x i is passive in a given action if, when executing that action, there is a subset of the state variables that directly affect x i (i.e. x i 's parents in the DBN) such that x i may change its value only if at least one of the variables in this subset changed its value. It is worth pointing out that passivity occurs naturally and frequently in many planning domains, especially in robotic and other physical systems (Mainzer, 2010). The following example 1 illustrates this in a simple robot arm: Example 1 (Robot arm). Consider a robot arm with three rotational joints and a gripper, as shown in Figure 2a. The joints are denoted by θ 1 , θ 2 , θ 3 and may take any values from the discrete set {0 • , 1 • , ..., 359 • } which indicate their absolute orientations (e.g. θ i = 0 • means that joint i points exactly to the right, θ i = 180 • means that it points to the left). For each joint i, let there be two actions CW i and CCW i which rotate the joint by 1 • clockwise and counter-clockwise, respectively. The uncertainty in this system could be due to stochastic joint movements or unreliable sensor readings for the joint orientations. For any action CW i or CCW i , the variable θ i is not passive because its value is directly modified by the action. However, the variables θ j =i are passive because they change their values only if the corresponding preceding variable θ j−1 changed its value, since a changed orientation of joint j − 1 causes a changed orientation of joint j (recall that the orientations are absolute). Note that this also accounts for chains of such causal effects, as indicated by the arrows: the orientation of joint 3 changes if the orientation of joint 1 changes, since joint 1 causes joint 2 to change, which in turn causes joint 3 to change.
Further examples of passivity can be seen in the context of object manipulation, such as in the "blocks" planning domain (e.g. Pasula, Zettlemoyer, & Kaelbling, 2007). Figure 2b shows the arm holding blocks B and A, with A on top of B. Here, the position of B (X B ) is passive with respect to the joint orientations since it will only change if any of the orientations changed. Furthermore, there is a causal chain from the joint orientations to the position of block A (X A ), since A's position will change if B's position changes.
How can passivity be exploited to accelerate the filtering task in the above example? The fact that the state variables are passive means that some aspects of the state may remain unchanged, depending on which action we choose. For example, if we choose to rotate joint 3, then the fact that joints 1 and 2 are passive means that they are unaffected by this action. Thus, it seems redundant to revise beliefs for the orientations of joints 1 and 2. However, this is precisely what current filtering methods do (cf. Section 2).
More concretely, assume we use a factored belief representation P (θ 1 , θ 2 , θ 3 ) = P (θ 1 , θ 2 ) * P (θ 2 , θ 3 ) and choose to rotate θ 3 in any direction. Then, it is easy to see that we will need to update the factor P (θ 2 , θ 3 ), since θ 3 changes its value, but not the factor P (θ 1 , θ 2 ), since the variables θ 1 , θ 2 are both passive. Since the parents of θ 1 , θ 2 (if any) do not change their values, we know that θ 1 , θ 2 will not change their values either. As we will show later, skipping over P (θ 1 , θ 2 ) does not result in a loss of information in such cases, and similarly for chains of such causal connections (cf. Example 1). A more complex example of a planning domain involving passivity, and how it can be exploited, is discussed in Section 6.2.
In addition to guiding belief revision, there are several features which make passivity an interesting example of a causal relation: First of all, passivity is a latent causal relation, meaning that it can be readily extracted from the process dynamics without additional annotation by an expert. (In Section 4, we give a procedure which identifies passive variables based on their conditional probability tables.) Furthermore, passivity is not a deterministic relation since passive variables may have any stochastic behaviour when changing their values. Finally, passivity is a relatively simple example of a causal relation, and the idea of exploiting passivity in order to accelerate the filtering task is intuitive. Yet, to the best of our knowledge, this has not been formalised and explored rigorously before.
The purpose of the present article is to formalise and evaluate the idea of automatically exploiting causal structure for efficient belief filtering in DBNs, using passivity as a concrete example of a causal relation. Specifically, our hypothesis is that in large processes with high degrees of passivity, this structure can be exploited to accelerate the filtering task. After discussing related work in Section 2 and technical preliminaries in Section 3, our contributions can be grouped into the following parts: • In Section 4, we give a formally concise definition of passivity and discuss various aspects of this definition. Our definition assumes a decision process which is specified as a set of dynamic Bayesian networks (one for each action). We also discuss a nonexample of passivity, by which we mean variables which appear to be passive but really are not passive. Finally, we give a simple procedure which can detect passive variables based on their conditional probability tables.
• In Section 5, we present the Passivity-based Selective Belief Filtering (PSBF) method. Following the idea outlined above, PSBF uses a factored belief representation in which the belief factors are defined over clusters of correlated state variables. PSBF follows a 2-step update procedure wherein the belief state is first propagated through the process dynamics (the transition step) and then conditioned on the observation (the observation step). The interesting novelty of PSBF is the way in which it performs the transition step: rather than updating all belief factors, PSBF updates only those factors whose variables it suspects to have changed, which is possible by exploiting passivity (to be made precise shortly). Similarly, in the observation step, PSBF updates only those belief factors which it determines to be structurally connected with the observation, and it uses only those parts of the observation which are relevant to the belief factor, thus allowing for a more efficient incorporation of observations. PSBF produces exact belief states under certain assumptions and approximate belief states otherwise. We also discuss the computational complexity and error bounds of PSBF.
• In Section 6, we evaluate PSBF in two experimental domains: We first evaluate PSBF in synthetic (i.e. randomly generated) processes of varying sizes and degrees of passivity. The process sizes vary from one thousand to one trillion states, and the passivity degrees vary from 25% to 100% passivity. Our results show that PSBF is faster than several alternative methods while maintaining competitive accuracy. In particular, our results indicate that the computational gains grow significantly with both the degree of passivity and the size of the process. We then evaluate PSBF in a complex simulation of a multi-robot warehouse system in the style of Kiva (Wurman, D'Andrea, & Mountz, 2008). We show how passivity occurs in this system and how PSBF can exploit this to accelerate the filtering task, again outperforming alternative methods.
Finally, we discuss the strengths and weaknesses of PSBF in Section 7, and we conclude our work in Section 8. All proofs can be found in the appendix.

Related Work
There exists a substantial body of work on belief filtering in partially observed stochastic processes. In this section, we review filtering methods that utilise the special structure of DBNs and situate our work within this and other related literature.

Approximate Belief Filtering in DBNs
Several authors proposed filtering methods wherein the belief state is represented as a set of state samples. Specifically, the probability that the process is in state s is the normalised frequency with which the state samples correspond to s. These methods are now commonly referred to as particle filters (PF); see the work of Doucet, de Freitas, and Gordon (2001) for a survey. In a common variant of PF (Gordon, Salmond, & Smith, 1993), the filtering task consists of propagating the current state samples through the process dynamics and a subsequent resampling step based on the probabilities with which the new state samples would have produced the observation. Two interesting features of PF are that it can be applied to processes with discrete and continuous variables, and that the approximation error converges to zero as we increase the number of state samples.
A known problem of PF is the fact that the number of samples needed for acceptable approximations can grow drastically with the variance in the process dynamics (as shown in our experiments; cf. Section 6). Rao-Blackwellised PF (RBPF) (Doucet, De Freitas, Murphy, & Russell, 2000) was developed to address this problem. RBPF assumes that the state variables can be grouped into sets R and X such that the distribution over X can be efficiently calculated from R during the filtering. Hence, a sample in RBPF consists of a sample of R and a corresponding marginal distribution over X. RBPF is useful when the variance in R is relatively low and the variance in X is high, since this reduces the number of samples needed for acceptable approximations. Koller (1999, 1998) recognised that if a process consists of several independent or weakly interacting subcomponents, then the belief state can be represented more efficiently as a product of smaller beliefs about these individual subcomponents. Their seminal contribution is to show that the approximation error due to this factored representation is essentially bounded by the degree of uncertainty (or "mixing rates") in the process. More precisely, they prove that the relative entropy (or KL divergence; Kullback & Leibler, 1951) between two belief states contracts at an exponential rate when propagated through a stochastic transition process. Based on this observation, they propose a filtering method (BK) wherein the belief state is represented in factored form and the belief factors are updated using an exact inference method, such as the junction tree algorithm (Lauritzen & Spiegelhalter, 1988). Since the internal "cliques" used in the junction tree algorithm may not correspond to the belief state representation of BK, a final "projection step" will typically have to be performed in which the original factorisation is restored. The performance of this method depends crucially on whether the relevant correlations between state variables can be captured in small clusters, and whether the projection step can be performed efficiently.
Factored particle filtering (FP) (Ng, Peshkin, & Pfeffer, 2002) addresses the main drawbacks of PF (many samples needed) and BK (small clusters required) by approximating the belief factors using a set of factored state samples. The samples are factored in the sense that they only assign values to the variables in the corresponding factor. This allows FP to represent belief factors which are too large for BK, and it reduces the number of samples needed due to the smaller number of variables in each factor. The authors provide different methods of updating the factored state samples, but the generic idea is to first perform a "join" operation in which full state samples are reconstructed from the factored samples, which are then updated as in standard PF. The updated samples are then projected down into factored form using a "project" operation. The main drawback of FP is that these join and project operations essentially correspond to standard relational database operations, which can be very expensive. Murphy and Weiss (2001) propose a filtering method called factored frontier (FF). FF uses a fully factored representation of belief states; that is, the belief state is a product of marginals for each individual state variable. This allows for a very compact representation of beliefs. The algorithm works by "moving" a set of state variables (the frontier) forward and backward in the DBN topology. This requires a certain variable ordering, which can be difficult to attain if intra-correlations between state variables (i.e. edges within the t + 1 slice of the DBN) are allowed. The authors show that their method is equivalent to a single iteration of loopy belief propagation (LBP) (Pearl, 1988). Thus, similar to LBP, FF can be applied in successive iterations to improve the approximation accuracy.
None of the works discussed above explicitly address the question of how causal relations between state variables can be exploited to accelerate the filtering task, or, alternatively, how the filtering methods proposed therein implicitly benefit from causal structure. Our method, PSBF, is related to BK and FP in that PSBF, too, uses a factored belief representation, where the belief factors are defined over clusters of correlated state variables. Therefore, the analysis of approximation errors by Boyen and Koller (1998) also applies to PSBF, as we show in Section 5 as well as in our experiments. However, in contrast to BK and FP, PSBF does not perform inference over the complete factorisation, but rather over the individual factors. As a consequence, PSBF does not require a join or project operation, which is one of the main disadvantages of BK and FP.

Belief Filtering in Decision Processes
The methods discussed in the preceding subsection can be used for belief filtering in decision processes, including POMDPs (Kaelbling et al., 1998;Sondik, 1971). In this regard, these methods can be viewed as "pure" filters in that they are only concerned with belief filtering and not with the control of the decision process. This is in contrast to combined filtering methods, which interleave the filtering and control tasks in decision processes and make specific assumptions regarding solutions thereof. There exists a large body of literature on such combined methods, including reachability-based methods (Hauskrecht, 2000;Washington, 1997), grid-based methods (Zhou & Hansen, 2001;Brafman, 1997;Lovejoy, 1991), pointbased methods (Smith & Simmons, 2005;Pineau, Gordon, & Thrun, 2003), and compression methods (Roy, Gordon, & Thrun, 2005;Poupart & Boutilier, 2002).
A potential advantage of such combined methods is that they have access to additional structure and may, therefore, utilise synergies between the filtering and control tasks. One such synergy is the use of decision quality to guide belief filtering, rather than metrics such as relative entropy. Boutilier (2001, 2000) propose a filtering method, called value-directed approximation, which chooses different approximation schemes for different decisions so as to minimise the expected loss in decision quality (i.e. accumulated rewards). The method assumes that the POMDP has been solved exactly and that the value function is provided in the form of α-vectors which represent the available actions in the POMDP. Based on the value function, their algorithm computes a "switching set" and "alternative plans" to determine the error bounds of approximation schemes. This is used to search for an optimal approximation scheme in a tree-based manner, where the search traverses from approximate to exact schemes.
While the idea of using decision quality to guide belief filtering is appealing, their method involves a series of optimisation problems and an exhaustive tree search, which can be very costly in complex systems. The advantage of pure filtering methods, including our proposed method PSBF, is that they can filter processes which are too complex for combined methods, such as the multi-robot warehouse system studied in Section 6. The actual control task can then be done via domain-specific solutions (cf. Section 6.2.1).

Substructure in Parameterisation
Bayesian networks, and hence DBNs, allow for a compact parameterisation (i.e. specification of probabilities) and efficient inference via conditional independence relations. In addition, there has been considerable work in identifying substructure in the parameterisation to further simplify knowledge acquisition and enhance inference (Koller & Friedman, 2009;Boutilier, Dean, & Hanks, 1999). The property studied in this work, passivity, is one example of substructure in the parameterisation. Other notable examples include causal independence (e.g. Heckerman & Breese, 1994;Heckerman, 1993) and context-specific independence (Boutilier, Friedman, Goldszmidt, & Koller, 1996).
Causal independence is the assumption that the effects of individual causes on a common variable (i.e. the parents of that variable) are independent of one another. This allows for a compact parameterisation via operators such as "noisy-or" (Srinivas, 1993;Pearl, 1988), and it can be used to enhance inference (Zhang & Poole, 1996). Note that passivity is a conceptually much simpler property than causal independence, because passivity is neither concerned with the strength of individual causes nor the extent to which they depend on each other. Moreover, passivity can be read directly from the parameterisation (cf. Section 4.3) whereas causal independence is usually imposed by the designer.
Context-specific independence (CSI) is a property which states that a variable is independent of some of its parents given a certain assignment of values (i.e. "context") to some of its other parents. Non-local CSI statements follow similarly to d-separation (Geiger, Verma, & Pearl, 1989). This can allow for a further reduction of parameters (Boutilier et al., 1996) and enhancement of inference (Poole & Zhang, 2003). As we will discuss in Section 4, passivity can be viewed as a special kind of CSI applied to DBNs, in that the parents with respect to which the variable is passive provide the context for CSI. However, in contrast to CSI, passivity does not assume that the context is actually observed.

Technical Preliminaries
This section introduces the basic concepts and notation used in our work. We begin with a brief discussion of decision processes to provide the context for our work, followed by a discussion of dynamic Bayesian networks as the model over which we perform inference.

Decision Processes, Belief States, Exact Updates
We consider a stochastic decision process wherein, at each time t, the process is in state s t ∈ S and a decision maker, or "agent", is choosing an action a t . After executing a t in s t , the process transitions into state s t+1 ∈ S with probability T a t (s t , s t+1 ) and the agent receives an observation o t+1 ∈ O with probability Ω a t (s t+1 , o t+1 ). We assume factored representations of the state space S and observation space O, such that S = X 1 × ... × X n and O = Y 1 × ... × Y m , where the domains X i , Y j are finite. The notation s i is used to denote the value of X i in state s ∈ S, and analogously for o j with o ∈ O. Moreover, we assume that the process is time-invariant, meaning that T a and Ω a are independent of t. This framework is compatible with many decision models used in the artificial intelligence literature, including POMDPs (Kaelbling et al., 1998;Sondik, 1971) and its many variants.
The agent chooses action a t based on its belief state b t (also known as information state), which represents the agent's beliefs about the likelihood of states at time t. Formally, a belief state is a probability distribution over the state space S of the process. Belief filtering is the task of calculating a belief state based on the history of observations. Ideally, the resulting belief state should be exact in that it retains all relevant information from the past observations (this is sometimes referred to as sufficient statistic; cf. Astrom, 1965). The exact update rule is a simple procedure that produces exact belief states: Definition 1 (Exact update rule). The exact update rule is defined as follows: After taking action a t and observing o t+1 , the belief state b t is updated to where η is a normalisation constant.
We sometimes refer to the step b t →b t+1 as the transition step and to the stepb t+1 → b t+1 as the observation step. Unfortunately, the space complexity of storing exact belief states and the time complexity of updating them using the exact update rule are both exponential in the number of state variables, making it infeasible for complex systems with large state spaces. Hence, more efficient approximate methods are required.

Dynamic Bayesian Networks
A dynamic Bayesian network (DBN) (Dean & Kanazawa, 1989) is a Bayesian network with a special temporal semantics that specifies how a stochastic process transitions from one state into another. DBNs can be used to model the effects of actions in a stochastic decision process. Specifically, they are a compact representation of the transition function T a and observation function O a of action a: Definition 2 (DBN). A dynamic Bayesian network for action a, denoted ∆ a , is an acyclic directed graph consisting of: i ∈ X i , representing the states of the process at time t and t + 1, respectively.
specifying the network topology and dependencies between variables.
• Conditional probability distributions P a (z | pa a (z)) for each variable z ∈ X t+1 ∪ Y t+1 , specifying the probability that z assumes a certain value given a specific assignment to its parents pa a (z) = {z | (z , z) ∈ E a }. For convenience, we also define pa t a (Z) = X t ∩ pa a (Z) and pa t+1 a (Z) = X t+1 ∩ pa a (Z), where pa a (Z) = ∪ z∈Z pa a (z).
The edges E a and distributions P a define the functions T a and Ω a as where we use the notation pa a (x t+1 i ) ← (s, s ) to specify that the parents of x t+1 i in X t and X t+1 , respectively, assume their corresponding values from s and s . Formally, if x t l ∈ pa t a (x t+1 i ) and x t+1 l ∈ pa t+1 a (x t+1 i ), then x t l = s l and x t+1 l = s l . Similarly, we use the notation pa a (y t+1 j ) ← (s , o) to specify that the parents of y t+1 j in X t+1 and Y t+1 , respectively, assume corresponding values from s and o.
Example 2 (DBN representation of robot arm). We can represent the robot arm from Example 1 as a set of DBNs, where we have one DBN ∆ a for each action a ∈ {CW i , CCW i }. The state and observation variables in the DBNs are X t = θ t 1 , θ t 2 , θ t 3 , X t+1 = θ t+1 1 , θ t+1 2 , θ t+1 3 , and Y t+1 = θ t+1 1 ,θ t+1 2 ,θ t+1 3 . To make our example more realistic, let us assume that the joint orientations are bounded relative to the orientation of the immediately preceding joint (e.g. in the form of a cone), where the first joint is bounded relative to the ground. This means that the joint movement depends on its own as well as the preceding joint orientation, as shown in Figure 3. Moreover, the joint orientations are correlated (i.e. edges within if at least one of the variables in this subset changed its value. Conversely, x t+1 i does not change if the variables in the subset did not change. Formally, we define passivity as follows: Definition 3 (Passivity). Let action a be given by a DBN ∆ a . A state variable x t+1 i ∈ E a and (ii) for any two states s t and s t+1 with T a (s t , s t+1 ) > 0 : A state variable which is not passive is called active.
The set Φ a,i corresponds to the subset of variables described above: it contains all those variables which directly affect x t+1 i (i.e. they are parents of x t+1 i in X t ) such that x t+1 i may change its value only if any of the variables in Φ a,i changed its value. We will sometimes say that a variable x t+1 As an example, see Figure 1 in which we assumed that the variable x t+1 2 was passive with respect to the variable x t 1 . (We will discuss the purpose of this clause in the next subsection.) Clause (ii) defines the core semantics of passivity by requiring that x t+1 i remains unchanged if all variables in Φ a,i remain unchanged. Note that this means that the distribution P a for x t+1 i may specify any deterministic or stochastic behaviour if the variables in Φ a,i change their values. This includes that x t+1 i may not change its value at all. A state variable x t+1 i can be passive even if it has no parents in X t , or none other than x t i . In this case, the set Φ a,i would be empty and clause (i) as well as the premise in (5) would trivially hold true. However, such a variable can only be passive if it does not change its value under any circumstances. In other words, it would have to be a constant. In that case, one should consider removing the variable from the state description in order to reduce computational costs.
As noted in Section 2.3, passivity can be shown to be a special kind of context-specific independence (CSI) (Boutilier et al., 1996) applied to DBNs. Here, the associated set Φ a,i of a passive variable x t+1 i provides the context: given any assignment of values to However, besides this similarity, there is an important difference between passivity and CSI, which is that passivity does not actually assume that the context is observed. Thus, passivity can be viewed as a kind of CSI for unobserved contexts. This will become clear in Section 5, when we describe a filtering method that exploits passivity.

Non-Example of Passivity
What is the purpose of clause (i) in the definition of passivity? After all, and as discussed previously, clause (ii) captures the core idea of passivity, which is that a variable may only change its value if any of the variables with respect to which it is passive changed its value. However, while it may seem intuitive that clause (ii) be sufficient for passivity, there are in fact processes in which clause (ii) alone does not suffice. In other words, clause (ii) is necessary but not sufficient for passivity. We illustrate this in the following example: Example 3 (Non-example of passivity). Consider a process with two binary state variables, x 1 , x 2 , and a single action, a, shown in Figure 4. (We omit the observation variables for clarity.) The dynamics of the process are such that x t+1 1 takes the value of x t 2 and x t+1 2 takes the value of x t 1 (i.e. x 1 and x 2 swap their values at each time step). In this process, both state variables satisfy clause (ii) of Definition 3: If we set x 0 1 = x 0 2 (i.e. same initial values), then T a (s t , s t+1 ) is positive only for states s t = s t+1 , and hence (5) , 2}, and hence (5) is trivially true since its premise is false.
Despite satisfying clause (ii), the state variables x t+1 1 and x t+1 2 from Example 3 are in fact not passive, for the following two reasons: Firstly, passivity is a causal relation and as such it must imply a causal order (Pearl, 2000). However, there is no causal order between x 1 and x 2 , because there is no edge between x t+1 1 and x t+1 2 . Secondly, passivity means that a variable may change its value only if another variable with respect to which it is passive (a variable in Φ a,i ) changed its value. In other words, whether or not a passive variable x t+1 i may change its value depends on both the past values of Φ a,i (at time t) and the new values of Φ a,i (at time t + 1). However, the variables in Example 3 only depend on the values at time t, hence their own values at time t + 1 are predetermined and do not depend on whether the variables in Φ a,i change values.
The first issue, namely that of the causal order, can be addressed by adding the corresponding edges in X t+1 . For instance, in Example 3 we could add an edge from x t+1 1 to x t+1 2 to establish a causal order. However, this does not generally solve the second issue, which is that every passive variable x t+1 i must depend on both past and new values of the variables in Φ a,i . In other words, x t+1 i must be both inter-correlated as well as intra-correlated with the variables in Φ a,i . The former is given by definition (since every variable in Φ a,i is a parent of x t+1 i ) and the latter is precisely what is required by clause (i) in Definition 3. Therefore, clauses (i) and (ii) together define the formal meaning of passivity.

Detecting Passive Variables
As mentioned in Section 1, passivity is a latent causal property in the sense that it can be extracted from the process dynamics without additional information, and with no additional assumptions regarding the representation of variable distributions. In order to determine if a Go to line 4 // clause (i) violated 10: is passive in ∆ a , one has to find a set Φ a,i such that both clauses of Definition 3 are satisfied. A simple procedure which does this for any representation of the variable distributions is given in Algorithm 1. The algorithm takes as inputs a variable x t+1 i and a DBN ∆ a , and checks whether x t+1 i is passive in ∆ a by searching for a set Φ a,i which satisfies both clauses of Definition 3. Note that the power set P in line 3 includes the empty set ∅, hence it also accounts for Φ a,i = ∅. Lines 7 to 9 check if clause (i) is satisfied while lines 10 to 14 check if clause (ii) is satisfied. Line 13 essentially checks if (5) holds true. If both clauses are satisfied, then x t+1 i is passive in ∆ a with respect to the variables in Φ a,i , and the algorithm returns the set Φ a,i . Otherwise, the algorithm returns a logical false. 2 The time complexity of Algorithm 1 is exponential in the worst case, in which x t+1 i is not passive. Specifically, the time requirements of line 4 grow exponentially with the number of parents of x t+1 i in X t , and the time requirements of line 12 grow exponentially with the cardinality of Φ a,i and Ψ a,i . However, these time requirements can be reduced significantly when committing to specific representations for the variable distributions P a . For example, if the distributions are represented in tabular form, then one can utilise arrays of indices to perform sweeping tests of (5), i.e. line 13. Moreover, it is important to realise that the algorithm needs to be performed only once for each state variable, prior to the start of the process or on demand. This is since passivity is invariant of the process states. In other words, if a variable is passive in ∆ a , then it will always be passive in ∆ a . Therefore, it suffices to check once in advance for passivity.
Note that the set Φ a,i is not necessarily unique. For example, consider a variable x t+1 1 which is passive in ∆ a with respect to variables x t 2 and x t 3 , i.e. Φ a,1 = x t 2 , x t 3 , and assume that x t+1 2 changes if and only if x t+1 3 changes (i.e. they change at the same time). Then, it is easy to verify that Φ a,1 = x t 2 and Φ a,1 = x t 3 also satisfy clauses (i) and (ii), and hence Φ a,1 , Φ a,1 , Φ a,1 are all valid sets under our definition of passivity. The guiding principle in such cases is Occam's razor, which, intuitively speaking, states that the simplest explanation suffices. In our case, this means that it suffices to use the smallest set Φ a,i in terms of the cardinality |Φ a,i |. (Hence, line 3 in Algorithm 1 sorts the queue Q in ascending order of |Φ a,i |.) The rationale is that if there exist multiple causal explanations for a passive variable x t+1 i , then the one involving the fewest key variables is to be favoured since it reduces (compared to the alternative explanations) the number of cases in which we would have to revise our beliefs about x t+1 i . In our earlier example, if we accept Φ a,1 as a causal explanation for x t+1 1 , then we would have to revise our beliefs for x t+1 only if x t+1 2 may have changed its value. This difference will become more obvious in Section 5.2, which explains how passivity can be exploited to reduce computational costs.

Passivity-based Selective Belief Filtering
This section presents the Passivity-based Selective Belief Filtering (PSBF) method, which exploits passivity for efficient filtering. As discussed in Section 3, we assume that the process is specified as a set of dynamic Bayesian networks which contains one DBN ∆ a for each action a ∈ A. Therefore, whenever we refer to an action a (e.g. T a , Ω a , P a , pa a ), this is assumed to be in the context of ∆ a .
PSBF follows the general two-step update procedure in which the belief state is first propagated through the process dynamics (transition step) and then conditioned on the observation (observation step). Thus, it is natural to divide the exposition of PSBF into three parts: (1) the belief state representation, (2) the transition step, and (3) the observation step. These are discussed in Sections 5.1, 5.2, and 5.3, respectively. A summary of PSBF is given in Section 5.4. We also discuss the computational complexity and error bounds of PSBF in Sections 5.5 and 5.6, respectively.

Belief State Representation
Recall from Section 1 that the principal idea behind PSBF is to maintain separate beliefs about individual aspects of the process, and to exploit passivity in order to perform selective updates over these separate beliefs. The union of all individual aspects constitutes a complete state description of the process. Therefore, the belief state can be represented as the product of all separate beliefs about the individual aspects.
We capture the informal notion of "individual aspects" formally in the form of clusters, which are defined as follows: Definition 4 (Cluster). A clustering of X t+1 is a set C = {C 1 , ..., C K } which satisfies ∀k : C k ⊆ X t+1 and C 1 ∪ ... ∪ C K = X t+1 . We refer to the elements C k ∈ C as clusters.
The underlying idea behind the concept of clusters is that the variables in a cluster C k are connected in some important sense. Specifically, if two or more variables are in a common cluster, then there exists some relation between these variables regarding the likelihood of values which they may assume. In other words, the variables are correlated in X t+1 .
The number K and the concrete choice of clusters C k can be specified by the user or generated automatically. For example, they may be specified manually by a domain expert who is familiar with the structure of the modelled system, or generated automatically using methods such as the ones described in Section 6.1. It should be stressed, however, that in order to reduce computational costs, it is advisable to follow the general rule "as small as possible, as large as necessary" when choosing clusters (see Section 5.5 for a discussion about computational complexity). Therefore, if two variables are strongly correlated, then they should presumably be in a common cluster, whereas if they are not or only weakly correlated ("weakly" meaning that the correlation can be ignored safely), then they should be in separate clusters in order to reduce computational costs. This is illustrated in the following example: Example 4 (Clusters in robot DBN). Recall the robot arm DBN from Example 2, specifically Figure 3. One way to cluster the state variables in X t+1 is given by the three clusters , as shown in Figure 5a. This clustering is most efficient since it minimises the size of each cluster. However, the clusters fail to capture the important correlation that the joint orientation θ i is restricted by the preceding joint orientation θ i−1 . Another way to cluster the state variables is given by the single cluster , as shown in Figure 5b. This clustering captures all correlations between variables. However, this is the largest possible cluster and, therefore, the least efficient one. A compromise is given by the two clusters , which are shown in Figure 5c. This clustering captures the correlation of the joint orientations with the immediately preceding joint orientations, and it is more efficient than the previous clustering since it has smaller clusters.
Given the definition of clusters, we capture the informal notion of "separate beliefs" in the form of belief factors: Definition 5 (Belief factor). Given a cluster C k , the corresponding belief factor b k is a probability distribution over the set S(C k ).
Intuitively, a belief factor b k represents the agent's beliefs as to the likelihood of values for the variables in the corresponding cluster C k . An analogy to this is to view a belief factor as a "smaller" belief state, and to view b as the "full" belief state which is a combination of the smaller belief states. However, to distinguish the two, we refer to b simply as the belief state and to b k as a belief factor.
Finally, given the clusters C k and their corresponding belief factors b k , the belief state b is represented in factored form as where we use the notation s k to refer to the tuple (s and s = (s 1 , s 2 , s 3 , s 4 ), then s k = (s 2 , s 3 ).)

Exploiting Passivity in the Transition Step
In order to perform selective updates over the belief factors b k , we require a procedure which performs the transition step independently for each factor. 3 We obtain such a procedure by introducing two assumptions which allow us to modify the transition step (1) of the exact update rule. The assumptions guarantee that the transition step is performed exactly, in the sense of (1). However, as we will discuss shortly, the assumptions can be violated to obtain approximate belief states. The first assumption, (A1), states that the clusters must be uncorrelated (i.e. there are no edges in X t+1 between clusters), and the second assumption, (A2), states that the clusters must be disjoint. Formally, these are defined as follows: Note that neither assumption implies the other. That is, it may be the case that (A1) is satisfied while (A2) is violated, and vice versa. Assuming both (A1) and (A2), we can reformulate (1) where η 1 is a normalisation constant and This procedure performs the transition step independently for each belief factor b k , hence they can be updated in any order and in parallel.
Assumption (A1) is what allows us to bring (1) into a form which updates the belief factors b k independently of each other. Specifically, (A1) allows us to define the cluster-based transition function T a k , which in turn enables the summation in (6). Assumption (A2), on the other hand, guarantees that the product in (6) is correct. In particular, it may be the case that |s k | < |C k | (i.e. there are fewer elements ins k than in ). In such cases, b t k is taken to be the marginal distribution over variables x t+1 , where (A2) guarantees that the marginalisation introduces no errors.
As mentioned previously, each assumption may be violated to obtain approximate belief states. However, there is an important distinction between (A1) and (A2) in this regard: If (A2) is violated, then (6) is still well-defined in the sense that it can still be executed, except that the product in (6) may degrade the accuracy of the results. This is in contrast to (A1), which is a structural requirement of T a k in the sense that T a k is ill-defined without (A1). This is since, if (A1) is violated, the variables in C k may have parents in X t+1 which are not in C k , in which case pa a (x t+1 i ) ← (s, s k ) would be ill-defined. Thus, if (A1) is violated, we have to enforce it by modifying the distributions P a of all x t+1 i ∈ C k to marginalise out all variables in pa t+1 a t (x t+1 i ) which are not in C k , for all clusters C k . This means that each variable has a separate distribution for every cluster which contains the variable, thereby possibly introducing an approximation error.
Given the modified transition step (6), we can exploit passivity to perform selective updates over the belief factors b k . Recall from Section 4.1 that a variable x t+1 i is passive in ∆ a if there exists a set Φ a,i of variables such that x t+1 i may change its value only if any of the variables in Φ a,i changed its value. This causal connection can be used to decide whether or not the values of the variables in a cluster C k may have changed, in which case the corresponding belief factor b k should be updated. Theorem 1 provides the formal foundation: Theorem 1. If (A1) and (A2) hold, and if all x t+1 i ∈ C k are passive in ∆ a t , then Theorem 1 states that if the clusters C 1 , ..., C K are disjoint and uncorrelated, and if all variables in cluster C k are passive in ∆ a t , then the transition step for the corresponding belief factor b t k →b t+1 k can be omitted without loss of information. How does Theorem 1 translate into situations in which (A1) or (A2), or both, are violated?
The key assumption is again (A1), which states that the clusters must be uncorrelated. As discussed earlier, we can enforce this by modifying the variable distributions P a in each cluster. However, if a passive variable x t+1 i ∈ C k is correlated with a (passive or active) variable a (x t+1 i ), then marginalising out x t+1 j in the distribution P a of x t+1 i will typically cause x t+1 i to lose its passivity, in the sense that it would no longer satisfy the clauses in Definition 3. Consequently, we would always have to perform the transition step for C k , even if the unmodified variables in C k are all passive. This is problematic not only because of the unnecessary computations, but also because the modified distributions will introduce an error every time the transition step is performed. Figure 6: Robot arm DBN implementing the action CW 3 . Dashed circles mark passive state variables. The coloured ellipses represent the clusters C 1 and C 2 .
To alleviate this effect, one can check if there is a chance that the unmodified variables in the cluster would change their values. It can be shown that this is the case whenever there is a causal path from any active variable to a variable in the cluster: Definition 6 (Causal path). A causal path in ∆ a , from an active variable x t+1 i to another variable x t+1 j , is a sequence x (1) , x (2) , ..., x (Q) such that x (1) = x t+1 i , x (Q) = x t+1 j , and for all for all 1 ≤ q < Q : Intuitively, a causal path defines a chain of causal effects (such as between joints 1 and 3 in Example 1): since the active variable x (1) may have changed its value and x (2) is passive with respect to x (1) , x (2) may also have changed its value; since x (2) may have changed its value and x (3) is passive with respect to x (2) , x (3) may also have changed its value, etc. Hence, in the absence of observing these changes, the mere existence of a causal path from x (1) to x (Q) is reason to revise our beliefs about x (Q) . Therefore, as a general update rule, we can omit the transition step b t k →b t+1 k if all unmodified variables in cluster C k are passive in ∆ a t , and if there is no causal path from any active variable in ∆ a t to any variable in C k . This is demonstrated in the following example: Example 5 (PSBF update rule in robot arm DBN). Let us again consider the robot arm from the previous examples. Figure 6 shows a DBN which implements the action CW 3 . This action rotates joint 3 of the robot arm by 1 • clock-wise (i.e. the joint orientation θ t+1 3 is a direct target of the action). Therefore, the variable θ t+1 3 is active while the variables θ t+1 1 and θ t+1 2 are passive (shown as dashed circles). We use the clustering C 1 = θ t+1 1 , θ t+1 2 , C 2 = θ t+1 2 , θ t+1 3 for reasons given in Example 4. Since θ t+1 1 is a parent of θ t+1 2 , PSBF will have to enforce assumption (A1) by Algorithm 2 SkippableClusters(C, ∆ a ) 1: Input: clustering C = {C 1 , ..., C K }, DBN ∆ a 2: Output: set of clusters C * ⊂ C which can be skipped in transition step out of the variable distribution P a of θ t+1 2 in cluster C 2 . While the modified variable distribution loses the passivity property (both clauses of Definition 3 are violated), the unmodified distribution of θ t+1 1 is still passive.
When performing the transition step, PSBF has to update the belief factor b 2 because the corresponding cluster C 2 contains the active variable θ t+1 3 . However, since all variables in cluster C 1 are passive (there are no modified variables in C 1 ), and since there is no causal path from θ t+1 3 to any variable in C 1 , PSBF can omit the update for the belief factor b 1 . Intuitively, this makes sense since a change in the orientation of joint 3 cannot cause a change in the orientations of the preceding joints. Note that this corresponds to a saving of 50% in the transition step.
Algorithm 2 defines a procedure which utilises this rule to find clusters for which the transition step can be skipped. The algorithm takes as inputs a clustering C and a DBN ∆ a , and returns a set C * of skippable clusters. It essentially searches through all active variables x t+1 i in ∆ a and removes all clusters C k from C which contain variables to which there is a causal path from x t+1 i . The function OrderedQueue(X t+1 ) returns an ordered queue Q with all variables in X t+1 . The performance of Algorithm 2 depends on the order of the queue. In our experiments, we obtained good performance by ordering the variables in descending order of their number of outgoing edges. The function NextElement(Q) returns the next element in the queue; the function Passive(x t+1 i , ∆ a ) is defined in Algorithm 1; and the function CausalPath(x t+1 i , x t+1 j , ∆ a ) returns a logical true if and only if there is a causal path from x t+1 i to x t+1 j in ∆ a . 4 Note that, given the invariance of passivity to process states (cf. Section 4.1), it suffices to call Algorithm 2 only once (in advance or as needed) to determine which of the clusters to omit in the transition step.

Efficient Incorporation of Observations
PSBF can perform the observation step similarly to the exact update rule (2), which conditions the propagated belief stateb t+1 on the observation o t+1 to obtain a fully updated belief state b t+1 . However, given the factored belief state representation used by PSBF, we require a procedure which respects this factorisation in the observation step. Assuming that (A1) and (A2) both hold, we can bring (2) into a form which updates the belief factors b k independently of each other where η 2 is a normalisation constant. Note that, analogously to (6), if there are variables in C k which are not in pa t+1 a t (Y t+1 ), thenb t+1 k is taken to be the marginal distribution over C k ∩ pa t+1 a t (Y t+1 ). Assumption (A2) guarantees that the marginalisation introduces no errors. If (A1) and (A2) both hold, then the transition step (6) and observation step (7) produce exact belief states in the sense of (1) and (2), regardless of how many clusters were skipped in the transition step (cf. Theorem 1).
The observation step (7) updates all belief states and uses all observation variables in the process. In other words, it ignores the internal structure of the observation variables. However, it is clear that if the variables in a cluster C k are marginally independent of the observation variables Y t+1 (this can be determined using d-separation (Geiger et al., 1989), or simply by checking if there is a directed path from C k to Y t+1 ), then there is no need to perform the observation step for the corresponding belief factor b k . This is expressed formally in Theorem 2: Theorem 2 states that if the variables in C k are independent of those in Y t+1 , then the observation step for b k can be skipped. However, even if C k is not independent of Y t+1 , it may be the case that the variables in C k depend only on a subset Y k ⊂ Y t+1 of the observation variables. Clearly, in such cases, it suffices to use Y k rather than Y t+1 in the observation step. To account for this, we first note that the variables in Y t+1 may be correlated with each other. To preserve the correlations, we subdivide Y t+1 into clustersĈ l ⊆ Y t+1 and introduce the following assumptions: A4) ∀l = l :Ĉ l ∩Ĉ l = ∅ Assumptions (A3) and (A4) are analogous to (A1) and (A2), respectively, and essentially serve the same purposes for the observation step. To distinguish the clusters C k andĈ l , we sometimes refer to the former as state cluster and to the latter as observation cluster.
Assuming that (A3) and (A4) both hold, we can redefine the observation step to where and Y k ⊂ Y t+1 is the set of observation variables which are not marginally independent of the variables in C k . Given Theorem 2, one can see that (8) is equivalent to (7) if the observation variables are not clustered (or, equivalently, there is a single observation clusterĈ l = Y t+1 ). However, it is important to note that if the observation variables are clustered (i.e. there are multiple observation clustersĈ l ), then (8) is not necessarily equivalent to (7). To see this, it is helpful to compare the abstract formulations m  (7) closely as we increase the number of state variables n. Our experiments indicate that it often suffices to use just a few more state variables than observation variables in order to obtain good approximations.
Finally, to show that it suffices to perform the observation step for b k using only those clustersĈ l whose variables are not independent of the variables in C k , we observe that (8) is in fact a repeated application of (7) for everyĈ l , where the updated belief factor b t+1 k is used in place ofb t+1 k in the subsequent application. Since every application has the same form as (7) (with Y t+1 =Ĉ l ), we conclude that Theorem 2 holds, and hence the observation step can be skipped for clustersĈ l which are independent of C k .

Summary of PSBF
The preceding sections can be summarised as follows: • Representation: The belief state b t is represented as a product of K belief factors b t k , such that b t (s) = K k=1 b t k (s). Each belief factor b t k is a probability distribution over the set S(C k ), where C k ⊆ X t+1 is a cluster of correlated state variables.
• Transition step: The transition step b t k →b t+1 k is performed using (6), for all clusters C k which include active variables in ∆ a t , or to which there is a causal path from an active variable in ∆ a t . All other clusters are skipped.
• Observation step: The observation stepb t+1 k → b t+1 k is performed using (8), for all clusters C k which are dependent on the observation variables Y t+1 , using only those observation clustersĈ l which are relevant for C k . All other clusters are skipped. if C k ∈ C * then

15:
if Y k = ∅ then for all s k ∈ S(C k ) do Algorithm 3 provides a procedural specification of PSBF. The algorithm takes as inputs the action at time t, a t , the subsequent observation at time t + 1, o t+1 , and the belief factors at time t, b t k . The internal parameters are the state clustering C, the observation clusterinĝ C, and the set of DBNs (∆ a ) a∈A which define the process. Lines 4 to 11 implement the transition step while lines 12 to 19 implement the observation step. Note that it suffices to execute lines 5 and 14 once in advance (or on demand) and to remember the results for future reference. The algorithm returns the updated belief factors b t+1 k .

Space and Time Complexity
A belief factor b k has one element b k (s k ) for each s k ∈ S(C k ). 5 Thus, the total space required to maintain K belief factors b k is K k=1 |S(C k )|. Furthermore, the size of the set S(C k ) grows exponentially with the number of variables in C k , hence the dominant growth factor in the space requirement is given by the largest cluster C k such that |C k | = max k |C k |. Therefore, the space complexity of PSBF is in O(exp max k |C k |), hence the representation is feasible for reasonably small clusters C k .
Similarly, the number of operations required to perform the transition and observation steps is in the order of 2 K k=1 |S(C k )| in the worst case (i.e. all clusters need to be updated in both steps). Specifically, line 11 and line 19 in Algorithm 3 are each executed once for every s k ∈ C k . The dominant growth factor is again given by the largest cluster C k , hence the time complexity of PSBF is in O(2 exp max k |C k |) = O(exp max k |C k |). Note that this assumes that the analysis performed by lines 5 and 14 in Algorithm 3 is done in advance.
The above time complexity is for the worst case, in which all clusters need to be updated in the transition and observation steps. It is difficult to derive the time complexity for the average case because it is unclear what the average case is in terms of passivity. Even if we stipulate a certain average degree of passivity (e.g. 50% of all variables are passive), it would still be difficult to make a general statement about time requirements since this depends crucially on how the passive variables are distributed across the clusters. For example, even if a process has on average 90% passivity, if there is one active variable in each cluster then every cluster would need to be updated in the transition step. Thus, the only general statement we can make with regards to passivity is that the time complexity of PSBF can be refined to O(exp max C k ∈ C T ∪ C O |C k |), where C T and C O include only those clusters that need to be updated in the transition and observation step, respectively.

Error Bounds
There are five possible sources of approximation errors in PSBF: • If the clusters are correlated (i.e. (A1) or (A3) are violated) • If the clusters are overlapping (i.e. (A2) or (A4) are violated) • Generally in (8) if multiple observation clustersĈ l are used In the first two cases, the approximation error depends on the amount of correlation and overlap. If there is only little correlation and overlap between the clusters, then the approximation error can be expected to be small. Conversely, if the clusters are strongly correlated and overlapping, then the approximation error can be expected to be large. Boyen and Koller (1998) provide a useful analysis of the error bound of any filtering method which uses a factored belief state representation. Since PSBF uses a factored representation, their analysis applies directly to PSBF. The purpose of this section is to restate the main result of their analysis in the context of our work.
Their analysis uses the concept of relative entropy (Kullback & Leibler, 1951) as a measure of similarity for belief states: 5. In practice, it suffices to store only |S(C k )| − 1 elements, but this is irrelevant in our analysis.
Definition 7 (Relative entropy). Let φ and ψ be two probability distributions defined over a set X. The relative entropy from φ to ψ is defined as Similar to Boyen and Koller (1998), we define the approximation error incurred by PSBF relative to the exact belief state. However, since we consider a decision process with multiple actions a ∈ A (represented by the DBNs ∆ a ), we define the error for each action respectively: Definition 8 (Approximation error). Let b be an exact belief state andb be the approximation by PSBF. After taking action a, let b be the exact update of b (using (1) and (2)) andb be the PSBF-update ofb (using (6) and (8)). Furthermore, letb be the exact update ofb (using (1) and (2)). We say that PSBF incurs error a in ∆ a relative to b if The analysis also relies on the concept of mixing rates. Intuitively, the mixing rate γ a of a DBN ∆ a quantifies the degree of stochasticity in ∆ a . It depends on the mixing rates γ a k of the individual clusters C k : Definition 9 (Mixing rate). The mixing rate of a cluster C k ⊂ X t+1 in ∆ a is defined as min T a k (s , s), T a k (s , s) .
If all C k satisfy (A1) and (A2), and if all observation variables Y t+1 are in one observation cluster, then the mixing rate of ∆ a is given by γ a = (min k γ a k /r) q where each cluster C k depends on at most r and influences at most q other clusters C k =k (Boyen & Koller, 1998). In the worst case (that is, all (A1-A4) are violated), the minimal mixing rate is given by γ a k for the single cluster C k = X t+1 .
Finally, the main result in the work of Boyen and Koller (1998), here restated in the context of our work in Theorem 3, essentially states that the approximation error of PSBF (measured in terms of relative entropy) is bounded by the mixing rates of the process: Theorem 3 (Boyen & Koller, 1998). Let b t be an exact belief state andb t be the approximation by PSBF using clusters C k . Then, for any t with states s = (s 0 , ..., s t ) and actions a = (a 0 , ..., a t−1 ), we have

Experimental Evaluation
We evaluated PSBF in two experimental domains: In Section 6.1, we evaluated PSBF in synthetic (i.e. randomly generated) processes with varying sizes and degrees of passivity. In Section 6.2, we evaluated PSBF in a simulation of a multi-robot warehouse system. A brief summary of the experimental results is given in Section 6.3.

Synthetic Processes
We first evaluated PSBF in a series of synthetic processes. PSBF is compared with a selection of alternative methods, including PF (Gordon et al., 1993), RBPF (Doucet et al., 2000), BK (Boyen & Koller, 1998), and FF (Murphy & Weiss, 2001); see Section 2 for a discussion of these methods. The algorithms were implemented in Matlab 7.13, where we used the Matlab toolbox BNT (Murphy, 2001) to implement BK and FF.

Specification of Synthetic Processes
We generated synthetic processes of four different sizes which are specified in Table 1. Each process was generated as follows: First, each variable x t+1 i is chosen to be passive with probability p, in which case we also add the edge (x t i , x t+1 i ). We refer to p as the degree of passivity. To sample further edges from X t /X t+1 to X t+1 , we generate a mixture of Gaussians G using Algorithm 4 (see Appendix C). Figure 7 shows an example of G generated for a process of size M. The set G is used to produce "areas" of correlated variables (i.e. the Gaussians), which will then constitute natural candidates for state clusters.
Let ω be the vector of maximum densities for each Gaussian in G, and let δ i be the vector of densities at value i ∈ N. Then, for every combination of i and j, the edge (x t i , x t+1 j ) is added with probability equal to the maximum element in δ i δ j /ω 2 , in which all operators are point-wise. If x t+1 i was chosen to be passive, then the edge (x t i , x t+1 j ) is only added if i < j. In that case, we also add the edge (x t+1 i , x t+1 j ). Edges (x t+1 i , x t+1 j ) are added similarly for each i < j, 6 where we also add the edge (x t i , x t+1 j ) for passive x t+1 j . To ensure that every variable has an effect in the generated process, each x t i is connected to at least one x t+1 j (adding (x t i , x t+1 i ) if necessary) and each x t+1 j has at least one parent in X t or X t+1 (adding 6. The condition i < j in both cases is to ensure that the resulting DBN is acyclic. are under the peak of a common Gaussian, the higher the probability that an edge will be added between them. . Finally, edges (x t+1 i , y t+1 j ) are added with probability 0.1, for each i, j, while ensuring that each y t+1 j has at least one parent in X t+1 . All variables in the process are binary. Passive variables are assumed to be passive with respect to all of their parents in X t . The distributions P a of x t+1 i ∈ X t+1 are generated uniformly randomly without bias. For passive variables x t+1 i , we modify P a to satisfy clause (ii) in Definition 3. The distributions P a of y t+1 j ∈ Y t+1 are generated with each probability sampled uniformly from either [0.0, 0.2] or [0.8, 1.0], to obtain meaningful observations. Finally, every process consists of two actions. These are obtained by randomly choosing between one and three variables x t+1 i whose distributions P a are resampled as above and edges from X t added with probability 0.1 (passive variables chosen in this way are no longer passive). During simulations, these actions are chosen uniformly randomly.
Each process starts in a random initial state, and all algorithms are tested on the same sequence of processes, initial states, chosen actions, and random numbers.

Clustering Methods
We used three different clustering methods, denoted pc , moral , and modis . The methods were applied to the variables in X t+1 without edges involving X t or Y t+1 : • pc drops the directions of the edges (i.e. for any edge x t+1 i → x t+1 j it ads the reverse edge x t+1 j → x t+1 i ) and puts all variables between which there is a (undirected) path into one cluster. By definition, the resulting clusters satisfy all assumptions (A1-A4).
• moral connects all parents of a variable and drops the directions (it "moralises" the variables) and then extracts clusters of fully connected variables ("maximum cliques"). The resulting clusters may not satisfy any of the assumptions (A1-A4).
• modis is similar to moral but truncates the resulting clusters to make them disjoint (clusters are removed if they become a subset of another cluster). By definition, the resulting clusters satisfy (A2/A4), but not necessarily (A1/A3).
As an example, consider Figure 5 from Section 5.1. Here, pc would produce the cluster C 1 from Figure 5b, since all variables are connected by an undirected path. Furthermore, moral would produce the two clusters C 1 and C 2 from Figure 5c, which correspond to the two maximum cliques after moralising the variables in X t+1 . Finally, modis would produce the cluster C 1 from Figure 5c and the cluster C 3 from Figure 5a. PSBF used the same clustering method to generate clusters of state variables (C k ) and observation variables (Ĉ l ). Moreover, PSBF enforced (A1/A3) whenever necessary by modifying the variable distributions as described in Section 5.1.

Accuracy
In order to compare the accuracy of the tested algorithms, we computed the relative entropy (cf. Definition 7) from exact belief states obtained using the exact update rule (cf. Definition 1) to the approximate belief states produced by the tested algorithms. However, since exact belief states and relative entropy are hard to compute for large processes, we were able to compare the accuracy of algorithms in processes of size S only. All algorithms were initialised with uniform belief states, or uniformly sampled particles.
We first compared the accuracy of PSBF and BK, since they use the same factorisation in their belief state representations. Figure 8 shows the relative entropy of PSBF and BK averaged over 1000 processes with 0%, 20%, 40%, 60%, 80%, and 100% passivity, respectively. The results show that PSBF pc/modis produced a lower relative entropy (i.e. higher accuracy) than BK pc/modis , and that PSBF moral produced a relative entropy comparable to that of BK moral . This indicates that violations of (A2/A4) introduce smaller errors than violations of (A1/A3). Note that PSBF and BK had the same convergent behaviour in their relative entropy, which shows that the approximation error due to the factorisation was bounded, as discussed in Section 5.6. This is interesting since PSBF and BK obtain approximation errors from the factorisation in different ways: PSBF loses accuracy by modifying the variable distributions to ensure that the state clusters are independent (cf. Section 5.2), while BK loses accuracy by marginalising out the original factorisation after the inference (i.e. the "projection step"; cf. Section 2.1). Nevertheless, as shown in our results, the resulting approximation errors were bounded in both cases, with similar convergence.
Note that the relative entropy of both methods increased with the degree of passivity in the process. This is explained by the fact that higher passivity implies higher determinacy and, therefore, lower mixing rates (cf. Definition 9), which are a crucial factor in the error bounds of PSBF and BK (cf. Theorem 3). Finally, note that PSBF did not produce exact belief states (i.e. zero relative entropy) when using pc clustering, despite the fact that the clusters generated by pc satisfy all assumptions (A1-A4). However, as discussed in detail in Sections 5.3 and 5.6, another possible source of approximation errors is if multiple observation clusters are used, which was often the case when using pc to produce observation clusters.
To compare the accuracy of PF/RBPF with PSBF/BK, the number of samples used in PF/RBPF was chosen automatically in each process such that they required approximately as much time per belief update as PSBF moral and BK moral , respectively. In our experiments, this meant that PF (RBPF) was only able to process between 100 and 300 (20 and 50) samples. However, since each process has over 1000 states, this was not nearly enough to represent a uniform belief state. Hence, PF/RBPF produced much higher relative entropy than PSBF/BK. Moreover, the fact that the processes have very high variance means that PF/RBPF would require many more samples to achieve the same accuracy as PSBF/BK (as Relative entropy (f) 100% passivity Figure 8: Accuracy results for PSBF and BK. Plots show relative entropy from exact to algorithms' belief states (lower is better). Results are averaged over 1000 processes of size S (n = 10, m = 3), where on average 0%-100% of non-target variables were passive (cf. Section 6.1.1). PSBF/BK used clustering methods pc , moral , and modis .
shown in the next section). One would expect that this latter issue was alleviated by the use of exact inference in RBPF (cf. Section 2.1). However, this is only the case if much of the variance in the process can be captured in the marginal distributions used in the particles in RBPF. In contrast, our synthetic processes exhibit high variance across all variables, and our automatic grouping 7 of state variables into "sampled" and "exact" variables still contained much variance in the sampled variables. Hence, RBPF required significantly more samples than the number it could process in the time provided.
Finally, in order to compare the accuracy of FF with PSBF/BK, the number of iterations used in FF (more precisely, the number of iterations in loopy belief propagation; cf. Murphy & Weiss, 2001) was chosen automatically in each process such that FF required approximately as much time per belief update as PSBF moral and BK moral , respectively. However, while FF was often able to perform several iterations in the provided time, the resulting relative entropy was again substantially higher than that of PSBF/BK. The problem is that FF was designed for a specific class of DBN topologies, namely those containing no edges within X t+1 (called "regular" DBNs by Murphy & Weiss, 2001). This is what allows FF to use a fully factored representation of belief states, in which each variable is its own belief factor. However, the processes used in our experiments have high intra-correlation between state variables (i.e. many edges in X t+1 ), especially with increasing passivity. These correlations cannot be captured in the belief state representation of FF, resulting in a significantly higher relative entropy than PSBF/BK.

Timing
We measured computation times in processes of sizes S, M, L, XL with passivities of 25%, 50%, 75%, 100%, respectively. PSBF and BK used moral clustering, which seemed most appropriate for a fair comparison since it produced consistently similar accuracy for both algorithms. The number of samples used in PF was chosen automatically in each process such that PF achieved an average accuracy approximately as good as that of PSBF and BK, respectively, in the final 20% of the process. As this involved computing exact belief states and relative entropies, we were able to use PF in processes of size S only. We omit RBPF and FF in this section as they were shown in the previous section to be unsuitable for the processes we consider. PSBF was tested with 1, 2, and 4 parallel processes, which were allocated approximately the same number of belief factors. Figures 9a -9d show the times for 1000 transitions averaged over 1000 processes, and Figure 9e shows the average percentage of belief factors that were updated in the transition and observation steps of PSBF. The timing reported for PSBF includes the time taken to modify variable distributions (in case of overlapping clusters) and to detect skippable clusters in the transition and observation steps, both of which were done once in advance for each action. The results show that PSBF was able to minimise the time requirements significantly by exploiting passivity. First, we note that there were only marginal gains from 25% to 50% passivity, despite the fact that PSBF updated 14% fewer clusters in the transition step. This is because these clusters were mostly very small. However, there were significant gains from 50% to 75% passivity with average speed-ups of 11% (S), 14% (M), 15% (L), 18% (XL), and 7. It is an open question how to group state variables into "sampled" and "exact" variables (Doucet et al., 2000). We used a simple heuristic whereby the set of sampled variables contained all variables x t+1 i that had no parents in X t/t+1 or none other than x t i . The remaining variables in X t+1 constituted the set of exact variables. To ensure that the resulting grouping was valid for all actions (i.e. DBNs) in a process, we considered edges in all involved DBNs; that is, we performed the grouping over the union of Ea for all a. Moreover, to improve efficiency, we further subdivided the set of exact variables into clusters of variables that were connected by undirected edges in X t+1 without edges involving the sampled variables.  Passivity of p% means that on average p% of non-target variables were passive (cf. Section 6.1.1). PSBF and BK used moral clustering. PF was optimised for binary variables and used number of samples to achieve accuracy of PSBF and BK, respectively. PSBF was run with 1 (PSBF-1), 2 (PSBF-2), 4 (PSBF-4) parallel processes. (e) Average percentage of belief factors which were updated in the transition and observation steps, respectively. from 75% to 100% passivity with further average speed-ups of 11% (S), 33% (M), 46% (L), 49% (XL). This shows that the computational gains can grow significantly with both the degree of passivity and the size of the process.
Our results show that PSBF consistently outperformed BK in all process sizes. There are two main computational savings in PSBF relative to BK: firstly, by skipping over belief factors in the transition and observation steps, and secondly, by not having to perform a potentially expensive projection step to restore the original factorisation after the inference. However, while the times of both algorithms grew exponentially in the size of the process, we note that the relative difference between PSBF and BK decreased significantly for lower degrees of passivity. This is an instance of "No Free Lunch" (see Section 7 for a discussion), which means that PSBF performs best in processes with high passivity but can suffer in performance in processes that lack passivity. Specifically, the computational overhead of modifying variable distributions and detecting skippable belief factors does not amortise as effectively in large processes with low passivity. Furthermore, with low passivity, PSBF often has to perform full transition and observation steps (i.e. update all belief factors in each step), which can be costly in large processes.
How were BK and PF affected by passivity? Not surprisingly, the performance of BK was nearly unaffected by the increasing degrees of passivity. The junction tree algorithm used in BK benefited marginally from an increased sparsity in the process, but the computational gains were minimal. We were at first unable to use PF as it required too many samples (between 10k and 200k) to achieve comparable accuracy to PSBF/BK, due to the very high variance in the processes. In order to investigate the effect of passivity on PF, we implemented a version of PF which was strictly optimised for binary variables. Interestingly, we found that passivity had an adverse effect on the performance of PF, requiring it to use exponentially more samples with increased passivity (see Figure 9a). This makes sense if we view PF as a factored approximation method (such as PSBF and BK) which means that the analysis in Section 5.6 applies. However, because PF puts all variables into a single cluster (since it is not actually a factored method), the mixing rate of the process will be much lower than for PSBF and BK (as discussed in Section 5.6) and, thus, the error bounds are less tight. To compensate for this, PF requires significantly more samples for increased passivity.

Multi-robot Warehouse System
In this section, we demonstrate how passivity can occur naturally in a more complex system and how PSBF can exploit this to accelerate the filtering task. To this end, we consider a multi-robot warehouse system in the style of Kiva , in which the robots' task is to transport goods within the warehouse (cf. Figure 10a). Figure 10b shows the initial state of the warehouse simulation. The warehouse consists of 2 workstations (W1, W2), 4 robots (R1-R4), and 16 inventory pods (I1-I16). Each robot can move forward and backward, turn left and right, load and unload an inventory pod (if positioned under the pod), or do nothing. As in Kiva, robots can move under inventory pods unless they are carrying a pod, in which case the other pods become obstacles. The move and turn operations are stochastic in that the robot may move/turn too far (3% chance) or do nothing (2% chance). Each robot possesses two sensors, one telling it which inventory pod it has loaded (if any) and one for the direction it is facing. The direction sensor is noisy in that a random direction may be reported (3% chance).

Specification of Warehouse System
Each robot maintains a list of tasks in the form of "Bring inventory pod I to workstation W" (yellow area around W) and "Bring inventory pod I to position (x,y)". How these tasks are executed depends on the control mode, of which we use two in our simulations: 8 8. Our control modes are ad hoc and often make suboptimal decisions. However, we found that current solution techniques for (DEC-)POMDPs, including approximate methods, were infeasible in this setting.
Nonetheless, the quality of the decisions made by our control modes largely depends on the accuracy of the belief states, hence it is important that the belief states are updated accurately. Therefore, the control modes were sufficient for our purposes. Centralised mode: A central controller maintains a belief state b t about the state of the warehouse system. At each time t, it samples 100 states from b t and removes all duplicate states, resulting in the setŜ = {ŝ 1 ,ŝ 2 , ...}. It then resamples a stateŝ * ∈Ŝ with probabilities w(ŝ * ) = b t (ŝ * )/ q b t (ŝ q ). Based onŝ * and the current task of each robot, it performs an A * search (Hart, Nilsson, & Raphael, 1968) (with Manhattan distance) in the space of joint actions to find the optimal action for each robot. After executing their actions, the robots send their sensor readings to the controller, and the controller updates its belief state using the sensor readings.
Decentralised mode: Each robot maintains its own belief state and there is no communication between the robots. The only knowledge the robots have about each other are their current tasks, communicated by the task allocation module. At each time t, each robot samples the setŜ and stateŝ * as is done in the centralised mode. Treating the other robots as static obstacles, it performs an A * search based onŝ * and its current task to find an action a t . This is repeated for each other robot r in all statesŝ q ∈Ŝ, resulting in actions a r,q which are used to obtain distributions π r : A → [0, 1] (A is the set of all actions) with π r (a) = q : ar,q=a w(ŝ q ). The robot then executes its action a t and updates its belief state using its sensor readings and the distributions π r to average over the other robots' actions.
The tasks are generated by an external scheduler in time intervals sampled from U [1, 10]. Each generated task is assigned to one of the robots through a sequential auction (Dias, Zlot, Kalra, & Stentz, 2006). The robots' bids are calculated as their total number of steps needed to solve all of their current tasks and the auctioned task (in a simplified model in which the other robots are removed), averaged over all states inŜ. The robot with the lowest bid is assigned the task. Figure 11: Example DBN of a smaller warehouse system consisting of only one inventory pod (I1) and two robots (R1, R2). The DBN implements the joint action in which R1 moves and R2 turns. Dashed circles mark passive state variables. The coloured areas represent the state clusters C 1 to C 8 . Figure 11 shows an example DBN for a smaller warehouse with one inventory pod and two robots. Each inventory pod I is represented by two variables, I.x and I.y, which correspond to the x and y position of the inventory pod. Each robot R is represented by four variables: R.x/R.y for its x/y position, R.d for its direction, and R.s for its status. The status of a robot R is either R.s=0 (unloaded) or R.s=I (loaded with inventory pod I). Constants such as the size of the warehouse and the positions of the workstations are omitted in the DBN.

DBN Topology and Clustering
There are four types of clusters: The I-clusters (C1-C4) preserve the correlation that if R is loaded with I, then I must always have the same position as R (there are two I-clusters for each (I,R) pair); The R-clusters (C5) and S-clusters (C6), respectively, preserve the correlation that no two robots can have the same position or carry the same inventory pod (there is one R/S-cluster for each (Ra,Rb) pair with a > b); And, finally, the D-clusters (C7, C8). PSBF uses singleton observation clusters (i.e. one cluster for each observation variable).
There are some differences between the DBNs for the centralised and decentralised modes ( Figure 11 uses the centralised mode). In the centralised mode, there is one DBN for each action combination of the robots. Since the controller observes all R.s noise-free, it can add edges from R.x/R.y to I.x/I.y if R.s=I or remove them otherwise to simplify the inference (thus, in Figure 11, R1 is loaded with I1 and R2 is unloaded). In the decentralised mode, each robot only observes its own sensor readings, hence it can add or remove edges only for itself, while edges for all other robots must be permanently added. This also means that the other robots' status variables (R.s) must be linked to all I.x/I.y and, therefore, included in the I-clusters (to preserve the correlation that I must have the same position as R if R is loaded with I). Moreover, since each robot only knows its own action, there is one DBN for each of its own actions, and all variables associated with the other robots are active (the distributions π r defined in the previous section are used to average over their actions).

Results
We implemented PSBF, BK, and PF in C#, using the framework Infer.NET (Minka, Winn, Guiver, & Knowles, 2012) to implement BK. This allowed BK to exploit sparsity in the process and offered improved memory handling. PSBF was optimised for sparsity in (6) and (8), respectively, by summing over statess for which all b t k /b t+1 k are positive. PF naturally benefits from sparsity as it allows it to concentrate the samples on fewer states. The number of samples used in PF was set in such a way that the controller decisions were invariant of the random numbers used in the sampling process of PF. This was done to ensure that the results were repeatable. Finally, to maintain sparsity in the process, each probability in the belief states lower than 0.01 was set to 0. All tested algorithms were initialised with an exact belief state, shown in Figure 10b. Figure 12 shows the time per transition averaged over 20 different simulations with 100 transitions each. The timing reported for PSBF includes the time needed to modify variable distributions (for overlapping clusters) and to detect skippable belief factors for the transition and observation steps, both of which were done once on demand for every previously unseen DBN. In the centralised mode, PSBF was able to outperform BK on average by 49% and PF by 36%. PF needed 20,000 samples to produce consistent (i.e. repeatable) results. In the decentralised mode, PSBF outperformed BK on average by 17% and PF by 32%. PF now needed 45,000 samples to produce consistent results, due to the increased variance in the process. All differences were statistically significant, based on paired t-tests with a 5% significance level. Note that PSBF and BK were slower in the decentralised mode since the corresponding DBNs had much higher inter-connectivity. In addition, PSBF updated more belief factors since there were more active variables.
As expected, PSBF was able to exploit the high degree of passivity in the process to accelerate the filtering task. In many cases, this meant that PSBF needed to update less than half of the belief factors. Precisely how many belief factors had to be updated depends on the performed action. To illustrate this, consider the smaller warehouse DBN shown in Figure 11 (for the centralised mode), in which R1 is moving and R2 is turning. Here, R1.x, R1.y, and R2.d are active variables while all other variables are passive (dashed circles), corresponding to a passivity of 70%. In this DBN, PSBF updates the belief factors corresponding to clusters C1, C2, C5, and C8, since they each contain active variables, and it also updates the belief factors for C3 and C4, since there are directed paths from active variables (R1.x and R1.y) to each of them. Therefore, the only factors which are not updated are for C6 and C7. Now consider the full warehouse in our experiment, which contains 16 inventory pods and 4 robots, resulting in 48 variables with 128 I-clusters, 6 R-clusters, 6 S-clusters, and 4 D-clusters. Assume a similar situation in which one robot moves with an inventory pod, say R4 with I1, while the R1-3 turn. In this case, PSBF updates only 3 of 6 R-clusters (those containing R4), 0 of 6 S-clusters (since no status change), 3 of 4 D-clusters (for R1-3), and 38 of 128 I-clusters (32 I-clusters containing R4 plus 6 I-clusters from R1-3 for I1), amounting to a total saving of 69.44% of belief factors which do not need to be updated.
The number of states in the warehouse system (including invalid states) exceeded 10 45 states. Therefore, we were unable to compare the accuracy of the tested algorithms in terms of relative entropy. Instead, we compared their accuracy based on the results of the task auctions and the number of completed tasks by the end of each simulation. This gives a good indication of the algorithms' accuracy, since both the outcome of the auction and the number of completed tasks depend on the accuracy of the belief states. In the centralised mode, the algorithms generated over 95% identical task auctions and completed 15.7 (BK), 15.5 (PSBF), and 15.2 (PF) tasks on average. In the decentralised mode, they generated over 93% identical auctions and completed 12.1 (BK), 12.2 (PSBF), and 11.7 (PF) tasks on average. In both modes, none of these differences were statistically significant. Therefore, this indicates that PSBF achieved an accuracy similar to that of BK and PF.

Summary of Experimental Evaluation
The experimental results show that PSBF produces belief states with competitive accuracy: In the synthetic processes, PSBF achieved an accuracy which on average was better or comparable to the accuracy of the alternative methods. In the warehouse system, PSBF was able to complete a statistically equivalent number of tasks as compared to the other methods, which indicates that its accuracy was equivalent or comparable.
Furthermore, the experimental results show that PSBF performed the belief updates significantly faster than the alternative methods: In the synthetic processes, PSBF using no parallel processes outperformed BK by up to 64% in the largest process (XL), while PF took too much time to achieve an accuracy comparable to PSBF. In particular, the results show that the computational gains can grow significantly with both the degree of passivity and the size of the process. In the warehouse system, PSBF outperformed the alternative methods by up to 49%, which is a substantial saving considering the size of the state space (more than 10 45 states). Furthermore, the computational gains where much higher in the centralised control mode than in the decentralised control mode, since the latter had a significantly lower degree of passivity. Therefore, this again shows that high degrees of passivity can bear great potential for the filtering task.

No Free Lunch for PSBF
Our view is that no belief filtering method is generally suited for all types of processes. Instead, each method assumes a certain structure in the process (explicitly or implicitly) which it attempts to exploit in order to render the filtering task more tractable. Typically, the methods are tailored in such a way with respect to this structure that they perform well if the structure is present in the process, but suffer a significant loss in performance if the structure is absent. For instance, PF works best in processes with low degrees of uncertainty, since this means that fewer state samples are needed for acceptable approximations. On the other hand, the number of samples needed for acceptable approximations can grow substantially with the degree of uncertainty in the process (as shown in our experiments). As another example, BK works best in processes with little correlation between state variables, since this means that the belief factors will be small and can be processed efficiently. However, if there are many variables which are strongly correlated, then BK typically becomes infeasible. Therefore, these structural assumptions have to be taken into account when choosing a filtering method for a specific process.
A formal account of this view is given by the "No Free Lunch" theorems (Wolpert & Macready, 1997, 1995 which state that, intuitively speaking, any two algorithms have equivalent performance when averaged over all possible instances of the problem. In other words, if there are classes of problem instances for which algorithm A has better performance than algorithm B, then there must be other classes of problem instances for which A has worse performance than B. Then, the question is: for what class of problem instances (that is, processes) can PSBF be expected to achieve good performance? This class is essentially described by the following three criteria: Degree of passivity -PSBF attempts to accelerate the filtering task by omitting the transition step for as many belief factors as possible. This depends on the passivity of the variables in the state clusters. In the ideal case, the process exhibits a high degree of passivity such that PSBF can omit the transition step for many belief factors. In the worst case, the process has no passive variables at all, and PSBF has to update all belief factors in the transition step. However, as discussed in Section 5.5, a high degree of passivity is not necessarily sufficient to infer that many clusters can be skipped in the transition step, since the passive variables could be distributed in such a way that no cluster can be skipped (e.g. if the passive variables are distributed uniformly amongst the state clusters). Therefore, in an optimal case, the passivity is concentrated on correlated state variables such that passive variables end up in the same clusters.
Size of state clusters -The space and time complexity of the belief state representation in PSBF is exponential in the size of the largest state cluster (cf. Section 5.5). Therefore, in the ideal case, the relevant variable correlations can be captured in small state clusters and the cost of storing the belief factors and performing the update procedures is small. In the worst case, large state clusters are required to retain the variable correlations and the cost of storing and updating belief factors is large. Another reason why the state clusters should be small is because of the way in which PSBF performs the transition step. One pre-requisite for omitting the transition step for a belief factor is that all variables in the corresponding cluster are passive. If there are many variables in one cluster, then it is less likely that all variables in the cluster are passive, and, therefore, it is less likely that the cluster can be skipped.
Structure of observations -A third criterion, though arguably less important than the other criteria, is the structure of the observations (i.e. the way in which the observation variables depend on the state variables) and the size of the observation clusters (Ĉ l ). PSBF attempts to accelerate the observation step by skipping over all those state clusters whose variables are structurally independent of the observation, and, if a cluster cannot be skipped, by incorporating only those observation clusters which are relevant to the update. Therefore, in the ideal case, only a fraction of the state clusters depend on the observation, and the relevant correlations between observation variables can be captured in small observation clusters. In the worst case, all state clusters depend on the observation in some sense, and the structure of the observation does not allow for an efficient clustering.
Thus, in summary, PSBF is most suitable for processes with high degrees of passivity and in which the relevant variable correlations can be captured in small state and observation clusters. On the other hand, PSBF may not be suitable if there is no or only low degrees of passivity, and if large state and observation clusters are necessary to retain the relevant variable correlations in the process.
In addition to identifying the class of processes for which a filtering method is suitable, it is also important to justify the practical relevance of this class. In this work, we are interested in robotic and other physical decision processes (as shown by our examples and experiments). Such systems typically exhibit a number of features: First of all, robotic systems usually have some causal structure (e.g. Mainzer, 2010;Pearl, 2000). Passivity, as a specific type of causality, can be observed in many robotic systems, including the robot arm used in our examples and the multi-robot warehouse system in Section 6.2. Furthermore, robotic systems most typically have a modular structure, in which each module is responsible for a specific subtask and may interact with other modules. This modular structure often allows for an efficient clustering, in the sense that each module corresponds to a cluster of correlated state variables. Finally, the sensors used in robotic systems typically only provide information about certain aspects of the system, and some components of the system may not benefit from some of the sensor information. In other words, there are independencies between state and observation variables. These features correspond to the criteria (above) which specify the class of processes for which PSBF is a suitable filtering method. Therefore, we believe that this class is practically justified.

Conclusion
Inferring the state of a stochastic process can be a difficult technical challenge in complex systems with large state spaces. The key to developing efficient solutions is to identify special structure in the process, e.g. in the topology and parameterisation of dynamic Bayesian networks, which can be leveraged to render the filtering task more tractable.
To this end, the present article explored the idea of automatically detecting and exploiting causal structure in order to accelerate the belief filtering task. We considered a specific type of causal relation, termed passivity, which pertains to how state variables cause changes in other state variables. To demonstrate the potential of exploiting passivity, we developed a novel filtering method, PSBF, which uses a factored belief state representation and exploits passivity to perform selective updates over the belief factors. PSBF produces exact belief states under certain assumptions and approximate belief states otherwise. We showed empirically, in synthetic processes with varying sizes and degrees of passivity as well as in an example of a complex multi-robot system, that PSBF can be faster than several alternative methods while achieving competitive accuracy. In particular, our results showed that the computational gains can grow significantly with the size of the process and the degree of passivity.
Our work demonstrates that if a system exhibits much causal structure, then there can be great potential in exploiting this structure to render the filtering task more tractable. In particular, our experiments support our initial hypothesis that factored beliefs and passivity can be a useful combination in large processes. This insight is relevant for complex processes with high degrees of causality, such as robots used in homes, offices, and industrial factories, where the filtering task may constitute a major impediment due to the often very large state space of the system.
There are several potential directions for future work. For example, it would be useful to know if the definition of passivity could be relaxed such that more variables fall under this definition, and such that the principal idea behind PSBF is still applicable. One such relaxation could be in the form of approximate passivity, which allows for small probabilities that passive variables change values even if the relevant parents remain unchanged. In addition, it would be interesting to know if the idea of performing selective updates over belief factors (via passivity) could also be applied to other existing methods that use a factored belief state representation (cf. Section 2.1). Finally, another useful avenue for future work would be to formulate additional types of causal relations which can be exploited in ways similar to how PSBF exploits passivity, or perhaps in ways other than that.

Appendix A. Proof of Theorem 1
To prove Theorem 1, it will be useful to first establish the following lemma: Lemma 1. If (A1) holds and all x t+1 i ∈ C k are passive in ∆ a , then ∀s, s : T a k (s, s k ) = 1 ⇔ s k = s k . Proof.
⇒: The fact of (A1) means that Φ a,i ⊆ C k for all x t+1 i ∈ C k . Since all x t+1 i ∈ C k are passive in ∆ a , it follows that all x t j ∈ Φ a,i are passive in ∆ a , for all Φ a,i . Therefore, given T a k (s, s k ) = 1 and clause (ii) in Definition 3, it follows that s k = s k . ⇐: Follows directly by (A1) and the fact that all x t+1 i ∈ C k are passive in ∆ a .
Using Lemma 1, we can give a compact proof of Theorem 1: Theorem 2. If (A1) and (A2) hold, and if all x t+1 i ∈ C k are passive in ∆ a t , then Proof.b t+1 k (s k ) = η 1 s ∈ S(pa t a t (C k )) T a t k (s, s k ) k :[∃x t+1 i ∈C k : x t i ∈ pa t a t (C k )] b t k (s k )

Lem1
= η 1 s ∈ S(pa t a t (C k )):s k =s k T a t k (s, s k )

Appendix B. Proof of Theorem 2
To prove Theorem 2, we first note the following proposition: Proposition 1. If all x t+1 i ∈ C k are marginally independent of all y t+1 j ∈ Y t+1 in ∆ a t , then ∀s, s : ∧ k =k s k = s k → Ω a (s, o t ) = Ω a (s , o t ).
This proposition follows directly by definition.
Using Proposition 1, we can give a compact proof of Theorem 2: Theorem 2. If all x t+1 i ∈ C k are marginally independent of all y t+1 j ∈ Y t+1 in ∆ a t , then ∀s : b t+1 k (s k ) =b t+1 k (s k ). Proof.

Appendix C. Mixture of Gaussians
Algorithm 4 provides a simple procedure that randomly generates a mixture of Gaussians (i.e. a set of normal distributions) for the synthetic processes in Section 6.1. The algorithm takes as input the number n of state variables and returns a set G of Gaussians whose means are in the set {1, ..., n}. The number of Gaussians, their means, and their variances are chosen automatically so as to achieve good "coverage" of state variables while minimising the (visual) overlap of Gaussians. See Figure 7 for an example. R − ← (R(1), R(2), ..., R(p)) such that R(p) < µ − σλ 14: R + ← (R(q), R(q + 1), ..., R(|R|)) such that R(q) > µ + σλ 15: if R − = ∅ then 16: if R + = ∅ then 18: