Variational Cumulant Expansions for Intractable Distributions

Intractable distributions present a common di(cid:14)culty in inference within the probabilistic knowledge representation framework and variational methods have recently been popular in providing an approximate solution. In this article, we describe a perturbational approach in the form of a cumulant expansion which, to lowest order, recovers the standard Kullback-Leibler variational bound. Higher-order terms describe corrections on the variational approach without incurring much further computational cost. The relationship to other perturbational approaches such as TAP is also elucidated. We demonstrate the method on a particular class of undirected graphical models, Boltzmann machines, for which our simulation results con(cid:12)rm improved accuracy and enhanced stability during learning.


Introduction
In recent years, interest has steadily grown over many associated elds in representing and processing information in a probabilistic framework.Arguably, the reason for this is that prior knowledge of a problem domain is rarely absolute and the probabilistic framework therefore arises naturally as a candidate for dealing with this uncertainty.Better known examples of models are Belief Networks (Pearl, 1988) and probabilistic neural networks (Bishop, 1995;MacKay, 1995).However, incorporating many prior beliefs can make models of domains so ambitious that dealing with them in a probabilistic framework is intractable, and some form of approximation is inevitable.One well-known approximate technique is Monte Carlo sampling (see e.g., Neal, 1993;Gilks, Richardson, & Spiegelhalter, 1996), which has the bene t of universal applicability.However, not only can sampling be prohibitively slow, but also the lack of a suitable convergence criterion can lead to low con dence in the results.Recently, variational methods have provided a popular alternative, since they not only approximate but can also bound quantities of interest giving, potentially, greater con dence in the results (Jordan, Gharamani, Jaakola, & Saul, 1998;Barber & Wiegerinck, 1999;Barber & Bishop, 1998).The accuracy of variational methods, however, is limited by the exibility of the variational distribution employed.Whilst, in principle, the exibility, and therefore, the accuracy of the model, can be increased at will (see e.g., Jaakkola & Jordan, 1998;Lawrence, Bishop, & Jordan, 1998), this generally is at the expense of incurring even greater computational di culties and optimization problems.
In this article, we describe an alternative approach which is a perturbation around standard variational methods.This has the property of potentially improving the performance at only a small extra computational cost.Since most quantities of interest are derivable from knowledge of the normalizing constant of a distribution, we present an approximation of this quantity which, to lowest order, recovers the standard Kullback-Leibler variational bound.Higher-order corrections in this expansion are expected to improve on the accuracy of the variational solution, despite the loss of a strict bound.
In section (2) we will brie y discuss why the normalizing constant of probability distributions is of such importance.We brie y review the Kullback-Leibler variational bound in section (3) before introducing the perturbational approach in section (4).This approach is illustrated on a well-known class of undirected graphical models, Boltzmann machines, in section ( 5), in which we also explain the relation of this approach to other perturbational methods such as TAP (Thouless, Anderson, & Palmer, 1977;Kappen & Rodr guez, 1998a).We conclude in section ( 6) with a discussion of the advantages and drawbacks of this approach compared to better known techniques.

Normalizing Constants and Generating Functions
We consider a family of probability distributions Q over a vector of random variables s = fs i ji 2 1; : : : ; N]g, parameterized by w,1 Q(s; w) = exp(H(s; w)) Z(w) .
(1) With a slight abuse of notation, we write the normalizing constant as Z(w) = Z ds exp(H(s; w)) . (2) The `potential' H(s; w) thus uniquely determines the distribution Q.We have assumed here that the random variable s is continuous|if not, the corresponding integral (see equation 2) should be replaced by a summation over all possible discrete states.
One approach in statistics to obtain marginals and other quantities of interest is given by considering generating functions (Grimmett & Stirzaker, 1992), which take the form Averages of variables with respect to Q are given by derivatives of G( ; w) with respect to for xed w, so that, for example, hs 1 s 2 i = @ 2 G( ; w)=@ 1 @ 2 , evaluated at = 0. We can equally well consider the function Z( ; w) = Z ds exp (H(s; w) + s) (4) so that hs 1 s 2 i = 1 Z( 1 ; 2 ; w) @ 2 Z( 1 ; 2 ; w) Formally, all quantities of interest can be calculated from \normalizing constants" of distributions, possibly modi ed in a suitable manner, such as in (4) above.It is clear that all moments of Q can be generated by di erentiating (4) in a similar manner, and can be as equally easily obtained from the generating function approach, as from the normalising constant approach.We prefer here the normalising constant approach since this enables us more easily to make contact with results from statistical physics.The normalising constant approach is also more natural for the examples of undirected networks that we shall consider in later sections.
In the following sections, we layout how to approximate the normalising constant, assuming that any additional terms in the normalising constant, of the form s in equation ( 4), are absorbed into the de nition of the potential H.
As we will see in section (5.2.3), the normalising constant also plays a central role in learning.Unfortunately, for many probability distributions that arise in applications, calculation of the normalizing constant is intractable due to the high dimensionality of the random variable s.When s is a N{dimensional binary vector, naively at least, a summation over 2 N states has to be performed to calculate the normalizing constant.
However, sometimes it is possible to exploit the structure of the distribution in order to reduce the computational expense.A trivial example is that of factorised models, , so that the computation scales only linearly with N. Similarly, in the case of a Gaussian distribution, computation scales (roughly) with N 3 .These tractable distributions have recently been exploited in the variational method to approximate intractable distributions, see for example (Saul, Jaakkola, & Jordan, 1996;Jaakkola, 1997;Barber & Wiegerinck, 1999;Jordan et al., 1998).
In the following section we will brie y review one of the most common variational methods which exploits the Kullback-Leibler bound.

The Kullback-Leibler Variational Bound
Our aim in this section is to brie y describe the current state-of-the-art in approximating the normalizing constant of an intractable probability distribution.We denote the intractable distribution of interest as Q 1 to distinguish it from an auxiliary distribution Q 0 , we will introduce.Without loss of generality, we write the distribution over a vector of random variables s, parameterized by a vector w as where the potential H 1 (s; w) is a known function of s and w, e.g., in the case of a factorised model this would have the form of H 1 = P k w k s k .The corresponding (assumed intractable) normalizing constant is given by Z 1 (w) = Z ds e H 1 (s;w) . (7) We will use a family of tractable distributions, Q 0 , parameterized by which we write as Q 0 (s; ) = e H 0 (s; ) Z 0 ( ) with Z 0 ( ) = Z ds e H 0 (s; ) where, by assumption, the normalizing constant Z 0 is tractably computable.Standard variational methods attempt to nd the best approximating distribution Q 0 (s; ) that matches Q 1 (s; w) by minimizing the Kullback-Leibler divergence, Since, using Jensen's inequality, KL(Q 0 ; Q 1 ) 0, we immediately obtain the lower bound log Z 1 Z ds Q 0 (s; ) log Q 0 (s; ) + Z ds Q 0 (s; )H 1 (s; w) . (10) Using the de nition ( 8) we obtain the bound in the more intuitive form, The lower bound is made as tight as possible by maximizing the right-hand side with respect to the parameters of Q 0 , which corresponds to minimising the Kullback-Leibler divergence (9).This approach has recently received attention in the arti cial intelligence community and has been demonstrated to be a useful technique (Saul et al., 1996;Jaakkola, 1997;Barber & Wiegerinck, 1999).Note that whilst this lower bound is useful in providing an objective comparison of two approximations to the normalising constant (the approximation with the higher bound value is preferred), it does not translate directly into a bound on conditional probabilities.An upper bound on the normalising constant is also required in this case.Whilst, in some cases, this may be feasible, in general this tends to be a rather more di cult task, and we restrict our attention to the lower bound (Jaakkola, 1997).
In the next section, we describe another approach that enables us to exploit further such tractable distributions.

The Variational Cumulant Expansion
In order to extend the Kullback-Leibler variational lower bound with a view to improving its accuracy without much greater computational expense, we introduce the following family of probability distributions Q , parameterized by , w and : Q (s; w; ) = e H (s;w; ) Z (w; ) (12) where the potential is given by H (s; w; ) = H 1 (s; w) + (1 )H 0 (s; ) . (13) The probability distributions in this family interpolate between a tractable distribution, Q 0 which is obtained for = 0, and an intractable distribution, Q 1 which is obtained for = 1.
For intermediate values of 2 (0; 1), the distributions remain intractable.The normalizing constant of a distribution from this family is given by log Z (w; ) = log Z ds e H 0 (s; )+ (H1(s;w) H 0 (s; )) = log Z 0 + log D e (H1(s;w) H 0 (s; )) E 0 , where h:i 0 denotes expectation with respect to the tractable distribution Q 0 .The nal term in ( 14) is intractable for any 6 = 0, and we shall therefore develop a Taylor series expansion for it around = 0.
where k 0 n ( H) is the nth cumulant of H with respect to the distribution Q 0 .(The de nition of the nth cumulant is k 0 n ( H) = @ n @ n log D e H E 0 =0 ).For convenience we have no longer denoted the explicit dependence on the parameters and w.
Since the intractable normalizing constant Z 1 corresponds to setting = 1 in equation ( 14), we can write the Taylor series as where the remainder follows from the the Mean Value Theorem (see, for example Spivak (1967)), and k l+1 ( H) is the (l + 1)th cumulant of H with respect to Q , 0 1.An approximation is obtained by simply neglecting this (intractable) remainder and using the truncated expansion: Note that the right-hand side of equation ( 18) depends on the free parameters , and we shall discuss later how these parameters can be set to improve the accuracy of the approximation.
In the following two subsections, we will look in more detail at the rst and second-order approximation of this Taylor series, since these illuminate the variational method and the di culties in retaining a lower bound.

First Order and the Variational Bound
The Taylor series (equation ( 17)) up to rst order yields log Z 1 = log Z 0 + h Hi 0 + 1 2 k 2 ( H) , (19) where the remainder k 2 ( H) is equal to the variance (the second cumulant) of H with respect to Q with 0 1, i.e., k 2 = var ( H) = D H 2 E h Hi 2 , and h:i denotes expectation with respect to the distribution Q .Since the variance is nonnegative for any value of , we have log Z 1 log Z 0 + h Hi 0 , (20) so that log Z 1 is not only approximated but also bounded from below by the right hand side of equation ( 20).This bound is the standard mean eld bound (see also equation ( 10)), and is employed in many variational methods (see, for example Saul et al., 1996;Jaakkola, 1997;Barber & Wiegerinck, 1999).This bound can then be made as tight as possible by optimizing with respect to the free parameters of the tractable distribution Q 0 .

Second and Higher Order
To second order equation ( 17) with the remainder k 3 = D H 3 E 3 D H 2 E h Hi + 2h Hi 3 and 0 1.Since this remainder cannot be bounded in a tractable manner for general distributions Q 0 and Q 1 , we have no obvious criterion to prefer one set of parameters of the tractable distribution over another.Similarly, for higher-order expansions it is unclear, within this expansion, how to obtain a tractable bound and, consequently, how to select a set of parameters .We stress that any criterion is essentially a heuristic since the error made by the resulting approximation is, by assumption, not tractably computable.Whilst, in principle, a whole range of criteria could be considered in this framework, we mention here only two.The rst, the bound criterion, is arguably the simplest and most natural choice.The second criterion we brie y mention is the independence method, which we discuss mainly for its relation to the TAP method of statistical physics (Thouless et al., 1977).

Bound Criterion
The rst criterion assigns the free parameters to maximize the ( rst-order) bound, equation (20).The resulting distribution Q 0 (s; ) is used in the truncated approximation, equation (18).To second order this is explicitly given by log Z 1 log Z 0 + h Hi 0 + 1 2 var 0 ( H) , (22) where a ` ' denotes that the distribution Q 0 (s; ) is to be used.Note that the computational cost involved over that of the variational method of section (4.1) is small.No further optimization is required, only the calculation of the second cumulant var 0 ( H) which by assumption is tractable.

Independence Criterion
The second criterion we consider is based on the fact that the exact value of Z 1 is independent of the free parameters , i.e., @ log Z 1 @ i 0 . (23) Since log Z 1 is intractable, we substitute the truncated cumulant expansion into equation In general, there are many solutions resulting from the independence criterion and alone it is too weak to provide reliable solutions.Under certain restrictions, however, this approach is closely related to the TAP approach of statistical mechanics (Thouless et al., 1977;Plefka, 1982;Kappen & Rodr guez, 1998a, 1998b), as will also be described in section (5.1.2).
The potential of a Boltzmann machine, with binary random variables s i 2 f0; 1g, is given by with w ij w ji and w ii 0. Unfortunately, Boltzmann machines are in general intractable since calculation of the normalizing constant involves a summation over an exponential number of states.In an early e ort to overcome this intractability, Peterson and Anderson (1987) proposed the variational approximation using a factorised model as a fast alternative to stochastic sampling.However, Galland (1993) pointed out that this approach often fails since it inadequately captures second order statistics of the intractable distribution.
Using the potentially more accurate TAP approximation did not overcome these di culties and often lead to unstable solutions (Galland, 1993).Furthermore, it is not clear how to extend the TAP approach to deal with using more complex, non-factorised approximating distributions.We examine the relationship of our approach to the TAP method in section (5.1.2).Another approach which aims to improve the accuracy of the correlations, though not the normalising constant, is linear response (Parisi, 1988) which was applied to Boltzmann machines by Kappen and Rodr guez (1998a) for the case of using a factorised model, see section (5.1).The approach we take here is a little di erent in that we wish to nd a better approximation primarily to the normalising constant.Approaches such as linear response can then be applied to this approximation to derive approximations for the correlations, if desired.
More exible approximating distributions have also been considered in the variational approach, for example, mixtures of factorised distributions (Jaakkola & Jordan, 1998;Lawrence et al., 1998), and decimatable structures (Saul & Jordan, 1994;Barber & Wiegerinck, 1999).We show how to combine the use of such more exible approximating distributions with higher-order corrections to the variational procedure.For clarity and simplicity, we will initially approximate the normalizing constant of a general intractable Boltzmann machine Q 1 , see equation ( 25), using only a factorised Boltzmann machine as our tractable model Q 0 .Subsequently, in our simulations, we will use a more exible tractable model, a decimatable Boltzmann machine.

Using Factorised Boltzmann Machines
The potential of a factorised Boltzmann machine is given by For this simple model, calculating higher-order cumulants is straightforward once the value of the free parameters are known since, hs i 1 s i 2 : : : s i l i 0 = hs i 1 i 0 hs i 2 i 0 : : : hs i l i 0 with i 1 6 = i 2 6 = : : : 6 = i l . (27) For notational convenience, we de ne m i to be the expectation value of s i , m i = hs i i 0 = (1 + exp ( i )) 1 : (28)

Bound Criterion
Using a factorised distribution to approximate the intractable normalizing constant corresponding to equation (25) gives, from equation (20), the variational bound where, for convenience, we have de ned the entropy One can either maximize the bound, i.e., the right-hand side of equation ( 29), directly, or use the fact that at the maximum, which leads to the xed point equation (with the understanding that the means m are related to the parameters by equation ( 28) The well-known \mean eld" equations (Parisi, 1988) are a re-expression of (32) in terms of the means only, given by ( 28).The optimal parameters of the factorised model might therefore also be obtained by iterating, either synchronously or asynchronously (Peterson & Anderson, 1987), these xed point equations.Note that equation (31) also holds for minima and saddle-points, so in general additional checks are needed to assert that the parameters correspond to a maximum of (29).Insertion of the xed point solution into the second-order expansion for the bound criterion, equation ( 22), yields (up to second order) where we emphasize that the m i are given by the variational bound solution, i.e., equation (32).

Independence Criterion
Using a factorised distribution to approximate the intractable normalizing constant, see equation ( 25), gives up to second order The independence criterion leads to the equations 0 B @ X j w ij2 m j (1 m j ) + 0 @ i w i X j In general, there are many solutions to these equations and the search can be limited by inserting in equation ( 34) the constraint that the second-order solution is close to the rstorder solution, i.e., i = w i + P j w ij m j + O(w 2 ), and by neglecting terms of O(w 4 ) and higher.This simpli es equation ( 34) to F TAP , the negative TAP free energy 2 (Thouless et al., 1977;Plefka, 1982;Kappen & Rodr guez, 1998a).The independence criterion, i.e., @F TAP =@ = 0, leads now to the xed point condition These TAP equations can have poor convergence properties and often produce a poor solution (Bray & Moore, 1979;Nemoto & Takayama, 1985;Galland, 1993).The additional Figure 1: The Boltzmann machine with the random variable s 1 can be decimated to a Boltzmann machine without s 1 .
constraint that the TAP solution should correspond to a minimum of F TAP improves the solution.However, since TAP is therefore essentially an expansion around the rst-order solution, we expect the numerical di erence between the bound criterion and convergent TAP results to be small.For these reasons, we prefer the straightforward bound criterion and leave other criteria for separate study.

Numerical Results
In this section we present results of computer simulations to validate our approach.In section (5.2.1) we will compare the di erent methods of section (4.1) and section (4.2) on approximating the normalizing constant Z.A similar comparison is made for approximating the correlations in section (5.2.2).In section (5.2.3) we show the results of a learning problem in which a Boltzmann machine is used to learn a set of sample patterns, a typical machine learning problem.In all cases, the numbers of random variables in the Boltzmann machines are chosen to be small to facilitate comparisons with the exact results.
In some simulations we will not only use factorised Boltzmann machines but also more exible models that can make the variational bound (see equation ( 20)) tighter, namely decimatable Boltzmann machines (Saul & Jordan, 1994;R uger, 1997;Barber & Wiegerinck, 1999).Decimation is a technique that eliminates random variables so that the normalizing constant for the distribution on the remaining variables remains unchanged up to a constant known factor.For completeness, we brie y describe this technique in the current context.
Suppose we have a Boltzmann machine with many random variables, for which a threevariable subgraph is depicted in Figure 1(a), so that random variable s 1 is connected only to random variables s 2 and s 3 .Mathematically, the condition for invariance of Z after removing random variable s 1 , Figure 1(b), becomes X sns 1 e 0 + 0 2 s 2 + 0 3 s 3 + 0 23 s 2 s 3 = X s e + 1 s 1 + 2 s 2 + 3 s 3 + 12 s 1 s 2 + 13 s 1 s 3 + 23 s 2 s 3 , (37) @ @ @ @ @ @ @ @ @ @ @ @ P P P B B B where we include the constant for convenience.Invariance is ful lled if 0 2 = 2 + log 1 + e 1 + 12 1 + e 1 , 0 3 = 3 + log 1 + e 1 + 13 1 + e 1 , 0 23 = 23 + log 1 + e 1 1 + e 1 + 12 + 13 (1 + e 1 + 12 ) (1 + e 1 + 13 ) and 0 = + log 1 + e 1 . (38) For decimatable Boltzmann machines, one can repeat the decimation process until,nally, one ends up with a Boltzmann machine which consists of a single random variable whose normalizing constant is trivial to compute.This means that for decimatable Boltzmann machines the normalizing constant can be computed in time linear in the number of variables.Decimation in undirected models is similar in spirit to variable elimination schemes in graphical models.For example, in directed belief networks, \bucket elimination" enables variables to be eliminated by passing compensating messages to other buckets (collections of nodes) in such that the desired marginal on the remaining nodes remains the same (Dechter, 1999).

Approximating the Normalizing Constant
As our `intractable' distribution Q 1 , we took a fully connected, eight-node Boltzmann machine, which is not decimatable (see Figure 2(c)).As our tractable Boltzmann machine, Q 0 , we took both the factorised model of Figure 2(a) and the decimatable model of Figure 2(b).We then tried to approximate the normalizing constant for the fully connected machine in which the parameters were randomly drawn from a zero-mean, Gaussian distribution with unit variance.See the appendix for additional theoretical and experimental details regarding the optimization scheme used.The relative error of the approximation, was then determined for both the rst and second-order approach using both the factorised and decimatable model.This was repeated 550 times for di erent random drawings of the parameters of the fully connected Boltzmann machine.The resulting histograms are depicted in Figure 3.In these histograms one can clearly see that the bound, while present in the rst-order approach, is lost in the second-order approach since some normalising constant estimates were above the true value, violating the lower bound.In ve out of 1100 cases (two times 550) the synchronous mean eld iterations did not converge and these outliers were neglected in the plots.It is, however, possible to make the number of non-convergent cases even lower by using asynchronous instead of synchronous updates of the mean eld parameters (Peterson & Anderson, 1987;Ansari, Hou, & Yu, 1995;Saul et al., 1996).Note that, in both cases, the second-order method improves on average on the standard ( rst order) variational results.Indeed, using only a factorised model with the second-order correction improves on the standard variational decimatable result.
We also determined whether the second-order approach improves the accuracy of the prediction in each individual run.In order to determine this we used the paired di erence, If is positive the second order improves the rst order approach.The corresponding results of our computer simulations are depicted in Figure 4.In all runs was positive, corresponding to a consistent improvement in accuracy.

Approximating the Means and Correlations
There are many techniques that can be called upon to approximate moments and correlations.Primarily, the cumulant expansion, as we have described it, is intended to improve the accuracy in estimating the normalising constant, and not directly the moments of the distribution.Arguably the simplest way to approximate correlations is to use the standard variational approach to nd the best approximating distribution to Q 1 and then to approximate the moments of Q 1 by the moments of Q 0 .In this approach, however, certain correlations that are not present in the structure of the approximating distribution Q 0 , will be trivially approximated.For example, approximating hs i s j i Q 1 using the factorised distribution gives hs i s j i Q 1 hs i i Q 0 hs j i Q 0 .We will examine in this section some ways of improving the estimation of the correlations.
One procedure that can be used to approximate correlations such as hs i s j i Q 1 is given by the so-called linear response theory (Parisi, 1988).This approach uses the relationship (5) in which the intractable normalising constant is replaced with its approximation (29).
The approach that we adopt here, due to it's straightforward relationship to normalising constants, is given by considering the following relationship: where Z i;j] is the normalising constant of the original network in which nodes i and j have been removed and the biases are transformed to h 0 k = h k +J ik +J jk .This means that we need to evaluate O(n 2 ) normalising constants to approximate the correlations.To approximate the normalising constants, we use the higher-order expansion (33).Similarly, we attempt to obtain a more accurate approximation to the means hs i i, based on the identity,    where Z i] is the normalising constant of the original network in which node i has been removed and the biases are transformed to h 0 k = h k + J ik .The \intractable" Boltzmann machine in our simulations has 8 fully connected nodes with biases and weights drawn from a standard zero-mean, unit-variance Gaussian distribution.In gures 5 and 6 we plot results for approximating the means hs i i and correlations hs i s j i respectively.As can be seen, the approach we take to approximate the correlations is an improvement on the standard variational factorised model.We emphasize that, since we are primarily concerned with approximating the normalising constant, there may be more suitable methods dedicated  to the task of approximating the correlations themselves.Indeed, one of the drawbacks of these perturbative approaches is that physical constraints, such as that moments should be bounded between 0 and 1, can be violated.

Learning
The results of the previous section show that, for randomly chosen connection matrices J, the higher-order terms can signi cantly improve the approximation to the normalising constant of the distribution.However, in a learning scenario, such results may be of little interest since the connection matrix will become intimately related to the patterns to be learned|that is, J will take on a form that is appropriate for storing the patterns in the training set as learning progresses.
We are therefore interested here in training networks on a set of visible patterns.That is, we split the nodes of the network into a set of visible S V and a set of hidden units S H with the aim of learning a set of patterns S 1 V : : : S p V with a Boltzmann machine Q 1 (S V ; S H ). By using the KL divergence between an approximating distribution Q0 (S H jS V ) on the hidden units and the conditional distribution Q 1 (S H jS V ) we obtain the following bound ln Q 1 (S V ) Z Q0 (S H jS V ) ln Q0 (S H jS V ) + Z Q0 (S H jS V ) ln Q 1 (S H ; S V ) (43) For the case of Boltzmann machines, ln Q 1 (S H ; S V ) = H(S H ; S V ) ln Z in which ln Z is (assumed) intractable.In order to obtain an approximation to this quantity, we therefore introduce a further variational distribution, Q 0 (S V ; S H ) (not the same as Q0 (S H jS V )) which can be used as in ( 10) to obtain a lower bound on ln Z. Unfortunately,used in (43), this   41), in which second-order corrections are included in approximating the normalising constants.Using the ratio of normalising constants without the higher order corrections gave a mean error of 0.046, slightly worse than the standard result variational result.Without second-order corrections, this gives a mean error of 0.0428.lower bound on ln Z does not give a lower bound on the likelihood of the visible units Q 1 (S V ).Nevertheless, we may hope that the approximation to the likelihood gradient is su ciently accurate such that ascent of the likelihood can be made.Taking the derivative of ( 43) with respect to the parameters of the Boltzmann machine Q 1 (S H ; S V ), we arrive at the following learning rule for gradient ascent given a pattern S V : J = hs i s j i Q0 (S H jS V ) hs i s j i Q 1 (S H ;S V ) (44) where is a chosen learning rate.The correlations in the rst term of (44) are straightforward to compute in the case of using a factorised distribution Q0 .The \free" expectations in the second term are more troublesome and need to be approximated in some manner.We examine using the standard factorised model approximations for the free correlations and more accurate \higher-order" approximations as described in section (5.2.2).Here, we are primarily concerned with monitoring the likelihood bound (43) under such gradient dynamics, so we will look at the exact bound value, it's approximation using a factorised model for ln Z (29), and it's second-order correction (33).
We consider a small learning problem with 3 hidden units and 4 visible units.Ten visible training patterns were formed in which the elements were chosen to be on (value=1) with probability 0.4.In gure 7(a) we demonstrate variational learning in which the free statistics, required for the likelihood gradient ascent rule (44), are approximated with a simple factorised assumption, hs i s j i Q 1 (S H ;S V ) m i m j (i 6 = j).We display the time evolution of the exact training pattern likelihood bound3 (43) (solid line), and two approximations to it: the rst approximates the intractable ln Z term by using the standard variational approach with a factorised model (dashed line).The second approximation uses the second-order correction to the factorised model value for ln Z (dot-dash line).The learning rate was xed at 0.05.In monitoring the progress of learning, the higher-order correction to the likelihood bound can be seen to be a more accurate approximation of the likelihood compared to that given by the use of the standard variational approximation alone.Interestingly, using the second-order correction to the likelihood, a maximum in the likelihood is detected at a point close to saturation of the exact bound.This is not the case using the rst-order approximation alone and, with the standard approximation to the likelihood, the dynamics of the learning process, in terms of the training set likelihood, are poorly monitored.
In gure 7(b) the gradient dynamics are provided again by equation ( 44) but now with the free correlations hs i s j i Q 1 (S H ;S V ) approximated by the more accurate ratio-ofnormalising-constants method, as described in section (5.2.2).Again, we see an improvement in the accuracy of monitoring the progress of the learning dynamics by the simple inclusion of the higher-order term and, again, the higher-order approximation displays a maximum at roughly the correct position.The likelihood is higher than in gure 7(a) since the gradient of the likelihood bound is more accurately approximated through the more accurate correlation estimates.Note that the reason that the exact likelihood lower bound can be above 1 is due to pattern repetitions in the training set.The instabilities in learning after approximately 120 updates are due to the emergence of multimodality in the learned distribution Q 1 (S H ; S V jJ), indicating that a more powerful approximation method, able . The dash-dot line uses the second order correction to ln Z, equation (33).
to capture such multimodal e ects, should be used to obtain further improvements in the likelihood.

Discussion
In this article we have described perturbational approximations of intractable probability distributions.The approximations are based on a Taylor series using a family of probability distributions that interpolate between the intractable distribution and a tractable approximation thereto.These approximations can be seen as an extension of the variational method, although they no longer bound the quantities of interest.We have illustrated our approach both theoretically and by computer simulations for Boltzmann machines.These simulations showed that the approximation can be improved beyond the variational bound by including higher-order terms of the corresponding cumulant expansion.Simulations showed that the accuracy in monitoring the training set likelihood during the learning process can be improved by including higher order corrections.However, these perturbational approximations cannot be expected to consistently improve on zeroth order (variational) solutions.For instance, if the distribution is strongly multimodal, then using a unimodal variational distribution cannot be expected to be improved much by the inclusion of higher-order perturbational terms.On the other hand, higher-order corrections to multimodal approximations may well improve the solution considerably.
We elucidated the relationship of our approach to TAP, arguing that our approach is expected to o er a more stable solution.Furthermore, the application of our approach to models other than the Boltzmann machine is transparent.Indeed, these techniques are readily applicable to other variational methods in a variety of contexts both for discrete and continuous systems (Barber & Bishop, 1998;Wiegerinck & Barber, 1998).
One drawback of the perturbational approach that we have described is that known \physical" constraints, for example that moments must be bounded in a certain range, are not necessarily adhered to.It would be useful to develop a perturbation method that ensures that solutions are at least physical.
We hope to apply these methods to variational techniques other than the standard Kullback-Leibler bound, and also to study the evaluation of other suitable criteria for the setting of the variational parameters.(49) where the Fisher matrix is given by F IJ = hs I s J i 0 hs I i 0 hs J i 0 which depends on the free parameters , and F is the Fisher matrix of the tractable Boltzmann machine.For a factorised approximating model, F is a diagonal matrix and equation ( 49) simpli es to equation (32).For a decimatable Boltzmann machine, the elements of the Fisher matrix are tractable.The iteration can be performed either synchronously or asynchronously (Peterson & Anderson, 1987).We prefer here the synchronous case since for all N parameter-updates we only need to calculate and invert a single Fisher matrix instead of N matrices in the asynchronous case.In most applications, however, the asynchronous method seems to be advantageous with respect to convergence (Peterson & Anderson, 1987;Ansari et al., 1995;Saul et al., 1996).

Figure 2 :
Figure 2: Boltzmann machines of eight random variables

Figure 3 :
Figure 3: Approximation of the normalizing constant of a fully connected, eight-node Boltzmann machine (see Figure 2(c)), whose parameters are randomly drawn from a zero-mean, Gaussian distribution with unit variance.A factorised model (see Figure 2(a)) is used as the approximating distribution in (a) and (b) and a decimatable model (see Figure 2(b)) is used in (c) and (d).The histograms are based on 550 di erent random drawings of the parameters.For the readability of the histogram, we have excluded ve outliers (four in (a) and (b), one in (c) and (d)), for which the synchronous mean eld iteration did not converge.The mean error is the mean of the absolute value of equation (39).

Figure 4 :
Figure4: Di erence between rst and second-order approximation of the normalizing constant of a fully connected, eight-node Boltzmann machine using (a) a factorised and (b) a decimatable model.The histograms are based on the same 550 random drawings as used in Figure3.All 1100 di erences, including the ve nonconvergent runs, were larger than zero.Again for readability of the histogram, the non-convergent runs are not included.The mean di erence was 0.0281 (0.0269 without the non-convergent runs) and 0.0155 (0.0154 without the non-convergent run) for the factorised model and the decimatable model, respectively.

Figure 5 :
Figure 5: Approximating the means hs i i for 1000 randomly chosen 8 node fully connected Boltzmann machines.The weights were drawn from the standard normal distribution.(a) The standard variational approach in which the means are estimated by the means of the best approximating factorised distribution.(b) Using the ratio of normalising constants (41), in which second-order corrections are included in approximating the normalising constants.Using the ratio of normalising constants without the higher order corrections gave a mean error of 0.046, slightly worse than the standard result variational result.

Figure 6 :
Figure6: Approximating the correlations hs i s j i for 1000 randomly chosen 8 node fully connected Boltzmann machines.The weights were drawn from the standard normal distribution.(a) The standard variational approach in which the means are estimated by the means of the best approximating factorised distribution, and (b) using the ratio of normalising constants, including second-order corrections.Without second-order corrections, this gives a mean error of 0.0428.
Learning using dynamics in which the free correlations are given by the ratio of normalising constants approach.

Figure 7 :
Figure 7: Learning a set of 10 random, 4-visible-unit patterns with a Boltzmann machine with 3 hidden units.The solid line is the exact value for the total likelihood bound (43) of all the patterns in the training set.The dashed line is the likelihood bound approximation based on using a factorised model to approximate ln Z, equation(29).The dash-dot line uses the second order correction to ln Z, equation (33).