Multi-fidelity Gaussian Process Bandit Optimisation

In many scientific and engineering applications, we are tasked with the optimisation of an expensive to evaluate black box function $f$. Traditional settings for this problem assume just the availability of this single function. However, in many cases, cheap approximations to $f$ may be obtainable. For example, the expensive real world behaviour of a robot can be approximated by a cheap computer simulation. We can use these approximations to eliminate low function value regions cheaply and use the expensive evaluations of $f$ in a small but promising region and speedily identify the optimum. We formalise this task as a \emph{multi-fidelity} bandit problem where the target function and its approximations are sampled from a Gaussian process. We develop MF-GP-UCB, a novel method based on upper confidence bound techniques. In our theoretical analysis we demonstrate that it exhibits precisely the above behaviour, and achieves better regret than strategies which ignore multi-fidelity information. Empirically, MF-GP-UCB outperforms such naive strategies and other multi-fidelity methods on several synthetic and real experiments.


Introduction
In stochastic bandit optimisation, we wish to optimise a function f : X → R by sequentially querying it and obtaining bandit feedback, i.e. when we query at any x ∈ X , we observe a possibly noisy evaluation of f (x). f is typically expensive and the goal is to identify its maximum while keeping the number of queries as low as possible. Some applications are hyper-parameter tuning in expensive machine learning algorithms (Snoek, Larochelle, & Adams, 2012), optimal policy search in complex systems (Martinez-Cantin, de Freitas, Doucet, & Castellanos, 2007), online advertising (Kar, Li, Narasimhan, Chawla, & Sebastiani, 2016), scientific experiments (Parkinson, Mukherjee, & Liddle, 2006), and statistical tasks such as collaborative filtering (S. Li, Karatzoglou, & Gentile, 2016) and clustering (Gentile et al., 2017). Historically, bandit problems were studied in settings where the goal is to maximise the cumulative reward of all queries to the payoff instead of just finding the maximum. Applications in this setting include clinical trials and online advertising.
Conventional methods in these settings assume access to only this single expensive function of interest f . We will collectively refer to them as single fidelity methods. In many practical problems however, cheap approximations to f might be available. For instance, when tuning hyper-parameters of learning algorithms, the goal is to maximise a cross validation score on a training set, which can be expensive if the training set is large. However validation curves tend to vary smoothly with training c 2019 AI Access Foundation. All rights reserved. n=300 n=3000 Figure 1: Average 5-fold CV log likelihood on datasets of size 300, 3000 on a synthetic kernel density estimation task. The crosses are the maxima. set size; therefore, we can train and cross validate on small subsets to approximate the validation accuracies of the entire dataset. For a concrete example, consider kernel density estimation (KDE), where we need to tune the bandwidth h of a kernel when using a dataset of size 3000. Figure 1 shows the average cross validation likelihood against h for a dataset of size n = 3000 and a smaller subset of size n = 300. Since the cross validation performance of a hyper-parameter depends on the training set size (Vapnik & Vapnik, 1998), we can obtain only a biased estimate of the cross validation performance with 3000 points using a subset of size 300. Consequently, the two maximisers are also different. That said, the curve for n = 300 approximates the n = 3000 curve quite well. Since training and cross validation on small n is cheap, we can use it to eliminate bad values of the hyper-parameters and reserve the expensive experiments with the entire dataset for the promising hyper-parameter values (for example, boxed region in Figure 1). In the conventional treatment for online advertising, each query to f is, say, the public display of an ad on the internet for a certain time period. However, we could also choose smaller experiments by, say, confining the display to a small geographic region and/or for shorter periods. The estimate is biased, since users in different geographies are likely to have different preferences, but will nonetheless be useful in gauging the all round performance of an ad. In optimal policy search in robotics and autonomous driving, vastly cheaper computer simulations are used to approximate the expensive real world performance of the system (Cutler, Walsh, & How, 2014;Urmson et al., 2008). Scientific experiments can be approximated to varying degrees using less expensive data collection, analysis, and computational techniques (Parkinson et al., 2006).
In this paper, we cast these tasks as multi-fidelity bandit optimisation problems assuming the availability of cheap approximate functions (fidelities) to the payoff f . Our contributions are:

Related Work
Since the seminal work by Robbins (1952), the multi-armed bandit problem has been studied extensively in the K-armed setting. Recently, there has been a surge of interest in the optimism under uncertainty principle for K-armed bandits, typified by upper confidence bound (UCB) methods (Auer, 2003;Bubeck & Cesa-Bianchi, 2012). UCB strategies have also been used in bandit tasks with linear (Dani, P. Hayes, & Kakade, 2008) and GP (Srinivas, Krause, Kakade, & Seeger, 2010) payoffs. There is a plethora of work on single fidelity methods for global optimisation both with noisy and noiseless evaluations. Some examples are branch and bound techniques such as dividing rectangles (DiRect), simulated annealing, genetic algorithms and more (Jones, Perttunen, & Stuckman, 1993;Kawaguchi, Kaelbling, & Lozano-Pérez, 2015;Kirkpatrick, Gelatt, & Vecchi, 1983;Munos, 2011). A suite of single fidelity methods in the GP framework closely related to our work is Bayesian Optimisation (BO). While there are several techniques for BO (Hernández-Lobato, Hoffman, & Ghahramani, 2014;Jones, Schonlau, & Welch, 1998;Mockus, 1994;Thompson, 1933), of particular interest to us is the Gaussian process upper confidence bound (GP-UCB) algorithm of Srinivas et al. (2010).
Many applied domains of research such as aerodynamics, industrial design and hyper-parameter tuning have studied multi-fidelity methods (Forrester, Sóbester, & Keane, 2007;Huang, Allen, Notz, & Miller, 2006;Klein, Bartels, Falkner, Hennig, & Hutter, 2015;L. Li, Jamieson, DeSalvo, Rostamizadeh, & Talwalkar, 2017;Swersky, Snoek, & Adams, 2013; a plurality of them use BO techniques. However these treatments neither formalise nor analyse any notion of regret in the multi-fidelity setting. In contrast, MF-GP-UCB is an intuitive UCB idea with good theoretical properties. Bogunovic, Scarlett, Krause, and Cevher (2016) study a version of BO where an algorithm might use cheap, noisy, yet unbiased approximations to a function f ; but as we will explain in Section 2, this is different to the multi-fidelity problem. Agarwal, Duchi, Bartlett, and Levrard (2011) derive oracle inequalities for hyper-parameter tuning with ERM under computational budgets. Our setting is more general as it applies to any bandit optimisation task. Sabharwal, Samulowitz, and Tesauro (2015) present a UCB based idea for tuning hyper-parameters with incremental data allocation. However, their theoretical results are for an idealised non-realisable algorithm. Cutler et al. (2014) study reinforcement learning with multi-fidelity simulators by treating each fidelity as a Markov Decision Process. Finally, Zhang and Chaudhuri (2015) study active learning when there is access to a cheap weak labeler and an expensive strong labeler. These works study problems different to optimisation. this is the first line of work to formalise a notion of regret and provide a theoretical analysis for multi-fidelity optimisation.
Subsequent to our work, there has been a line of research on multi-fidelity optimisation in various settings. Shakkottai (2018, 2019) develop an algorithm in frequentist settings which builds on the key intuitions here, i.e. query at low fidelities and proceed higher only when the uncertainty has shrunk. In addition, Song, Chen, and Yue (2018) develop a Bayesian algorithm which chooses fidelities based on the mutual information. Poloczek, Wang, and Frazier (2017) ;Wu, Toscano-Palmerin, Frazier, and Wilson (2019) use knowledge gradient methods for multi-fidelity Bayesian optimisation while Hoag and Doppa (2018) use techniques from search based optimisation for this problem.
The remainder of this manuscript is organised as follows. Section 2 presents our formalism including a notion of simple regret for multi-fidelity GP optimisation. Section 4 presents our algorithm. We present our theoretical results in Section 5 beginning with an informal discussion of results for M = 2 fidelities in Section 5.1 to elucidate the main ideas. The proofs are given in Section 8. Section 7 presents our experiments with some details deferred to Appendix A. Appendix B collects some ancillary material including a table of notations and abbreviations in Appendix B.2.

Problem Set Up
We wish to maximise a function f : X → R where X is a finite discrete or compact subset of [0, r] d , where r > 0 and d is the dimension of X . We can interact with f only by querying it at some x ∈ X and obtaining a noisy evaluation y = f (x) + of f , where the noise satisfies E[ ] = 0. Let x ∈ argmax x∈X f (x) be a maximiser of f and f = f (x ) be the maximum value. Let x t ∈ X be the point queried at time t by a sequential procedure. The goal in bandit optimisation is to achieve small simple regret S n , defined below, after n queries to f . (1) Our primary distinction from the usual setting is that we have access to M − 1 successively accurate approximations f (1) , f (2) , . . . , f (M −1) to the function of interest f = f (M ) . We refer to these approximations as fidelities. The multi-fidelity framework is attractive when the following two conditions are true about the problem.
1. The approximations f (1) , . . . , f (M −1) approximate f (M ) . To this end, we will assume a uniform bound for the fidelities, 2. The approximations are cheaper than evaluating at f (M ) . We will assume that a query at fidelity m expends a cost λ (m) of a resource, such as computational effort or money. The costs are known and satisfy 0 < λ (1) < λ (2) < · · · < λ (M ) .
Above, and throughout this manuscript, for any h : X → R, we define h ∞ = sup x∈X |h(x)|. As the fidelity m increases, the approximations become better but are also more costly. An algorithm for multi-fidelity bandits is a sequence of query-fidelity pairs {(x t , m t )} t≥0 , where at time n, the algorithm chooses (x n , m n ) using information from previous query-observation-fidelity triples {(x t , y t , m t )} n−1 t=1 . Here y t = f (mt) (x t ) + t where, the t values are independent noise at each time step t and E[ t ] = 0.
Some smoothness assumptions on f (m) 's are needed to make the problem tractable. A standard in the Bayesian nonparametric literature is to use a Gaussian process (GP) prior (Rasmussen & Williams, 2006) with covariance kernel κ. Two popular kernels of choice are the squared exponential (SE) kernel κ σ,h and the Matérn kernel κ ν,h . Writing z = x − x 2 , they are defined as respectively. Here σ, h, ν, ρ > 0 are parameters of the kernels and Γ, B ν are the Gamma and modified Bessel functions. A convenience the GP framework offers is that posterior distributions are analytically tractable. If f ∼ GP(0, κ) is a sample from a GP, and we have observations (Rasmussen & Williams, 2006),

The Generative Process for Multi-fidelity Optimisation
In keeping with the above framework, we assume the following generative model for the functions f (1) , . . . , f (M ) . A generative mechanism is given constants ζ (1) , . . . , ζ (M −1) . It then generates the functions as follows.
Condition A2 characterises the approximation conditions for the lower fidelities. Lemma 2 shows that A2 is satisfied with positive probability when f (1) , . . . , f (M ) are sampled from a GP. Hence this is a valid generative process since A2 will eventually be satisfied. Moreover, in Section 4 we argue that while A2 renders the computation of the true posterior of all GPs inefficient via closed form equations such as in (2), it is still possible to derive an efficient algorithm that uses (2) to determine future points for evaluation.
We note that other natural approximation conditions can be used to characterise the cheaper fidelities. We choose a uniform bound condition because it provides a simple way to reason about one fidelity from the others, hence keeping the analysis tractable while ensuring the model is interesting enough so that empirical performance is not compromised. Our theoretical analysis assumes that the algorithm needs to know the uniform bounds ζ (1) , . . . , ζ (M −1) which can be unrealistic in practical settings. In Section 6 we describe a heuristic for choosing these values in a data dependent manner. That said, we believe that the intuitions in this work can be used to develop other upper confidence based multi-fidelity BO algorithms for other approximation conditions. In fact, the approximation conditions in our follow up work in Kandasamy et al. (2017), are of a Bayesian flavour via a kernel on the fidelities. The algorithm, BOCA, builds on the key insights developed here.
It is worth mentioning that while our theoretical results are valid for arbitrary In this work, we will assume that M is a small fixed value and that λ (1) is comparable to λ (M ) . For instance, in many practical applications of multi-fidelity optimisation, while an approximation may be cheaper than the real experiment, it could itself be quite expensive and hence require an intelligence procedure, such as Bayesian optimisation, to choose the next point. This is the regime the current paper focuses on, as opposed to asymptotic regimes where M → ∞ and/or λ (1) → 0. Moreover, very large values of M are better handled by the formalism in our follow up work in Kandasamy et al. (2017).
Finally, we note that Assumption A1 can be relaxed to hold for different kernels and noise variances for each fidelity, i.e. different κ (m) , η (m) for m = 1, . . . , M , with minimal modifications to our analysis but we use the above form to simplify the presentation of the results. In fact, our practical implementation uses different kernels.

Simple Regret for Multi-fidelity Optimisation
Our goal is to achieve small simple regret S(Λ) after spending capital Λ of a resource. We will aim to provide any-capital bounds, meaning that we will assume that the game is played indefinitely and will try to bound the regret for all (sufficiently large) values of Λ. This is similar in spirit to any-time analyses in single fidelity bandit methods as opposed to fixed time horizon analyses. Let {m t } t≥0 be the fidelities queried by a multi-fidelity method at each time step. Let N be the random quantity such that N = max{n ≥ 1 : n t=1 λ (mt) ≤ Λ}, i.e. it is the number of queries the strategy makes across all fidelities until capital Λ. Only the optimum of f = f (M ) is of interest to us. The lower fidelities are useful to the extent that they help us optimise f (M ) with less cost, but there is no reward for optimising a cheaper approximation. Accordingly, we set the instantaneous reward q t at time t to be −∞ if m t = M and f (M ) For optimisation, the simple regret is simply the best instantaneous regret, S(Λ) = min t=1,...,N r t . Equivalently, we have queried at the M th fidelity at least once, Note that the above reduces to S n in (1) when we only have access to f (M ) with n = N = Λ/λ (M ) . Before we proceed, we note that it is customary in the bandit literature to analyse cumulative regret. The definition of cumulative regret depends on the application at hand (Kandasamy, Dasarathy, Poczos, & Schneider, 2016) and our results can be extended to many sensible notions of cumulative regret. However, both to simplify exposition and since our focus in this paper is optimisation, we stick to simple regret.
Challenges: We conclude this subsection with a commentary on some of the challenges in multifidelity optimisation using Figure 2 for illustration. For simplicity, we will focus on 2 fidelities when we have one approximation f (1) to an expensive function f (2) . For now assume that (unrealistically) f (1) and its optimum x (1) are known. Typically x (1) is suboptimal for f (2) . A seemingly straightforward solution might be to search for x in an appropriate subset, such as a neighborhood (1) . However, if this neighborhood is too small, we might miss the optimum x (green region in Figure 2(a)). A crucial challenge for multi-fidelity methods is to not get stuck at the optimum of a lower fidelity. While exploiting information from lower fidelities, it is also important to explore sufficiently at higher fidelities. In our experiments, we demonstrate that naive strategies which do not do so could get stuck at the optimum of a lower fidelity. Alternatively, if we pick a very large subset (Figure 2(b)) we might not miss x ; however, it defeats the objectives of the multi-fidelity set up where the goal is to use the approximation to be prudent about where we query f (2) . Figure 2(c) displays a seemingly sensible subset, but it remains to be seen how it is chosen. Further, this subset might not even be a neighborhood as illustrated in Figure 2(d), where f (1) , f (2) are multi-modal and the optima are in different modes. In such cases, an appropriate algorithm should explore all such modes. On top of the above, an algorithm does not actually know f (1) . A sensible algorithm should explore f (1) and simultaneously identify the above subset, either implicitly or explicitly, for exploration at the second fidelity f (2) . Finally, it is also important to note that f (1) is not simply a noisy version of f (2) ; this setting is more challenging as an algorithm needs to explicitly account for the bias in the approximations.

Some Useful Properties of GPs
For what follows, we present some useful properties and concepts related to GPs with well behaved kernels. We we will denote probabilities when f (1) , . . . , f (M ) ∼ GP(0, κ) independently, by P GP . P will denote probabilities under the prior in the multi-fidelity setting which includes A2 after sampling the functions; i.e. for any event E, P(E) = P GP (E|A2). First, we will need the following regularity conditions on the kernel. It is satisfied for four times differentiable kernels such as the SE kernel and Matérn kernel when ν > 2; see Ghosal and Roy (2006), Theorem 5.
Assumption 1. (Theorem 5 in Ghosal and Roy, 2006) Let f ∼ GP(0, κ), where κ : [0, r] d ×[0, r] d → R is a stationary kernel (Rasmussen & Williams, 2006). The partial derivatives of f satisfies the following condition. There exist constants a, b > 0 such that, for all J > 0, and for all i ∈ {1, . . . , d}, P GP sup Observe that we have used notation P GP to indicate the prior probability when f ∼ GP(0, κ) for consistency. Next, the following assumption supposes that there is a positive probability to the event that the supremum of a GP in a bounded domain is smaller than any given > 0.
Assumption 2. Let X = [0, r] d and f ∼ GP(0, κ). Let κ : X × X d → R be such that for all > 0, there exists Q( ) > 0 such that, As shown by Theorem 4 in Ghosal and Roy (2006), this is satisfied for the SE and Matérn kernels. Finally, following Srinivas et al. (2010), our theoretical results will be given in terms of the Maximum Information Gain (MIG), defined below. Srinivas et al., 2010)

Definition 1. (Maximum Information Gain
The MIG, which depends on the kernel and the set A, will be an important quantity in our analysis as it characterises the statistical difficulty of GP Bandits. For a given kernel it typically scales with the volume of A . It is known that for the SE kernel, Ψ n ([0, 1] d ) ∈ O((log(n)) d+1 ) and for the (Seeger, Kakade, & Foster, 2008;Srinivas et al., 2010).

A Review of GP-UCB
Sequential optimisation methods adopting UCB principles maintain a high probability upper bound ϕ t : X → R for f (x) for all x ∈ X (Auer, 2003). At time t we query at the maximiser of this upper bound x t = argmax x∈X ϕ t (x). Our work builds on GP-UCB , where ϕ t takes the form ϕ t (x) = µ t−1 (x) + β 1/2 t σ t−1 (x). Here µ t−1 , σ t−1 are the posterior mean and standard deviation of the GP conditioned on the previous t − 1 queries {(x i , y i )} t−1 i=1 and β t > 0. The key intuition here is that the mean µ t−1 encourages an exploitative strategy -in that we want to query where we know the function is high -and the standard deviation σ t−1 encourages an explorative strategy -in that we want to query at regions we are uncertain about f lest we miss out on high • D 0 ← ∅, (µ 0 , σ 0 ) ← (0, κ 1/2 ).
valued regions. β t will control the trade-off between exploration and exploitation. We have presented GP-UCB in Algorithm 1 and illustrated it in Figure 3. The following theorem from Srinivas et al. (2010) bounds the simple regret S n (1) for GP-UCB. They give their bounds in terms of the cumulative regret, but converting it to simple regret is straightforward.
• If X is a finite discrete set, run GP-UCB with β t = 2 log |X |t 2 π 2 /6δ . Then, for all n ≥ 1, Then, for all n ≥ 1, S n ≤ C 1 β n Ψ n (X ) n + 2 n 1. In section C.2 of Srinivas et al. (2010), the kernel's eigenspectrum is defined with respect to the uniform measure on the domain X . When we consider any subset A ⊂ X with the same measure and eigenspectrum, a multiplicative vol(A) term appears.

Multi-fidelity Gaussian Process Upper Confidence Bound (MF-GP-UCB)
We now propose MF-GP-UCB, which extends GP-UCB to the multi-fidelity setting. Like GP-UCB, MF-GP-UCB will also maintain a UCB for f (M ) obtained via the previous queries at all fidelities. Denote the posterior GP mean and standard deviation of f (m) conditioned only on the previous queries at fidelity m by µ respectively (See (2)). Then define, For appropriately chosen β t , µ We have M such bounds, and their minimum ϕ t (x) gives the best upper bound for f (M ) . Following UCB strategies such as GP-UCB, our next query is at the maximiser of this UCB, x t = argmax x∈X ϕ t (x).
Next we need to decide which fidelity to query at. Consider any m < M . The ζ (m) constraints on f (m) restrict the value of f (M ) -the confidence band β is large, it means that we have not constrained f (m) sufficiently well at x t and should query at the m th fidelity. On the other hand, querying indefinitely in the same region to reduce the uncertainty β t−1 at the m th fidelity in that region will not help us much as the ζ (m) elongation caps off how much we can learn about f (M ) from f (m) ; i.e. even if we knew f (m) perfectly, we will only have constrained f (M ) to within a ±ζ (m) band. Our algorithm captures this simple intuition. Having selected x t , we begin by checking at the first is smaller than a threshold γ (1) , we proceed to the second fidelity. If at any stage β we query at fidelity m t = m. If we proceed all the way to fidelity M , we query at m t = M . We will discuss choices for γ (m) in Sections 5.1 and 6. We summarise the resulting procedure in Algorithm 2.
Before we proceed, we make an essential observation. The posterior for any f (m) (x) conditioned on previous queries at all fidelities M =1 D ( ) t is not Gaussian due to the ζ (m) constraints (A2).
t−1 (x) holds with high probability, since, by conditioning only on queries at the m th fidelity we have Gaussianity for f (m) (x). (See Lemma 9, Section 8.1).
The small crosses are queries from 1 to t − 1 and the red star is the maximiser of ϕ t , i.e. the next query x t . x , the optimum of f (2) is shown in magenta. In the bottom figures, the solid orange line is β we play at fidelity m t = 2 and otherwise at m t = 1. The cyan region in the last panel is the good set X g described in Section 5.1. Figure 4 illustrates MF-GP-UCB via a simulation on a 2-fidelity problem. At the initial stages, MF-GP-UCB is mostly exploring X in the first fidelity. β

An Illustration of MF-GP-UCB:
t−1 is large and we are yet to constrain f (1) well to proceed to m = 2. At t = 10, we have constrained f (1) sufficiently well at a region around the optimum. β (1) and we query at m t = 2. Notice that once we do this (at t = 11), ϕ (2) t dips to change ϕ t in that region. At t = 14, MF-GP-UCB has identified the maximum x with just 4 queries to f (2) . The region shaded in cyan in the last figure is the "good set" X g , which we alluded to in Section 2. We will define it formally and explain its significance in the multi-fidelity set up shortly. Our analysis predicts that most second fidelity queries in MF-GP-UCB will be be confined to this set (roughly) and the simulation corroborates this claim. For example, in the last figure, at t = 50, the algorithm decides to explore at a point far away from the optimum. However, this query occurs in the first fidelity since we have not sufficiently constrained f (1) (x t ) in this region and β The key idea is that it is not necessary to query such regions at the second fidelity as the first fidelity alone is enough to conclude that it is suboptimal. In addition, observe that in a large portion of X , ϕ t is given by ϕ (1) t except in a small neighborhood around x , where it is given by ϕ (2) t . Next we present our main theoretical results. We wish to remind the reader that a table of notations is available in Appendix B.2.

Theoretical Results
First and foremost, we will show that condition A2 occurs with positive probability when we sample the functions from a GP. The following lemma shows that P GP (A2) = ξ A2 > 0 which establishes that the generative mechanism is valid. The proof is given in Section 8.
Here Q is from Assumption 2. ξ A2 > 0 since each of the terms in the product are positive.
We are now ready to present our theoretical results. We begin with an informal yet intuitive introuduction to our theorems in M = 2 fidelities.

A Preview of our Theorems
In this subsection, we will ignore constants and polylog terms when they are dominated by other terms. , , denote inequality and equality ignoring constants. When A ⊂ X , we will denote its complement by A.
Fundamental to the 2-fidelity problem is the good set is at most 2ζ (1) away from the optimum. If a multi-fidelity strategy were to use all its second fidelity queries only in X g , then, by Theorem 1, the regret will only have Ψ n (X g ) dependence after n high fidelity queries. In contrast, a strategy that only operates at the highest fidelity, such as GP-UCB, will have Ψ n (X ) dependence. When ζ (1) is small, i.e. when f (1) is a good approximation to f (2) , X g will be much smaller than X .
Then, Ψ n (X g ) Ψ n (X ), and the multi-fidelity strategy will have better bounds on the regret than a single fidelity strategy. Alas, achieving this somewhat ideal goal is not possible without perfect knowledge of the approximation. However, with MF-GP-UCB we can come quite close. As we will show shortly, most second fidelity queries will be confined to the slightly inflated good set The following lemma bounds the number of first and second fidelity evaluations in X g and its complement X g . We denote the number of queries at the m th fidelity in a set A ⊂ X within the first n time steps by T Here Π(A) = |A| for discrete A and Π(A) = vol(A) for continuous A. The bound for T (2) n X g holds for any sublinear increasing sequence {τ n } n≥1 .
The above lemma will be useful for two reasons. First, the bounds on T (2) n (·) show that most second fidelity queries are inside X g ; the number of such expensive queries outside X g is small. This strong result is only possible in the multi-fidelity setting. From the results of Srinivas et al. (2010), we can infer that the best achievable bound on the number of plays for GP-UCB inside a suboptimal set is n 1/2 for the SE kernel and even worse for the Matérn kernel. For example, in the simulation of Figure 4, all queries to f (2) are in fact confined to X g which is a subset of X g . This allows us to obtain regret that scales with Ψ n (X g ) as explained above. Second, we will use Lemma 3 to control N , the (random) number of queries by MF-GP-UCB within capital Λ. Let n Λ = Λ/λ (2) be the (non-random) number of queries by a single fidelity method operating only at the second fidelity. As λ (1) < λ (2) , N could be large for an arbitrary multi-fidelity method. However, using the bounds on T (1) n (·) we can show that N is n Λ when Λ is larger than some value Λ 0 . Below, we detail the main ingredients in the proof of Lemma 3.

• T
(1) n (X g ): By the design of our algorithm, MF-GP-UCB will begin querying f (1) . To achieve finite regret we need to show that we will eventually query f (2) . For any region in X g the switching condition of step 2 in Algorithm 2 ensures that we do not query that region indefinitely. That is, if we keep querying a certain region, the first fidelity GP uncertainty β t−1 will reduce below γ (1) in that region. We will discuss the implications of the choice of γ (1) at the end of this subsection and in Section 6.
• T (1) n X g : For queries to f (1) outside X g , we use the following reasoning: as f (1) is small outside X g , it is unlikely to contain the UCB maximiser and be selected in step 1 of Algorithm 2 several times.

• T
(2) n X g : We appeal to previous first fidelity queries. If we are querying at the second fidelity at a certain region, it can only be because the first fidelity confidence band is small. This implies that there must be several first fidelity queries in that region which in turn implies that we can learn about f (1) with high confidence. As f (1) alone would tell us that any point in X g is suboptimal for f (2) , the maximiser of the UCB is unlikely to lie in this region frequently. Hence, we will not query outside X g often.
It follows from the above that the number of second fidelity queries in X g scales T (2) n (X g ) n.
Finally, we invoke techniques from Srinivas et al. (2010) to control the regret using the MIG. However, unlike them, we can use the MIG of X g since an overwhelming amount of evaluations at the second fidelity are in X g . This allows us to obtain a tighter bound on S(Λ) of the following form.
It is instructive to compare the above rates against that for GP-UCB in Theorem 1. By dropping the common and sub-dominant terms, the rate for GP-UCB is Ψ . Therefore, whenever the approximation is very good (vol(X g ) vol(X )) the rates for MF-GP-UCB are very appealing. When the approximation worsens and X g , X g become larger, the bound decays gracefully. In the worst case, MF-GP-UCB is never worse than GP-UCB up to constant terms for Λ ≥ Λ 0 . The Λ 0 term is required since at the initial stages, MF-GP-UCB will be exploring f (1) before proceeding to f (2) , at which stage its regret will still be +∞. The costs λ (1) , λ (2) get factored into the result via the Λ > Λ 0 condition. If λ (1) is large, for fixed γ (1) , a larger amount of capital is spent at the first fidelity, so Λ 0 will be large. We will make the dependence on Λ 0 on the lower fidelities explicit in the formal theorem statements. Now let us analyse the effect of the parameter γ (1) on the result. At first sight, large γ (1) seems to increase the size of X g which would suggest that we should keep it as small as possible. However, smaller γ (1) also increases Λ 0 ; intuitively, if γ (1) is too small, then one will wait for a long time in step 2 of Algorithm 2 for β t−1 to decrease without proceeding to f (2) . As one might expect, an "optimal" choice of γ (1) depends on how large a Λ 0 we are willing to tolerate; i.e. how long we are willing to wait investigating the cheap approximation. Moreover, if the approximation is extremely cheap, it makes sense to use very small γ (1) and learn as much as possible about f (2) from f (1) . However, it also depends on other problem dependent quantities such as X g . In Section 5.2 we describe a choice for γ (1) based on λ (1) , λ (2) and ζ (1) that aims to balance the cost spent at each fidelity. In our experiments however, we found that more aggressive choices for these threshold values γ (m) perform better in practice. We describe one such technique in Section 6.
For general M , we will define a hierarchy of good sets, the complement of which will be eliminated when we proceed from one fidelity to the next. At the highest fidelity, we will be querying mostly inside a small subset of X informed by the approximations f (1) , . . . , f (M −1) . We will formalise these intuitions in the next two subsections.

Discrete X
We first analyse the case when X is a discrete subset of > 0 for all m by our assumptions. Central to our analysis will be the partitioning ( (1) } to be the arms whose f (1) value is at least ζ (1) + 3γ (1) below the optimum f . Then recursively define, In addition to the above, we will also find it useful to define the sets "above" H (m) as H (m) = M =m+1 H ( ) and the sets "below" H (m) as H (m) = m−1 =1 H ( ) . Our analysis reveals that most of the capital invested at points in H (m) will be due to queries to the m th fidelity function f (m) . H (m) is the set of points that can be excluded from queries at fidelities m and beyond due to information from lower fidelities. H (m) are points that will be queried at fidelities higher than m several times. In the 2 fidelity setting described in Section 5, X g = H (2) and X g = H (1) = H (2) . We have illustrated these sets in Figure 5.
Recall that n Λ = Λ/λ (M ) is the number of queries by a single-fidelity method; it is a lower bound on N , the number of queries by a multi-fidelity method. Similarly, n Λ = Λ/λ (1) will be an upper bound on N . We will now define two quantities Λ 1 , Λ 2 where Λ 1 < Λ 2 . We will show improved simple regret over GP-UCB when the capital Λ is larger than these quantities, with the Λ > Λ 2 regime being better by an additive log(λ (M ) /λ (1) ) factor over the Λ > Λ 1 case. Formally, we define Λ 1 to be the smallest Λ satisfying the following condition, and Λ 2 to be the smallest Λ satisfying the following condition, We can find such Λ 1 , Λ 2 , since for fixed γ (m) 's, in both cases, the right side is linear in Λ and the left is logarithmic since β n O(log(n)) and n Λ Λ. Since {H (m) } M m=1 form a partition of X and λ (1) < · · · < λ (M ) , we see that Λ 1 < Λ 2 . Recall that at the initial stages, MF-GP-UCB has infinite simple regret since the evaluations are at lower fidelities. Λ > Λ 1 indicates the phase where Θ(n Λ ) evaluations have been made inside H (M ) , but the total number of evaluations N could be much larger. When Λ > Λ 2 , we have reached a phase where N is also in Θ(n Λ ).
Moreover, note that when the approximations are good, i.e. the sets H (m) are small, both Λ 1 and Λ 2 are small. Λ 1 is also small when the approximations are cheap, i.e. λ (m) 's are small. Therefore, the cheaper and better the approximations, we have to wait less time (for fixed γ (m) ) before MF-GP-UCB starts querying at the M th fidelity and achieves good regret.
We now state our main theorem for discrete X . To simplify the analysis, we will introduce an additional condition in the fidelity selection criterion in step 2 of Algorithm 2. We will always evaluate f (m) at x t only if x t has been evaluated at all lower fidelities, 1, . . . , m − 1; precisely, that n (x t ) = 0}. Both this condition, and the dependence of Λ 2 on |X | in (8) are an artefact of our analysis. They arise only because we do not account for the correlations between the arms in our discrete analysis; doing so requires us to make assumptions about the locations of the arms in [0, r] d . We will not need this condition or have Λ 2 depend on |X | for the continuous case.
The difference between the two results is the β n Λ dependence in the former setting and β n Λ in the latter; the latter bound is better by an additive log(λ (M ) /λ (1) ) term, but we have to wait for longer. Dropping constant and polylog terms and comparing to the result in Theorem 1 reveals that we outperform GP-UCB by a factor of Ψ n Λ (H (M ) )/Ψ n Λ (X ) vol ( )/vol() Figure 6: Empirically computed values for the ratio vol(H (2) )/vol(X ) for a one dimensional (left) and two dimensional (right) 2-fidelity problem. For this, the samples f (1) , f (2) were generated using the generative mechanism of Section 2.1, under the stipulated value for ζ (1) . In both cases, we used an SE kernel with bandwidth 1 and scale parameter 1. The y-axis is the mean value for the ratio over several samples and the x-axis is ζ (1) . In both cases, we used γ (1) = ζ (1) /3, and approximated the continuous domain with a uniform grid of size 10 4 . The figure indicates that as the approximation improves, i.e. ζ (1) decreases, the ratio decreases and consequently, we get better bounds.
claim that the rates improve as the ζ (m) values decrease. As the approximations worsen, the advantage to multi-fidelity optimisation diminishes as expected, but we are never worse than GP-UCB up to constant factors.
A few remarks are in order. First, note that the dependence on n Λ (or equivalently Λ) is the same for both GP-UCB and MF-GP-UCB. In fact, one should not expect multi-fidelity optimisation to yield "rate" improvements since such 1/n dependencies are typical in the bandit literature (Bubeck, Munos, Stoltz, & Szepesvári, 2011;Shang, Kaufmann, & Valko, 2017). The multi-fidelity framework allows us to find a good region, i.e. H (M ) , where the optimum exists, and as such, we should expect the improvements to be in terms of the size of this set, relative to X . Second, even when the kernels for each GP are different, the MIG dependence in Theorem 5 will be that of the highest fidelity GP f (M ) . The dependence of the other kernels will factor in via the ξ A2 bound; precisely, the more κ (m) is different from κ (M ) , the corresponding Q(ζ (m) /2) term will be smaller, leading to a smaller ξ A2 value. Finally, the bound is given in terms of H (M ) which, as illustrated by Figure 6, gives us insight into the types of gains we can expect from multi-fidelity optimisation. However, H (M ) is a random quantity and obtaining high probability bounds on its volume could shed more light on the gains of our multi-fidelity optimisation framework; this is an interesting avenue for future work.
Choice of γ (m) . It should be noted that an "optimal" choice of γ (m) depends on the available budget, i.e. how long we are willing to wait before achieving non-trivial regret. If we are willing to wait long, we can afford to choose small γ (m) and consequently have better guarantees on the regret. This optimal choice also depends on several unknown problem dependent factors -such as the sizes of the sets H (m) . In Kandasamy, Dasarathy, Poczos, and Schneider (2016), the choice γ (m) = ζ (m) λ (m) /λ (m+1) was used which ensures that for an arm x ∈ H (m) , the cost spent at lower fidelities 1, . . . , m − 1 is not more than the cost spent at fidelity m. Beyond this intuitive property, this choice further achieves a lower bound on the K-armed multi-fidelity problem. The same choice for γ (m) here ensures that the cost spent at the lower fidelities is not more than an upper bound on the cost spent at fidelity m -we have elaborated more in Remark 2 after our proofs. We have empirically demonstrated the effect of different choice of γ (m) values via an experiment in Figure 8(b). Building on these ideas, an explicit prescription for the choice of γ (m) is bound to be a fruitful avenue of research, and we leave this to future work. In the meanwhile, in Section 6, we describe a heuristic for adaptively choosing γ (m) adaptively which worked well in our experiments.

Continuous and Compact X
We define the sets H (m) , H (m) for m = 1, . . . , M as in the discrete case. Let {ν n } n≥0 be any sublinear sequence such that ν n → ∞. Let to be a ν n -dependent L 2 dilation of H n . Here, B 2 (x, ) is an L 2 ball of radius centred at x. Notice that as n → ∞, H (m) n → H (m) . Similar to the discrete case, we define Λ 1 to be the smallest Λ satisfying the following the condition, and Λ 2 to be the smallest Λ satisfying the following condition, Here p = 1/2 for the SE kernel and p = 1 for the Matérn kernel. C κ is a kernel dependent constant elucidated in our proofs; for the SE kernel, C κ = 2 2+d/2 (dκ 0 /h 2 ) d/2 where κ 0 , h are parameters of the kernel. Via a reasoning similar to the discrete case we see that Λ 1 < Λ 2 . Our main theorem is as follows.

Note that the sets H (M )
n Λ depend on the sublinear increasing sequence {ν n } n≥0 -the theorem is valid for any such choice of ν n . The comparison of the above bound against GP-UCB is similar to the discrete case. The main difference is that we have an additional dilation of H (M ) to H (M ) n Λ which occurs due to a covering argument in our analysis. Recall that H n Λ , which is small when the approximations are good.

Some Implementation Details of MF-GP-UCB and other Baselines
Our implementation uses some standard techniques in the Bayesian optimisation literature given below. In addition, we describe the heuristics used to set the γ (m) , ζ (m) parameters of our method.
Initialisation: Following recommendations in Brochu, Cora, and de Freitas (2010), all GP methods were initialised with uniform random queries using an initialisation capital Λ 0 . For single fidelity methods, we used it at the M th fidelity, whereas for multi-fidelity methods we used Λ 0 /2 at the first fidelity and Λ 0 /2 at the second fidelity.
Kernel: In all our experiments, we used the SE kernel. We initialise the kernel by maximising the GP marginal likelihood (Rasmussen & Williams, 2006) on the initial sample and then update the kernel every 25 iterations using marginal likelihood.
Choice of β t : β t , as specified in Theorems 1, 6 has unknown constants and tends to be too conservative in practice. Following Kandasamy, Schenider, and Póczos (2015) we use β t = 0.2d log(2t) which captures the dominant dependencies on d and t.
Maximising ϕ t : We used the DiRect algorithm .
Choice of γ (m) 's: The role of the γ (m) values at each fidelity is to ensure that we do not spend too much effort at the lower fidelities, where if γ (m) is too small, MF-GP-UCB spends a large number of queries at fidelity m to reduce the variance below γ (m) . This might cause MF-GP-UCB to spend an unnecessarily large number of evaluations at fidelity m. Hence, we start with small values for all γ (m) . However, if the algorithm does not query above the m th fidelity for more than λ (m+1) /λ (m) iterations, we double γ (m) . All γ (m) values were initialised to 1% of the range of initial queries.
Whilst the first four choices are standard in the BO literature (Brochu et al., 2010;Snoek et al., 2012), our methods for selecting the ζ (m) and γ (m) parameters are heuristic in nature. We obtained robust implementations of MF-GP-UCB with little effort in tweaking these choices. In fact, we found our implementation was able to recover even from fairly bad approximations at the lower fidelities (see experiment in Figure 9). We believe that other reasonable heuristics can also be used in place of our choices here, and a systematic investigation into protocols for the same will be a fruitful avenue for future research.

Experiments
We present experiments for compact and continuous X since it is the more practically relevant setting. We compare MF-GP-UCB to the following baselines. Single fidelity methods: GP-UCB; EI: the expected improvement criterion for BO ; DiRect: the dividing rectangles method . Multi-fidelity methods: MF-NAIVE: a naive baseline where we use GP-UCB to query at the first fidelity a large number of times and then query at the last fidelity at the points queried at f (1) in decreasing order of f (1) -value; MF-SKO: the multi-fidelity sequential kriging method from Huang et al. (2006). Previous works on multi-fidelity methods (including MF-SKO) had not made their code available and were not straightforward to implement. We discuss this more in Appendix A.1 along with some other single and multi-fidelity baselines we tried but excluded in the comparison to avoid clutter in the figures. We also detail some design choices and hyper-parameters for the baselines in Appendix A.1.

Synthetic Examples
We begin with a series of synthetic experiments, designed to demonstrate the applicability and limitations of MF-GP-UCB. We use the Currin exponential (d = 2), Park (d = 4) and Borehole (d = 8) functions in M = 2 fidelity experiments and the Hartmann functions in d = 3 and 6 with M = 3 and 4 fidelities respectively. The first three functions are taken from previous multi-fidelity literature (Xiong, Qian, & Wu, 2013) while we tweaked the Hartmann functions to obtain the lower fidelities for the latter two cases. In Appendix A we give the formulae for these functions and the approximations used for the lower fidelities. We show the simple regret S(Λ) against capital Λ in Figure 7. The number of fidelities and the costs used for each fidelity are also given in Figure 7. MF-GP-UCB outperforms other baselines on all problems.
The last panel of Figure 7 shows a histogram of the number of queries at each fidelity after 184 queries of MF-GP-UCB, for different ranges of f (3) (x) for the Hartmann-3D function. Many of the queries at the low f (3) values are at fidelity 1, but as we progress they decrease and the second fidelity queries increase. The third fidelity dominates very close to the optimum but is used sparingly elsewhere. This corroborates the prediction in our analysis that MF-GP-UCB uses low fidelities to explore and successively higher fidelities at promising regions to zero in on x . (Also see Figure 4.) A common occurrence with MF-NAIVE was that once we started querying at fidelity M , the regret barely decreased. The diagnosis in all cases was the same: it was stuck around the maximum of f (1) which is suboptimal for f (M ) . This suggests that while we have cheap approximations, the problem is by no means trivial. As explained previously, it is also important to explore at higher fidelities to achieve good regret. The efficacy of MF-GP-UCB when compared to single fidelity methods is that it confines this exploration to a small set containing the optimum. In our experiments we found that MF-SKO did not consistently beat other single fidelity methods. Despite our best efforts to reproduce MF-SKO, we found it to be quite brittle. In fact, we also tried another multi-fidelity method and found that it did not perform as desired (See Appendix A.1 for details).
Effect of the cost of the approximations: We now test the effect the cost of the approximation on performance. Figure 8(a) shows the results when MF-GP-UCB was run on the 2-fidelity Borehole experiment for different costs for the approximation f (1) . We fixed λ (2) = 1 and varied λ (1) between 0.01 to 0.5. As λ (1) increases, the performance worsens as expected. At λ (1) = 0.5 it is indistinguishable from GP-UCB as the overhead of managing 2 fidelities becomes significant when compared to the improvements of using the approximation.   Figure 8(b). We see that as γ (1) decreases the curves start later in the figure indicating that MF-GP-UCB spends more time at the approximation f (1) before proceeding to f (2) ; however, the simple regret is also generally better for smaller γ (1) . Therefore, if we have a large computational budget and are willing to wait longer, we can choose small γ (m) values and achieve better simple regret.
Bad Approximations: It is natural to ask how MF-GP-UCB performs with bad approximations at lower fidelities. We found that our implementation with the heuristics suggested in Section 6 to be quite robust. We demonstrate this using the Currin exponential function, but using the negative of f (2) as the first fidelity approximation, i.e. f (1) (x) = −f (2) (x). Figure 9 illustrates f (1) , f (2) and gives the simple regret S(Λ). Understandably, it loses to the single fidelity methods since the first fidelity queries are wasted and it spends some time at the second fidelity recovering from the bad approximation. However, it eventually is able to achieve low regret.

Model Selection and Astrophysics Experiments
We now present results on three hyper-parameter tuning tasks and a maximum likelihood inference task in Astrophysics. We compare methods on computation time since that is the "cost" in all experiments. We include the processing time for each method in the comparison (i.e. the cost of determining the next query). The results are given in Figure 10, where, as we see MF-GP-UCB outperforms other baselines on all tasks. The experimental set up for each optimisation problem is described below.
Classification using SVMs (SVM): We trained a Support vector classifier on the magic gamma dataset using the sequential minimal optimisation algorithm to an accuracy of 10 −12 . The goal is to tune the kernel bandwidth and the soft margin coefficient in the ranges (10 −3 , 10 1 ) and (10 −1 , 10 5 ) respectively on a dataset of size 2000. We set this up as a M = 2 fidelity experiment with the entire training set at the second fidelity and 500 points at the first. Each query to f (m) required 5-fold cross validation on the respective training sets.
Regression using additive kernels (SALSA): We used the SALSA method for additive kernel ridge regression (Kandasamy & Yu, 2016) on the 4-dimensional coal power plant dataset. We tuned the 6 hyper-parameters -the regularisation penalty, the kernel scale and the kernel bandwidth for each dimension-each in the range (10 −3 , 10 4 ) using 5-fold cross validation. This experiment used M = 3 and 2000, 4000, 8000 points at each fidelity respectively.

Viola & Jones face detection (V&J):
The Viola & Jones cascade face classifier (Viola & Jones, 2001), which uses a cascade of weak classifiers, is a popular method for face detection. To classify an image, we pass it through each classifier. If at any point the classifier score falls below a threshold, the image is classified as negative. If it passes through the cascade, then it is classified as positive.
One of the more popular implementations comes with OpenCV and uses a cascade of 22 weak classifiers. The threshold values in the OpenCV implementation are pre-set based on some heuristics and there is no reason to think they are optimal for a given face detection problem. The goal is to tune these 22 thresholds by optimising them over a training set. We modified the OpenCV implementation to take in the thresholds as parameters. As our domain X we chose a neighbourhood around the configuration used in OpenCV. We set this up as an M = 2 fidelity experiment where For the three hyper-parameter tuning tasks we plot the best cross validation error (lower is better) and for the astrophysics task we plot the highest log likelihood (higher is better). For the hyper-parameter tuning tasks we obtained the lower fidelities by using smaller training sets, indicated by n tr in the figures and for the astrophysical problem we used a coarser grid for numerical integration, indicated by "Grid". MF-NAIVE is not visible in the last experiment because it performed very poorly. All curves were produced by averaging over 10 experiments. The error bars indicate one standard error. The lengths of the curves are different in time as we ran each method for a pre-specified number of iterations and they concluded at different times.
the second fidelity used 3000 images from the Viola and Jones face database and the first used just 300. Interestingly, on an independent test set, the configurations found by MF-GP-UCB consistently achieved over 90% accuracy while the OpenCV configuration achieved only 87.4% accuracy.
Type Ia Supernovae (Supernova): We use Type Ia supernovae data from Davis et al (2007) for maximum likelihood inference on 3 cosmological parameters, the Hubble constant H 0 ∈ (60, 80), the dark matter fraction Ω M ∈ (0, 1) and the dark energy fraction Ω Λ ∈ (0, 1). Unlike typical parametric maximum likelihood problems we see in machine learning, the likelihood is only available as a black-box. It is computed using the Robertson-Walker metric Davis et al (2007), which requires a (one dimensional) numerical integration for each sample in the dataset. We set this up as a M = 3 fidelity task. At the third fidelity, the integration was performed using the trapezoidal rule on a grid of size 10 6 . For the first and second fidelities, we used grids of size 10 2 , 10 4 respectively. The goal is to maximise the likelihood at the third fidelity.

Proofs
In this section we present the proofs of our main theorems. While it is self contained, the reader will benefit from first reading the more intuitive discussion in Section 5. The goal in this section is to bound the simple regret S(Λ) given in (3). Recall that N is the random number of plays within capital Λ. While N ≤ Λ/λ (1) is a trivial upper bound for N , this will be too loose for our purposes. In fact, we will show that after a sufficiently large number of queries at any fidelity, the number of queries at fidelities smaller than M will be sublinear in N . Hence N ∈ O(n Λ ) where n Λ = Λ/λ (M ) is the number of plays by any algorithm that operates only at the highest fidelity. We introduce some notation to keep track of the evaluations at each fidelity in MF-GP-UCB. After n steps, we will have queried multiple times at any of the M fidelities. T  Roadmap: To bound S(Λ) in both the discrete and continuous settings, we will begin by studying the algorithm after n evaluations at any fidelity and analyse the following quantity, Readers familiar with the bandit literature will see that this is similar to the notion of cumulative regret, except we only consider queries at the M th fidelity and inside a set Z ⊂ X . Z contains the optimum and generally has high value for the payoff function f (M ) (x); it will be determined by the approximations provided via the lower fidelity evaluations. We will show that most of the M th fidelity evaluations will be inside Z in the multi-fidelity setting, and hence, the regret for MF-GP-UCB will scale with Ψ n (Z) instead of Ψ n (X ) as is the case for GP-UCB. Finally, to convert this bound in terms of n to one that depends on Λ, we show that both the total number of evaluations N and the number of highest fidelity evaluations T N (X ) are on the order of n Λ when Λ is sufficiently large. For this, we bound the number of plays at the lower fidelities (see Lemma 3). Then S(Λ) can be bounded by, Before we proceed, we will prove a series of results that will be necessary in our proofs of Theorems 5 and 6. We first prove Lemma 2.
Hence, P GP (A2) ≥ P GP (A2 ). We can now bound, Here the equality in the first step comes from the observation that the f (m) 's are independent under the P GP probability. The last inequality comes from Assumption 2.
Remark 1. It is worth noting that the above bound is a fairly conservative lower bound on ξ A2 since A2 essentially requires that all samples f (m) be small so as to make the differences f (M ) − f (m) small. We can obtain a more refined bound on ξ A2 by noting that f (M ) − f (m) ∼ GP(0, 2κ) and following proofs for bounding the supremum of a GP (e.g. Theorem 5.4 in Adler, 1990, or Theorem 4 in Ghosal and Roy, 2006). This leads to smaller values for β t in Theorems 5 and 6 and consequently better constants in our bounds. However, this analysis will require accounting for correlations when analysing multiple GPs which is beyond the scope and tangential to the goals of this paper. Moreover, from a practical perspective it would not result in anything actionable since many quantities in the expression for β t are already unknown in practice, even for GP-UCB. It is also worth noting that the dependence of ξ A2 on our regret bounds is mild since it appears as a log(1/ξ A2 ) term.
Next, Lemma 7 provides a way to bound the probability of an event under our prior (A1 and A2) using the probability of the event when the functions are sampled from a GP (A1 only).
Proof This follows via a straightforward application of Bayes' rule, shown below. The last step uses Lemma 2 and that the intersection of two sets is at most as large as either set.
For our analysis, we will also need to control the sum of conditional standard deviations for queries in a subset A ⊂ X . We provide the lemma below, whose proof is based of a similar result in Srinivas et al. (2010).
Lemma 8. Let f ∼ GP(0, κ), f : X → R and each time we query at any x ∈ X we observe y = f (x) + , where ∼ N (0, η 2 ). Let A ⊂ X . Assume that we have queried f at n points, (x t ) n t=1 of which s points are in A. Let σ t−1 denote the posterior variance at time t, i.e. after t − 1 queries. Then, xt∈A σ 2 t−1 (x t ) ≤ Proof Let A s = {z 1 , z 2 , . . . , z s } be the queries inside A in the order they were queried. Now, assuming that we have only queried inside A at A s , denote byσ t−1 (·), the posterior standard deviation after t − 1 such queries. Then, t:xt∈A Queries outside A will only decrease the variance of the GP so we can upper bound the first sum by the posterior variances of the GP with only the queries in A. The third step uses the inequality u 2 /v 2 ≤ log(1 + u 2 )/ log(1 + v 2 ) with u =σ t−1 (z t )/η and v = 1/η and the last step uses Lemma 15 in Appendix B.1. The result follows from the fact that Ψ s (A) maximises the mutual information among all subsets of size s.

Discrete X
Proof of Theorem 5. Without loss of generality, we can assume that MF-GP-UCB is run indefinitely. Let N denote the (random) number of queries within Λ, i.e. the quantity satisfying N = max{n ≥ 1; n t=1 λ (mt) ≤ Λ}. Note that supp (N ) ⊂ {n ∈ N : n Λ ≤ n ≤ n Λ }. In our analysis, we will first analyse MF-GP-UCB after n steps and control the regret and the number of lower fidelity evaluations.
Bounding the regret after n evaluations: We will need the following lemma to establish that ϕ t (x) upper bounds f (M ) (x). The proof is given in Section 8.1.1.
Lemma 9. Pick δ ∈ (0, 1) and choose β t ≥ 2 log M |X |π 2 t 2 3ξ A2 δ . Then, with probability at least 1 − δ/2, for all t ≥ 1, for all x ∈ X and for all m ∈ {1, . . . , M }, we have First note the following bound on the instantaneous regret when m t = M , The first step uses that ϕ (m) t (x) is an upper bound for f (M ) (x) by Lemma 9 and the assumption A2, and hence so is the minimum ϕ t (x). The second step uses that x t was the maximiser of ϕ t (x) and the third step that ϕ To controlR n , we will use Z = H (M ) in (11) and invoke Lemma 8. Applying the Cauchy Schwarz inequality yields, Here C 1 = 8/ log(1 + η −2 ).
Bounding the number of evaluations: Lemma 10, given below, bounds the number of evaluations at different fidelities in different regions of X . This will allow us to bound, among other things, the total number of plays N and the number of M th fidelity evaluations outside Z. The proof of Lemma 10 is given in Section 8.1.2. Recall that T (m) n (x) denotes the number of queries at point x ∈ X at fidelity m. Similarly, we will denote T (>m) n (x) to denote the number of queries at point x at fidelities larger than m.
Lemma 10. Pick δ ∈ (0, 1) and set β t = 2 log M |X |π 2 t 2 3ξ A2 δ . Further assume ϕ t (x ) ≥ f . Consider any x ∈ H (m) \{x } for m < M . We then have the following bounds on the number of queries at any given time step n, First whenever ϕ t (x ) ≥ f , by using the union bound on the second result of Lemma 10, Here we have used n −2 = π 2 /6. The last two quantifiers just enumerates over all x ∈ X \{x }. Similarly, applying the union bound for u = 1 on the third result, we have, for any given n, We will apply the above result for n = Λ/λ (1) and observe that T (>m) n (x) is non-decreasing in n. Hence, The condition for Lemma 10 holds with probability at least 1 − δ/2 (by Lemma 9), and therefore the above bounds hold together with probability > 1 − δ. We have tabulated these bounds in Table 1. We therefore have the following bound on the number of fidelity m (< M ) plays T (m) n (X ),  The second step uses that ∆ (m) (x) ≥ 3γ (m) for x ∈ H (m) and the last step uses the modification to the discrete algorithm which ensures that we will always play an arm at a lower fidelity before we play it at a higher fidelity. Hence, for an arm in H (m) , the 1 play at fidelities larger than m will be played at fidelity m + 1.
Proof of first result: First consider the total cost Λ (n) expended at fidelities 1, . . . , M − 1 and at the M th fidelity outside of H (M ) after n evaluations. Using (17), we have, Since N ≤ n Λ , we have for all n ∈ supp (N ), Λ (n) is less than the LHS of (7) and hence less than Λ/2. Therefore, the amount of cost spent at the M th fidelity inside H (M ) is at least Λ/2 and since each such evaluation expends λ (M ) , we have T

(M )
N (H (M ) ) ≥ n Λ /2. Therefore using (14) we have, Here, we have used N ≤ n Λ and that n Λ ≥ T The first term of the RHS above follows via (15) and the following argument. In particular, this does not use the additional condition on the discrete algorithm -we will use a similar argument in the continuous domain setting.
Let the LHS of (18) be A and the RHS be B when n = N . When Λ > Λ 2 , by (8) and using the fact that N ≤ n Λ , we have B < n Λ /2 < N/2. Since N = A + T Remark 2. Choice of γ (m) : As described in the main text, the optimal choice for γ (m) depends on the available budget and unknown problem dependent quantities. However the choice γ (m) = λ (m) /λ (m+1) ζ (m) ensures that for any x ∈ H (m) , the bounds on the number of plays in Table 1 are on the same order for fidelities m and below. To see this, consider any < m. Then, We therefore have, Above, by Table 1, the left most expression is an upper bound on the cost spent at fidelity and the term inside the parantheses is an upper bound on the cost spent at fidelity m. Hence, the capital spent at the lower fidelities is within a constant factor of this bound. In the K-armed setting (Kandasamy, Dasarathy, Poczos, & Schneider, 2016), we showed a O(η 2 /∆ (m) (x) 2 ) lower bound on the number of plays at the m th fidelity as well; such a result is not straightforward in the GP setting due to correlations between arms.

PROOF OF LEMMA 9
This is a straightforward argument using Gaussian concentration and the union bound. Consider any given m, t, x.
The first step uses Lemma 7. In the second step we have conditioned w.r.t D (m) t−1 which allows us to use Lemma 14. Recall that conditioning on all queries will not be a Gaussian due to the ζ (m) constraints. The statement follows via a union bound over all m ∈ {1, . . . , M }, x ∈ X and all t and noting that t t −2 = π 2 /6.

PROOF OF LEMMA 10
First consider any < m. Assume that we have already queried η 2 β n /γ (m) 2 times at any t ≤ n. Since the Gaussian variance after s observations is η 2 /s and that queries elsewhere will only decrease the conditional variance we have, κ (m) and by the design of our algorithm we will not play at the th fidelity at time t for all t until n. This establishes the first result.
To bound T (m) n (x) we first observe, The first line just enumerates the conditions in our algorithm for it to have played x at time t at fidelity m. In the second step we have relaxed some of those conditions, noting in particular that if ϕ t (·) was maximised at x then it must be larger than ϕ t (x ). The last step uses the fact that ϕ (m) t (x) ≥ ϕ t (x) and the assumption on ϕ t (x ). Consider the event {ϕ t−1 (x) ≥ u}. We will choose u = 5η 2 β n /∆ (m) (x) 2 and bound its probability via, t−1 (x) and that β n ≥ β t . The fourth step uses Lemma 14 after conditioning on D (m) t−1 , the fifth step uses ( √ 5 − 1) 2 > 3/2 and the last step uses 3δ/|X |π 2 < 1. Using the union bound on (20), we get P(T (m) The second inequality of the lemma follows by noting that there are at most n terms in the summation.
Finally, for the third inequality we observe As before, we have used that if x is to be queried at time t, then ϕ t (x) should be at least larger than ϕ t (x ) which is larger than f due to the assumption in the theorem. The second condition is necessary to ensure that the switching procedure proceeds beyond the m th fidelity. It is also necessary to have β 1/2 t σ ( ) t−1 (x) < γ ( ) for < m, but we have relaxed them. We first bound the probability of the event {ϕ Here, the second step uses that for all x ∈ H (m) , ∆ (m) (x) > 3γ (m) > 2γ (m) and the third step uses the second condition. Using the union bound on (21) and bounding the sum by an integral gives us,

Compact and Convex X
To prove theorem 6 we will require a fairly delicate set up for the continuous setting. Given a sequence {ν n } n≥0 , at time n we will consider a r  For instance, if X = [0, r] d a sufficient discretisation would be an equally spaced grid having ν 1/2d n points per side. Let {a i,n } n α 2 i=1 be the points in the covering, F n = {A i,n } n α 2 i=1 be the "cells" in the covering, i.e. A i,n is the set of points which are closest to a i,n in X and the union of all sets A i,n in F n is X . Next we will define another partitioning of the space similar using this covering. First let F (1) n . We define the following disjoint subsets {F n with respect to H (m) and H (m) n in Figure 11. By observing that H (1) Figure 11) we have the following, We are now ready to prove Theorem 6. We will denote the ε covering number of a set A ⊂ X in the · 2 metric by Ω ε (A).
Proof of Theorem 6. As in the discrete case, we will first control the regret and the number of lower fidelity evaluations by controlling each term in (23).
Bounding the regret after n evaluations: We will need the following lemma whose proof is given in Section 8.1.1.
As in the discrete setting, we set Z = H (M ) n in (11) to boundR n . Using m = M in Lemma 11 and using calculations similar to the discrete case yields, . (24) Here C 1 = 8/ log(1 + η −2 ). We have also used the fact t>0 t −2 = π 2 6 . Bounding the number of evaluations: The following lemma will be used to bound the number of plays in H (m) n ∪ H (m) . The proof is given in Section 8.2.2.
Lemma 12. Let f ∼ GP(0, κ), f : X → R and we observe y = f (x) + where ∼ N (0, η 2 ). Let A ⊂ X such that its L 2 diameter diam(A) ≤ D. Say we have n queries (x t ) n t=1 of which s points are in A. Then the posterior variance of the GP, κ (x, x) at any x ∈ A satisfies for appropriate kernel dependent constants C SE , C M at .
First consider the SE kernel. At time t consider any ε n = m) . The number of queries inside any B i of this covering at time n will be at most 2η 2 γ (m) 2 β n . To see this, assume we have already queried this many times inside B i at time t ≤ n. By Lemma 12 the maximum variance in A i can be bounded by t−1 (x) < γ (m) and we will not query inside A i until time n. A similar result is obtained for the Matérn kernel by setting ε n = γ (m) 2 4C M at βn . Therefore we have, .
Here C κ = 2 2+d/2 (dC SE ) d 2 and p = 1/2 for the SE kernel while C κ = 2 2+d (C M at ) d d d/2 and p = 1 for the Matérn kernel. We have also used the fact that k ≤ 2k for large enough k and the following bound for a δ-packing in the Euclidean metric Ω δ (A) ≤ vol(A)d d/2 /(2 d/2 δ d ).
Next, we will bound T To that end we provide the following Lemma whose proof is given in Section 8.2.3.

Lemma 13. Consider any
is as defined in (22). Let β t be as given in Theorem 6. Then for all n ≥ u ≥ (3η) −2/3 we have, Using the above result with n = n Λ gives us the result for all n ≤ n Λ since T Henceforth, all statements we make will make use of the bounds above and will hold with probability > 1 − δ for all n ∈ supp (N ).
The second step uses (23). The third step uses (25), (26), and the following argument, The remainder of the proof follows similar to the discrete case. Noting that n Λ ≤ n ≤ n Λ and that H (m) n is shrinking with n, we can conclude that Λ (n) is less than the LHS of (10). Therefore, T The first step uses (23) while the second step uses (25) and (27). Once again, similar to the discrete case we can argue that for all Λ > Λ 2 , the RHS B of (28) satisfies B < n Λ /2 < N/2, the M th fidelity plays in H n Λ ) n Λ + π 2 3n Λ .

PROOF OF LEMMA 11
The first part of the proof mimics the arguments in Lemmas 5.6, 5.7 of Srinivas et al. (2010). By Assumption 1 for any given m ∈ {1, . . . , M } and i ∈ {1, . . . , d} we have, Then, by the union bound and Lemma 7 we have, P ∀ m ∈ {1, . . . , M }, ∀ i ∈ {1, . . . , d}, ∀x ∈ X , Now we construct a discretisation F t of X of size (ν t ) d such that we have for all x ∈ X , x−[x] t 1 ≤ rd/ν t . Here [x] t is the closest point to x in the discretisation. (Note that this is different from the discretisation appearing in Theorem 6 even though we have used the same notation). By choosing ν t = t 2 brd log(6M ad/(ξ A2 δ)) and using the above we have for all f (m) 's with probability > 1 − δ/6. Noting that β t ≥ 2 log(M |F t |π 2 t 2 /2δ) for the given choice of ν t we have the following with probability > 1 − δ/3.
The proof mimics that of Lemma 9 using the same conditioning argument. However, instead of a fixed set over all t, we change the set at which we have confidence based on the discretisation. Similarly we can show that with probability > 1 − δ/3 we also have confidence on the decisions x t at all time steps. Precisely, Using (29), (30) and (31) the following statements hold with probability > 1 − 5δ/6. First we can upper bound f by, Since the above holds for all m, we have f ≤ ϕ t ([x ] t ) + 1/t 2 . Now, using similar calculations as (13) we bound ∆ (m) (x t ).

PROOF OF LEMMA 12
Since the posterior variance only decreases with more observations, we can upper bound κ (x, x) for any x ∈ A by considering its posterior variance with only the s observations in A. Further the maximum variance within A occurs if we pick 2 points x 1 , x 2 that are distance D apart and have all observations at x 1 ; then x 2 has the highest posterior variance. Therefore, we will bound κ (x, x) for any x ∈ A with κ(x 2 , x 2 ) in the above scenario. Let κ 0 = κ(x, x) and κ(x, x ) = κ 0 φ( x − x 2 ), where φ(·) ≤ 1 depends on the kernel. Denote the gram matrix in the scenario described above by ∆ = κ 0 11 + η 2 I. Then using the Sherman-Morrison formula on the posterior variance (2), ≤ 1 − D 2 h 2 . Plugging this into the bound above retrieves the first result with C SE = κ 0 /h 2 . For the Matérn kernel we use a Lipschtiz constant L M at of φ. Then 1 − φ 2 (D) = (1 − φ(D))(1 + φ(D)) ≤ 2(φ(0) − φ(D)) ≤ 2L M at D. We get the second result with C M at = 2κ 0 L M at . Since the SE kernel decays fast, we get a stronger result on its posterior variance which translates to a better bound in our theorems.

PROOF OF LEMMA 13
First, we will invoke the same discretisation used in the proof of Lemma 11 via which we have ϕ t ([x ] t ) ≥ f − 1/t 2 (32). (Therefore, Lemma 13 holds only with probability > 1 − δ/6, but this event has already been accounted for in Lemma 11.) Let b i,n,t = argmax x∈A i,n ϕ t (x) be the maximiser of the upper confidence bound in A i,n at time t. Note that the discretisation is fixed ahead of time and b i,n,t is deterministic given the data {(x t , m t , y t )} t−1 i=1 at time t. Now using the relaxation t (b i,n,t ) > f − 1/t 2 and proceeding, In the second step we have rearranged the terms and used the definition of ∆ (m) (x). In the third step, as A i,n ⊂ J t−1 (b i,n,t ). The last step bounds the sum by an integral. For the fourth step, we have used, t > u ≥ 1/(3η) 2/3 , β t > 2 log(M π 2 t 2 /2δ) > (3/2) 2 , and σ (m)

Conclusion
We introduced and studied the multi-fidelity bandit problem under Gaussian Process assumptions. Our theorems demonstrate that MF-GP-UCB explores the space using the cheap lower fidelities, and uses the higher fidelity queries on successively smaller regions, hence performing better than single fidelity strategies. Via experiments on synthetic functions, three hyper-parameter tuning tasks, and an astrophysical maximum likelihood estimation problem, we demonstrate the efficacy of our method and more generally, the utility of the multi-fidelity framework. Our Matlab implementation and experiments can be downloaded from github.com/kirthevasank/mf-gp-ucb.
Going forward we wish to study multi-fidelity optimisation under different model assumptions, and extend the algorithm when we have to deal with approximations from structured fidelity spaces.
Hartmann-3D function: The M th fidelity function is f (M ) (x) = 4 i=1 α i exp − 3 j=1 A ij (x j − P ij ) 2 where A, P ∈ R 4×3 are fixed matrices given below and α = [1.0, 1.2, 3.0, 3.2]. For the lower fidelities we use the same form except changing α to α Hartmann-6D function: The 6-D Hartmann function takes the same form as the 3-D case except A, P ∈ R 4×6 are as given below. We use the same modifications as above to obtain the lower fidelities using M = 4.