Sliding-Window Thompson Sampling for Non-Stationary Settings

Restless Bandits describe sequential decision-making problems in which the re-wards evolve with time independently from the actions taken by the policy-maker. It has been shown that classical Bandit algorithms fail when the underlying environment is changing, making clear that in order to tackle more challenging scenarios specifically crafted algorithms are needed. In this paper, extending and correcting the work by Trovò et al. [2020], we analyze two Thompson-Sampling inspired algorithms, namely BETA-SWTS and γ -SWGTS , introduced to face the additional complexity given by the non-stationary nature of the settings; in particular we derive a general formulation for the regret in any arbitrary restless environment for both Bernoulli and Subgaussian rewards, and, through the introduction of new quantities, we delve in what contribution lays the deeper foundations of the error made by the algorithms. Finally, we infer from the general formulation the regret for two of the most common non-stationary settings: the Abruptly Changing and the Smoothly Changing environments.


Introduction
The field of reinforcement learning has seen remarkable advancements, with bandit algorithms Lattimore and Szepesvári [2020] standing out as a fundamental component.These algorithms, which balance exploration and exploitation to optimize decision-making, have traditionally been studied in stationary settings where the environment does not change over time.However, many real-world applications, such as online advertising, medical treatment scheduling, and dynamic resource allocation, operate in environments that are inherently dynamic.These are often referred to as "restless" settings, where the state of the world evolves independently of the actions taken by the decision-maker.
Analyzing bandit algorithms in restless settings is crucial as the assumption of stationarity in traditional bandit problems is rarely met in practice.Markets fluctuate, user preferences shift, and external conditions vary, all contributing to a dynamic environment that challenges the efficacy of conventional bandit strategies.In these contexts, the ability to adapt to changing conditions can significantly enhance performance, making the study of restless bandit algorithms highly relevant, embodying a richer and more complex decision-making scenario and offering a closer approximation to real-world challenges.This complexity necessitates the development of more sophisticated algorithms capable of handling uncertainty and temporal dynamics.
Original Contributions In this paper, we provide original contributions in the fields of nonstationary MABs as we extend and correct the original work by Trovò et al. [2020]. 2 .In particular: • In Section 4, we derive a novel general formulation for the frequentist regret of Sliding-Window Thompson Sampling-inspired algorithms in an arbitrary restless setting, for both Bernoulli and Subgaussian rewards, unveiling the deeper dynamics that rule the performances of those algorithms.This implies the introduction of new quantities that characterize the non-stationary nature of the environment; • In Section 5, we show what the statements for the general setting imply for the abruptly changing environment, retrieving regret bounds in agreement with the state of art; • In Section 6, we show what the statements for the general setting imply for the smoothly changing environment, retrieving regret bounds in agreement with the state of art.
Restless Bandits Even if, as we've stated earlier, Thompson Sampling is optimal in the stationary case, it has been shown in multiple cases that in non-stationary ( Garivier and Moulines [2011], Trovò et al. [2020], Liu et al. [2024]) or adversarial settings ( Cesa-Bianchi and Lugosi [2006]) it provides poor performances in terms of regret.Lately, UCB1 and TS algorithms inspired the development of techniques to face the inherent complexities of restless MAB settings (Whittle [1988]).The main idea behind these newly crafted algorithms is to forget past observations, removing samples from the statistics of the arms reward.There are two main approaches in literature to forget past obervations: passive and active.The former iteratively discards the information coming from the far past, making decisions on the most recent samples coming from the arms pulled by the algorithms.Examples of such a family of algorithms are DUCB (Garivier and Moulines [2011]), Discounted TS (Qi et al. [2023], Raj and Kalyani [2017]), SW-UCB (Garivier and Moulines [2011]), and SW-TS (Trovò et al. [2020]).Instead, the latter class of algorithms uses change-detection techniques (Basseville et al. [1993]) to decide when it is the case to discard old samples.This occurs when a sufficiently large change affects the arms' expected rewards.Among the active approaches we mention CUSUM-UCB (Liu et al. [2018]), REXP3 (Besbes et al. [2014]), GLR-klUCB (Besson and Kaufmann [2019]), and BR-MAB (Re et al. [2021]).

Definition of the Problem and Algorithms
We model the problem as a stochastic NS-MAB setting, in which, at each round t over a finite time horizon T , the learner selects an arm i among a finite set of K arms.At each round t the learner observes a realization of the reward X i,t obtained from the chosen arm i.The reward for each arm i P K at round t is modeled by a sequence of independent random variables X i,t from a distribution unknown to the learner.We denote by µ i,t " ErX i,t s, and we will study two types of distributions of the rewards that will be encoded by the following assumptions.Assumption 3.1 (Bernoulli rewards).For every arm i P K and round t P T , we have: (1) where Bepµq denotes the Bernoulli distribution with parameter µ P r0, 1s.Assumption 3.2 (Subgaussian rewards).For every arm i P K and round t P T , we have: (2) where SubGpµ, σ 2 var q denotes a generic subgaussian distribution with mean µ P R and proxy variance σ 2 var , i.e., ErexppλXqs ď exppλ 2 σ 2 var {2q for every λ P R. The goal of the learner A is to minimize the expected cumulative dynamic frequentist regret over the horizon T , against the comparator that chooses at each time the arm with the largest expected reward at time t defined as i ˚ptq P argmax iP K µ i,t , formally: Algorithm 1 Beta-SWTS Algorithm 1: Input: Number of arms K, Time horizon T , time window τ 2: Set Xi,t,τ Ð 0 for each i P K 3: Set αi,1 Ð 1 `Xi,t,τ and βi,1 Ð 1 `p1 Xi,t,τ q for each i P K 4: Set νi,1 Ð Betapαi,1, βi,1q for each i P K 5: for t P T do 6: Sample θi,t,τ " νi,t for each i P K 7: Select It P arg max iP K θi,t,τ 8: Pull arm It 9: Collect reward Xt 10: Update Xi,t,τ and Ti,t,τ , respectively the sum of collected rewards within t and t ´τ `1 for arm i and the number arm i has been pulled within t and t ´τ `1 11: Update for each i P K νi,t`1 Ð Betap1 `Xi,t,τ , 1 `pTi,t,τ ´Xi,t,τ qq 12: end for Algorithm 2 γ-SWGTS Algorithm 1: Input: Number of arms K, Time horizon T , exploration parameter γ, time window τ 2: Play every arm once and collect reward Xt 3: Set Ti,t,τ Ð 1, μi,t,τ Ð Xt, μi,t,τ Ð μi,t,τ for each i P K 4: Set νi,t Ð N pμ i,t,τ , 1 γ q for each i P K 5: for t P T do 6: Sample θi,t,τ " νi,t for each i P K 7: Select It P arg max iP K θi,t,τ 8: Pull arm It 9: Collect reward Xt 10: Update the sum of the collected rewards within t and t ´τ `1, namely μi,t,τ and the number of pulls within t and t ´τ `1 namely Ti,t,τ , and μi,t,τ " μi,t,τ T i,t,τ

12:
Every τ times we will play all the arms once in order to ensure Ti,t,τ ą 0 13: end for where the expected value is taken w.r.t. the possible randomness of the algorithm.We analyse two sliding-window algorithms algorithms, namely the Beta-SWTS, proposed in Trovò et al. [2020], and the γ-SWGTS, introduced by Fiandri et al. [2024], inspired by the classical Thompson Sampling algorithm.Similarly to what happens with SW-UCB, they face the problem posed by the nonstationarity of the rewards by exploiting only the subset of the most recent collected rewards (i.e., a window of size τ ), in order to handle the bias given by the older rewards, that, in a non-stationary environment, may be non-representative of the real state of the system.We will characterize the performance estimating E τ rT i pT qs, i.e., the expected value of T i pT q given a choice of τ , being T i pT q the random variable describing the number of total pulls of the arm i at the time horizon T .

Regret Analysis for a General Non-Stationary Restless Environment
We now analyse the cumulative regret of the previously introduced Thompson Sampling strategies in generic non-stationary restless environment.We point out that the presented analysis does not make any assumption on the nature of the non-stationarity (e.g., abrupt or smoothly changing).Definition 4.0.1 (F τ , F A τ ).For every τ P N, we define F τ as any superset of F 1 τ defined as: tµ i ˚ptq,t 1 u ď max and we define the complementary set of F τ as F A τ .
Notice that by definition, for every t P F A τ , the following inequality holds true for all i ‰ i ˚ptq: min Intuitively, as F A τ collects all the time instants t in which the optimal arm at t, i.e. i ˚ptq, is such that the smallest expected reward within the last τ round is larger than the largest expected reward of all other arms in the same window.This makes it possible to introduce a more general definition for the suboptimality gaps that encapsulate how challenging it is for the algorithms to rank the arms relying on the inferential estimate of the past τ rewards.Definition 4.0.2(General Sub-optimality gap, ∆ τ ).For every τ P N, we define the general suboptimality gap as follows: tµ i ˚ptq,t 1 u ´max Analogously to the definition of F A τ , the suboptimality gap ∆ τ quantifies this minimum non-zero distance in expected reward between the optimal arm i ˚ptq and all other arms, across all rounds t P T .We are now ready to present the result on the upper bound of the expected number of pulls for each algorithm.Theorem 4.1 (General Analysis for Beta-SWTS).Under Assumption 3.1 and τ P N, for Beta-SWTS the following holds true for every arm i P K : Theorem 4.2 (General Analysis for γ-SWGTS).Under Assumption 3.2, τ P N, for γ-SWGTS with γ ď mint 1 4σ 2 var , 1u the following holds true for every arm i P K : These results capture the intuition that to guarantee for the algorithms to learn, we must set the window in such a way that, on average, every possible realization within the window allows distinguishing the optimal arm from the suboptimal ones at the time we must make a play, using the introduced notation this means that given a choice of τ , only in the set F A τ the algorithms are surely able to make an informed decision.Figure 1 provides a graphical representation of the fact that selecting a large window, τ 1 in the example, might lead to have realizations in which the sub-optimal arm dominates the optimal one (in the first interval, within the dashed lines), conversely selecting a proper window size τ we have that the average reward sampled within τ of the optimal arm is strictly larger than any possible average reward sampled within τ of the sub-optimal arm.We stress that the results hold for any arbitrary restless setting, e.g., the Rising Restless (Metelli et al. [2022]) or the Rotting Restless Bandits (Seznec et al. [2020]).Now we are ready to show which results these theorems imply for the most common non-stationary restless settings.
• A breakpoint is a round t P T such that i ˚ptq ‰ i ˚pt ´1q (for the sake of analysis, we will also consider as a breakpoint t " T , being T the time horizon) or either a round t P T such that exists i P K zi ˚pt ´1q that satisfies µ i,t ě µ i ˚ptq,t´1 .
• The ψ-th breakpoint (i.e.ψ-th smaller round t in which we have a breakpoint) determines the phase F ψ as the set of rounds within the pψ ´1q-th and ψ-th breakpoint.Formally, denoting with t ψ the round of the ψ-th breakpoint (with the convention that t 0 " 1), we define F ψ :" tt P T | t P rt ψ´1 , t ψ qu.
• The pseudophase for ψ ą 1 is defined as F ψ :" tt P T | t P rt ψ´1 `τ, t ψ qu (if τ ě t ψ ´tψ´1 we have F τ " tu) and F 1 " F 1 .Finally F ˚" Ť ψě1 F ψ .• We denote the total number of breakpoints (excluding the one at time t " T ) as Υ T .
In order to grasp the intuition behind the definition of the pseudophases, observe that by definition (see also Figures 2,3,4) when sampling in a pseudophase within a window τ we will sample rewards belonging only to a single phase.Assumption 5.1.(General Abruptly Changing Setting) For all ψ P Υ T `1 , the following holds true: This assumption captures the intuition of the abruptly changing restless bandit setting that ensures that, in every phase F ψ , the optimal arm i ˚ptq does not change.Notice that given Assumption 5.1, we can define F τ as the union of those set of times of length τ after every breakpoint, formally: Consequently, we have: F A τ " F ˚and as at any round t belonging to any pseudophase, by its definition, within a time window τ , the algorithms will use samples belonging only to a single phase, we have for any t P F ˚: min tµ i ˚ptq,t 1 u ą max The latter inequality follows from the fact that any round t P F ˚belongs to a pseudophase F ψ and therefore all the times t 1 P t ´τ, t ´1 will belong to a single phase F ψ , so that every reward sampled in those times allows to properly distinguish the optimal arm from the suboptimal ones, using Assumption 5.1 : By definition of the general suboptimality gap given in the previous section we have: tµ i ˚ptq,t 1 u ´max Notice that Assumption 5.1 is looser than the usual piece-wise constant setting.Indeed, every piecewise constant restless bandit (Garivier and Moulines [2008]) will respect the condition but not every restless setting that respect the condition is piece-wise constant (as shown in Figure 4, that depicts a piece-wise constant abruptly changing environment, while Figures 2 and 3 respresent arbitrary instances that satisfy Assumption 5.1).The definition of ∆ τ , if τ is such that no pseudophase is empty, satisfies the one given of ∆ in Garivier and Moulines [2008] in the case of piece-wise constant restless setting.We are now ready to present the results on the upper bounds on the number of plays in the abruptly changing, restless environment.Theorem 5.1 (Analysis for Beta-SWTS for Abruptly Changing Environments).Under Assumptions 3.1 and 5.1, τ P N, for Beta-SWTS the following holds: Remark 5.1.Notice that since the error suffered at each turn cannot be greater than one the upper bound on the expected cumulative dynamic regret can be written as: retrieving the same order in terms of T , τ and Υ T derived by Trovò et al. [2020]., 1u it holds that: Remark 5.2.Assuming it exists M , finite, such that max tP T ,iP K zi ˚ptq tµ i ˚ptq,t ´µi,t u ď M , the upper bound on the expected cumulative dynamic regret can be written as: Remark 5.3.We notice that both in Theorem 5.1 and Theorem 5.2 we achieve the same performance of SW-UCB (Garivier and Moulines [2008], Theorem 7) in terms of T , τ and Υ T and Theorem 5.2 manages to achieve the same order also in terms of ∆ τ .

Corollaries for Smoothly Changing Restless Environments
We now study the implications of the statements about the general restless setting in the smoothly changing environments.First, we introduce the assumptions that characterize the environment.Assumption 6.1 (Combes andProutiere [2014], Trovò et al. [2020]).The expected reward of such an arm varies smoothly over time, i.e., it is Lipschitz continuous.Formally, there exists σ ă `8 such that: |µ i ptq ´µi pt 1 q| ď σ|t ´t1 | for every t, t 1 P T and i P K .
Notice that Combes and Proutiere [2014] assumption is a particular case of the above assumption when β " 1.In the supplementary material, we show that in a smoothly-changing environment (Assumption 6.1), Assumption 6.2 introduced in the paper by Combes and Proutiere [2014] is a particular case of our general statement, in fact it is possible to prove that F ∆ 1 ,T is defined in a way that implies for the set of times t P F A ∆ 1 ,τ that the following will surely hold true: min tµ i,t 1 u, making possible, using our notation, to set F τ " F ∆ 1 ,τ and to prove that ∆ τ " ∆ 1 ´2στ .We are now ready to present the result on the upper bounds on the number of plays for the smoothly changing environment.Theorem 6.1 (Analysis for Beta-SWTS for Smoothly Changing Environments).Under Assumptions 3.1, 6.1, and 6.2 for Beta-SWTS, it holds that: Remark 6.1.Since at every round the error suffered at every round cannot be greater than one the upper bound on the expected cumulative dynamic regret at time T can be written as: retrieving the same order in T and τ derived by Trovò et al. [2020].
Theorem 6.2 (Analysis for γ-SWGTS for Smoothly Changing Environments).Under Assumptions 3.2, 6.1, and 6.2, for γ-SWGTS with γ ď mint 1 4σ 2 var , 1u, it holds that: where: Remark 6.2.Assuming it exists M , finite, such that max tP T ,iP K zi ˚ptq tµ i ˚ptq,t ´µi,t u ď M , the upper bound on expected cumulative dynamic regret at time T can be written as: Remark 6.3.Notice that the results we obtain in Theorem 6.1 and Theorem 6.2 are of the same order obtained in Theorem 5.1 Combes and Proutiere [2014] in T and τ and for 6.2 even in terms of p∆ 1 ´2στ q.

Conclusions
We have characterized the performance of Thompson-Sampling inspired algorithms designed for non-stationary environments, namely Beta-SWTS and γ-SWGTS, in a general formulation of a newly characterized restless setting, inferring the underlying dynamics that regulates how these algorithms learn in any arbitrary environment, for either Bernoulli and Subgaussian rewards.Finally, we have tested how these general rules apply for two of the most common restless settings in the literature, namely the abruptly changing environment and the smoothly changing one, deriving upper bounds on the performance that are in line with the state of the art analysis of the performance of sliding window algorithms.8 Supplementary Material Theorem 4.1 (General Analysis for Beta-SWTS).Under Assumption 3.1 and τ P N, for Beta-SWTS the following holds true for every arm i P K : Proof.Let's define the two threshold quantities x i,t and y i,t for t P F A τ (t being the time the policy-maker has to choose the arm) as: with ∆ i,t,τ " min t 1 Prt´1,t´τ s tµ i ˚ptq,t 1 u ´max t 1 Prt´1,t´τ s tµ iptq,t 1 u, we will always consider in the following analysis: x i,t " max t 1 Prt´1,t´τ s tµ iptq,t 1 u `∆i,t,τ 3 , y i,t " min t 1 Prt´1,t´τ s tµ i ˚ptq,t 1 u ´∆i,t,τ 3 , notice then that the following quantities will have their minima for those t P F A τ such ∆ i,t,τ " ∆ τ : and independently from the time t P T in which happens they will have always the same value.We then refer to the minimum values the quantities above can get in t P F A τ as: With the introduced threshold we can divide the analysis considering the following events: • E µ i ptq as the event for which μi,t,τ ď x i,t ; • E θ i ptq as the event for which θ i,t,τ ď y i,t , where θ i,t,τ denotes a sample generated for arm i from the posterior distribution at time t from the sample collected in the last τ plays, i.e., BetapS i,t,τ `1, F i,t,τ `1q, being S i,t,τ and F i,t,τ the number of successes and failures from t ´τ up to round t (excluded) for arm i (note that T i,t,τ " S i,t,τ `Fi,t,τ and μi,t,τ " S i,t,τ {T i,t,τ ), μi,t,τ " 0 when T i,t,τ " 0; • p i,t in such framework will be defined as p i,t " Prpθ i ˚ptq,t,τ ě y i,t | F t´1 q; • We will assign "error" equal to one for every t P F τ .
Moreover, let us denote with E µ i ptq A and E θ i ptq A the complementary event E µ i ptq and E θ i ptq, respectively.Let us focus on decomposing the probability term in the regret as follows: Term A We have: Where the inequality for the first term in 27 is due to Lemma 10.6, while the inequality for the summands in 28 the inequality follows from the Chernoff-Hoeffding bound, Lemma 10.1, in fact as E µ i ptq A is the event that μi,t,τ ą x i,t we have that: Term B Let us focus on the summands of the term P B of the regret.To this end, let pF t´1 q tP T be the canonical filtration.We have: ď P pBeta px i,t T i,t,τ `1, p1 ´xi,t qT i,t,τ `1q ą y i,t q (36) ď F B Ti,t,τ ,yi,t `xi,t T i,t,τ ˘ď exp p´T i,t,τ dpx i,t , y i,t qq (37) where for the last inequality we expoloited the Pinsker inequality and the penultimate inequality follows from the generalized Chernoff-Hoeffding bounds (Lemma 10.1) and the Beta-Binomial identity (Fact 3 of Agrawal and Goyal [2017]).Equation ( 35) was derived by exploiting the fact that on the event E µ i ptq a sample from Beta px i T i,t,τ `1, p1 ´xi qT i,t,τ `1q is likely to be as large as a sample from Betapμ i,t,τ T i,t,τ `1, p1 ´μ i,t,τ qT i,t,τ `1q, reported formally in Fact 10.2.Therefore, for t such that T i,t,τ ą L i pτ q, where L i pτ q :" log τ 2pyi´xiq 2 we have: We decompose in two events, when T i,t,τ ď L i pτ q and when T i,t,τ ě L i pτ q , then: ) Where for the first term in 43 we exploited Lemma 10.6, as: Term C For this term, we shall use Lemma 2.8 by Agrawal and Goyal [2017].Let us define p i,t " Ppθ i ˚ptq,t,τ ą y i,t |F t´1 q.We have: Thus, we can rewrite the term P C as follows: " We will write this term as a sum of two contributions: Exploiting the fact that ErXY s " ErXErY | Xss we can rewrite both A and B as: Let' first tackle term A: A " Let's evaluate what happens when C 1 holds true, that are the only cases in which the summands within the summation in 57 are different from zero.Now consider an arbitrary instantation T 1 i ˚,t,τ of T i ˚,t,τ (i.e an arbitrary number of pulls of the optimal arm within the time window τ ) in which C1 holds true, we can rewrite p˚q as: where the expected value r¨s is taken over all the values of T 1 i ˚,t,τ that make C 1 true.As Lemma 10.7 states any bound obtained for the stationary case on the term p˚1q will also holds true for the non-stationary case, then we can bound p˚1q with Lemma 4 by Agrawal and Goyal [2012], using as the average reward for the best arm the smaller possible average reward within the time window τ (i.e.min t 1 Prt´1,t´τ s µ i ˚ptq,t 1 ) that, as encoded by 10.7, is the worst case scenario for the quantity under analysis.For ease of notation we will denote µ 1 i ˚" min t 1 Prt´1,t´τ s µ i ˚ptq,t 1 : Where by definition ∆ 1 i :" pµ 1 i ˚´y i,t q.We notice that the worst case scenario we can have is for , so that every possible instantiation in which condition C1 holds true the expectation value of 1´pi,t pi,t can be upper bounded by substituting in the latter inequalities the worst case scenario for T 1 i ˚,t,τ we obtain a term independent with the pulls: so that the inequality for A can be rewritten as: A ď O ˆT lnpτ q τ pµ i ˚,F A τ ´yi q 3 ˙(61) where we have exploited Lemma 10.6 that bounds the maximum number of times C 1 can be true within T rounds: Facing now the term B: B " Let's evaluate what happens when C 2 holds true, that are the only cases in which the summands within the summation in 63 are different from zero.Let's now consider an arbitrary instantiation T 1 i ˚,t,τ of T i ˚,t,τ in which C2 holds true (i.e an arbitrary number of pulls of the optimal arm within the time window τ ): where the expected value r¨s is taken over all the values of T 1 i ˚,t,τ that make C 2 true.Again, thanks to Lemma 10.7 we can bound the latter term p˚˚1q using Lemma 4 of Agrawal and Goyal [2012] with the smaller expected reward the optimal arm can get within the window τ : We see that the worst case scenario when C2 holds true is when T 1 i ˚,t,τ " 8 lnpτ q pµ i ˚,F A τ ´yiq 2 , so considering the worst case scenario for the case C2 holds true we can bound the expected value for 1´pi,t pi,t for every possible realization of C2 independently from T 1 i ˚,t,τ as: so that: Summing all the terms yields to the result.
Theorem 8.1 (General Analysis for γ-SWGTS).Under Assumption 3.2, τ P N, for γ-SWGTS with γ ď mint 1 4σ 2 var , 1u the following holds true for every arm i P K : Proof.Let's define x i,t and y i,t for t P F A τ (t being the the policy-maker has to choose the arm) as: max t 1 Prt´1,t´τ s tµ iptq,t 1 u ă x i,t ă y i,t ă min with ∆ i,t,τ " min t 1 Prt´1,t´τ s tµ i ˚ptq,t 1 u ´max t 1 Prt´1,t´τ s tµ iptq,t 1 u, we will always consider in the following analysis: x i,t " max notice then that the following quantities will have their minima for those t P F A τ such ∆ i,t,τ " ∆ τ : y i,t ´xi,t x i,t ´max t 1 Prt´1,t´τ s tµ iptq,t 1 u min t 1 Prt´1,t´τ s tµ i ˚ptq,t 1 u ´yi,t , / .
/ - and independently from the time t P T in which happens they will have always the same value.We then refer to the minimum values the quantities above can get in t P F A τ as: With the introduced threshold we can divide the analysis considering the following events: • E µ i ptq as the event for which μi,t,τ ď x i,t ; • E θ i,t as the event for which θ i,t,τ ď y i,t , where θ i,t,τ denotes a sample generated for arm i from the posterior distribution at time t , i.e., N pμ i,t,τ , 1 γTi t ,t,τ q, being T it,t,τ of trials at time t in the temporal window τ for arm i t ; • p i,t in such framework will be defined as p i,t " Prpθ i ˚ptq,t,τ ě y i,t | F t´1 q; • We will assign "error" equal to one for those t P F τ .
Notice that by definition (as within the window we have at least one pull for each arm): Moreover, let us denote with E µ i ptq A and E θ i ptq A the complementary event E µ i ptq and E θ i ptq, respectively.Let us focus on decomposing the probability term in the regret as follows: error due to the round robin every τ rounds . (72) Term A We have: Where in 75 we used Lemma 10.6 and in 76 we used the Chernoff-Hoeffding bound for subgaussian random variables, Lemma 10.5, remembering that γ ď mint1, 1 4σ 2 var u, in fact as E µ i ptq A is the event that μi,t,τ ą x i,t we have that: , we decompose each summand into two parts: The first term is bounded by L i pτ q T τ due to Lemma 10.6.For the second term: Now, θ i,t,τ is a N ´μ i,t,τ , 1 γTi,t,τ ¯distributed Gaussian random variable.An N `m, σ 2 ˘distributed r.v.(i.e., a Gaussian random variable with mean m and variance σ 2 ) is stochastically dominated by N `m1 , σ 2 ˘distributed r.v.if m 1 ě m.Therefore, given μi,t,τ ď x i,t , the distribution of θ i,t,τ is stochastically dominated by N ´xi,t , 1 γTi,t,τ ¯.That is, Using Lemma 10.4: τ q γpyi´xiq 2 .Substituting, we get, Summing over t P F τ , we get an upper bound of Term C For this term, we shall use Lemma 2.8 by Agrawal and Goyal [2017].Let us define p i,t " Ppθ i ˚,t,τ ą y i,t |F t´1 q.We have: Thus, we can rewrite the term P C as follows: " We will decompose the latter inequality in two contributions: Where L i pτ q " 288 logpτ ∆ 2 τ `e6 q γ∆ 2 τ .Let's first tackle term A exploiting the fact that ErXY s " ErXErY | Xss we can rewrite it as: Let's evaluate what happens when C 1 holds true, that are the only cases in which the summands within the summation in 96 are different from zero.We will show that whenever condition C1 holds true p˚q is bounded by a constant.We will show that for any realization of the number of pulls within a time window τ such that condition C1 holds true (i.e.number of pulls j of the optimal arm within the time window less than L i pτ q) the expected value of G j is bounded by a constant for all j defined as earlier.
Let Θ j denote a N ´μ i ˚ptq,j , 1 γj ¯distributed Gaussian random variable, where μi ˚ptq,j is the sample mean of the optimal arm's rewards played j times within a time window τ at time t.Let G j be the geometric random variable denoting the number of consecutive independent trials until and including the trial where a sample of Θ j becomes greater than y i,t .Consider now an arbitrary realization of of T i ˚,t,τ , namely T i ˚,t,τ " j respecting condition C1 then observe that p i,t " Pr `Θj ą y i,t | F τj ȃnd: Where by E j|C1 r¨s we denote the expected value taken over every j respecting condition C 1 .Consider any integer r ě 1.Let z " ?ln r and let random variable MAX r denote the maximum of r independent samples of Θ j .We abbreviate μi ˚ptq,j to μi ˚and we will abbreviate min t 1 Prt´1,t´τ s tµ i ˚ptq,t 1 u as µ i ˚in the following.Then for any integer r ě 1: P pG j ď rq ě P pMAX r ą y i,t q (98) For any instantiation F τj of F τj , since Θ j is Gaussian N ´μ i ˚, 1 γj ¯distributed r.v., this gives using 10.3: " 1 ´˜1 ´1 ?2π ?ln r pln r `1q For r ě e 12 : Substituting we obtain: Applying 10.5 to the second term we can write: where the last inequality follows as by definition we will always have that µ i ˚´Er μi ˚s ď 0. Using, y i,t ď µ i ˚, this gives: Substituting all back we obtain: This shows a constant bound independent from j of E " 1 pi,t ´1ı for all any possible arbitrary j such that condition C1 holds true.Then A can be rewritten as: where in the last inequality we exploited Lemma 10.6 that bounds the maximum number of times C 1 can hold true within T rounds: Let's now tackle B yet again exploiting the fact that ErXY s " ErXErY | Xss: B " Let's evaluate what happens when C 2 holds true, that are the only cases in which the summands within the summation in 119 are different from zero.We derive a bound for p˚˚q for large j as imposed by condition C2.Consider then an arbitrary instantiation in which T i ˚,t,τ " j ě L i pτ q (as dictated by C2): Where by E j|C2 r¨s we denote the expected value taken over every j respecting condition C 2 .Given any r ě 1, define G j , MAX r , and z " ?ln r as defined earlier.
Then, since Θ j is N ´μ i ˚,j , 1 γj ¯distributed random variable, using the upper bound in Lemma 10.4, we obtain for any instantiation F τj of history F τj , being j ě L i pτ q.This implies: Also, for any t such condition C2 holds true, we have j ě L i pτ q, and using 10.Where the last inequality of 129 follows from the fact that: where the last inequality follows as by definition we will always have that µ i ˚´Er μi ˚s ď 0.
`e6 ˘2.Therefore, for 1 ď r ď T 1 P pG j ď rq ě 1 ´1 2 r pT 1 q r{2 ´1 pT 1 q 8 (134) When r ě T 1 ě e 12 , we obtain: Combining all the bounds we have derived a bound independent from j as: So that: The statement follows by summing all the terms.
Theorem 5.1 (Analysis for Beta-SWTS for Abruptly Changing Environments).Under Assumptions 3.1 and 5.1, τ P N, for Beta-SWTS the following holds: Proof.The proof follows by defining F τ as the set of times of length τ after every breakpoint, and noticing that by definition of the general abruptly changing setting we have for any t P F A τ , as we have demonstrated in the main paper, that: Theorem 5.2 (Analysis for γ-SWGTS for Abruptly Changing Environments).Under Assumptions 3.2 and 5.1, τ P N, for γ-SWGTS with γ ď mint 1 4σ 2 var , 1u it holds that: Proof.The proof, yet again, follows by defining F τ as the set of times of length τ after every breakpoint, and noticing that by definition of the general abruptly changing setting we have for any t P F A τ , as we have demonstrated in the main paper, that: min Theorem 6.1 (Analysis for Beta-SWTS for Smoothly Changing Environments).Under Assumptions 3.1, 6.1, and 6.2 for Beta-SWTS, it holds that: Proof.In order to derive the bound we will assign "error" equal to one for every t P F ∆ 1 ,T and we will study what happens in F A ∆ 1 ,T .Notice that by definition of F A ∆ 1 ,T we will have that @i ‰ i ˚ptq: µ i ˚ptq,t´1 ´µi,t´1 ě ∆ 1 ą 2στ.
Substituting we obtain: min t 1 Prt´1,t´τ s tµ i ˚ptq,t 1 u ´max t 1 Prt´1,t´τ s tµ i,t 1 u ě µ i ˚ptq,t´1 ´στ ´µi,t´1 ´στ, so that due to the introduced assumptions we have: Notice that is the assumption for the general theorem so we will have that F A ∆ 1 ,T " F A τ , this yields to the desired result noticing that by definition ∆ τ " ∆ 1 ´2στ .Theorem 6.2 (Analysis for γ-SWGTS for Smoothly Changing Environments).Under Assumptions 3.2, 6.1, and 6.2, for γ-SWGTS with γ ď mint 1 4σ 2 var , 1u, it holds that: where: Proof.In order to derive the bound we will assign "error" equal to one for every t P F ∆ 1 ,T and we will study what happens in F A ∆ 1 ,T , i.e. the set of times t P T such that t R F ∆ 1 ,T .Notice that by definition of F A ∆ 1 ,T we will have that @i ‰ i ˚ptq: µ i ˚ptq,t´1 ´µi,t´1 ě ∆ 1 ą 2στ.
Using the Lipsitchz assumption we can infer that for i ‰ i ˚ptq: tµ i,t 1 u ě µ i ˚ptq,t´1 ´στ ´µi,t´1 ´στ, so that due to the introduced assumptions we have: Notice that is the assumption for the general theorem so we will have that F A ∆ 1 ,T " F A τ , this yields to the desired result noticing that by definition ∆ τ " ∆ 1 ´2στ .9 Errors from the paper by Trovò et al. [2020] Rewriting Eq. 18 to Eq. 21 from Trovò et al. [2020]: notice that the term ř )ı is bounded using Lemma 10.6, implying that the event t¨u in 1t¨u is: however notice that the separation of the event used by the author (following the line of proof Kaufmann et al. [2012]) in Eq.12 to Eq.16 in Trovò et al. [2020]: is such that the event t¨u is given by: t¨u " thus making the the derived inequality incorrect.The same error is done also in the following equations (Eq. 70 to Eq. 72, Trovò et al. [2020]): where notice that yet again ř tPF ∆ C ,N P ´Ti t ,t,τ ď nA ¯has been wrongly bounded by nA r N τ s.

Auxiliary Lemmas
In this section, we report some results that already exist in the bandit literature and have been used to demonstrate our results.Lemma 10.1 (Generalized Chernoff-Hoeffding bound from Agrawal and Goyal [2017]).Let X 1 , . . ., X n be independent Bernoulli random variables with ErX i s " p i , consider the random variable X " 1 n ř n i"1 X i , with µ " ErXs.For any 0 ă λ ă 1 ´µ we have: PpX ě µ `λq ď exp `´ndpµ `λ, µq ˘, and for any 0 ă λ ă µ PpX ď µ ´λq ď exp `´ndpµ ´λ, µq ˘, where dpa, bq :" a ln a b `p1 ´aq ln 1´a 1´b .Lemma 10.2 (Beta-Binomial identity).For all positive integers α, β P N, the following equality holds: F beta α,β pyq " 1 ´F B α`β´1,y pα ´1q, where F beta α,β pyq is the cumulative distribution function of a beta with parameters α and β, and F B α`β´1,y pα ´1q is the cumulative distribution function of a binomial variable with α `β ´1 trials having each probability y.
Lemma 10.3 (Abramowitz and Stegun [1968] Formula 7.1.13).Let Z be a Gaussian random variable with mean µ and standard deviation σ, then: Lemma 10.4 (Abramowitz and Stegun [1968]).Let Z be a Gaussian r.v. with mean m and standard deviation σ, then: 1 4 ?π e ´7z 2 {2 ă Pp|Z ´m| ą zσq ď 1 2 e ´z2 {2 . (158) Lemma 10.5 (Rigollet and Hütter [2023] Corollary 1.7).Let X 1 , . . ., X n be n independent random variables such that X i " SUBG(σ 2 ), then for any a P R n , we have and Of special interest is the case where a i " 1{n for all i we get that the average X " 1 n ř n i"1 X i , satisfies Pp X ą tq ď e where the equality holds if and only if p 1 " ¨¨¨" p n of the poisson-binomial distribution are all equal to p of the binomial.Lemma 10.7 (Fiandri et al. [2024], Lemma 4.1 (Technical Lemma)).Let's define p i,t , for any y i P p0, 1q, as: p i,t :" Pr `Beta `Si ˚,t `1, F i ˚,t `1˘ą y i,t |F t´1 ˘, where S i ˚,t is the random variable characterized by either a Binomial or a Poisson-Binomial distribution describing the number of successes of the stochastic process , F i ˚,t " N i ˚,t ´Si ˚,t is the number of failures and F t´1 is the filtration of the history up to time t ´1.Let PBpµ i ˚pj qq be a Poisson-Binomial distribution (that is the distribution describing the number of success of a certain number of Bernoulli trials but with different probability of success) with individual means µ i ˚pj q " pµ i ˚p1q, . . ., µ i ˚pj qq, and Binpj, xq be a binomial distribution with an arbitrary number j of trials and probability of success x ď µ i ˚pj q " ř j l"1 µ i ˚plq j . For any N i ˚,t " j and y i P p0, 1q, it holds that:

Figure 1 :
Figure 1: The intuition behind the general regret analysis.
Lemma 10.6(Combes and Proutiere [2014], Lemma D.1).Let A Ă N, and τ P N fixed.Define apnq " ř n´1 t"n´τ 1pt P Aq.Then for all T P N and s P N we have the inequality: T ÿ n"1 1pn P A, apnq ď sq ď srT {τ s. (161) Julien Seznec, Pierre Menard, Alessandro Lazaric, and Michal Valko.A single algorithm for both restless and rested rotting bandits.In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3784-3794.PMLR, 26-28 Aug 2020.William R. Thompson.On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285-294, 1933.
Then, P pG j ď rq ě P pMAX r ą y i,t q P ˜ϑi t ,t ď µ i t ,t ´στ ´d 5 log τ T i t ,t,τ , T i t ,t,τ ą nA ¸`n A A " P ˜ϑi t ,t ď µ i t ,t ´στ ´d 5 log τ T i t ,t,τ , T i t ,t,τ ą nA ÿ tPF ∆ C ,N P ´Ti t ,t,τ ď nA ¯(154) t "PBpµ