Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning

Offline reinforcement learning -- learning a policy from a batch of data -- is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.


Introduction
Offline reinforcement learning (RL) involves using a previously collected static dataset, without any online interaction, to learn an output policy.This problem setting is important for a variety of real world problems where learning online can be dangerous, such as for selfdriving cars, or when building a good simulator is difficult or costly, such as for healthcare.It is also a useful setting for applications where there is a large amount of data available, such as dynamic search advertising.
A challenge in offline RL is that the quality of the output policy can be highly dependent on the data.Most obviously, the data might not cover some parts of the environment, resulting in two issues.The first is that the learned policy, when executed in the environment, is likely to deviate from the behavior that generated its training data and reach a state-action pair that was unseen in the dataset.For these unseen state-action pairs, the algorithm has no information about how to choose a good action.The second issue is that if the dataset does not contain transitions in the high-reward regions of the state-action space, it may be impossible for any algorithm to return a good policy.One can easily construct a family of MDPs with missing data such that no algorithm can identify the MDP and suffer a large suboptimality gap (Chen & Jiang, 2019).
The works that provide guarantees on the suboptimality of the output policy usually rely on strong assumptions about good data coverage and mild distribution shift.The theoretical results are for methods based on approximate value iteration (AVI) and approximate policy iteration (API), with results showing the output policy is close to an optimal policy (Farahmand, Szepesvári, & Munos, 2010;Munos, 2003Munos, , 2005Munos, , 2007)).They assume a small concentration coefficient, which is the ratio between the state-action distribution induced by any policy and the data distribution (Munos, 2003).However, the concentration coefficient can be very large or even infinite in practice, so assuming a small concentration coefficient can be unrealistic for many real world problems.Different measures of distribution shift are also used, for example, Yin, Bai, and Wang (2021) assume the visitation distribution of the least occupied state-action pair is greater than zero.
To avoid making strong assumptions on the concentration coefficient, several works consider constraining divergence between the behavior policy and the output policy on the policy improvement step for API algorithms.The constraints can be enforced either as direct policy constraints or by a penalty added to the value function (Levine, Kumar, Tucker, & Fu, 2020;Wu, Tucker, & Nachum, 2019).Another approach is to constrain the policy set such that it only chooses actions or state-action pairs with sufficient data coverage when applying updates for AVI algorithms (Kumar, Fu, Soh, Tucker, & Levine, 2019;Liu, Swaminathan, Agarwal, & Brunskill, 2020).However, these methods only work when the data collection policy covers an optimal policy (see our discussion in Section 3), which can be difficult or impossible to guarantee.
Another direction has been to assume pessimistic values for unknown state-action pairs, to encourage the agent to learn an improved policy that stays within the parts of the space covered by the data.CQL (Kumar, Zhou, Tucker, & Levine, 2020) penalizes the values for out-of-distribution actions and learns a lower bound of the value estimates.A related idea is to constrain the bootstrap target to avoid out-of-distribution actions, introduced first in the BCQ algorithm (Fujimoto, Meger, & Precup, 2019) with practical improvements given by IQL (Kostrikov, Nair, & Levine, 2022).MOReL (Kidambi, Rajeswaran, Netrapalli, & Joachims, 2020) learns a model and an unknown state-action detector to partition states similar to the R-Max algorithm (Brafman & Tennenholtz, 2002), but then uses the principle of pessimism for these unknown states rather than optimism.Safe policy improvement methods (Thomas, Theocharous, & Ghavamzadeh, 2015;Laroche, Trichelair, & Des Combes, 2019) rely on a high-confidence lower bound on the output policy performance, performing policy improvement only when the performance is higher than a threshold.
In practice, the results of pessimistic approaches are mixed.Some have been shown to be effective on the D4RL dataset (Fu, Kumar, Nachum, Tucker, & Levine, 2020).Other results, however, show methods can be too conservative and fail drastically when the behavior policy is not near-optimal (Kumar et al., 2020;Liu et al., 2020), as we reaffirm in our experiments.Further, actually using these pessimistic methods can be difficult, as their hyper-parameters are not easy to tune in the offline setting (Wu et al., 2019) and some methods require an estimate of the behavior policy or data distribution (Kumar et al., 2019;Liu et al., 2020).
Intuitively, however, there are settings where offline RL should be effective.Consider a trading agent in a stock market.A policy that merely observes stock prices and volumes without buying or selling any shares provides useful information about the environment.For this collected dataset, an offline agent can counter-factually reason about the utility of many different actions as demonstrated in Figure 1, because its actions have limited impact on the prices and volumes.Such MDPs, which are called Exogenous MDPs, have states that separate into exogenous states (stock price), not impacted by actions, and endogenous Figure 1: In the stock market example, the exogenous state corresponds to the stock price, the endogenous state corresponds to number of shares the agent has, and the action corresponds to the number of share to buy or sell at each time step.Given an observed exogenous trajectory, the agent can counter-factually reason about the outcomes of different actions and endogenous state.
states (number of shares owned by the agent).The structure in Exogenous MDPs has been used in online RL to learn more efficiently (Dietterich, Trimponias, & Chen, 2018).This exogenous structure, however, has yet to be formally investigated for offline RL, though it is likely already being exploited in industry.Exploiting this structure is natural in applied financial applications, because datasets allow for alternative trajectories to be simulated, as described in the above example.One (unpublished) system uses RL and trajectory simulation for the optimal order execution problem (Burhani, Ding, Hernandez-Leal, Prince, Shi, & Szeto, 2020); it seems likely that there are other such systems in use.
What has yet to be done, however, is to understand the theoretical properties of such algorithms, as well as potential algorithmic improvements.
In this paper, we first generalize the definition of exogenous MDPs, and formalize the action impact regularity (AIR) property.We say an MDP has the AIR property-or is ε-AIR-if the actions have a limited impact on exogenous dynamics, with the level of impact determined by ε ≥ 0. This generalizes the previous definition, which required strict separation, corresponding to ε = 0. We develop both theory and algorithms for this more general setting, assuming access only to an offline dataset, an approximate (learned) endogenous model and the reward function. 1 We design an efficient algorithm, called FQI-AIR, to exploit the AIR property, that (1) does not require an estimate of the behavior policy or the data distribution, (2) has a straightforward approach to select hyperparameters using just the given data and (3) is much less sensitive to the quality of the data collection policy, if our assumptions hold.This algorithm is a simple extension of FQI, but is significantly more computationally efficient than the trajectory simulation approach mentioned above and allows us to leverage and extend the existing theory for FQI.We bound the suboptimality of the output policy from FQI-AIR, in terms of ε and other standard terms (model errors and the inherent Bellman error, see Section 5.2).Importantly, in place of the concentration coefficient, we have a term that depends on the size of the endogenous state and number of actions; when the concentration coefficient is bounded and small, it is on the same order as this term.
We then conduct a comprehensive empirical study of FQI-AIR.We compare several algorithms in two simulated environments, across three different data collection policies, with varying offline dataset sizes, for ε = 0 (assumption perfectly satisfied) and a larger ε (assumption somewhat violated).FQI-AIR significantly outperforms the offline RL algorithms that do not leverage the AIR property-including FQI, MBS-QI, CQL and IQL; this outcome is expected, but nonetheless verifies that exploiting the AIR property, when appropriate, can have a big benefit.We show that these conclusions extend to two environments based on real-world datasets (for bitcoin trading and for controlling battery usage in a hybrid car).An important detail here is how hyperparameters are chosen.FQI-AIR can exploit AIR for policy evaluation, to automatically select hyperparameters.For the other algorithms, we do not have such an approach, and instead report idealized performance by picking hyperparameters based on performance in the environment.
Finally, these results all used the true endogenous model for FQI-AIR.We chose to do so partly because the endogenous model is known for certain AIR-MDPs (e.g., trading, inventory management) and partly to focus the investigation on the role of ε rather than model error.However, for certain AIR-MDPs, we will not have access to the true endogenous model.In our final experiment, we investigated the impact of using a learned endogenous model in the hybrid car environment, and show that FQI-AIR remains effective.
All of this is only possible because we make a strong assumption about the environment.However, given the hardness results in offline RL, we should acknowledge that we likely need to restrict the class of MDPs.This work is a step towards understanding for what classes of MDPs offline RL is feasible.At the same time, though we consider a restricted setting, it is by no means a trivial setting.There are many real-world examples where this regularity holds (as we discuss later in this work).This is doubly true given that our generalization provides some flexibility in violating the assumption: the regularity only needs to hold approximately rather than exactly.The algorithms and theory developed here can benefit these real-world applications now, by providing an approach that is well-designed and wellbehaved with strong theoretical guarantees for their specific problem setting.

Problem Formulation
The agent-environment interaction is formalized as a finite horizon Markov decision process (MDP) M = (S, A, P, r, H, ν).S is a set of states, and A is an set of actions; for simplicity, we assume that both sets are finite.P : S × A → ∆(S) is the transition probability where ∆(S) is the set of probability distributions on S, and by slightly abusing notation, we will write P (s, a, s ) as the probability that the process will transition into state s when in state s it takes action a.The function r : S × A → [0, r max ] gives the reward when taking action a in state s, where r max ∈ R + .H ∈ Z + is the planning horizon, and ν ∈ ∆(S) the initial state distribution.
In the finite horizon setting, the policies are non-stationary.A non-stationary policy is a sequence of memoryless policies (π 0 , . . ., π H−1 ) where π h : S → ∆(A).We assume that the set of states reachable at time step h, S h ⊂ S, are disjoint, without loss of generality, because we can always define a new state space S = S ×[H −1] where [n] := {0, 1, 2, . . ., n}.Then, it is sufficient to consider stationary policies π : S → ∆(A).
Given a policy π, h ∈ [H − 1], and (s, a) ∈ S × A, we define the value function and the action-value function as v π h (s where the expectation is with respect to P π M (we may drop the subscript M when it is clear from the context).P π is the probability measure on the random element in (S × A) H induced by the policy π and the MDP such that, for the trajectory of state-action pairs (S 0 , A 0 , . . ., S H−1 , A H−1 ), we have P π (S 0 = s) = ν(s), P π (A t = a|S 0 , A 0 , . . ., S t ) = π(a|S t ), and P π (S t+1 = s |S 0 , A 0 , . . ., S t , A t ) = P (S t , A t , s ) for t ≥ 0 (Lattimore & Szepesvári, 2020).The optimal value function is defined by v * h (s) := sup π v π h (s), and the Bellman operator is defined by In the batch setting, we are given a fixed set of transitions D with samples drawn from a data distribution µ over (S × A).In this paper, we consider the setting where the data is collected by a data collection policy π b .That is, D consists of N trajectories (S H−1 ) induced by the the interaction of the policy π b and MDP M .A representative algorithm for the batch setting is Fitted Q Iteration (FQI) (Ernst, Geurts, & Wehenkel, 2005).In the finite horizon setting, FQI learns an action-value function for each time step, q 0 , . . ., q H−1 ∈ F where F ⊆ R S×A is the value function class.The algorithm is defined recursively from the end of the episode: for each time step h from H − 1 to 0, q h = arg min q∈F q − T q h+1 2 2 where T is defined by replacing expectations with sample averages for the Bellman operator T and q H = 0.The output policy is obtained by greedifying according to these action-value functions.

Common Assumptions for Offline RL
In this section we discuss common assumptions used in offline RL.The most common assumptions typically concern properties of the data distribution and MDPs together: either that it sufficiently covers the set of possible transitions, or that it sufficiently covers a nearoptimal policy.However, these assumptions are often impractical for many real world applications.Furthermore, recent results show that restriction on the data distribution alone are insufficient to obtain guarantees.This motivates considering realistic assumptions on the MDP alone, as we do in this work.
Sufficient coverage of the data distribution has primarily been quantified by concentration coefficients (Munos, 2003).Given a data distribution µ, the concentration coefficient C is defined to be the smallest value such that, for any policy π, If µ(s, a) = 0 for some (s, a), then we define C = ∞ by convention.Several results bound the suboptimality of batch API and AVI algorithms in terms of the concentration coefficient (Chen & Jiang, 2019;Farahmand et al., 2010;Munos, 2003Munos, , 2007)).For example, it has been shown that FQI outputs a near-optimal policy when C is small and the value function class is rich enough, where the upper bound on the suboptimality of the output policy scales linearly with √ C.However, in practice, the concentration coefficient can be very large or even infinite.For example, if the data collection policy is not well-randomized or exploratory-often the case in practice-then the concentration coefficient is infinite due to missing some stateaction pairs.Munos (2007) provides some intuition about the size of the concentration coefficient.Suppose that the data distribution is uniform (e.g., µ(s, a) = 1/|S||A|) and the environment transition probability is less uniform, that is, there exists some policies such that the visitation distribution concentrates on a single state-action pair (e.g., P π (S h = s, A h = a) = 1 for some s, a and h), then the concentration coefficient can be as large as the number of state-action pairs Another direction has been to consider approaches where the data covers a near-optimal policy.The key idea behind these methods is to restrict the policy to choose actions that have sufficient data coverage, which is effective if the given data has near-optimal action selection.For example, BCQ (Fujimoto et al., 2019) and BEAR (Kumar et al., 2019) only bootstrap values from actions a if the probability π b (a|s) is above a threshold b.MBS-QI (Liu et al., 2020) extends this to consider state-action probabilities, only bootstrapping from state-action pairs (s, a) when µ(s, a) is above a threshold.The algorithm is modified from FQI by replacing the bootstrap value q h (s, a) by qh (s, a) := I{µ(s, a) ≥ b}q h (s, a) and the policy is greedy with respect to qh (s, a).If a state-action pair does not have sufficient data coverage, its value is zero.They show that MBS-QI outputs a near-optimal policy if That is, the data provides sufficient coverage for state-action pairs visited under an optimal policy π * .
Though potentially less stringent than having a small concentration coefficient, this assumption can be impractical.We may be able to satisfy this assumption in simulated environments, such as those in our experiments; in the real world, though, if we have a simulator we are unlikely to use offline RL.For many real world domains, optimal policies are unknown.In fact, one of the primary purposes of using offline RL is to get (significantly) improved policies.It is also hard to carefully design a data collection policy to cover an unknown optimal policy, making it difficult to even check whether this assumption holds.
Finally, some recent negative results suggest it is not sufficient to have a good data distribution alone, and that it will be necessary to also make assumptions on the MDP.In particular, Chen and Jiang (2019) showed that if we do not make assumptions on the MDP dynamics, no algorithm can achieve a polynomial sample complexity to return a near-optimal policy, even when the algorithm can choose any data distribution µ.Wang, Foster, and Kakade (2021) provide an exponential lower bound for the sample complexity of off-policy policy evaluation and optimization algorithms with q π -realizable linear function class, even when the data distribution induces a well-conditioned covariance matrix.Zanette (2021) provide an example where offline RL is exponentially harder than online RL, even with the best data distribution, q π -realizable function class and assuming the exact feedback is observed for each sample in the dataset.Xiao, Lee, Dai, Schuurmans, and Szepesvari (2022) provide an exponential lower bound for the sample complexity of obtaining nearly-optimal policies when the data is obtained by following a data collection policy.These results are consistent with the above, since achieving a small concentration coefficient implicitly places assumptions on the MDP.
The main message from these negative results is that a good data distribution alone is not sufficient.We need to investigate realistic problem-dependent assumptions for MDPs.In the remainder of this work, we explore a restricted class of MDPs, for which we can obtain much stronger guarantees when learning offline, without stringent requirements on data collection.

Action Impact Regularity
Actions play an important role for the exponential lower bound constructions cited in the last section.They use tree structures where different actions lead to different subtrees and hence different sequence of futures states and rewards.A class of MDPs that do not suffer from these lower bounds are those where actions do not have such strong impact on the future states and rewards.In this section, we introduce the Action Impact Regularity (AIR) property, a property of the MDP which allows for more effective offline RL.The state is partitioned into an exogenous and endogenous component, and the property reflects that the agent's actions primarily impact the endogenous state with limited influence on the exogenous state.We first provide the formal definition and assumptions we leverage to design a practical offline RL algorithm and then discuss when these assumptions are likely to be satisfied.

Formal Definition and Assumptions
We use the standard state decomposition from Exogenous MDPs (McGregor, Houtman, Montgomery, Metoyer, & Dietterich, 2017;Dietterich et al., 2018).We assume the state space is S = S exo × S end where S exo is the exogenous variable and S end is the endogenous variable.The transition dynamics are P exo : S exo × A → ∆(S exo ) and P end : S × A → ∆(S end ) for exogenous and endogenous variable respectively.The transition probability from a state s 1 = (s exo 1 , s end 1 ) to another state Definition 1 (The AIR Property).An MDP is ε-AIR if S = S exo × S end , and for any actions a, a, ∈ A, the next exogenous variable distribution is similar if either action a or a is taken.That is, for each state s ∈ S, where D T V is the total variation distance between two probability distributions on S exo .For discrete spaces, the total variation distance is D T V (P, P ) = 1 2 P − P 1 ( 1 norm).We define the AIR-MDP such that the property holds for all exogenous state-action pairs.If the property does not hold for one of the exogenous state-action pairs, then one can design an adversarial MDP that hides all difficulties in this single exogenous state-action pair and assuming the properties hold for all but one pair would be useless (Jiang, 2018).
Access to an (approximate) endogenous model is critical to exploit the AIR property, and is a fundamental component of our algorithm.To be precise, we make the following assumption in this paper.
Assumption 1 (AIR with an Approximate Endogenous Model).We assume that the MDP is ε air -AIR and that we have the reward model r : S × A → [0, r max ] and an approximate endogenous model P end : S × A → ∆(S end ) such that D T V (P end (s, a), P end (s, a)) ≤ ε p for any (s, a) ∈ S × A.
As mentioned in the introduction, it is common to assume that only the transition dynamics are approximated.Moreover, similar to Definition 1, we need the error on the approximate model to hold uniformly.Finally, the above assumption implicitly assumes that the separation between exogenous and endogenous state is given to us.More generally, the separation could be identified or learned by the agent, as has been done for contingencyaware RL agents (Bellemare, Veness, & Bowling, 2012) and wireless networks (Dietterich et al., 2018).Because there are many settings where the separation is clear, we focus on this more clear case first where the separation is known.

When Are These Assumptions Satisfied?
Many real-world problems can be formulated as ε-AIR MDPs.Further, for many of these environments, the separation between exogenous and endogenous state is clear, and we either know or can reasonably approximate the endogenous model.In this section, we go through several concrete examples.
We can first return to our stock trading example, from the introduction.The exogenous component is the market information (stock prices and volumes) and the endogenous component is the number of stock shares owned by the agent.The agent's actions influence their own number of shares, but as an individual trader, have limited impact on stock prices.Using a dataset of stock prices over time allows the agent to reason counterfactually about the impact of many possible trajectories of actions (buying/selling) on its shares (endogenous state) and profits (reward).
There are many settings where the agent has a limited impact on a part of the state.The optimal order execution problem is a task to sell M shares of a stock within H steps; the goal is to maximize the profit.The problem can be formulated as an MDP where the exogenous variable is the stock price and endogenous variable is the number of shares left to sell.It is common to assume infinite market liquidity (Nevmyvaka, Feng, & Kearns, 2006) or that actions have a small impact on the stock price (Abernethy & Kale, 2013;Bertsimas & Lo, 1998); this corresponds to assuming the AIR property.
Another example is the secretary problem (Freeman, 1983), which a family of problems that can often be used to model real-world application (Babaioff, Immorlica, Kempe, & Kleinberg, 2007;Goldstein, McAfee, Suri, & Wright, 2020).The goal for the agent is to hire the best secretary out of H, interviewed in random order.After each interview, they have to decide if they will hire that applicant, or wait to see a potentially better applicant in the future.The problem can be formulated as a 0-AIR MDP where the endogenous variable is a binary variable indicating whether we have chosen to stop or not.
Other examples include those where the agent only influences energy efficiency, such as in the hybrid vehicle problem (Shahamiri, 2008;Lian, Peng, Wu, Tan, & Zhang, 2020) and electric vehicle charging problem (Abdullah, Gastli, & Ben-Brahim, 2021).In the former problem, the agent controls the vehicle to use either the gas engine or the electrical motor at each time step, with the goal to minimize gas consumption; its actions do not impact the driver's behavior.In the latter problem, the agent controls the charging schedule of an electric vehicle to minimize costs; its actions do not impact electricity cost.
In some settings we can even restrict the action set or policy set to make the MDP ε-AIR.For example, if we know that selling M shares hardly impacts the markets, we can restrict the action space to selling less than or equal to M shares.In the hybrid vehicle example, if the driver can see which mode is used, we can restrict the policy set to only switch actions periodically to minimize distractions for the driver.
In these problems with AIR, we often know the reward and transitions for the endogenous variables, or have a good approximation.For the optimal order execution problem, the reward is simply the selling price times the number of shares sold minus transaction costs, and the transition probability for P end is the inventory level minus the number of shares sold.In other applications, we may be able to use domain knowledge to build an accurate model for the endogenous dynamics.For the hybrid vehicle, we can use domain knowledge to calculate how much gas would be used for a given acceleration.Such information about the dynamics of the system can be simpler for engineers to specify, than (unknown) behavior of different drivers and environment conditions.Our theoretical results will include a term for the error in the endogenous model, but it is reasonable to assume that for many settings we can get that error to be relatively low, particularly in comparison to the error we might get if trying to model the exogenous state.

Connections to the Literature on Exogenous MDPs
AIR MDPs can be viewed as an extension of Exogenous MDPs.(1) We allow the action to have small impact on the environmental state, while the action has no impact on the exogenous state in Exogenous MDPs.(2) We do not assume the reward can be decomposed additively to an exogenous reward and an endogenous reward (Dietterich et al., 2018) nor factor into a sum over each exogenous state variable (Chitnis & Lozano-Pérez, 2020).For this previous definition of Exogenous MDPs, the focus was on identifying and removing the exogenous state/noise so that the learning problem could be solved more efficiently (Dietterich et al., 2018;Efroni, Misra, Krishnamurthy, Agarwal, & Langford, 2022), thus the focus on reward decomposition.Our focus is offline learning where we want to exploit the known structure to enable counterfactual reasoning and avoid data coverage issues.

Offline Policy Optimization for AIR MDPs
In this section, we discuss several offline algorithms that exploit the AIR property for policy optimization.We then theoretically analyze an FQI-based algorithm, characterizing the performance of its outputted policy.

Algorithms for AIR MDPs
Two standard classes of algorithms in offline RL are model-based algorithms-that learn a model from the offline dataset and then use dynamic programming-and model-free algo-rithms like fitted Q-iteration (FQI).These two approaches can be tailored to our setting with AIR MDPs, as we described below.There is, however, an even more basic approach in our offline RL setting using trajectory simulation, that has previously been used (Burhani et al., 2020).We start by describing this simpler approach, and then the modified modelbased and FQI approaches.
A natural approach is to reuse trajectories in the dataset to simulate alternative trajectories for an online RL algorithm.For each episode, a random trajectory is selected from the dataset.The online RL algorithm-such as an actor-critic method or a Q-learning agent-takes actions and deterministically transitions to the next exogenous state in the trajectory.The approximate endogenous and reward model are used to sample the next endogenous variable and reward.With such a trajectory simulator, we can run any online reinforcement learning algorithm to find a good policy for the simulator.
This approach, however, does not exploit the fact that the agent is actually learning offline.The online RL algorithm cannot simply query the model for any state and action, and needs a good exploration strategy to find a reasonable policy.There are fewer theoretical guarantees for such online RL algorithms, and arguably more open questions about their properties than DP-based algorithms and fitted value iteration algorithms.
A more explicit model-based approach is to learn the exogenous model from data, to obtain a complete transition and reward model, and use any planning approach.The transition model for exogenous states can be constructed as if the action has no impact.With the model, we can use any query-efficient planning algorithm to find a good policy for the model.Because actions have only small impact in the true MDP, we can learn an accurate exogenous model even if we do not have full data coverage.
More precisely, recall the offline data is randomly generated by running π b on M , that is, i=1 sampled according to the probability measure P π b M .The pertinent part is the transitions between exogenous variables, so we define D exo = {(S = s exo h ) for all a ∈ A. Exogenous variables not seen in the data are not reachable, and so can either be omitted from P exo or set to self-loop.For large or continuous state spaces, we can learn p(s exo h+1 |s exo h ) using any conditional distribution learning algorithm, and set P exo (s exo h , a, s exo h+1 ) = p(s exo h+1 |s exo h ) for all a ∈ A. For large or continuous states spaces, however, learning such a model and planning can be impractical.Learning an accurate exogenous model might be difficult if the exogenous transition is complex or the exogenous state is high-dimensional.Further, it is not possible to sweep through all states during planning.Smarter approximate dynamic programming algorithms need to be used, but even these can be quite computationally costly.
A reasonable alternative is FQI, which approximates value iteration without the need to learn a model.Our FQI algorithm that exploits the AIR property is described in Algorithm 1, which we call FQI-AIR.The algorithm simulates all actions from a state, and assumes it transitions to the exogenous state observed in the dataset.The reward and Algorithm 1 FQI-AIR Input: dataset D, value function class F, P end , r Let q H = 0, D H−1 = ∅, ..., D 1 = ∅ for h = H − 1, . . ., 0 do For all i ∈ {1, . . ., N }, all s end h ∈ S end , all a ∈ A Sample s end ∼ P end (s Add (synthetic) pair ((s endogenous state for each simulated action can be obtained using the reward model and approximate endogenous model.Even though the true MDP is not necessarily 0-AIR MDP, we will show in the analysis that as long as ε air is small, the algorithm can return a nearly optimal policy in the true MDP.This algorithm, although simple, enjoys theoretical guarantees without making assumptions on the concentration coefficient, and can be much more computationally efficient than trajectory simulation methods.
Note that the computational cost scales with the size of S end and A. When |S end | or |A| is large, we can modify FQI-AIR to no longer use full sweeps.Instead, we can randomly sample from the endogenous state space and action.We include a practical implementation of FQI-AIR in Algorithm 2. For each exogenous state in the dataset, we sample an endogenous state and an action, and query the approximate model to obtain a target for FQI update.As a result, the computation can be independent of the size of S end and A. However, for sample complexity, the performance loss of the algorithm would depend on the squared root of the size |S end ||A|, as shown in the next section.

Theoretical Analysis of FQI-AIR
First we need the following definitions.For a given MDP M , we define J(π, M ) := E π M [R(τ )] where τ = (S 0 , A 0 , . . ., S H−1 , A H−1 ) is a random element in (S × A) H , the expectation E π M is with respect to P π M , and R(τ ) = H−1 h=0 r(S h , A h ).We also need the following assumption on the function approximation error.This is a common assumption to analyze approximate value iteration algorithms (Antos, Szepesvári, & Munos, 2006;Munos, 2007) Compute the mini-batch loss L(θ) = B j=1 (q θ (s exo j , send j , ãj , h j ) − t j ) 2 Update θ to reduce L(θ) θ ← θ Output: the greedy policy with respect to q θ S exo at horizon h.Given a probability measure ν h on S exo and p ∈ [1, ∞), define the norm Assumption 2. Assume the function class F is finite and the inherent Bellman error is bounded by ε apx , that is, We assume the function class is finite for simplicity, which is common in many offline RL papers (Chen & Jiang, 2019).If the function class is not finite but has a bounded complexity measure, we can derive similar results by replacing the size of the function class with the complexity measure.For example, Duan, Jin, and Li (2021) analyze FQI with the Rademacher complexity.Since studying the complexity measure is not a critical point for our paper, we decide to make the finite function class assumption.
Theorem 1.Under Assumption 1 and 2, let π * M be an optimal policy in M and π the output policy of FQI-AIR, then with probability at least 1 − The bound on performance loss has three components: (1) a sampling error term which decreases with more trajectories; (2) the AIR parameter; and (3) an approximation error term which depends on the function approximation used.The result implies that as long as we have a sufficient number of episodes, a good function approximation, and small ε air , then the algorithm can find a nearly-optimal policy with high probability.For example, if ε air , ε p and ε apx are small enough, we only need N = Õ(H 4 v max 2 |S end ||A|/δ) trajectories, which is polynomial in H, to obtain a δ-optimal policy.
The proof can be found in the appendix.The key idea is to introduce a baseline MDP M b that is 0-AIR, that approximates M which is actually ε air -AIR.The baseline MDP M b = (S, A, P , r, H, ν) has P exo (s exo h , a, s exo h+1 h ) and P end (s h , a, s end h ) = P end (s h , a, s end h ).The transition probability for exogenous state does not depend on the action a taken, so M b is 0-AIR.We show that FQI returns a good policy in M b , and that good policies in M b are also good in the true MDP M .
We can contrast this bound to others in offline RL.For FQI results that assume the concentration coefficient is bounded and small, the error bound has a term that scales with √ C, which is on the same order as the term |S end ||A| in our bound.We can get a similar bound by considering this restricted class of MDPs that are ε-AIR, without having to make any assumptions on the concentration coefficient.For settings where this assumption is appropriate-namely when the MDP is ε-AIR-this is a significant success, as we need not make these stringent conditions on data distributions.

Policy Evaluation for AIR MDPs
We can also exploit the AIR property, and access to the approximate endogenous model and reward model, to evaluate the value of a given policy.Given a trajectories of exogenous states (S (i) 0 , S (i),exo 1 , . . ., S (i),exo H−1 ), we can rollout a synthetic trajectory under π and P end : t ), the average return over the N synthetic trajectories Ĵ(π, ) is an estimator of J(π, M ).This method is simple, but very useful because we can do hyperparameter selection with only the offline dataset without introducing extra hyperparameters.
We can bound the policy evaluation error by Hoeffding's inequality.More sophisticated bounds for policy evaluation can be found in (Thomas et al., 2015).
Theorem 2. Under Assumption 1, given a deterministic policy π, we have that with probability at least 1 − ζ The results suggests that if we have a sufficient number of trajectories and small ε air and ε p , then Ĵ(π, M ) is a good estimator for J(π, M ).Even though the estimator is biased and not consistent, we find it provides sufficient information for hyperparameter selection in our experiments.
Theorem 2 only holds for a given policy that is independent of the offline data.If we want to evaluate the output policy, which depends on the data, we need to apply an union bound for all deterministic policies.In that case, the sampling error term becomes Õ( |S|/N ).To avoid the dependence on the state space, an alternative is to split the data into two subsets: one subset is used to obtain an output policy and another subset is used to evaluate the output policy.

Simulation Experiments
We evaluate the performance of the FQI-based algorithm in two simulated environments with the AIR property: an optimal order execution problem and an inventory management problem.We also tested the other algorithms described in Section 5.1, for completeness and to contrast to FQI-AIR.We include these results in Appendix C. FQI-AIR is notably better than these other approaches, and so we focus on it as the representative algorithm that exploits the AIR property.
The first goal of these experiments is to demonstrate that existing offline RL algorithms fail to learn a good policy for some natural data collection policies, while our proposed algorithm returns a near-optimal policy.To demonstrate this, we test three data collection policies: (1) a random policy which is designed to give a small concentration coefficient, (2) a learned near-optimal policy obtained using DQN with online interaction, which covers an optimal policy reasonably well, and (3) a constant policy which, in theory, has an infinite concentration coefficient due to missing state-action pairs and does not cover an optimal policy.The second goal is to validate the policy evaluation analysis with a varying number of trajectories N and ε air .

Environments
We investigated the behavior of the algorithms on two simulated environments that mimic two real-world problems that satisfy the AIR property: optimal order execution and inventory management.In the optimal order execution problem, the task is to sell M = 10 shares within H = 100 steps.The stock prices X 1 , . . ., X h are generated by an ARMA(2, 2) process and scaled to the interval [0, 1].Specifically, the ARMA(2, 2) process is ϕ 1 ∼ U(−0.9, 0.0), ϕ 2 ∼ U(0.0, 0.9) and θ i ∼ U(−0.5, 0.5) for i = 1, 2. The scaling parameters are chosen so that the process is stable and the price is in the interval [0, 1].The endogenous variable P h is the number of shares left to sell.To construct state, we use the most recent K = 3 prices with the most recent endogenous variable, that is, . The reward function is the stock price X h multiplied by the number of selling shares min{A h , P h }.
We consider both a setting with ε air = 0 and ε air > 0, as well as different data generating policies.When the number of selling shares is greater than 0, the stock price drops by 10% with probability ε air .For ε air = 0, this means selling shares has no impact on the stock price.When ε air > 0, it does, allowing us to test how robust FQI-AIR is to some violation of the AIR property.The random policy used in the environment chooses 0 with probability 75% and choose 1, . . ., 5 with probability 5%.The constant policy always chooses action 0.
We design an inventory management problem based on existing literature (Kunnumkal & Topaloglu, 2008;Van Roy, Bertsekas, Lee, & Tsitsiklis, 1997).The task is to control the inventory of a product over H = 100 stages.At each stage, we observe the inventory level X t (endogenous) and the previous demand D t−1 (exogenous) and choose to place a order A t ∈ [10].The inventory level evolves as: where c is the order cost, h is the holding cost, and b is the cost for lost sale.We use c = 0.1, h = 0.25 and b = 1.0 in the experiment.To make sure the reward is bounded, we clip the reward at a large negative number −100.The endogenous variable is the inventory level, which can be as large as 1000, so we restrict FQI-AIR to sweep only for a subset of the endogenous space, that is, [15] ⊂ S end .
As before, we consider both ε air = 0 and ε air > 0, which in this case impacts the demand.The demand D t = ( Dt ) + where Dt follows a normal distribution with mean µ and σ = µ/3 and (d) + := max{d, 0}.In the beginning of each episode, µ is sampled from a uniform distribution in the interval [3,9].When the order is greater than 0, the mean of the demand distribution decreases or increases by 10% with probability ε air /2 respectively.
Again, we consider three different data generating policies.The random policy used in the environment is a policy which chooses a value Ãt ∈ [D t−1 − 3, D t−1 + 3] uniformly and then choose the action A t = max{min{ Ãt , 10}, 0}.The constant policy always chooses action A t = min{D t−1 , 10}.The near-optimal policy is obtained using DQN with online interaction, for both environments.
There is an important nuance for the inventory problem.In this problem, the endogenous transition and reward depends on the next exogenous variable.Fortunately, we can generalize the definition of exogenous MDPs such that the endogenous transition is P end : S × A × S exo → ∆(S end ) and the reward is r : S × A × S exo → [0, r max ].We assume we have an approximate endogenous model P end : S × A × S exo → ∆(S end ) such that D T V (P end (s, a, s exo ), P end (s, a, s exo )) ≤ ε p for any (s, a, s exo ) ∈ S × A × S exo .With these changes, the algorithms and the theoretical analysis naturally extend to the new definition of exogenous MDPs.

Algorithm Details
We compare FQI-AIR to FQI, MBS-QI (Liu et al., 2020), CQL (Kumar et al., 2020), IQL (Kostrikov et al., 2022).As we discussed in the previous sections, FQI is expected to work well when the concentration coefficient is small.MBS-QI is expected to perform well when the data covers an optimal policy.CQL and IQL are strong baselines which have been shown to be effective empirically for discrete-action environments such as Atari games.
We had several choices to make for the baseline algorithms.MBS-QI requires density estimation for the data distribution µ.For the optimal order execution problem, we use state discretization and empirical counts to estimate the data distribution as used in the original paper.For the inventory problem, the state space is already discrete so there is no need for discretization.We show the results with the best threshold b from the set {0.002, 0.001, 0.0001, 0.00005}.Note that it is possible that there is no data for some states (or state discretization) visited by the output policy, and for these states, all action values are all zero.To break ties, we allow MBS-QI to choose an action uniformly at random.For CQL, we add the CQL(H) loss with a weighting α when updating the action values.We show the results with the best α from the set {0.1, 0.5, 1.0, 5.0} as suggested in the original  paper.For IQL, we show the results with the best τ from the set {0.7, 0.8, 0.9} and β from the set {10.0, 3.0, 1.0}.
We use the same value function approximation for all algorithms in our experiments: two-layers neural networks with hidden size 128.The neural networks are optimized by Adam (Kingma & Ba, 2014) or RMSprop with learning rate in the set {0.001, 0.0003, 0.0001}.All algorithms are trained for 100 iterations.We also tried training the comparator algorithms for longer, but it did not improve their performance.
The hyperparameters for FQI-AIR are selected based on Ĵ(π, M ), which only depends on the dataset.The hyperparameters for comparator algorithms are selected based on J(π, M )-which should be a large advantage-estimated by running the policy π on the true environment M with 100 rollouts.

Policy Performance When ε air = 0
Figure 2 shows the performance of our algorithm and the comparator algorithms with a different number of trajectories N = {1, 5, 10, 25, 50, 100, 200} and ε air = 0. Our algorithm outperforms other algorithms for all data collection policies.This result is not too surprising, as FQI-AIR is the only algorithm to exploit this important regularity in the environment; but nonetheless it shows how useful it is to exploit this AIR property when it is possible.
We can first look more closely at the optimal order execution results.MBS performs slightly better than FQI, however, we found it performs better because the tie-breaking is done with a uniform random policy especially under the constant policy dataset.2CQL and IQL fails when the data collection policy is far from optimal (constant policy) and perform reasonably with a learned policy.FQI-AIR exploits the AIR property, and so is robust to different data collection policies.The results show that exploiting the AIR property is critical for the robust performance.
We see similar patterns for the inventory management problem.FQI-AIR outperforms the other algorithms for all data collection policies.CQL and IQL perform well in this environment.MBS outperforms FQI under the learned policy, but FQI outperforms MBS under the random policy.The results match the expectation that FQI performs well with an uniform data and MBS-QI performs well with an expert data.

Policy Performance with a Large Value of ε air
Now we consider the impact of using these algorithms when ε air > 0. We should expect FQI-AIR to be most impacted, as the other algorithms do not exploit the AIR property.We vary ε air from 0.1 to 0.8 and find that the results are similar to those with small ε air .FQI-AIR still significantly outperforms other offline methods.
Figure 3 shows the result with ε air = 0.8 where the performance of all algorithms drop significantly.In theory, FQI-AIR can have a large performance loss with large ε air , however, FQI-AIR still consistently outperforms other baselines in our experiments, except for the inventory management problem with the learned policy.This is because the divergence between the true exogenous transition and the synthetic exogenous transition in FQI-AIR does not occur at every time step even when ε air is large.For example, in the optimal order execution problem, the divergence can only happen when we sell a positive number of shares.The theoretical result is the worst-case analysis where the divergence can occur at every time step and we suffer r max loss every time the divergence occurs.Therefore, the experiment results suggest that these practical problems considered in the paper are not the worst case and FQI-AIR can perform well even with large ε air .

Results for Policy Evaluation
To validate the policy evaluation analysis, we investigate the difference | Ĵ(π, M ) − J(π, M )| with ε air ∈ {0, 0.05, 0.1, 0.2, 0.4} and N = {1, 5, 25, 100, 200} where π is the output policy of FQI-AIR.We show the 90th percentile of the difference for each combination of ε air and N over 90 data points (30 runs under each data collection policy) in Figure 4.The 90th percentiles scale approximately linearly with ε air and inversely proportional to N .The results suggest that the dependence on ε air is linear and the policy evaluation error goes to zero at a sublinear rate, which reflects the bound of Theorem 2.

Real World Experiments
To demonstrate the practicality of the proposed algorithm, we evaluate the proposed algorithm for two real world experiments: (1) Bitcoin: an optimal order execution for the bitcoin market, and (2) Prius: a hybrid car control problem.For the Bitcoin experiment,  we use historical prices of bitcoin. 3The problem is to sell one bitcoin within 60 steps where each step corresponds to 10 minutes in real world.On each step, the agent chooses to sell some numbers of bitcoins in {0, 0.1, 0.2, 0.3, 0.4, 0.5}.Each episode corresponds to 10 hours, with a start state chosen from a random time step in the data (consisting of 300 days).
The exogenous state contains the most recent 60 closing prices, and the endogenous state contains the number of shares left to be sold.We collect an offline dataset by running a trained policy by DQN for N episodes, and report performance of the output policy for the testing period (about 41 days).
For the Prius experiment, we use the hybrid car environment from Lian et al. (2020). 4he agent can switch between using fuel or battery, with the goal to minimize fuel consumption while maintaining a desired battery level.The exogenous state is the driving patterns and the endogenous state contains the state of charge and the previous action.We collect the offline dataset by running a learned policy with 10 different driving patterns, and test on 12 driving patterns.To better mimic the real-world, where we would not have a random policy or constant policy, we use the learned policy from DQN as the data collection policy.Further, now that the state space is larger, we run FQI-AIR where we randomly sample endogenous states and actions, rather than sweeping through all endogenous states and actions.FQI-AIR performs significantly better than CQL, IQL and FQI, as shown in Figure 5. MBS-QI does not scale to high-dimensional continuous state spaces, and so is excluded.These results highlight that FQI-AIR can scale to high-dimensional continuous state space and large endogenous state space.

Learning an Endogenous Model in AIR-MDPs
In the previous experiments, we assume we are given the endogenous models.In this section, we investigate the impact of using an approximate endogenous model learned from offline data.We perform this experiment in the hybrid car environment, which reflects a setting where the endogenous dynamics might in fact not be known and would be useful to estimate from data.We use a neural network to approximate the endogenous state and reward model, and run FQI-AIR with the learned endogenous model.
Let us first reason about when it might be feasible to learn a reasonably accurate endogenous model.In the worse case, learning an accurate endogenous and reward model would require data coverage for the entire state space.However, in many practical scenarios, the endogenous model can be easy to learn and does not require full data coverage.For example, in the optimal order execution problem, the endogenous dynamics does not depend on the exogenous variables, as a result, we only need coverage for the endogenous state.In the Prius environment, collecting data from just one driving cycle is sufficient to learn a good endogenous model, as we will demonstrate in these experiments.
We first collect a dataset from the hybrid car environment by running a random policy in one of the driving cycles and a deterministic policy for the other driving cycles.This data generation approach mimics a scenario we might see in practice.In the factory, we might have a test system for which it is acceptable to try many different actions (using gas or the battery), and so get a more varied dataset for learning the endogenous model.We would only get this data from one limited course (one driving cycle).The rest of the data would be collected in the wild, where the deployed solution should not be exploring many actions and should largely be deterministic.
We also test two model-based baselines (MB): (1) The first baseline has full knowledge of the reward and endogenous models, and learns the exogenous model from offline data without exploiting the AIR property.The algorithm is similar to Algorithm 1 but s exo h+1 is generated from the learned exogenous model.(2) The second baseline does not have knowledge of the reward, endogenous models and exogenous models.It learns a full model to from a state-action pair to the next state.For these model-based baselines, the model is parameterized by a two-layer neural network and learned by minimizing the 2 distance between predicted states and next states recorded in the data.The transitions in these environments are deterministic, so it is appropriate to learn an expectation model.
In Figure 6 (a), we perform an ablation study to compare FQI-AIR and MB with the true endogenous model or a learned endogenous model.The result shows that (1) MB with the true endogenous model performs slightly worse than FQI-AIR with a small data size.
(2) FQI-AIR with a learned endogenous model perform worse than FQI-AIR, however, it outperforms IQL and MB without the true endogenous model.(3) MB with a learned endogenous model performs worse than FQI-AIR with a learned endogenous model.This suggests that it is useful to separate the exogenous state and endogenous state especially when we need to learn an endogenous model.
Next, we test FQI-AIR when learning the endogenous model only from a more limited dataset: a dataset based solely on one cycle.We collect a dataset from the hybrid car environment by running a random policy in one of the driving cycles for 500 episodes.This reflects a practical scenario that we can just run our vehicle in a closed area and still are

Conclusion
In this paper, we aim to understand whether and when offline RL is feasible for real world problems.It is known that, without extra assumptions on MDPs, offline RL requires exponential numbers of samples to obtain a nearly optimal policy with high probability even if we have a good data distribution.As a result, we must make assumptions on MDPs to make learning feasible (learning with polynomial sample complexity).However, common assumptions can be impractical as discussed in Section 3. Therefore, our goal in this paper is to study an MDP property that (1) is realistic for several important real world problems and (2) makes offline RL feasible.
We introduced an MDP property, which we call Action Impact Regularity (AIR).We developed an algorithm for MDPs satisfying AIR, that (1) has strong theoretical guarantees on the supoptimality, without making assumptions about concentration coefficients or data coverage, (2) provides a simple way to select hyperparameters offline, without introducing extra hyperparameters and (3) is simple to implement and computationally efficient.We showed empirically that the proposed algorithm significantly outperforms existing offline RL algorithms, across two simulated environments and two real world environments.instead of qh (s, a) := I{μ(s, a) ≥ b}q h (s, a) where μ is the estimated data distribution.

Data Collection Policies
To collect data for the learned policy, we first train a DQN algorithm for 1000 episodes with online interaction with the underlying environment.The DQN parameters (that is, learning rate and optimizer) are chosen based on estimated J(π, M ).After training, we collect data by running the trained policy.

Appendix B. Theoretical Analysis
We first provide definitions and lemmas that would be useful for proving our main theorems.Given q ∈ R S×A , β h be a probability measure on S × A and p > 0, define q p p,β h = s∈S,a∈A β h (s, a)|q(s, a)| p .Given q ∈ R S×A , we define the Bellman evaluation operator for a given policy π: Lemma 1.Given a MDP M , suppose the sequence of value function (q 0 , . . ., q H−1 ) satisfies that T q h+1 − q h 2,d π h ≤ ε for all policy π with q H = 0. Let π be the greedy policy with respect to (q 0 , . . ., q H−1 ), then we have that J(π, M ) ≥ J(π * M , M ) − (H + 1)Hε.Proof.By the performance difference lemma (Lemma 5.2.1 of Kakade (2003) or Chen and Jiang ( 2019)), we get where d π h is the state-action distribution at horizon h induced by policy π.The first inequality follows because π h is greedy with respect to q h .We consider a state-action distribution β 0 that is induced by some policy, then q * − q 0 2,β 0 ≤ T q * − T q 1 + T q 1 − q 0 2,β 0 ≤ T q * − T q 1 2,β 0 + T q 1 − q 0 2,β 0 ≤ q * − q 1 2,β 1 + ε where β 1 (s , a ) = s,a β 0 (s, a)P (s, a, s )I{a = arg max a ∈A (q * (s , a ) − q 1 (s , a )) 2 } is also induced by some policy.The first inequality follows by the fact that q * is the fixed point of the operator T .We can recursively apply the same process for q * − q h 2,β h , h > 0, and we can get Plug in the inequality to the performance difference lemma, we get The simulation lemma was first introduced for discounted setting in (Kearns & Singh, 2002), and here we prove a modified version of the simulation lemma for the finite horizon MDPs.Proof.Since M is ε air -AIR, we have D T V P exo (s exo , a), P exo (s exo , a ) = 1 2 P exo (s exo , a) − P exo (s exo , a ) 1 ≤ ε air Let e(s, a, s end ) = P end (s, a, s end )− P end (s, a, s end ), we know D T V P end (s, a), P end (s, a) = where s end h+1 ∼ P end (s our goal is to bound R( f ) with high probability.This is similar to bounding the generalization error in the statistical learning literature.We follow the proof technique of Lemma A.11 in Agarwal, Jiang, Kakade, and Sun (2019) to bound the excess risk.
Fix u = (s end h , a) ∈ U := S end × A, let x u i = (s exo h , s end h , a) and y u i = r(s exo h , s end h , a) + max a f (s exo h+1 , s end h+1 , a ) with (x u i , y u i ) ∼ ν, f * (x u i ) = E[y u i |x u i ].Note that our goal is to bound ].We can bound the deviation from the mean for all f ∈ F with one-sided Bernstein's inequality: the following holds with probability 1 − ζ, We need this holds for all u ∈ U.By the union bound, we have for all u ∈ U Since f is the empirical minimizer, we have Since f ∈ F, we have model-based approach constructs an empirical MDP M D = (S, A, P , r, H, ν).For the tabular setting we have ν(s) = 1

Figure 2 :
Figure 2: Comparison between algorithms in the optimal order execution problem and inventory management problem, for ε air = 0.The gray lines show the performance of the data collection policies.Results are averaged over 30 runs with error bars for one standard error.

Figure 3 :Figure 4 :
Figure 3: Comparison between algorithms in two simulation problems with ε air = 0.8.The gray lines show the performance of the data collection policies.The results are averaged over 30 runs with error bars representing one standard error.

Figure 5 :
Figure5: Performance on real world datasets.For (a), the numbers represent the total selling price minus the average price.For (b), the numbers represent the fuel cost with a penalty for not maintaining a desired batter level.Results are averaged over 30 runs, shown with one standard error.

Figure 6 :
Figure 6: Experiment on the Prius dataset.We include IQL with 500 episodes as a baseline (purple line).Results are averaged over 30 runs, shown with one standard error.
The first equality follows from the Bellman equation.The second and third inequalities follow from that v π M (s) is at most (H − h)r max for s ∈ S h .Lemma 3. Let M = (S, A, P exo , P end , r, H) be an ε air -AIR MDP and M b = (S, A, P exo , P end , r, H) with D T V (P end (s, a), P end (s, a)) ≤ ε p , then for any policy π,|J(π, M ) − J(π, M b )| ≤ v max H(ε air + ε p ).