On overfitting and asymptotic bias in batch reinforcement learning with partial observability

This paper stands in the context of reinforcement learning with partial observability and limited data. In this setting, we focus on the tradeoff between asymptotic bias (suboptimality with unlimited data) and overfitting (additional suboptimality due to limited data), and theoretically show that while potentially increasing the asymptotic bias, a smaller state representation decreases the risk of overfitting. Our analysis relies on expressing the quality of a state representation by bounding L1 error terms of the associated belief states. Theoretical results are empirically illustrated when the state representation is a truncated history of observations. Finally, we also discuss and empirically illustrate how using function approximators and adapting the discount factor may enhance the tradeoff between asymptotic bias and overfitting.


Introduction
This paper studies sequential decision-making problems that may be modeled as Markov Decision Processes (MDP) but for which the state is partially observable. This class of problems is called Partially Observable Markov Decision Processes (POMDPs) (Sondik, 1978). Within this setting, we focus on decision-making strategies computed using Reinforcement Learning (RL). When the model of the environment is not available, RL approaches rely on observations gathered through interactions with the (PO)MDP, and, although some RL approaches have strong convergence guarantees, classic RL approaches are challenged by data scarcity. When acquisition of new observations is possible (the "online" case), data scarcity is gradually phased out using strategies balancing the exploration / exploitation (E/E) tradeoff. The scientific literature related to this topic is vast; in particular, Bayesian RL techniques (Ross, Pineau, Chaib-draa, & Kreitmann, 2011;Ghavamzadeh, Mannor, Pineau, & Tamar, 2015) offer an elegant way of formalizing the E/E tradeoff.
However, such E/E strategies are not applicable when the acquisition of new observations is not possible anymore. In the pure "batch" setting, the task is to learn the best possible policy from a fixed set of transition samples (Farahmand, 2011;Lange, Gabel, & Riedmiller, 2012). Within this context, we propose to revisit RL as a learning paradigm that faces, similarly to supervised learning, a tradeoff between simultaneously minimizing two sources of error: an asymptotic bias and an overfitting error. The asymptotic bias (also simply called bias in the following) directly relates to the choice of the RL algorithm (and its parameterization). Any RL algorithm defines a policy class as well as a procedure to search within this class, and the bias may be defined as the performance gap between actual optimal policies and the best policies within the policy class considered. This bias does not depend on the set of observations. On the other hand, overfitting is an error term induced by the fact that only a limited amount of data is available to the algorithm. This overfitting error vanishes as the size and the quality of the dataset increase.
In this paper, we focus on studying the interactions between these two sources of error, in a setting where the state is partially observable. Due to this particular setting, one needs to build a state representation from the sequence of observations, actions and rewards in a trajectory (Singh, Jaakkola, & Jordan, 1994;Aberdeen, 2003). By increasing the cardinality of the state representation, the algorithm may be provided with a more informative representation of the POMDP, but at the price of simultaneously increasing the size of the set of candidate policies, thus also increasing the risk of overfitting. We analyze this tradeoff in the case where the RL algorithm provides an optimal solution to the frequentist-based MDP associated with the state representation (independently of the method used by the learning algorithm to converge towards that solution). Our novel analysis relies on expressing the quality of a state representation by bounding L 1 error terms of the associated belief states, thus introducing the concept of -sufficient statistics in the hidden state dynamics.
Experimental results illustrate the theoretical findings on a distribution of synthetic POMDPs as well as a large-scale POMDP with real-world data. In addition, we illustrate the link between the variance observed when dealing with different datasets (directly linked to the size of the dataset) and overfitting, where the link is that variance leads to overfitting if we have a (too) large feature space.
By extending known results for MDPs, we also briefly discuss and illustrate how using function approximators and adapting the discount factor play a role in the tradeoff between bias and overfitting when the state is partially observable. This has the advantage of providing the reader with an overview of key elements involved in the bias-overfitting tradeoff, specifically for the POMDP case.
The remainder of the paper is organized as follows. Section 2 formalizes POMDPs, (limited) sets of observations and state representations. Section 3 details the main contribution of this paper: an analysis of the bias-overfitting tradeoff in learning POMDPs in the batch setting. Section 4 empirically illustrates the main theoretical results, while Section 5 concludes with a discussion of the findings

Formalization
We consider a discrete-time POMDP (Sondik, 1978)  • T : S×A×S → [0, 1] is the transition function (set of conditional transition probabilities between states), • R : S × A × S → R is the reward function, where R is a continuous set of possible rewards in a range R max ∈ R + (e.g., [0, R max ] without loss of generality), • Ω is a finite set of observations {1, . . . , N Ω }, • O : S × Ω → [0, 1] is a set of conditional observation probabilities, and The initial state is drawn from an initial distribution b(s 0 ). At each time step t ∈ N 0 , the environment is in a state s t ∈ S. At the same time, the agent receives an observation ω t ∈ Ω which depends on the state of the environment with probability O(s t , ω t ) and the agent has to take an action a t ∈ A. Then, the environment transitions to state s t+1 ∈ S with probability T (s t , a t , s t+1 ) and the agent receives a reward r t ∈ R equal to R(s t , a t , s t+1 ). In this paper, the conditional transition probabilities T , the reward function R and the conditional observation probabilities O are unknown. The only information available to the agent is the past experience it gathered while interacting with the POMDP. A POMDP is illustrated in Fig. 1.

Policy Policy
Hidden dynamics Policy Figure 1: Graphical model of a POMDP.

Processing a History of Data
Policies considered in this paper are mappings from (an ordered set of) observation(s) into actions. A simple approach to build a space of candidate policies is to consider the set of mappings taking only the very last observation(s) as input (Whitehead & Ballard, 1990). However, in a POMDP setting, this leads to candidate policies that are likely not rich enough to capture the system dynamics, thus suboptimal (Singh et al., 1994;Wolfe, 2006). The alternative is to use a history of previously observed features to better estimate the hidden state dynamics (McCallum, 1996;Littman & Sutton, 2002;Singh, James, & Rudary, 2004).
We denote by H t = Ω × (A × R × Ω) t the set of histories observed up to time t for t ∈ N 0 , and by H = ∞ t=0 H t the space of all possible observable histories.
A straightforward approach is to take the whole history H t ∈ H as input of candidate policies. However, taking a too long history may have several drawbacks. Indeed, increasing the size of the set of candidate optimal policies generally implies: (i) more computation to search within this set (Littman, 1994;McCallum, 1996) and (ii) an increased risk of including candidate policies suffering overfitting (see Section 3). In this paper, we are specifically interested in minimizing the latter overfitting drawback while keeping an informative state representation.
In this paper, we consider a mapping φ : On the one hand, we will show that when φ discards information from the whole history, the state representation φ(H) that the agent uses to take decision might depart from sufficient statistics, which can hurt performance. On the other hand, we will show that it is beneficial to use a mapping φ that has a low cardinality |φ(H)| to avoid overfitting. This can be intuitively understood since a mapping φ(·) induces an upper bound on the number of candidate policies: |Π φ(H) | ≤ |A| |φ(H)| . In the following, we discuss this tradeoff formally.
Let us first introduce a notion of information on the latent hidden state s through the notion of belief state (Cassandra, Kaelbling, & Littman, 1994 Among all possible mappings φ, we are particularly interested in the ones that extract enough information from the history to accurately capture the corresponding belief state, with the notion of sufficient statistics (Kaelbling, Littman, & Cassandra, 1998;Aberdeen, Buffet, & Thomas, 2007). We thus define this notion in the context of the mapping φ. Definition 2.3. In a POMDP M , a statistic φ(H) is a sufficient statistic at the condition that ∀s ∈ S: for H ∈ H. A mapping φ which provides sufficient statistics for all histories H ∈ H is called a sufficient mapping and is denoted as φ 0 .
However, as mentioned previously, it may happen that a mapping φ does not capture enough information from a history to contain the same information than the belief state. One of the key notions on which our analysis relies is the one of "approximately sufficient mappings", i.e. mappings whose corresponding belief state lies in an L 1 -ball of radius centered on b(·|H): 1. Note that s and H are random variables and their exact distribution will depend on the context that is considered. For any given probability distribution DH over histories: H ∼ DH , the probability P (s|φ(H)) is the expectation of the state when φ(H) is observed:

Definition 2.4. In a POMDP M , a statistic φ(H)
is an -sufficient statistic at the condition that it meets the following condition with ≥ 0 and with the L 1 norm: for H ∈ H. A mapping φ that provides -sufficient statistics for all histories H ∈ H is called an -sufficient mapping and is denoted as φ (H).

Working with a Limited Dataset
Let M(S, A, Ω, γ) be a set of POMDPs with fixed S, A, Ω, and γ. For any POMDP M (T, R, O) ∈ M, we denote by D M,πs,Ntr,N l a random dataset generated according to a probability distribution D M,πs,Ntr,N l over the set of N tr trajectories of length N l . One such trajectory is defined as the observable history H N l ∈ H N l obtained in M when starting from s 0 and following a stochastic sampling policy π s that ensures a non-zero probability of taking any action given an observable history H ∈ H. For simplicity we denote D M,πs,Ntr,N l , simply as D ∼ D M . For the purpose of the analysis, we also introduce the asymptotic dataset D ∞ = D M,πs,Ntr→∞,N l →∞ that would be theoretically obtained in the case where one could generate an infinite number of observations (N tr → ∞ and N l → ∞).
In this paper, the algorithm cannot generate additional data. The challenge is to determine a high-performance policy (in the actual environment) while having only access to a fixed dataset D. We formalize this hereafter.

Assessing the Performance of a Policy
Let us consider stationary and deterministic control policies π : φ(H) → A with π ∈ Π. Any particular choice of φ induces a particular definition of the policy space Π. We introduce V π M (φ(H)) with H ∈ H as the expected return obtained over an infinite time horizon when the system is controlled using policy π in the POMDP M . For any given distribution D H over histories, this is defined as: , P s t+1 |s t , a t = T (s t , a t , s t+1 ) and r t = R s t , a t , s t+1 . We also define π * as an optimal policy in M : where H 0 is taken out of the distribution of initial observations (compatible with the distribution b(s 0 ) of initial states through the conditional observation probabilities).

Bias-overfitting in RL with Partial Observability
In this section, we study the performance difference (or gap) between the expected return that can be obtained following the policy built from limited data and the highest possible expected return that we would obtain if the algorithm had access to the POMDP parameters.
In particular, we analyze how this performance gap can be decomposed into the sum of two terms: a term related to an asymptotic bias (suboptimality with unlimited data) and a term due to overfitting (additional suboptimality due to limited data).

Importance of the Feature Space
To study the importance of the feature space, let us assume that the policies built from limited data are optimal according to frequentist statistics, which allows removing from the analysis how the RL algorithm converges. In order to define the optimal policy according to frequentist statistics, let us first introduce a frequentist-based (augmented) MDP from the dataset D: • the action space: A = A, • the estimated transition function: for σ, σ ∈ Σ and a ∈ A,T (σ, a, σ ) is the number of times we observe the transition (σ, a) → σ divided by the number of times we observe (σ, a) 2 , • the estimated reward function: for σ, σ ∈ Σ and a ∈ A,R(σ, a, σ ) is the mean of the rewards observed for the tuple (σ, a, σ ) 3 , and • the discount factor Γ ≤ γ.
As long as the mapping φ is a sufficient mapping (thus denoted φ 0 ), the asymptotic frequentist-based MDP (when unlimited data is available) actually gathers the relevant information from the actual POMDP. Indeed, when the POMDP is known (i.e. T, R, O are known), the knowledge of H t allows one to obtain the belief state b(s t |H t ), calculated recursively thanks to the Bayes rule based on b(s t |H t ) = P (s t |ω t , a t , b(s t−1 |H t−1 )). It is then possible to define, from the history H ∈ H and for any action a ∈ A, the expected 2. if any (σ, a) has never been encountered in a dataset, we arbitrarily setT (σ, a, σ ) = 1/|Σ|, ∀σ . The theoretical results that follow are independent of how this case is treated. 3. if any (σ, a, σ ) has never been encountered in a dataset, we arbitrarily setR(σ, a, σ ) to the average of rewards observed over the whole dataset D. The theoretical results that follow are independent of how this case is treated.
immediate reward as well as a transition function into the next observation ω : In the frequentist approach, this information is estimated directly from interactions with the POMDP inR andT without any explicit knowledge of the POMDP parameters. We introduce V π M D (σ) with σ ∈ Σ as the expected return obtained over an infinite time horizon when the system is controlled using a policy π : Σ → A in the augmented decision processM D : A policy π is defined to be better than or equal to a policy π if its expected return is greater than or equal to that of π for all states. In an MDP, there is always at least one policy that is better than or equal to all other policies and this is an optimal policy (Sutton & Barto, 1998). In the augmented MDPM D , we denote the optimal policy as π D,φ and we also call it the frequentist-based policy. Let us now decompose the error of using a frequentist-based policy π D,φ in the actual POMDP: bias function of dataset D∞ (function of πs) and frequentist-based policy π D∞,φ (function of φ and Γ) overfitting due to finite dataset D (function of πs, N l , Ntr) in the context of frequentist-based policy π D,φ (function of φ and Γ) . (1) The term bias actually refers to an asymptotic bias when the size of the dataset tends to infinity, while the term overfitting refers to the expected suboptimality due to the finite size of the dataset (and thus due to the variance in the estimated transition function and reward function).
Selecting the feature space φ(H) carefully allows building a class of policies that have the potential to accurately capture information from data (low bias), but also generalize well (low overfitting). On the one hand, using too many non-informative features will increase overfitting, as stated in Theorem 3 below. On the other hand, a mapping φ(H) that discards useful available information will suffer an asymptotic bias, as stated in Theorem 1 below (arbitrarily large depending on the POMDP and on the features discarded).
We start by providing a bound on the bias, which is an original result based on the belief states via the -sufficient statistic.

Theorem 1. "Bound on the bias":
Let M be a POMDP described by the 7-tuple (S, A, T, R, Ω, O, γ). LetM D∞ be an augmented MDP (Σ, A,T ,R, Γ = γ) estimated, according to Definition 3.1, from a dataset D ∞ . Then, for any -sufficient mapping φ = φ , the asymptotic bias can be bounded as follows: Proof. We consider the frequentist-based MDPM D∞,φ 0 (Σ 0 , A,T ,R, Γ = γ), for H ∈ H and a ∈ A, let us define where the rewardR Then the main part of the proof is to demonstrate Proposition 2 below. From there, by applying Lemma 1 by Abel, Hershkowitz, and Littman (2016), we have: By further noticing that, when starting in s 0 ,M D∞,φ 0 and M provide an identical value function for a given policy π D,φ and that π D∞,φ 0 ∼ π * , i.e. V π * M = V π D∞,φ 0 M , the theorem follows.
Remark. As compared to Hutter (2014) and Abel et al. (2016), this bound relates directly to the capacity of the mapping φ(H) to retrieve sufficient information on the latent hidden state. As compared to PBVI (Pineau, Gordon, & Thrun, 2003) and similar approaches, we do not make the assumption that T , R and O are known and, as such, they need to be estimated from data.
We now provide Proposition 2, which is the key result required in the proof of Theorem 1.

Proposition 2.
Let φ be an -sufficient mapping, and let φ 0 be a sufficient mapping. Then, for any Proof. For this proposition, we rely on the fact that since φ (H (1) ) = φ (H (2) ), we are able to bound the L1 error terms of the associated belief states of H (1) and H (2) . This is illustrated in Figure 2. From that bound, we then present two different ways of independent interest to prove Proposition 2. The details of the proofs are given in Appendix A.1.
Figure 2: Illustration of the φ mapping and the belief for The first proof makes use of a tree of possible future observations, rewards and corresponding actions given a policy π, and we show that when starting from H (1) , H (2) such that φ (H (1) ) = φ (H (2) ), the bound holds.
We also provide an alternative proof that makes use of the formalism of the bisimulation metric (Ferns, Panangaden, & Precup, 2004) along with the data processing inequality. Intuitively, the proof relies on the fact that the histories H (1) and H (2) are close according to a bisimulation metric that measures the behavioral similarity (future rewards and observations). The data processing inequality is used for finding that distance as a function of the L1 error terms of the associated belief states.
We now provide a bound on the overfitting error that monotonically grows with |φ(H)|. Theorem 3 shows that using a large set of features allows a larger policy class, hence potentially leading to a stronger drop in performance when the available dataset D is limited (the bound decreases proportionally to 1 √ n ). A theoretical analysis in the context of MDPs with a finite dataset was performed by Jiang, Kulesza, and Singh (2015a).
. Then the overfitting due to using the frequentist-based policy π D,φ instead of π D∞,φ in the POMDP M can be bounded as follows: with probability at least 1 − δ.
Proof. The proof of Theorem 3 is deferred to Appendix A.2.
Overall, Theorems 1 and 3 can help choose a good state representation for POMDPs as they provide bounds on the two terms that appear in the bias-overfitting decomposition of Equation 1. For example, an additional feature in the mapping φ has an overall positive effect only if it provides a significant increase of information on the belief state (i.e. if it allows one to obtain a more accurate knowledge of the underlying state of the MDP defined by T and R when given φ(H)). This increase of information must be significant enough to compensate for the additional risk of overfitting when choosing a large cardinality of φ(H). Note that one could combine the two bounds to theoretically define an optimal choice of the state representation with lower bound guarantees regarding the bias-overfitting tradeoff. In practice, as the two bounds are loose, other techniques described in Section 3.3 are usually more useful.
Related work Partial observability is very common in real world domains and though there has been many works in state abstraction and homomorphisms in the MDP setting (Ravindran & Barto, 2004;Ferns et al., 2004), there has been relatively little in the POMDP setting. One related work in POMDPs by Castro, Panangaden, and Precup (2009) has discussed the parallel between the notion of bisimulation and a notion of trace equivalence, under which states are considered equivalent if they generate the same conditional probability distributions over observation sequences (where the conditioning is on action sequences).
In this paper, we introduce the definition of -sufficient statistics in POMDPs and derive bounds for the bias based on this property. An insightful part of the proof of Proposition 2 is to use the bisimulation metric (Ferns et al., 2004) and the data processing inequality. Note that the bisimulation metrics may also be used to take into account how the errors on the belief states may have less of an impact at the condition that the hidden states affected by these errors are close according to the bisimulation metric. In case there is some knowledge on the underlying dynamics, one could also extend the notion of the bisimulation metric to allow certain distinct actions to be essentially equivalent (Arun-Kumar, 2006). In that context, Taylor, Precup, and Panagaden (2009) have generalized the notion of the bisimulation metric to a lax bisimulation metric, which may take into account some symmetries and special structures.

Discussion on Function Approximators
As described earlier, a straightforward mapping φ(·) may be obtained by discarding some features from the observable history. In addition, the theoretical work developed in this paper can be useful to understand how a specific design of deep neural networks may work well (or not). Indeed, deep neural networks can be seen as a composition of many learnable mappings (constrained by the design of the neural network) and we provide bounds that depend on the property of a given mapping φ of the inputs (e.g., the first layers of a deep Q-network). If there is, for instance, an attention mechanism in the first layers of a deep neural network, the mapping made up of those first layers can be analyzed through the bounds developed in our work (e.g., it should not discard important observations that prevent the mapping to be close to some sufficient statistics). As another example, our work helps explain why basic recurrence in a neural network may not be well-suited for POMDPs that require a long history of observations for approaching some sufficient statistics. Indeed, basic recurrent cells are known to have difficulties to convey the long-term dependencies in sequences (i.e. features from the first time steps in a long time series) and LSTMs (Hochreiter & Schmidhuber, 1997) or other variants should be preferred.
Note that we could add a theorem showing that using, for instance, Rademacher complexity would allow providing other bounds (potentially tighter) than Theorem 3. However, such a bound is usually of little interest in practice, specifically concerning deep learning, because the bounds based on complexity measures fail to provide insights on the generalization capabilities of neural networks (Zhang, Bengio, Hardt, Recht, & Vinyals, 2016). Indeed, it has been empirically demonstrated that deep neural networks have strong generalization capabilities even with a high number of parameters (hence a potentially large complexity). In other words, a very loose bound on the overfitting error due to a large number of parameters may not lead to a performance drop in practice.

Selection of the Parameters with Validation or Cross-Validation to Balance the Bias-Overfitting Tradeoff
In the batch setting case, the selection of the policy parameters to effectively balance the bias-overfitting tradeoff can be done similarly to that in supervised learning (e.g., crossvalidation) as long as the performance criterion can be estimated from a subset of the trajectories from the dataset D not used during training (validation set). One possibility is to fit an MDP model from data via the frequentist approach (or regression), and evaluate the policy against the model (with Monte-Carlo simulations). Another approach is to use the idea of importance sampling (Precup, 2000). A mix of the two approaches has also been developed (Jiang & Li, 2016;Thomas & Brunskill, 2016).
Importance of the discount factor Γ used in the training phase: Besides the elements related to feature selection and function approximators, artificially lowering the discount factor can also be used to improve the performance of the policy when solving MDPs with limited data (Petrik & Scherrer, 2009;Jiang, Kulesza, Singh, & Lewis, 2015b).
In the partially observable setting, these results can be transferred to the frequentist-based MDP (Σ, A,T ,R, Γ) by choosing Γ < γ, which introduces a bias but reduces the risk of overfitting.

Experiments
This section provides empirical illustrations of the theoretical results both on synthetic POMDPs and on a large-scale POMDP in the context of smartgrids, with real-world data.
On the synthetic POMDPs, we first illustrate the main theoretical results of this paper related to the state representation of POMDPs and we also illustrate the use of function approximators and the training discount factor Γ. On the large-scale POMDP, we illustrate that an efficient feature selection process can be useful, even when used in addition to function approximators and a biased discount factor.

Synthetic POMDPs
In order to be representative of a diversity of environments, we provide results on a distribution of POMDPs.

Protocol
We randomly sample N P POMDPs such that N S = 5, N A = 2 and N Ω = 5 (except when stated otherwise) from a distribution P that we refer to as Random POMDP. The distribution P is fully determined by specifying a distribution over the set of possible transition functions T (·, ·, ·), a distribution over the set of reward functions R(·, ·, ·), and a distribution over the set of possible conditional observation probabilities O(·, ·). Random transition functions T (·, ·, ·) are drawn by assigning, for each entry (s, a, s ), a zero value with probability 3/4, and, with probability 1/4, a non-zero entry with a value drawn uniformly in For each generated POMDP P ∼ P, we generate 20 datasets D ∈ D P where D P is a probability distribution over all possible sets of n trajectories (n ∈ [2, 5000]); where each trajectory is made up of a history H 100 of 100 time steps, when starting from an initial state s 0 ∈ S and taking uniform random decisions. Each dataset D induces a policy π D,φ , and we want to evaluate the expected return of this policy while discarding the variance related to the stochasticity of the transitions, observations and rewards. To do so, policies are tested with 1000 rollouts of the policy. For each POMDP P , we are then able to get an estimate of the average score µ P which is defined as: We are also able to get an estimate of a parametric variance σ 2 P defined as:

History Processing
We show experimentally that any additional feature from the history H t is likely to reduce the asymptotic bias, but may also increase the overfitting. For any history length h, we consider the mapping φ h that extracts the current observation and the last h−1 (observation, action) tuples. In the experiments, we compare the policies π D,φ h for h = 1, 2, 3.
The values E P ∼P µ P and E P ∼P σ P are displayed in Figure 3. One can observe that a small set of features (small history) appears to be a better choice (in terms of total bias) when the dataset is small (only a few trajectories). With an increasing number of trajectories, the optimal choice in terms of number of features (h = 1, 2 or 3) also increases. In addition, one can also observe that the expected variance of the score decreases as the number of samples  σ P computed from a sample of N P = 50 POMDPs drawn from P. The bars are used to represent the variance observed when dealing with different datasets drawn from a distribution; note that this is not a usual error bar.
increases. As the variance decreases, the risk of overfitting also decreases, and it becomes possible to target a larger policy class (using a larger feature set).
The overfitting error is also linked to the variance of the value function estimates in the frequentist-based MDP. When these estimates have a large variance, an overfitting term appears because of a higher chance of picking one of the suboptimal policies, as illustrated in Figure 4.

Function Approximator and Discount Factor
We also illustrate the effect of using function approximators on the bias-overfitting tradeoff. To do so, we process the output of the state representation φ(·) into a deep Q-learning scheme (technical details are given in Appendix A.3). We can see in Figure 5 that deep Q-learning policies suffer less overfitting as compared to the frequentist-based approach (lower degradation of performance in the low-data regime) even though using a large set of features still leads to more overfitting than a small set of features. We also see that deep Q-learning policies do not introduce an important asymptotic bias (identical performance when a lot of data is available) because the neural network architecture is rich enough. Note that the variance is slightly larger than in Figure 3, and does not vanish to 0 with additional data. This is due to the additional stochasticity induced when building the Q-value function with neural networks (note that when performing the same experiments while taking the average recommendation of several Q-value functions, this variance decreases with the number of Q-value functions).
Finally, we empirically illustrate in Figure 6 the effect of modifying the discount factor Γ. When the training discount factor is lower than the one used in the actual POMDP (Γ < γ), there is an additional bias term, while when a high discount factor is used (Γ close to 1) ; the best performing (in green), the worst performing (in red) and the median performing were selected in a set of 50 policies built when D ∼ D has 5 trajectories of data from the actual POMDP. On the actual POMDP, the expected returns are V π 1 M = 28, V π 2 M = 33, V π 3 M = 42 (in general, these values need not be the same as the expected value of the probability distribution in the two graphs).
with a limited amount of data, overfitting increases. In our experiments, the influence of the discount factor is more subtle as compared to the impact of the state representation and the function approximator. The influence is nonetheless clear: it is better to have a low discount factor when only a few data points are available, and it is better to have a high discount factor when a lot of data is available, which is in line with previous analyses for MDPs (Jiang et al., 2015b).

Real-world Application in Smartgrids
The microgrid model considered here was introduced by François-Lavet, Taralla, Ernst, and Fonteneau (2016) and we next describe the main elements 4 .

Microgrid Benchmark
The microgrid is powered by photovoltaic (PV) panels combined with both long-term storage (hydrogen-based) and short-term storage (such as, for instance, LiFePO 4 batteries). These two types of storage aim at fulfilling, at best, the demand by addressing the seasonal and daily fluctuations of solar irradiance. Distinguishing short-from long-term storage is mainly a question of cost: batteries are too expensive to be used for addressing seasonal variations.   POMDPs drawn from P with N S = 8 and N Ω = 8 (h = 3). The bars are used to represent the variance observed when dealing with different datasets drawn from a distribution; this is not a usual error bar.
Operating the microgrid is formalized as a POMDP over discrete time steps of one hour. The observations at each time step are made up of the consumption (c t ), the production (ψ t ) and the level of energy in the battery (s B t ). The mapping φ(·) is defined such that where h c = h p are the lengths of the time series considered for the consumption and production, respectively. The instantaneous reward signal r t is obtained by adding the revenues generated by the hydrogen production r H 2 and the penalties r − due to the value of loss load: where d t denotes the net electricity demand, which is the difference between the local consumption c t and the local production of electricity ψ t . The penalty r − is proportional to the total amount of energy that was not supplied to meet the demand, with the cost incurred per kilowatt-hour (kW h) not supplied within the microgrid set to 2 e/kW h (corresponding to a value of loss load). The revenues (or cost) generated by the hydrogen production r H 2 is proportional to the total amount of energy transformed (or consumed) in the form of hydrogen, where the revenue (or cost) per kW h of hydrogen produced (or used) is set to 0.1 e/kW h. A synthetic residential consumption profile is considered with a daily consumption of 18kW h. The PV production profile comes from actual data of a residential customer located in Belgium (average production changes by a factor of about 1:5 between summer and winter). A fixed sizing of the microgrid is considered. The size of the battery is x B = 15kW h, the instantaneous power of the hydrogen storage is x H 2 = 1.1kW and the peak power generation of the PV installation is x P V = 12kW p (kW p stands for kilowatt-peak, which is the maximum electric power that can be supplied by the PV panels).
The action a t is an element of the action space made up of three discrete actions: A = {a (0) , a (1) , a (2) }. The possible actions relate to whether the microgrid creates hydrogen from electricity, creates electricity from hydrogen or leaves the long-term storage unused. The battery adapts to store excess electrical power (except when the battery is full) while avoiding any loss load (except when the battery is empty, which leads to negative rewards).
Note that in this specific application, using a history of observations from the previous time steps provides information on the latent features (of the state of the POMDP) such as the time of the day, the weather, the season (PV production and consumption time series are conditionally dependent on these latent features).
Also note that the dynamics of the system depend on exogenous time series (production and consumption time series) for which the agent only has finite data. In order to evaluate the capacity of the agent to generalize, we can break down the exogenous time series into a part that will be used for training and one part that will be used for validation (François-Lavet et al., 2016). The final performance is then evaluated on the environment with unseen time series.

Splitting the Times Series to Avoid Overfitting
In this microgrid domain, the agent is provided with up to two years of actual past realizations of the consumption (c t ) and the production (ψ t ). These past realizations are split into a training environment (one year) and a validation environment (one year). The training environment is used to train the policy, while the validation environment is used at each epoch 5 to estimate how well the policy performs on the undiscounted sum of rewards, and it selects the best (approximated) Q-network denoted Q * before overfitting (by selecting a discount factor lower than the maximum and by using early stopping). It also has the advantage of picking up the Q-network at an epoch less affected by instabilities. The selected trained Q-network is then used in a test environment (y = 3) to provide an independent estimation of how well the resulting policy performs. Technical details relative to the deep Q-network algorithm used are given in Appendix B.1.
To empirically demonstrate the effect of the bias-overfitting tradeoff, we artificially reduce the diversity of the training and validation time-series by a factor κ = {1, 2, 4, 8, 16}. Overall, the time series for training and validation should still be as close to 365 days as possible (the epochs are run on approximately the same given number of time steps for all cases). In addition, the true underlying processes should be respected as closely as possible by (i) guaranteeing that the succession of the seasons are not corrupted and (ii) guaranteeing that consecutive days in the original time series should be kept consecutive whenever possible.
In order to do so, the time series of one year are divided into four seasons. Each season is then split into κ blocs of the same number of days. For each season, data reduction is done by replicating one of the blocs instead on all remaining blocs of the same season. This artificial reduction on the available data is illustrated on Figure 7. Within each season, data reduction is done by replicating a fraction κ of the time series (here κ = 2). With this process, κ new time series are obtained from one original time series. Figure 7: Illustration of how the quantity of data in a time series of one year is artificially reduced by a factor κ = 2 while guaranteeing that the specificity of the seasons are essentially preserved. This schema applies for both the consumption and the production time-series and for both the training time series and validation time series.

Results of the Experiments
The purpose of the experiments is to illustrate the theoretical results, and are given in Figure 8. On the one hand, using a long history may pose problems relative to overfitting 5. An epoch is defined as the set of all iterations required to go through the whole year of the exogenous time series ct and ψt. Each iteration is made up of a transition of one time-step in the environment as well as a gradient step of all parameters θ of the Q-network.
when the amount of training data is limited. On the other hand, using a long history of observations allows achieving a good operation when enough data is available (information on the latent features are preserved). Operational revenue (test data) ( /year) 3 hours of history 12 hours of history Figure 8: Evolution of the performance of the policy in the test setting (one year of unseen past time series for both the consumption and production). Using a relatively short history of observations is beneficial when the amount of data is limited while a longer history of observations leads to improved results when the amount of data increases (lower κ means that more data is available). For each point, the mean and the standard deviation are calculated over 2 3 different runs of the Deep Q-network algorithm (over different ways to split the time series defined in appendix and for different random seeds).
As already noted, there is a close link between (i) choosing a function approximator and (ii) selecting features, since the function approximator characterizes how the features will be treated into higher levels of abstraction (a fortiori it can thus give more or less weight to some features). Notwithstanding, this example shows that, in general, the use of function approximators may still benefit from an efficient feature selection process. Indeed, we still observe a tradeoff between bias and overfitting in this microgrid application, even with the convolutional architecture of the deep Q-network.

Conclusion and Future Works
This paper discusses the bias-overfitting tradeoff of batch RL algorithms in the context of POMDPs. Most current RL works compare algorithms in an online setting, where the performance depends on many factors such as improved exploration, better generalization, or better numerical efficiency of the gradient step (in the case of deep RL). In contrast, this paper is focused on one source of sub-optimality related to generalization from limited data in the case of POMDPs. In that context, we propose an analysis showing that, similarly to supervised learning techniques, RL may face a bias-overfitting dilemma in situations where the policy class is too large compared to the batch of data. In such situations, we show that it may be preferable to concede an asymptotic bias in order to reduce overfitting. This (favorable) asymptotic bias may be introduced through different manners: (i) downsizing the state representation, (ii) using specific types of function approximators and (iii) lowering the discount factor.
The main theoretical results of this paper relate to the state representation; the originality of the setting proposed in this paper compared to Munos (2011), Ortner, Maillard, andRyabko (2014) and the related work is mainly to formalize the problem in a batch setting (limited set of tuples) instead of the online setting, where they investigate the E/E dilemma. As compared to Jiang et al. (2015b), the originality is to consider a partially observable setting. We introduce the notion of the -sufficient statistics and showed that it allows us to formalize the intuition of the bias-overfitting trade-off in a rigorous way and provides new insights compared to the MDP case. In particular, the bound of Theorem 1 is a new result based on L1 error terms of the associated belief states. There are also interesting insights from the techniques used to derive that bound, where we use the formalism of bisimulation metrics and the data processing inequality.
The work proposed in this paper may also be of interest in online settings because, at each stage, obtaining a performant policy from given data is part of the solution to an efficient exploration/exploitation tradeoff. For instance, optimizing the bias-overfitting tradeoff suggests that it can be beneficial to dynamically adapt the feature space and the function approximator. This can be done through ad hoc regularization or by adapting the neural network architecture, using for instance the NET2NET transformation (Chen, Goodfellow, & Shlens, 2015).
Given a history H ∈ H we can then define the Q-value of a state s ∈ S and an action a ∈ A in the underlying MDP following π and assuming that H has been observed: Q π H (s, a) =R (s, a) + γE s 1 |s,a E ω 1 |s 1 R (s 1 , a 1 = π H (aω 1 )) + γ 2 E s 1 |s,a E ω 1 |s 1 E s 2 |s 1 ,a 1 E ω 2 |s 2 R (s 2 , π H (aω 1 a 1 ω 2 )) where R (s, a) = E s |s,a R(s, a, s ). Note that Q πM for any history H ∈ H and any action a ∈ A since φ 0 is a sufficient mapping.
We can now prove the proposition. Let a ∈ A. Let π * = π D∞,φ 0 be the optimal policy in M and let π i = π * H i be the optimal policy conditioned on having observed H i for i = 1, 2. Since π * is optimal we have We then have where we used the facts that b(·|H (1) ) − b(·|H (2) ) is a vector of L 1 -norm less than 2 (by Lemma 4) whose components sum to zero and |Q π 1 (s, a) − Q π 1 (s , a)| ≤ Rmax 1−γ for any s, s ∈ S.
Applying the same argument mutatis mutandis we obtain Q π * M D∞,φ 0 Alternative proof. An alternative proof of independent interest makes use of the formalism of the bisimulation metric (Ferns et al., 2004) along with the data processing inequality. Let us consider d ∈ M, where M is the set of all semi-metrics on Σ 0 with distance at most 1.
We fix a particular d as follows: ∀σ (1) , σ (2) ∈ Σ 0 : where d P is some probability metric, and where c R , c T ≥ 0 are such that c R + c T ≤ 1. We define F : M → M by where T K (d) is the Kantorovich metric induced by d. From Lemma 5, F has a least fixedpoint d f ix , and it is a bisimulation metric. From Lemmas 6, 7 (using the data processing inequality), we have ∀σ (1) , σ (2) ∈ Σ 0 that Using Lemma 8, it follows that Then F has a least fixed-point, d f ix , and d f ix is a bisimulation metric.
Proof. This follows directly from the fact that ∀H: where R (s, a) = E s |s,a R(s, a, s ).
Proof. For any a ∈ A by direct application of the data processing inequality (DPI) in the special case of the total variation: Moreover, as illustrated on Figure 9 and once more by application of the DPI in the context of the special case of the total variation, we have that with the distributions • P X = P s|φ 0 (H (1) ), a , • Q X = P s|φ 0 (H (2) ), a , • P Y = P ω, s | φ 0 (H (1) ), a , and Note that the output distributions is on the tuple (ω, s). Since the right hand-side in the aforementioned DPI can be bounded by : we have 1 2 ω∈Ω s∈S P ω | φ 0 (H (1) ), a P s | φ 0 (H (1) ), a, ω −P ω | φ 0 (H (2) ), a P s | φ 0 (H (2) ), a, ω ≤ .
We have Taking the limit of m → ∞, Q m → Q πM , we have which completes the proof.

A.3 Q-learning with Neural Network as a Function Approximator: Technical Details of Figure 5
The neural network is made up of three intermediate fully connected layers with 20, 50 and 20 neurons with ReLu activation function and is trained with Q-learning. Weights are initialized with a Glorot uniform initializer (Glorot & Bengio, 2010). It is trained using a target Q-network with a freeze interval (see (Mnih, Kavukcuoglu, Silver, et al., 2015)) of 100 mini-batch gradient descent steps. It uses an RMSprop update rule (learning rate of 0.005, ρ = 0.9), mini-batches of size 32 and 20000 mini-batch gradient descent steps.

B.1.1 Neural Network Architecture
The inputs of the neural network architecture are provided by φ(H t ) (normalized into [0,1]), and the outputs represent the Q-values for each discretized action. The neural network processes the time series (when h c > 2 and h p > 2) thanks to a set of convolutions with 16 filters of 2 × 1 with stride 1, followed by a convolution with 16 filters of 2 × 2 with stride 1. The combination of the output of the convolutions and the non-time series inputs is then followed by two fully-connected layers with 50 and 20 neurons. The activation function used is the Rectified Linear Unit (ReLU) except for the output layer where no activation function is used. A sketch of the structure of the neural network is provided in Figure 10.

B.1.2 Hyperparameters Used in DQN
The different hyperparameters used for the DQN algorithm (Mnih et al., 2015) are provided in Table 1.