Policy Invariance under Reward Transformations for General-Sum Stochastic Games

We extend the potential-based shaping method from Markov decision processes to multi-player general-sum stochastic games. We prove that the Nash equilibria in a stochastic game remains unchanged after potential-based shaping is applied to the environment. The property of policy invariance provides a possible way of speeding convergence when learning to play a stochastic game.


Introduction
In reinforcement learning, one may suffer from the temporal credit assignment problem (Sutton & Barto, 1998) where a reward is received after a sequence of actions.The delayed reward will lead to difficulty in distributing credit or punishment to each action from a long sequence of actions and this will cause the algorithm to learn slowly.An example of this problem can be found in some episodic tasks such as a soccer game where the player is only given credit or punishment after a goal is scored.If the number of states in the soccer game is large, it will take a long time for a player to learn its equilibrium policy.
Reward shaping is a technique to improve the learning performance of a reinforcement learner by introducing shaping rewards to the environment (Gullapalli & Barto, 1992;Mataric, 1994).When the state space is large, the delayed reward will slow down the learning dramatically.To speed up the learning, the learner may apply shaping rewards to the environment as a supplement to the delayed reward.In this way, a reinforcement learning algorithm can improve its learning performance by combining a "good" shaping reward function with the original delayed reward.
The applications of reward shaping can be found in the literature (Gullapalli & Barto, 1992;Dorigo & Colombetti, 1994;Mataric, 1994;Randløv & Alstrøm, 1998).Gullapalli and Barto (1992) demonstrated the application of shaping to a key-press task where a robot was trained to press keys on a keyboard.Dorigo and Colombetti (1994) applied shaping policies for a robot to perform a predefined animate-like behavior.Mataric (1994) presented an intermediate reinforcement function for a group of mobile robots to learn a foraging task.Randløv and Alstrøm (1998) combined reinforcement learning with shaping to make an agent learn to drive a bicycle to a goal.The theoretical analysis of reward shaping can be found in the literature (Ng, Harada, & Russell, 1999;Wiewiora, 2003;Asmuth, Littman, & Zinkov, 2008).Ng et al. (1999) presented a potential-based shaping reward that can guarantee the policy invariance for a single agent in a Markov decision process (MDP).Ng et al. proved that the optimal policy keeps unchanged after adding the potential-based shaping reward to an MDP environment.Following Ng et al., Wiewiora (2003) showed that the effects of potential-based shaping can be achieved by a particular initialization of Q-values for agents using Q-learning.Asmuth et al. (2008) applied the potential-based shaping reward to a model-based reinforcement learning approach.
The above articles focus on applications of reward shaping to a single agent in an MDP.For the applications of reward shaping in general-sum games, Babes, Munoz de Cote, and Littman (2008) introduced a social shaping reward for players to learn their equilibrium policies in the iterated prisoner's dilemma game.But there is no theoretical proof of policy invariance under the reward transformation.In our research, we prove that the Nash equilibria under the potential-based shaping reward transformation (Ng et al., 1999) will also be the Nash equilibria for the original game under the framework of general-sum stochastic games.Note that the similar work of Devlin and Kudenko (2011) was published while this article was under review.But Devlin and Kudenko only proved sufficiency based on a proof technique introduced by Asmuth et al. (2008), while we prove both sufficiency and necessity using a different proof technique in this article.

Framework of Stochastic Games
Stochastic games were first introduced by Shapley (1953).In a stochastic game, players choose the joint action and move from one state to another state based on the joint action they choose.In this section, under the framework of stochastic games, we introduce Markov decision processes, matrix games and stochastic games respectively.

Markov Decision Processes
A Markov decision process is a tuple (S, A, T, γ, R) where S is the state space, A is the action space, T : S × A × S → [0, 1] is the transition function, γ ∈ [0, 1] is the discount factor and R : S × A × S → R is the reward function.The transition function denotes a probability distribution over next states given the current state and action.The reward function denotes the received reward at the next state given the current action and the current state.A Markov decision process has the following Markov property: the player's next state and reward only depend on the player's current state and action.A player's policy π : S → A is defined as a probability distribution over the player's actions given a state.An optimal policy π * will maximize the player's discounted future reward.For any MDP, there exists a deterministic optimal policy for the player (Bertsekas, 1987).
Starting in the current state s and following the optimal policy thereafter, we can get the optimal state-value function as the expected sum of discounted rewards (Sutton & Barto, 1998) where k is the current time step, r k+ j+1 is the received immediate reward at the time step k + j + 1, γ ∈ [0, 1] is a discount factor, and T is a final time step.In (1), we have T → ∞ if the task is an infinite-horizon task such that the task will run over infinite period.If the task is episodic, T is defined as the terminal time when each episode is terminated at the time step T .Then we call the state where each episode ends as the terminal state s T .In a terminal state, the state-value function is always zero such that V (s T ) = 0 for all s T ∈ S. Given the current state s and action a, and following the optimal policy thereafter, we can define an optimal action-value function (Sutton & Barto, 1998) where T (s, a, s ′ ) = Pr {s k+1 = s ′ |s k = s, a k = a} is the probability of the next state being s k+1 = s ′ given the current state s k = s and action a k = a at time step k, and R(s, a, s ′ ) = E{r k+1 |s k = s, a k = a, s k+1 = s ′ } is the expected immediate reward received at state s ′ given the current state s and action a.In a terminal state, the action-value function is always zero such that Q(s T , a) = 0 for all s T ∈ S.

Matrix Games
A matrix game is a tuple is the action set for the player i and R i : A matrix game is a game involving multiple players and a single state.Each player i(i = 1, . . ., n) selects an action from its action set A i and receives a payoff.The player i's payoff function R i is determined by all players' joint action from joint action space For a two-player matrix game, we can set up a matrix with each element containing a payoff for each joint action pair.Then the payoff function R i for player i(i = 1, 2) becomes a matrix.If the two players in the game are fully competitive, we will have a two-player zero-sum matrix game with R 1 = −R 2 .
In a matrix game, each player tries to maximize its own payoff based on the player's strategy.A player's strategy in a matrix game is a probability distribution over the player's action set.To evaluate a player's strategy, we introduce the following concept of Nash equilibrium.A Nash equilibrium in a matrix game is a collection of all players' policies where V i (•) is the expected payoff for player i given all players' current strategies and π i is any strategy of player i from the strategy space Π i .In other words, a Nash equilibrium is a collection of strategies for all players such that no player can do better by changing its own strategy given that other players continue playing their Nash equilibrium policies (Başar & Olsder, 1999).We define Q i (a 1 , . . . ,a n ) as the received payoff of the player i given players' joint action a 1 , . . . ,a n , and π i (a i ) (i = 1, . . ., n) as the probability of player i choosing action a 1 .Then the Nash equilibrium defined in (3) becomes where π * i (a i ) is the probability of player i choosing action a i under the player i's Nash equilibrium strategy π * i .A two-player matrix game is called a zero-sum game if the two players are fully competitive.In this way, we have R 1 = −R 2 .A zero-sum game has a unique Nash equilibrium in the sense of the expected payoff.It means that, although each player may have multiple Nash equilibrium strategies in a zero-sum game, the value of the expected payoff V i under these Nash equilibrium strategies will be the same.If the players in the game are not fully competitive or the summation of the players' payoffs is not zero, the game is called a general-sum game.In a general-sum game, the Nash equilibrium is no longer unique and the game might have multiple Nash equilibria.Unlike the deterministic optimal policy for a single player in an MDP, the equilibrium strategies in a multiplayer matrix game may be stochastic.

Stochastic Games
A Markov decision process contains a single player and multiple states while a matrix game contains multiple players and a single state.For a game with more than one player and multiple states, we define a stochastic game (or Markov game) as the combination of Markov decision processes and matrix games.A stochastic game is a tuple (n, S, A 1 , . . ., A n , T, γ, R 1 , . . ., R n ) where n is the number of the players, T : is the reward function for player i.The transition function in a stochastic game is a probability distribution over next states given the current state and joint action of the players.The reward function R i (s, a 1 , . . ., a n , s ′ ) denotes the reward received by player i in state s ′ after taking joint action (a 1 , . . ., a n ) in state s.Similar to Markov decision processes, stochastic games also have the Markov property.That is, the player's next state and reward only depend on the current state and all the players' current actions.
To solve a stochastic game, we need to find a policy π i : S → A i that can maximize player i's discounted future reward with a discount factor γ. Similar to matrix games, the player's policy in a stochastic game is probabilistic.An example is the soccer game introduced by Littman (Littman, 1994) where an agent on the offensive side must use a probabilistic policy to pass an unknown defender.In the literature, a solution to a stochastic game can be described as Nash equilibrium strategies in a set of associated state-specific matrix games (Bowling, 2003;Littman, 1994).In these state-specific matrix games, we define the action-value function Q * i (s, a 1 , . . ., a n ) as the expected reward for player i when all the players take joint action a 1 , . . ., a n in state s and follow the Nash equilibrium policies thereafter.If the value of Q * i (s, a 1 , . . ., a n ) is known for all the states, we can find player i's Nash equilibrium policy by solving the associated state-specific matrix game (Bowling, 2003).Therefore, for each state s, we have a matrix game and we can find the Nash equilibrium strategies in this matrix game.Then the Nash equilibrium policies for the game are the collection of Nash equilibrium strategies in each state-specific matrix game for all the states.

Multi-Player General-Sum Stochastic Games
For a multi-player general-sum stochastic game, we want to find the Nash equilibria in the game if we know the reward function and transition function in the game.A Nash equilibrium in a stochastic game can be described as a tuple of n policies (π * 1 , . . ., π * n ) such that for all s ∈ S and i where Π i is the set of policies available to player i and V i (s, π * 1 , . . ., π * n ) is the expected sum of discounted rewards for player i given the current state and all the players' equilibrium policies.To simplify notation, we use as the state-value function under Nash equilibrium policies.We can also define the action-value function Q * (s, a 1 , • • • , a n ) as the expected sum of discounted rewards for player i given the current state and the current joint action of all the players, and following the Nash equilibrium policies thereafter.Then we can get where π * i (s, a i ) ∈ PD(A i ) is a probability distribution over action a i under player i's Nash equilibrium policy, T (s, a 1 , . . ., a n , s ′ ) = Pr {s k+1 = s ′ |s k = s, a 1 , . . ., a n } is the probability of the next state being s ′ given the current state s and joint action (a 1 , . . ., a n ), and R i (s, a 1 , . . ., a n , s ′ ) is the expected immediate reward received in state s ′ given the current state s and joint action (a 1 , . . ., a n ).Based on ( 6) and ( 7), the Nash equilibrium in (5) can be rewritten as 3. Potential-Based Shaping in General-Sum Stochastic Games Ng et al. (1999) presented a reward shaping method to deal with the credit assignment problem by adding a potential-based shaping reward to the environment.The combination of the shaping reward with the original reward may improve the learning performance of a reinforcement learning algorithm and speed up the convergence to the optimal policy.The theoretical studies on potentialbased shaping methods that appear in the published literature consider the case of a single agent in an MDP (Ng et al., 1999;Wiewiora, 2003;Asmuth et al., 2008).In our research, we extend the potential-based shaping method from Markov decision processes to multi-player stochastic games.We prove that the Nash equilibria under the potential-based shaping reward transformation will be the Nash equilibria for the original game under the framework of general-sum stochastic games.
We define a potential-based shaping reward F i (s, s ′ ) for player i as where Φ : S → R is a real-valued shaping function and Φ(s T ) = 0 for any terminal state s T .We define a multi-player stochastic game as a tuple M = (S, A 1 , . . ., A n , T, γ, R 1 , . . ., R n ) where S is a set of states, A 1 , . . ., A n are players' action sets, T is the transition function, γ is the discount factor, and R i (s, a 1 , . . ., a n , s ′ )(i = 1, . . ., n) is the reward function for player i.After adding the shaping reward function F i (s, s ′ ) to the reward function R i (s, a 1 , . . ., a n , s ′ ), we define a transformed multi-player stochastic game as a tuple Inspired by Ng et al. (1999)'s proof of policy invariance in an MDP, we prove the policy invariance in a multi-player general-sum stochastic game as follows.
Theorem 1.Given an n-player discounted stochastic game M = (S, A 1 , . . ., A n , T, γ, R 1 , . . ., R n ), we define a transformed n-player discounted stochastic game M ′ = (S, A 1 , . . ., A n , T, γ, R 1 + F 1 , . . ., R n + F n ) where F i ∈ S × S is a shaping reward function for player i.We call F i a potential-based shaping function if F i has the form of ( 9).Then, the potential-based shaping function F i is a necessary and sufficient condition to guarantee the Nash equilibrium policy invariance such that • (Sufficiency) If F i (i = 1, . . ., n) is a potential-based shaping function, then every Nash equilibrium policy in M ′ will also be a Nash equilibrium policy in M (and vice versa).
• (Necessity) If F i (i = 1, . . ., n) is not a potential-based shaping function, then there may exist a transition function T and reward function R such that the Nash equilibrium policy in M ′ will not be the Nash equilibrium policy in M.
Proof.(Proof of Sufficiency) Based on (8), a Nash equilibrium in the stochastic game M can be represented as a set of policies such that for all i = 1, . . ., n, s ∈ S and π We define Then we can get We now use some algebraic manipulations to rewrite the action-value function under the Nash equilibrium in (7) for player i in the stochastic game M as Since ∑ s ′ ∈S T (s, a 1 , . . ., a n , s ′ ) = 1, the above equation becomes According to (6), we can rewrite the above equation as Based on the definitions of F i (s, s ′ ) in ( 9) and QM ′ i (s, a 1 , . . ., a n ) in ( 13), the above equation becomes Since equations ( 14) and ( 18) have the same form as equations ( 6)-( 8), we can conclude that QM ′ i (s, a 1 , . . ., a n ) is the action-value function under the Nash equilibrium for player i in the stochastic game M ′ .Therefore, we can obtain If the state s is the terminal state s T , then we have QM ′ i (s T , a 1 , . . ., (s, a 1 , . . ., a n ), we can find that the Nash equilibrium in M is also the Nash equilibrium in M ′ .Then the state-value function under the Nash equilibrium in the stochastic game M ′ can be given as (Proof of Necessity) If F i (i = 1, . . ., n) is not a potential-based shaping function, we will have F i (s, s ′ ) = γΦ i (s ′ ) − Φ i (s).Similar to Ng et al. (1999)'s proof of necessity, we define Then we can build a stochastic game M by giving the following transition function T and player 1's reward where a i (i = 1, . . ., n) represents any possible action a i ∈ A i from player i, and a 1 1 and a 2 1 represent player 1's action 1 and action 2 respectively.Equation T (s 1 , a 1 1 , a 2 , . . ., a n , s 3 ) = 1 in (21) denotes that, given the current state s 1 , player 1's action a 1 1 will lead to the next state s 3 no matter what joint action the other players take.Based on the above transition function and reward function, we can get the game model including states (s 1 , s 2 , s 3 ) shown in Figure 1.We now define Φ 1 (s i ) = −F 1 (s i , s 3 )(i = 1, 2, 3).Based on ( 6), ( 7), ( 19), ( 20) and ( 21), we can obtain player 1's action-value function at state s 1 in M and M Then the Nash equilibrium policy for player 1 at state s 1 is Therefore, in the above case, the Nash equilibrium policy for player 1 at state s 1 in M is not the Nash equilibrium policy in M ′ .
The above analysis shows that the potential-based shaping reward with the form of F i (s, s ′ ) = γΦ i (s ′ ) − Φ i (s) guarantees the Nash equilibrium policy invariance.Now the question becomes how to select a shaping function Φ i (s) to improve the learning performance of the learner.Ng et al. (1999) showed that Φ i (s) = V * M i (s) is a good candidate for improving the player's learning performance in an MDP.We substitute Φ i (s) = V * M i (s) into ( 18 T (s, a 1 , . . ., a n , s ′ ) R M i (s, a 1 , . . ., a n , s ′ ) + F i (s, s ′ ) T (s, a 1 , . . ., a n , s ′ ) R M i (s, a 1 , . . ., a n , s ′ ) + F i (s, s ′ ) .
Equation ( 23) shows that the action-value function Q * M ′ i (s, a 1 , . . ., a n ) in state s can be easily obtained by checking the immediate reward R M i (s, a 1 , . . ., a n , s ′ ) + F i (s, s ′ ) that player i received in state s ′ .However, in practical applications, we will not have all the information of the environment such as T (s, a 1 , . . ., a n , s ′ ) and R i (s, a 1 , . . ., a n , s ′ ).This means that we cannot find a shaping function Φ i (s) such that Φ i (s) = V * M i (s) without knowing the model of the environment.Therefore, the goal for designing a shaping function is to find a Φ i (s) as a "good" approximation to V * M i (s).

Conclusion
A potential-based shaping method can be used to deal with the temporal credit assignment problem and speed up the learning process in MDPs.In this article, we extend the potential-based shaping method to general-sum stochastic games.We prove that the proposed potential-based shaping reward applied to a general-sum stochastic game will not change the original Nash equilibrium of the game.The analysis result in this article has the potential to improve the learning performance of the players in a stochastic game.