The primary focus of this repository is to train deep RL models to excel at the game of dominoes. For now, the repository exclusively uses agents trained with the TD-lambda algorithm. I have explored how various TD-lambda agents perform, including analyses related to differences in input structure, network architectures, and training programs.
This documentation file starts by explaining the TD-lambda algorithm and how it is applied to dominoes, then shows how it is implemented by TD-lambda agents.
The foundation of the TD-Lambda algorithm is
Temporal Difference Learning.
In temporal-difference learning, a value function
The term in the parentheses is called the temporal difference error because it
reflects the error in predicting the next states reward from the previous
state. It is denoted as
Suppose the value function is defined as a neural network
How does one implement an update of the value function? To do so, we need to
determine how the parameters of the network affect the estimate of the value.
For this, we need the gradient of the value network with respect to the
parameters
In an autocorrelated games like dominoes, it makes sense to keep track of how
the parameters have been influencing the estimate of the final score
throughout each hand. This value is a temporally discounted accumulation of
gradients that is referred to as the eligibility trace, because it represents
the "eligibility" of each parameter to be updated by temporal difference
errors. The eligibility trace is denoted
Fortunately, this equation is recursive, so it can be updated each time step without recomputing the gradients of all past time steps as follows:
Here,
Now, we can't just add the eligibility trace to the networks parameters, we
have to make sure that we update the parameters such that the value function
will become progressively more accurate over time. To do that, the eligibility
trace needs to be scaled by the temporal difference error (
In a game of dominoes, the goal of the game is to end each hand with as few
points as possible. Therefore, an agent's value function is defined as its
estimate of its final score at the end of the game (the sum of the points in
its hand when a player goes out - see the rules). Final
score is denoted by
Following the convention of the influential TD-Gammon model of TD-Lambda learning, the temporal difference is defined in two different ways depending on the game state.
- If the hand is not over, then the temporal difference is
defined as the difference in the models prediction of the final score before
and after a turn occurs. The model prediction before and after a turn occurs
are referred to, respectively, as the pre-state and the post-state model
prediction. For mathematical notation, I will refer to these as
$f_V(S_t)$ for pre-state and$f_V(S_{t+})$ for post-state.
- If the hand is over, then the temporal difference is defined as the
difference between the true final score (
$R_{final}$ ) and the model prediction from the previous game state.
To choose a move, TD-lambda agents simply simulate the future game state for each possible legal move and estimate the final score given that simulated future state, then pick whichever move leads to the lowest estimate of their final score.
Here, I explain the code that implements TD-lambda in value agents as they learn to play dominoes. This section provides an overview with a few key details; for further information see the code defining TD-lambda agents and the network architectures that are used as value functions.
Following the style of the hand-crafted agents in this repository, there is a
parent class called valueAgent
that defines the core code of any agent
learning with the TD-lambda algorithm. This parent class is not meant to be
used on its own. Child classes defined within the same file inherit from
valueAgent
and add their own rules for measuring value. I'll also mention
that if you add a new agent, then you should make sure to add it to the
standard imports here.
This repository contains several network architectures designed to represent
the value function (all coded in pytorch). These architectures are located in
the networks file and are created by the
prepareNetwork()
method of each valueAgent
. At the time of the network
creation, TD-lambda agents also initialize their eligibility traces with torch
tensors that have the same shape as the network parameters.
Since all network architectures are coded in pytorch, the first step of
measuring the value of a game state is converting the game state into a torch
tensor. Like every dominoeAgent
,
TD-lambda agents have a method called processGameState()
that is called by
the gameState()
method when the dominoeGame
object feeds the game state to each agent. For TD-lambda agents, this takes
each component of the game state and converts it to a binarized tensor
representation, then uses the additional method prepareValueInputs()
to
concatenate each tensor into a vector(s) that represents the input to the
value function.
Every turn, the dominoeGame
object tells each agent to estimate the
pre-state value performPrestateValueEstimate()
.
Each agent decides whether or not to estimate the pre-state value based on an
agent method called checkTurnUpdate()
, which returns True
or False
depending on who's turn it is and also if agents are in the "learning" state,
which can be updated by setLearning()
.
Note: I've found that TD-lambda agents learn most effectively when they estimate and update their value function on every turn, regardless of whether it is their turn. This might be overkill, but it works so that's how the code is setup right now.
Estimating pre-state value is exclusively used to measure the eligibility
trace of the value functions's parameters
Let an agent be known as self
, and let finalScoreOutput
be the output of
the value function, let finalScoreEligibility
be the eligibility traces for
each parameter, and let finalScoreNetwork
be the value function of each
agent. Here is the code:
for idx,fsOutput in enumerate(self.finalScoreOutput):
fsOutput.backward(retain_graph=True) # measure gradient of weights with respect to this output value
for trace,prms in zip(self.finalScoreEligibility[idx], self.finalScoreNetwork.parameters()):
trace *= self.lam # discount past eligibility traces by lambda
trace += prms.grad # add new gradient to eligibility trace
prms.grad.zero_() # reset gradients for parameters of finalScoreNetwork in between each backward call from the output
We enumerate through the finalScoreOutput (at the moment, this is just a
single value representing the own agent's final score estimate, but the code
is written so that this can be extended to represent every players final
score). Then, we compute the gradient with respect to each output using
fsOutput.backward(retain_graph=True)
, making sure to retain the graph so
that it is presevered across the for loop. Then, for each set of parameters
(corresponding to the weights and biases of each layer of the
finalScoreNetwork
), we scale the previous eligibility trace by
TD-lambda agents come pre-equipped with methods for saving and loading their
parameters, including all value function parameters and any ancillary
parameters required by each agent type. These are the methods
saveAgentParameters()
and loadAgentParameters()
.