This repository contains the code for A4C introduced in the paper
A4C:Anticipatory Asynchronous Actor Critic
Tharun Medini, Xun Luan, Anshumali Shrivastava
If you find the idea useful, please cite
@article{ anonymous2018anticipatory, title={Anticipatory Asynchronous Advantage Actor-Critic (A4C): The power of Anticipation in Deep Reinforcement Learning}, author={Anonymous}, journal={International Conference on Learning Representations}, year={2018}, url={https://openreview.net/forum?id=rkKkSzb0b} }
The repository contains two folders, one for the baseline GA3C and the other for Switching version of A4C. The other 2 variants Dependent Updating(DU) and Indpenednt Updating(IU) can be obtained by tweaking the switching_time in Config.py in the folder Switching. If switching time is set very high(beyond 20 hrs = 72000 seconds), it performs DU, and if we set the switching_time to 0, it performs IU. The code structure is essentially similar to GA3C with critical differences in Config.py, ProcessAgent.py, GameManager.py and Environment.py . We use the same Network Architecture(16 conv layers with size 8*8 - 32 cond layers with size 4*4 - FC layer with 256 units) and Optimizer as in GA3C. To run the A4C code, please change your directory to Switching and run:
sh _clean.sh
sh _train.sh
The folder Switching has the file Config.py with the following chunk of code:
meta_size = 2 # aka step_size
begin_time = time.time()
switching_time = 9000
Here, meta_size is the max length of multi-step actions. By default, we use upto 2-step actions.
switching_time is the time after which we switch(in seconds). This should roughly be the time when Dependent Updating starts to decay(9000 for Pong, Qbert and SpaceInvaders; 18000 for BeamRider). Automatig this is a little difficult because the rewards are extremely variant even when we perform a moving average over the last 1000 episodes. It's hard to judge whether the rewards have saturated or not. We are working on tracking the moving median of rewards of 1000 episodes and switching when it reduces.
The codes in both GA3C and Switching folders give out text file each with total rewards for each episode. Running the script plot.py directly plots the comparison of both approaches. We can plot both wrt time and episodes by changing the variable plotby in the following chunk:
plotby = 'time'
if plotby=='episodes':
plt1, = plt.plot(r_mean1,'r',label='GA3C')
plt2, = plt.plot(r_mean2,'g',label='A4C')
.....