Python implementation of the example from the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.
There are 2 differences from the original example:
- Instead of 20x20x36 states, there are 21x21x36 states.
- Instead of 4 actions, there are 6 actions. Each correspond to:
- Up: Center of the rod moves up 30 units
- Down: Center of the rod moves down 30 units
- Right: Center of the rod moves right 30 units
- Left: Center of the rod moves left 30 units
- +10 Degree: Rod rotates +10 degrees from the center
- -10 Degree: Rod rotates -10 degrees from the center
With this actions and states, shortest path from starting state to ending state has 47 steps.
All the obstacles are hard-coded so that if you change the resolution of the screen, you should also change the obstacle positions.
Also if you want to change the positions of the initial and goal state, be sure to make their center locations multiple of 30, and angles to multiple of 10.
Default values of hyperparameters:
- α = 0.1
- γ = 0.97
- ε = 0.1
- θ = 0.01
- n = 30
Rewards are 0 for each step except +1 for step to goal state. When agent reaches to goal state, episode ends and positions reset.
With this hyperparameters, learning completely happens after average of 130.000 steps.
If you use q-learning instead of prioritized sweeping, average of 1.720.000 steps required to learn completely.
To install requirements:
pip install -r requirements.txt
To train from scratch:
python3 main.py -t
If you use this flag, initial screen will give you warning. If you click it, the animation will start but this will slow down the process of learning. So wait until some convergence to see the results faster.
To use pre-trained Q values:
python3 main.py
To slow down the animate while using pre-trained values to see the actions more clear:
python3 main.py -s
Do not try to use both -t and -s flags at the same time because with -s flag agent will wait 0.05 secs after each step. So training with that much wait is impossible.
If you want to try learning with q-learning:
python3 main.py -q -t
With that, use q-learning instead of prioritized sweeping and learn from scratch.