The agent wants to cross the frozen lake from Start (S) to Goal (G) without falling into any Holes (H) by walking over the Frozen (F) lake. Since the lake is frozen, the agent may not always move to the grid that it intends[1].
We identify this problem as a stochastic problem and we decide to use two approaches to solve this task: Policy Iteration (PI) and Value Iteration (VI).
-
Policy Iteration (PI): Firstly, we randomly initialize a value function. Secondly, we iterate the value function over and over until the it converges (policy evaluation). Thridly, we use that value function to get our policy (policy improvement).
-
Value Iteration (VI): Firstly, we randomly initialize a value function. Secondly, we find the optimal value function. Thirdly, we extract a policy based on that value function.
Our code is based on the following pseudocode[2].
Policy iterationValue iteration
- Policy Iteration (PI):
Value function of Policy Iteration
- Value Iteration (VI):
Value function of Value Iteration
We run 50 trials, each trial we calculate the value function and the policy, and we run the agent using that information for 100 episodes and sum up the number of times it reaches the goal without falling into one of the holes in the map.
Policy Iteration | Value Iteration | |
---|---|---|
mean | 78.46 | 78.94 |
std | 3.45354 | 3.683443 |
min | 70 | 67 |
max | 86 | 88 |