forked from RussTedrake/underactuated
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdp.html
1453 lines (1195 loc) · 75.6 KB
/
dp.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Ch. 7 - Dynamic Programming</title>
<meta name="Ch. 7 - Dynamic Programming" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://underactuated.mit.edu/dp.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="chapters.js"></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="htmlbook/MathJax/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('underactuated');">
<div data-type="titlepage">
<header>
<h1><a href="index.html" style="text-decoration:none;">Underactuated Robotics</a></h1>
<p data-type="subtitle">Algorithms for Walking, Running, Swimming, Flying, and Manipulation</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2023<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p><b>Note:</b> These are working notes used for <a
href="https://underactuated.csail.mit.edu/Spring2023/">a course being taught
at MIT</a>. They will be updated throughout the Spring 2023 semester. <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.</p>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=stochastic.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=lqr.html>Next Chapter</a></td>
</tr></table>
<script type="text/javascript">document.write(notebook_header('dp'))
</script>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 6"><h1>Dynamic Programming</h1>
<p>In chapter 2, we spent some time thinking about the phase portrait of the
simple pendulum, and concluded with a challenge: can we design a nonlinear
controller to <em>reshape</em> the phase portrait, with a very modest amount
of actuation, so that the upright fixed point becomes globally stable? With
unbounded torque, feedback-cancellation solutions (e.g., invert gravity) can
work well, but can also require an unnecessarily large amount of control
effort. The energy-based swing-up control solutions presented for the acrobot
and cart-pole systems are considerably more appealing, but required some
cleverness and might not scale to more complicated systems. Here we
investigate another approach to the problem, using computational optimal
control to synthesize a feedback controller directly.</p>
<section><h1>Formulating control design as an optimization</h1>
<p>In this chapter, we will introduce optimal control - a control design
process using optimization. This approach is powerful for a number of
reasons. First and foremost, it is very general - allowing us to specify
the goal of control equally well for fully- or under-actuated, linear or
nonlinear, deterministic or stochastic, and continuous or discrete
systems. Second, it permits concise descriptions of potentially very
complex desired behaviours, specifying the goal of control as a scalar
objective (plus a list of constraints). Finally, and most importantly,
optimal control is very amenable to numerical solutions.
<elib>Bertsekas00a</elib> is a fantastic reference on this material for
those who want a somewhat rigorous treatment; <elib>Sutton18</elib> is an
excellent (free) reference for those who want something more
approachable.</p>
<p>The fundamental idea in optimal control is to formulate the goal of
control as the <em>long-term</em> optimization of a scalar cost function.
Let's introduce the basic concepts by considering a system that is even
simpler than the simple pendulum.</p>
<example><h1>Optimal Control Formulations for the Double Integrator</h1>
<p>Consider the double integrator system $$\ddot{q} = u, \quad |u| \le
1.$$ If you would like a mechanical analog of the system (I always do),
then you can think about this as a unit mass brick moving along the x-axis
on a frictionless surface, with a control input which provides a
horizontal force, $u$. The task is to design a control system, $u =
\pi(\bx,t)$, $\bx=[q,\dot{q}]^T$ to regulate this brick to $\bx =
[0,0]^T$.</p>
<figure>
<img width="70%" src="figures/double_integrator_brick.svg"/>
<figcaption>The double integrator as a unit-mass brick on a frictionless
surface</figcaption>
</figure>
<p>In order to formulate this control design problem using optimal
control, we must define a scalar objective which scores the long-term
performance of running each candidate control policy, $\pi(\bx,t)$, from
each initial condition, $(\bx_0,t_0)$, and a list of constraints that must
be satisfied. For the task of driving the double integrator to the origin,
one could imagine a number of optimal control formulations which would
accomplish the task, e.g.: <ul> <li> Minimum time: $\min_\pi t_f,$
subject to $\bx(t_0) = \bx_0,$ $\bx(t_f) = {\bf 0}.$ </li> <li> Quadratic
cost: $\min_\pi \int_{t_0}^{\infty} \left[ \bx^T(t) {\bf Q} \bx(t)
\right] dt,$ ${\bf Q}\succ0$.</li> </ul> where the first is a constrained
optimization formulation which optimizes time, and the second accrues a
penalty at every instance according to the distance that the state is away
from the origin (in a metric space defined by the matrix ${\bf Q}$), and
therefore encourages trajectories that go more directly towards the goal,
possibly at the expense of requiring a longer time to reach the goal (in
fact it will result in an exponential approach to the goal, where as the
minimum-time formulation will arrive at the goal in finite time). Note
that both optimization problems only have well-defined solutions if it is
possible for the system to actually reach the origin, otherwise the
minimum-time problem cannot satisfy the terminal constraint, and the
integral in the quadratic cost would not converge to a finite value as
time approaches infinity (fortunately the double integrator system is
controllable, and therefore can be driven to the goal in finite time).</p>
<p> Note that the input limits, $|u|\le1$ are also required to make this
problem well-posed; otherwise both optimizations would result in the
optimal policy using infinite control input to approach the goal
infinitely fast. Besides input limits, another common approach to limiting
the control effort is to add an additional quadratic cost on the input (or
"effort"), e.g. $\int \left[ \bu^T(t) {\bf R} \bu(t) \right] dt,$ ${\bf
R}\succ0$. This could be added to either formulation above. We will
examine many of these formulations in some details in the examples worked
out at the end of this chapter. </p>
</example>
<p>Optimal control has a long history in robotics. For instance, there has
been a great deal of work on the minimum-time problem for pick-and-place
robotic manipulators, and the linear quadratic regulator (LQR) and linear
quadratic regulator with Gaussian noise (LQG) have become essential tools
for any practicing controls engineer. With increasingly powerful computers
and algorithms, the popularity of numerical optimal control has grown at an
incredible pace over the last few years.</p>
<example id="minimum_time_double_integrator"><h1>The minimum time problem
for the double integrator</h1>
<p>For more intuition, let's do an informal derivation of the solution to
the minimum time problem for the double integrator with input constraints:
\begin{align*}
\minimize_{\pi} \quad & t_f\\
\subjto \quad & \bx(t_0) = \bx_0, \\
& \bx(t_f) = {\bf 0}, \\
& \ddot{q}(t) = u(t), \\
& |u| \le 1.
\end{align*}
What behavior would you expect an optimal controller exhibit?</p>
<p> Your intuition might tell you that the best thing that the brick can
do, to reach the goal in minimum time with limited control input, is to
accelerate maximally towards the goal until reaching a critical point,
then hitting the brakes in order to come to a stop exactly at the goal.
This would be called a <em>bang-bang</em> control policy; these are often
optimal for systems with bounded input, and it is in fact optimal for the
double integrator, although we will not prove it until we have developed
more tools. <!-- leave the proof to the pontryagin notes --></p>
<p> Let's work out the details of this bang-bang policy. First, we can
figure out the states from which, when the brakes are fully applied, the
system comes to rest precisely at the origin. Let's start with the case
where $q(0) < 0$, and $\dot{q}(0)>0$, and "hitting the brakes" implies
that $u=-1$ . Integrating the equations, we have \begin{gather*}
\ddot{q}(t) = u = -1 \\\dot{q}(t) = \dot{q}(0) - t \\ q(t) = q(0) +
\dot{q}(0) t - \frac{1}{2} t^2. \end{gather*} Substituting $t =
\dot{q}(0) - \dot{q}$ into the solution reveals that the system orbits
are parabolic arcs: \[ q = -\frac{1}{2} \dot{q}^2 + c_{-}, \] with $c_{-}
= q(0) + \frac{1}{2}\dot{q}^2(0)$.</p>
<figure>
<img width="80%" src="figures/double_integrator_orbits.svg"/>
<figcaption>Two solutions for the system with $u=-1$</figcaption>
</figure>
<!-- t = qdot - qdot0,
q = q0 + qdot0(qdot-qdot0) + 1/2(qdot-qdot0)^2
= q0 + qdot0*qdot - qdot0^2 + 1/2qdot^2 - qdot*qdot0 + 1/2qdot0^2
= 1/2qdot^2 + (q0 - 1/2 qdot0^2)
-->
<p>Similarly, the solutions for $u=1$ are $q = \frac{1}{2} \dot{q}^2 +
c_{+}$, with $c_{+}=q(0)-\frac{1}{2}\dot{q}^2(0)$.</p>
<p> Perhaps the most important of these orbits are the ones that pass
directly through the origin (e.g., $c_{-}=0$). Following our initial
logic, if the system is going slower than this $\dot{q}$ for any $q$, then
the optimal thing to do is to slam on the accelerator
($u=-\text{sgn}(q)$). If it's going faster than the $\dot{q}$ that we've
solved for, then still the best thing to do is to brake; but inevitably
the system will overshoot the origin and have to come back. We can
summarize this policy with: \[ u = \begin{cases} +1 & \text{if } (\dot{q}
< 0 \text{ and } q \le \frac{1}{2} \dot{q}^2) \text{ or } (\dot{q}\ge 0
\text{ and } q < -\frac{1}{2} \dot{q}^2) \\ 0 & \text{if } q=0 \text{ and
} \dot{q}=0 \\ -1 & \text{otherwise} \end{cases} \]</p> <!--This policy is
cartooned in Figure xxx. %Trajectories of the
system %executing this policy are also included - the fundamental
%characteristic is that the system is accelerated as quickly as %possible
toward the switching surface, then rides the switching %surface in to the
origin. -->
<figure>
<img width="80%" src="figures/double_integrator_mintime_policy.svg"/>
<figcaption>Candidate optimal (bang-bang) policy for the minimum-time
double integrator problem.</figcaption>
</figure>
<p>and illustrate some of the optimal solution trajectories:</p>
<figure>
<img width="80%" src="figures/double_integrator_mintime_orbits.svg"/>
<figcaption>Solution trajectories of system using the optimal
policy</figcaption>
</figure>
<p>And for completeness, we can compute the optimal time to the goal by
solving for the amount of time required to reach the switching surface
plus the amount of time spent moving along the switching surface to the
goal. With a little algebra, you will find that the time to the goal,
$J(\bx)$, is given by \[ J(\bx) = \begin{cases}
2\sqrt{\frac{1}{2}\dot{q}^2-q} - \dot{q} & \text{for } u=+1 \text{
regime}, \\ 0 & \text{for } u=0, \\ \dot{q} +
2\sqrt{\frac{1}{2}\dot{q}^2+q} & \text{for } u=-1, \end{cases} \]<!-- call
t_m the time to the surface, then the time on switching surface =
|qdot(t_m)|
for u=1
q0,qdot0 => qm,qdotm with qm = -1/2 qdotm^2
-1/2 qdotm^2 = 1/2 qdotm^2 + c+
qdotm^2 = - c+ , qdotm = sqrt(-c+)
T = (qdotm-qdot0)+qdotm = 2*sqrt(-c+) - qdot0
for u=-1, qdotm^2 = c- , qdotm = -sqrt(c-)
T = (qdot0-qdotm)-qdotm = qdot0 + 2*sqrt(c-)
-->
plotted here:</p>
<figure>
<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="data/double_integrator_mintime_cost_to_go.html" height="350" width="100%"></iframe>
<figcaption>Time to the origin using the bang-bang policy</figcaption>
</figure>
<p> Notice that the function is continuous (though not smooth), even
though the policy is discontinuous.</p>
</example> <!-- end of example -->
<subsection><h1>Additive cost</h1>
<p> As we begin to develop theoretical and algorithmic tools for optimal
control, we will see that some formulations are much easier to deal with
than others. One important example is the dramatic simplification that
can come from formulating objective functions using <em>additive
cost</em>, because they often yield recursive solutions. In the additive
cost formulation, the long-term "score" for a trajectory can be written
as $$\int_0^T \ell(x(t),u(t)) dt,$$ where $\ell()$ is the instantaneous
cost (also referred to as the "running cost"), and $T$ can be either a
finite real number or $\infty$. We will call a problem specification
with a finite $T$ a "finite-horizon" problem, and $T=\infty$ an
"infinite-horizon" problem. Problems and solutions for infinite-horizon
problems tend to be more elegant, but care is required to make sure that
the integral converges for the optimal controller (typically by having an
achievable goal that allows the robot to accrue zero-cost).</p>
<p> The quadratic cost function suggested in the double integrator
example above is clearly written as an additive cost. At first glance,
our minimum-time problem formulation doesn't appear to be of this form,
but we actually can write it as an additive cost problem using an
infinite horizon and the instantaneous cost $$\ell(x,u) = \begin{cases}
0 & \text{if } x=0, \\ 1 & \text{otherwise.} \end{cases}$$</p>
<p> We will examine a number of approaches to solving optimal control
problems throughout the next few chapters. For the remainder of this
chapter, we will focus on additive-cost problems and their solution via
<em>dynamic programming</em>.</p>
</subsection> <!-- end of additive cost -->
</section> <!-- control design as an optimization -->
<section><h1>Optimal control as graph search</h1>
<p> For systems with continuous states and continuous actions, dynamic
programming is a set of theoretical ideas surrounding additive cost optimal
control problems. For systems with a finite, discrete set of states and a
finite, discrete set of actions, dynamic programming also represents a set
of very efficient numerical <em>algorithms</em> which can compute optimal
feedback controllers. Many of you will have learned it before as a tool for
graph search. </p>
<p>Imagine you have a directed graph with states (or nodes) $\{s_1,s_2,...\}
\in S$ and "actions" associated with edges labeled as $\{a_1,a_2,...\} \in
A$, as in the following trivial example:</p>
<figure><img width="70%" src="figures/graph_search.svg"/><figcaption>A
simple directed graph.</figcaption></figure>
<p>Let us also assume that each edge has an associate weight or cost, using
$\ell(s,a)$ to denote the cost of being in state $s$ and taking action $a$.
Furthermore we will denote the transition "dynamics" using \[ s[n+1] =
f(s[n],a[n]). \] For instance, in the graph above, $f(s_1,a_1) = s_2$.</p>
<p>There are many algorithms for finding (or approximating) the optimal
path from a start to a goal on directed graphs. In dynamic programming,
the key insight is that we can find the shortest path from every node by
solving recursively for the optimal <em>cost-to-go</em> (the cost that will
be accumulated when running the optimal controller) from every node to the
goal. One such algorithm starts by initializing an estimate $\hat{J}^*=0$
for all $s_i$, then proceeds with an iterative algorithm which sets
\begin{equation} \hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[
\ell(s_i,a) + \hat{J}^*\left({f(s_i,a)}\right) \right].
\label{eq:value_update} \end{equation} The algorithm is simpler if we
ensure that the update $f(s,a)$ is defined for every state-action pair; we
can add zero-cost self transitions at the goal node for every action. In
software, $\hat{J}^*$ can be represented as a vector with dimension equal
to the number of discrete states. This algorithm, appropriately known as
<em>value iteration</em>, is guaranteed to converge to the optimal
cost-to-go up to a constant factor, $\hat{J}^* \rightarrow J^* + c$
<elib>Bertsekas96</elib>, and in practice does so rapidly. Typically the
update is done in <em>batch</em> -- e.g. the estimate is updated for all
$i$ at once -- but the <em>asynchronous</em> version where states are
updated one at a time is also known to converge, so long as every state is
eventually updated infinitely often. Assuming the graph has a goal state
with a zero-cost self-transition, then this cost-to-go function represents
the weighted shortest distance to the goal. </p>
<p> Value iteration is an amazingly simple algorithm, but it accomplishes
something quite amazing: it efficiently computes the long-term cost of an
optimal policy from <i>every</i> state by iteratively evaluating the
one-step cost. If we know the optimal cost-to-go, then it's easy to extract
the optimal policy, $a = \pi^*(s)$: \begin{equation} \pi^*(s_i) = \argmin_a
\left[ \ell(s_i,a) + J^*\left( f(s_i,a) \right) \right].
\label{eq:policy_update} \end{equation} It's a simple algorithm, but playing
with an example can help our intuition.</p>
<example id="grid_world"><h1>Grid World</h1>
<p>Imagine a robot living in a grid (finite state) world. Wants to get
to the goal location. Possibly has to negotiate cells with obstacles.
Actions are to move up, down, left, right, or do nothing.
<elib>Sutton98</elib></p>
<figure>
<iframe style="border:0;height:460px;width:550px;"
src="data/grid_world.html?height=350px" pdf="no"></iframe>
<p pdf="only"><a href="data/grid_world.html">Click here for the
animation.</a></p>
<figcaption>The one-step cost for the grid-world minimum-time problem.
The goal state has a cost of zero, the obstacles have a cost of 10, and
every other state has a cost of 1.</figcaption>
</figure>
<script>document.write(notebook_link('dp'))</script>
</example> <!-- end grid world -->
<todo>figure/text for graph approximation of a continuous state
space.</todo>
<example><h1>Dynamic Programming for the Double Integrator</h1>
<p>You can run value iteration for the double integrator (using
barycentric interpolation to interpolate between nodes) in <drake></drake>
using: </p>
<script>document.write(notebook_link('dp'))</script>
<p>Please do take some time to try different cost functions by
editing the code yourself.</p>
</example>
<p> Let's take a minute to appreciate how amazing this is. Our solution to
finding the optimal controller for the double integrator wasn't all that
hard, but it required some mechanical intuition and solutions to
differential equations. The resulting policy was non-trivial -- bang-bang
control with a parabolic switching surface. The value iteration algorithm
doesn't use any of this directly -- it's a simple algorithm for graph
search. But remarkably, it can generate effectively the same policy with
just a few moments of computation.</p>
<p>It's important to note that there <em>are</em> some differences between
the computed policy and the optimal policy that we derived, due to
discretization errors. We will ask you to explore these in the
problems.</p>
<p>The real value of this numerical solution, however, is unlike our
analytical solution for the double integrator, we can apply this same
algorithm to any number of dynamical systems virtually without modification.
Let's apply it now to the simple pendulum, which was intractable
analytically.</p>
<example><h1>Dynamic Programming for the
Simple Pendulum</h1>
<p>You can run value iteration for the simple pendulum (using barycentric
interpolation to interpolate between nodes) in <drake></drake> using:</p>
<script>document.write(notebook_link('dp'))</script>
<p>Again, you can easily try different cost functions by
editing the code yourself.</p>
</example>
</section> <!-- end of graph search -->
<section id="continuous"><h1>Continuous dynamic programming</h1>
<p> I find the graph search algorithm extremely satisfying as a first step,
but also become quickly frustrated by the limitations of the discretization
required to use it. In many cases, we can do better; coming up with
algorithms which work more natively on continuous dynamical systems. We'll
explore those extensions in this section.</p>
<subsection><h1>The Hamilton-Jacobi-Bellman Equation</h1>
<p> It's important to understand that the value iteration equations,
equations (\ref{eq:value_update}) and (\ref{eq:policy_update}), are more
than just an algorithm. They are also sufficient conditions for
optimality: if we can produce a $J^*$ and $\pi^*$ which satisfy these
equations, then $\pi^*$ must be an optimal controller. There are an
analogous set of conditions for the continuous systems. For a system
$\dot{\bx} = f(\bx,\bu)$ and an infinite-horizon additive cost
$\int_0^\infty \ell(\bx,\bu)dt$, we have: \begin{gather} 0 = \min_\bu \left[
\ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) \right], \label{eq:HJB} \\
\pi^*(\bx) = \argmin_\bu \left[ \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu)
\right]. \end{gather} Equation \ref{eq:HJB} is known as the
<em>Hamilton-Jacobi-Bellman</em> (HJB) equation.
<sidenote>Technically, a Hamilton-Jacobi equation is a PDE whose <em>time
derivative</em> depends on the first-order partial derivatives over
state, which we get in the finite-time deriviation; Eq \ref{eq:HJB} is
the steady-state solution of the Hamilton-Jacobi equation.</sidenote>
<elib>Bertsekas05</elib><!-- chapter 3 --> gives an informal derivation
of these equations as the limit of a discrete-time approximation of the
dynamics, and also gives the famous "sufficiency theorem". The treatment
in <elib>Bertsekas05</elib> is for the finite-horizon case; I have
modified it to one particular form of an infinite-horizon case here.</p>
<theorem><h1>HJB Sufficiency Theorem <i>(stated informally)</i></h1>
<p>Consider a system $\dot{\bx}=f(\bx,\bu)$ and an infinite-horizon
additive cost $\int_0^\infty \ell(\bx,\bu)dt$, with $f$ and $\ell$
continuous functions, and $\ell$ a strictly-positive-definite function
that obtains zero only at a unique state $\bx^*$. Suppose $J(\bx)$ is a
solution to the HJB equation: $J$ is continuously differentiable in
$\bx$ and is such that \[ 0 = \min_{\bu \in U} \left[\ell(\bx,\bu) +
\pd{J}{\bx}f(\bx,\bu) \right],\quad \text{for all } \bx. \] Further
assume that $J(\bx)$ is positive definite and that $\pi(\bx)$ is the
minimizer for all $\bx$. Then, under some technical conditions on the
existence and boundedness of solutions, we have that $J(\bx) -
J(\bx^*)$ is the optimal cost-to-go and $\pi$ is an optimal policy.</p>
<p style="font-size:smaller;"><details><summary>Here is a more formal version from a personal communication with <a href="http://www.mit.edu/~ameg/">Sasha Megretski</a>.</summary>
<p>Given an open subset $\Omega\subset\mathbb R^n$, with a selected
element $x^*$, a subset $U\subset\mathbb R^m$, continuous functions
$f:~\Omega\times U\to\mathbb R^n$, $g:~\Omega\times U\to[0,\infty)$,
continuously differentiable function $V:~\Omega\to[0,\infty)$, and a
function $\mu:~\Omega\to U$, such that <ol type="a"> <li>
$g(x,u)$ is strictly positive definite, in the sense that
$\lim_{k\to\infty}x_k=x^*$ for every sequence of vectors
$x_k\in\Omega$, $u_k\in U$ ($k=1,2,\dots$) such that
$\lim_{k\to\infty}g(x_k,u_k)=0$; </li><li> the function
$g_V:~\Omega\times U\to\mathbb R$, defined by
\begin{equation}\label{e0}
g_V(x,u)=g(x,u)+\frac{dV(x)}{dx}f(x,u),\end{equation} is non-negative,
and achieves zero value whenever $u=\mu(x)$; </li><li> for every
$x_0\in\Omega$, the system of equations \begin{equation}\label{e1} \dot
x(t)=f(x(t),u(t)),~x(0)=x_0~\left(\text{i.e.,}~
x(t)=x_0+\int_0^tf(x(\tau),u(\tau)d\tau\right)\end{equation}
\begin{equation}\label{e2} u(t)=\mu(x(t)),\end{equation} has at least
one piecewise continuous solution $x:~[0,\infty)\to\Omega$,
$u:~[0,\infty)\to U$. </ol>Then, for every $x_0\in\Omega$, the
functional \[ J(x,u)=\int_0^\infty g(x(t),u(t))dt,\] defined on the
set of all piecewise continuous pairs $(x,u)$ satisfying (\ref{e1}),
achieves its minimal value of $V(x_0)-V(x^*)$ at every piecewise
continuous solution of (\ref{e1}), (\ref{e2}).
</p>
<p><b>Proof.</b>
First, observe that, for all piecewise continuous $(x,u)$ satisfying
(\ref{e1}),
<ol type="I">
<li> due to (\ref{e0}),
\begin{equation}\label{e3}
\int_0^Tg(x(t),u(t))dt=V(x_0)-V(x(T))+\int_0^Tg_V(x(t),u(t))dt;
\end{equation}
</li>
<li> whenever $J(x,u)<\infty$, there exists a sequece
of positive numbers $T_k$, $k=1,2,\dots$, $T_k\to+\infty$, such that
$g(x(T_k),u(T_k))\to0$, which, according to assumption (a), implies
$x(T_k)\to x^*$ as $k\to\infty$.</li>
</ol>
In particular, for a piecewise continuous solution $(x,u)$ of
(\ref{e1}), satisfying, in addition, equation (\ref{e2}), condition
(I) means \[ \int_0^Tg(x(t),u(t))dt=V(x_0)-V(x(T))\le V(x_0).\] Since
the upper bound $V(x_0)$ does not depend on $T$, and
$g(x(t),u(t))\ge0$ is non-negative, we conclude that \[ \int_0^\infty
g(x(t),u(t))dt<\infty.\] Therefore observation (**) applies, and, for
the corresponding $T_k$, \begin{align*} J(x,u)&=\int_0^\infty
g(x(t),u(t))dt=\lim_{k\to\infty}\int_0^{T_k}g(x(t),u(t))dt \\&=
\lim_{k\to\infty}V(x_0)-V(x(T_k))=V(x_0)-V(x^*).\end{align*}
Moreover, since $g$ is non-negative, the inequality $V(x_0)\ge
V(x^*)$ must hold for every $x_0\in\Omega$.</p>
<p>To finish the proof, for an arbitrary piecewise continuous
solution of (\ref{e1}), equation (\ref{e3}), together with
non-negativity of $g_V$, implies \[ J(x,u)=\int_0^Tg(x(t),u(t))dt\ge
V(x_0)-V(x(T)).\] When $J(x,u)<\infty$, applying this to $T=T_k$,
where $T_k$ are described in observation (II), and taking the limit
as $k\to\infty$ (and $x(T_k)\to x^*$) yields $J(x,u)\ge
V(x_0)-V(x^*)$. $\Box$
</details></p>
</theorem>
<p>It is possible to prove sufficiency under different assumptions, too.
The particular assumptions here were chosen to ensure that $J(\bx(0)) <
\infty$ implies that $\bx(t) \rightarrow \bx^*$. As Sasha says, "without
something like this, all sorts of counter-examples emerge."</p>
<p>As a tool for verifying optimality, the HJB equations are actually
surprisingly easy to work with: we can verify optimality for an
infinite-horizon objective without doing any integration; we simply have
to check a derivative condition on the optimal cost-to-go function $J^*$.
Let's see this play out on the double integrator example.</p>
<example id="hjb_double_integrator"><h1>HJB for the Double
Integrator</h1>
<p>Consider the problem of regulating the double integrator (this time
without input limits) to the origin using a quadratic cost: $$
\ell(\bx,\bu) = q^2 + \dot{q}^2 + u^2. $$ I claim (without
derivation) that the optimal controller for this objective is
$$\pi(\bx) = -q - \sqrt{3}\dot{q}.$$ To convince you that this is
indeed optimal, I have produced the following cost-to-go function:
$$J(\bx) = \sqrt{3} q^2 + 2 q \dot{q} + \sqrt{3} \dot{q}^2.$$</p>
<p>Taking \begin{gather*} \pd{J}{q} = 2\sqrt{3} q + 2\dot{q}, \qquad
\pd{J}{\dot{q}} = 2q + 2\sqrt{3}\dot{q}, \end{gather*} we can write
\begin{align*} \ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) &= q^2 + \dot{q}^2
+ u^2 + (2\sqrt{3} q + 2\dot{q}) \dot{q} + (2q + 2\sqrt{3}\dot{q}) u
\end{align*} This is a convex quadratic function in $u$, so we can
find the minimum with respect to $u$ by finding where the gradient with
respect to $u$ evaluates to zero. \[ \pd{}{u} \left[ \ell(\bx,\bu) +
\pd{J}{\bx} f(\bx,\bu) \right] = 2u + 2q + 2\sqrt{3}\dot{q}. \]
Setting this equal to $0$ and solving for $u$ yields: $$u^* = -q -
\sqrt{3} \dot{q},$$ thereby confirming that our policy $\pi$ is in fact
the minimizer. Substituting $u^*$ back into the HJB reveals that the
right side does in fact simplify to zero. I hope you are
convinced!</p>
</example>
<p>Note that evaluating the HJB for the time-to-go of the minimum-time
problem for the double integrator will also reveal that the HJB is
satisfied wherever that gradient is well-defined. This is certainly
mounting evidence in support of our bang-bang policy being optimal, but
since $\pd{J}{\bx}$ is not defined everywhere, it does not actually
satisfy the requirements of the sufficiency theorem as stated above. In
some sense, the assumption in the sufficiency theorem that $\pd{J}{\bx}$
is defined everywhere makes it very weak.</p>
</subsection> <!-- end HJB -->
<subsection id="hjb_minimizing_control"><h1>Solving for the minimizing
control</h1>
<p>We still face a few barriers to actually using the HJB in an algorithm.
The first barrier is the minimization over $u$. When the action set was
discrete, as in the graph search version, we could evaluate the one-step
cost plus cost-to-go for every possible action, and then simply take the
best. For continuous action spaces, in general we cannot rely on the
strategy of evaluating a finite number of possible $\bu$'s to find the
minimizer.</p>
<p>All is not lost. In the quadratic cost double integrator example
above, we were able to solve explicitly for the minimizing $\bu$ in
terms of the cost-to-go. It turns out that this strategy will actually
work for a number of the problems we're interested in, even when the
system (which we are given) or cost function (which we are free to pick,
but which should be expressive) gets more complicated.</p>
<p>Recall that I've already tried to convince you that a majority of the
systems of interest are <em>control affine</em>, e.g. I can write \[
f(\bx,\bu) = f_1(\bx) + f_2(\bx)\bu. \] We can make another dramatic
simplification by restricting ourselves to instantaneous cost functions
of the form \[ \ell(\bx,\bu) = \ell_1(\bx) + \bu^T {\bf R} \bu, \qquad
{\bf R}={\bf R}^T \succ 0. \] In my view, this is not very restrictive -
many of the cost functions that I find myself choosing to write down can
be expressed in this form. Given these assumptions, we can write the HJB
as \[ 0 = \min_{\bu} \left[ \ell_1(\bx) + \bu^T {\bf R} \bu + \pd{J}{\bx}
\left[ f_1(\bx) + f_2(\bx)\bu \right]\right]. \] Since this is a
positive quadratic function in $\bu$, if the system does not have any
constraints on $\bu$, then we can solve in closed-form for the minimizing
$\bu$ by taking the gradient of the right-hand side: \[ \pd{}{\bu} =
2\bu^T {\bf R} + \pd{J}{\bx} f_2(\bx) = 0, \] and setting it equal to
zero to obtain \[ \bu^* = -\frac{1}{2}{\bf R}^{-1}f_2^T(\bx)
\pd{J}{\bx}^T.\] If there are linear constraints on the input, such as
torque limits, then more generally this could be solved (at any
particular $\bx$) as a quadratic program.</p>
<p>Exploiting the ability to solve for the optimal action in closed form
is also nice for generating benchmark problems for numerical optimal
control, as it can be used for "converse optimal control"
<elib>Doyle96</elib>.</p>
<example id="optimal_cubic_polynomial">
<h1>Converse optimal control for a cubic polynomial</h1>
<p>Consider a one-dimensional system of the form $\dot{x} = f_1(x) +
xu,$ and the running cost function $\ell(x,u) = x^2 + u^2.$ The optimal
policy is $u^* = -\frac{1}{2}x\pd{J^*}{x},$ leading to the HJB $$0 =
x^2 - \frac{1}{4}x^2\left[\pd{J^*}{x}\right]^2 + \pd{J^*}{x} f_1(x).$$
Choosing $J^*(x) = x^2$, we find that this is the cost-to-go when
$f_1(x) = -\frac{1}{2}x + \frac{1}{2}x^3$ and $u^* = -x^2.$
</p>
</example>
<p> What happens in the case where our system is not control affine or if
we really do need to specify an instantaneous cost function on $\bu$
that is not simply quadratic? If the goal is to produce an iterative
algorithm, like value iteration, then one common approach is to make a
(positive-definite) quadratic approximation in $\bu$ of the HJB, and
updating that approximation on every iteration of the algorithm. This
broad approach is often referred to as <em>differential dynamic
programming</em> (c.f. <elib>Jacobson70</elib>). </p>
</subsection> <!-- end solve for u -->
<subsection><h1>Numerical solutions for $J^*$</h1>
<p> The other major barrier to using the HJB in a value iteration
algorithm is that the estimated optimal cost-to-go function, $\hat{J}^*$,
must somehow be represented with a finite set of numbers, but we don't yet
know anything about the potential form it must take. In fact, knowing the
time-to-goal solution for minimum-time problem with the double integrator,
we see that this function might need to be non-smooth for even very simple
dynamics and objectives.</p>
<p>One natural way to parameterize $\hat{J}^*$ -- a scalar
valued-function defined over the state space -- is to define the values
on a mesh. This approach then admits algorithms with close ties to the
relatively very advanced numerical methods used to solve other partial
differential equations (PDEs), such as the ones that appear in finite
element modeling or fluid dynamics. One important difference, however, is
that our PDE lives in the dimension of the state space, while many of the
<a href="https://en.wikipedia.org/wiki/Types_of_mesh">mesh
representations</a> from the other disciplines are optimized for two or
three dimensional space. Also, our PDE may have discontinuities (or at
least discontinuous gradients) at locations in the state space which are
not known apriori.</p>
<p>A slightly more general view of the problem would describe the mesh
(and the associated interpolation functions) as just one form of
representations for <a
href="https://en.wikipedia.org/wiki/Function_approximation">function
approximation</a>. Using a <a
href="https://en.wikipedia.org/wiki/Deep_learning">
neural network</a> to represent the cost-to-go also falls under the
domain of function approximation, perhaps representing the other extreme
in terms of complexity; using neural networks in approximate dynamic
programming is common in <a
href="http://rail.eecs.berkeley.edu/deeprlcourse/">reinforcement
learning</a>, which we will discuss more later in the book.</p>
<todo> (see Appendix C for a brief background on function approximation)
</todo>
<subsubsection id="function_approximation">
<h1>Value iteration with function approximation</h1>
<p>If we approximate $J^*$ with a finitely-parameterized function
$\hat{J}_\balpha^*$, with parameter vector $\balpha$, then this
immediately raises many important questions. If the true cost-to-go
function does not live in the prescribed function class -- e.g., there
does not exist an $\balpha$ which satisfies the sufficiency conditions
for all $\bx$ -- then we can might instead write a cost function to
match the optimality conditions as closely as possible. But our choice
of cost function is important. Are errors in all states equally
important? If we are balancing an Acrobot, presumably having an
accurate cost-to-go in the vicinity of the upright condition is more
important that some rarely visited state? Even if we can define a cost
function that we like, then can we say anything about the convergence
of our algorithm in this more general case?</p>
<p>Let us start by considering a least-squares approximation of the
value iteration update.</p>
<p>Using the least squares solution in a value iteration update is
sometimes referred to as <i>fitted value iteration</i>, where $\bx_k$
are some number of samples taken from the continuous space and for
discrete-time systems the iterative approximate solution to
\begin{gather*} J^*(\bx_0) = \min_{u[\cdot]} \sum_{n=0}^\infty
\ell(\bx[n],\bu[n]), \\ \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]),
\bx[0] = \bx_0\end{gather*} becomes \begin{gather} J^d_k = \min_\bu
\left[ \ell(\bx_k,\bu) + \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right)
\right], \\ \balpha \Leftarrow \argmin_\balpha \sum_k
\left(\hat{J}^*_\balpha(\bx_k) - J^d_k \right)^2.
\label{eq:fitted_value_iteration} \end{gather} Since the desired values
$J^d_k$ are only an initial guess of the cost-to-go, we will apply this
algorithm iteratively until (hopefully) we achieve some numerical
convergence.</p>
<p>Note that the update in \eqref{eq:fitted_value_iteration} is not
<i>quite</i> the same as doing least-squares optimization of $$\sum_k
\left(\hat{J}^*_\balpha(\bx_k) - \min_\bu \left[ \ell(\bx_k,\bu) +
\hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right] \right)^2,$$
because in this equation $\alpha$ has an effect on both occurences of
$\hat{J}^*$. In \eqref{eq:fitted_value_iteration}, we cut that
dependence by taking $J_k^d$ as fixed desired values; this version
performs better in practice. Like many things, this is an old idea
that has been given a new name in the deep reinforcement learning
literature -- people think of the $\hat{J}^*_\alpha$ on the right hand
side as being the output from a fixed "target network". For nonlinear
function approximators, the update to $\alpha$ in
\eqref{eq:fitted_value_iteration} is often replaced some steps of
gradient descent.</p>
</subsubsection>
<subsubsection><h1>Linear function approximators</h1>
<p>In general, the convergence and accuracy guarantees for value
iteration with generic function approximators are quite weak. But we
do have some results for the special case of <em>linear function
approximators</em>. A linear function approximator takes the form: \[
\hat{J}^*_\balpha(\bx) = \sum_i \alpha_i \psi_i(\bx) = \bpsi^T(\bx)
\balpha, \] where $\bpsi(\bx)$ is a vector of potentially nonlinear
features. Common examples of features include polynomials, radial
basis functions, or most interpolation schemes used on a mesh. The
distinguishing feature of a linear function approximator is the ability
to exactly solve for $\balpha$ in order to represent a desired function
optimally, in a least-squares sense. For linear function
approximators, this is simply: \begin{gather*} \balpha \Leftarrow
\begin{bmatrix} \bpsi^T(\bx_1) \\ \vdots \\
\bpsi^T(\bx_K)\end{bmatrix}^+ \begin{bmatrix} J^d_1 \\ \vdots \\ J^d_K
\end{bmatrix}, \end{gather*} where the $^+$ notation refers to a
Moore-Penrose pseudoinverse. In finite state and action settings,
fitted value iteration with linear function approximators is known to
converge to the globally optimal $\balpha^*$<elib>Tsitsiklis97</elib>.
<elib>Munos08</elib> extends these results and treats the continuous
state-action case.</p>
<p>There can be some subtleties required for achieving convergence in
practice. <a href="lqr.html#fvi">In the next chapter</a>, we'll see why
exponentially-discounted cost functions are often used help the value
iterations converge even for infinite-horizon formulations.</p>
</subsubsection>
<subsubsection id="barycentric"><h1>Value iteration on a mesh</h1>
<p>Imagine that we use a mesh to approximate the cost-to-go function
over that state space with $K$ mesh points $\bx_k$. We would like to
perform the value iteration update: \begin{equation} \forall k,
\hat{J}^*(\bx_k) \Leftarrow \min_\bu \left[ \ell(\bx_k,\bu) +
\hat{J}^*\left({f(\bx_k,\bu)}\right) \right],
\label{eq:mesh_value_iteration} \end{equation} but must deal with the
fact that $f(\bx_k,\bu)$ might not result in a next state that is
directly at a mesh point. Most interpolation schemes for a mesh can be
written as some weighted combination of the values at nearby mesh
points, e.g. \[ \hat{J}^*(\bx) = \sum_i \beta_i(\bx) \hat{J}^*(\bx_i),
\quad \sum_i \beta_i = 1 \] with $\beta_i$ the relative weight of the
$i$th mesh point. In <drake></drake> we have implemented barycentric
interpolation<elib>Munos98</elib>. Taking $\alpha_i =
\hat{J}^*(\bx_i)$, the cost-to-go estimate at mesh point $i$, we can
see that this is precisely an important special case of fitted value
iteration with linear function approximation. Furthermore, assuming
$\beta_i(\bx_i) = 1,$ (e.g., the only point contributing to the
interpolation <i>at a mesh point</i> is the value at the mesh point
itself), the update in Eq. (\ref{eq:mesh_value_iteration}) is precisely
the least-squares update (and it achieves zero error). This is the
representation used in the value iteration examples that you've already
experimented with above.</p>
</subsubsection>
<subsubsection id="neural"><h1>Neural fitted value iteration</h1>
<example><h1>Neural Fitted Value Iteration</h1>
<p>Let us try reproducing our double-integrator value iteration
examples using neural networks:</p>
<script>document.write(notebook_link('dp'))</script>
<!--
<p>In this example, you'll see that we use two different approaches to
minimizing over $\bu$. First, we use discrete actions where we simply
evaluate all possible actions and keep the best. Second, we explore
the case which exploits the quadratic cost and the control-affine
dynamics -- but note that we also have to go to continuous time to
have the closed-form solution. (Can you understand why?)</p>
-->
</example>
<example><h1>Continuous Neural Fitted Value Iteration</h1>
<script>document.write(notebook_link('dp'))</script>
</example>
</subsubsection>
<subsubsection><h1>Continuous-time systems</h1>
<p>For solutions to systems with continuous-time dynamics, I have to
uncover one of the details that I've so far been hiding to keep the
notation simpler. Let us consider a problem with a finite-horizon:
\begin{gather*} \min_{\bu[\cdot]} \sum_{n=0}^N \ell(\bx[n],\bu[n]), \\
\text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*}
In fact, the way that we compute this is by solving the
<em>time-varying cost-to-go function</em> backwards in time:
\begin{gather*}J^*(\bx,N) = \min_\bu \ell(\bx, \bu) \\ J^*(\bx,n-1) =
\min_\bu \left[ \ell(\bx, \bu) + J^*(f(\bx,\bu), n) \right].
\end{gather*} The convergence of the value iteration update is
equivalent to solving this time-varying cost-to-go backwards in time
until it reaches a steady-state solution (the infinite-horizon
solution). Which explains why value iteration only converges if the
optimal cost-to-go is bounded.</p>
<p>Now let's consider the continuous-time version. Again, we have a
time-varying cost-to-go, $J^*(\bx,t)$. Now $$\frac{dJ^*}{dt} =
\pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t},$$ and our sufficiency condition
is $$0 = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu) +
\pd{J^*}{t} \right].$$ But since $\pd{J^*}{t}$ doesn't depend on $\bu$,
we can pull it out of the $\min$ and write the (true) HJB:
$$-\pd{J^*}{t} = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu)
\right].$$ The standard numerical recipe <elib>Osher03 </elib> for
solving this is to approximate $\hat{J}^*(\bx,t)$ on a mesh and then
integrate the equations backwards in time (until convergence, if the
goal is to find the infinite-horizon solution). If, for mesh point
$\bx_i$ we have $\alpha_i(t) = \hat{J}^*(\bx_i, t)$, then:
$$-\dot\alpha_i(t) = \min_\bu \left[\ell(\bx_i, \bu) + \pd{J^*(\bx_i,
t)}{\bx}f(\bx_i,\bu) \right],$$ where the partial derivative is
estimated with a suitable finite-difference approximation on the mesh
and often some "viscosity" terms are added to the right-hand side to
provide additional numerical robustness; see the Lax-Friedrichs scheme
<elib>Osher03 </elib> (section 5.3.1) for an example. </p>
<p>Probably most visible and consistent campaign using numerical HJB
solutions in applied control (at least in robotics) has come from <a
href="http://hybrid.eecs.berkeley.edu/index.html">Claire Tomlin's group
at Berkeley</a>. Their work leverages <a
href="https://www.cs.ubc.ca/~mitchell/ToolboxLS/">Ian Mitchell's Level
Set Toolbox</a>, which solves the Hamilton-Jacobi PDEs on a Cartesian
mesh using the technique cartooned above, and even includes the
minimum-time problem for the double integrator as a tutorial
example<elib>Mitchell05</elib>.</p>
</subsubsection>
</subsection> <!-- end function approx -->
</section> <!-- end continuous time DP -->
<todo> try just running snopt on quartic objective </todo>
<!--
<subsection><h1>A continuous policy iteration algorithm</h1>
<todo> simple (e.g. one-d example) </todo>
</section> -->
<!-- end policy iteration -->
<!--
<subsection><h1>How far can we take this? (Performance and Scaling)</h1>
<todo> Errors in bang-bang for double integrator due to discretization.
Limited in number of dimensions. Function approximation, but lacks
convergence results. </todo>
</section> --><!-- end scaling -->
<section><h1>Extensions</h1>
<todo> finite-horizon / time-varying dynamics or cost.</todo>
<p>There are many many nice extensions to the basic formulations that we've
presented so far. I'll try to list a few of the most important ones here.
I've also had a number of students in this course explore very interesting
extensions; for example <elib>Yang20</elib> looked at imposing a low-rank
structure on the (discrete-state) value function using ideas from matrix
completion to obtain good estimates despite updating only a small fraction
of the states.</p>
<subsection id="discounting">
<h1>Discounted and average cost formulations</h1>
<p>Throughout this chapter, we have focused primarily on minimizing an
infinite horizon integral of costs. Whenever you write an infinite sum
(or integral), one should immediately question whether that sum converges
to a finite value. For many control problems, it does -- for instance, if
we set the cost to be zero at the goal and the controller can drive the
system to the goal at a rate that is faster cost is accrued. However,
once we enter the realm of approximate optimal control, we can run into
problems. For instance, if we can a discrete state and action
approximation of the dynamics, it may not be possible for the discretized
system to arrive <i>exactly</i> at the zero-cost goal.</p>
<p>A common way to protect oneself from this is to add an exponential
discount factor to the integral cost. This is a natural choice since it
preserves the recursive structure of the Bellman equation. In discrete
time, it takes the form $$\min_{\bu[\cdot]} \sum_{n=0}^\infty \gamma^n
\ell(\bx[n], \bu[n]),$$ which leads to the Bellman equation $$J^*(\bx) =
\min_\bu \left[ \ell(\bx, \bx) + \gamma J^*(f(\bx,\bu) \right].$$ In
continuous time we have<elib>Doya00</elib> $$\min_\bu(\cdot)
\int_0^\infty e^{-\frac{t}{\tau}} \ell(\bx(t), \bu(t)) dt,$$ which leads
to the corresponding HJB: $$\frac{1}{\tau} J^*(\bx) = \min_\bu \left[
\ell(\bx, \bu) + \pd{J^*}{\bx} f(\bx,\bu)\right].$$
</p>
<p>A word of warning: this discount factor <i>does</i> change the optimal
policy; and it not without its pitfalls. We'll work out a specific case
of the discounted cost for the LQR problem, and see that the discounted
controller is exactly a controller which assumes that the plant is more
stable than it actually is. It is possible to find an optimal controller
for the discounted cost that results in an unstable controller.</p>
<p>A more robust alternative is to consider an average-cost
formulations... (details coming soon!) <!-- : $$J^\pi(\bx) =
\liminf_{N \rightarrow \infty} \frac{1}{N} \sum_{n=0}^N
\ell(\bx[n],\pi(\bx[n])) \qquad \text{or} \qquad J^\pi(\bx) = \liminf_{T
\rightarrow \infty} \int_0^T \ell(\bx(t), \pi(\bx(t)) dt.$$ The
corresponding Bellman equations are: $$J^*(\bx) + c^* = \min_u \left[
\ell(\bx, \bu) + J^*(f(\bx,\bu))\right],$$ and $$c^* = \min_\bu \left[
\ell(\bx,\bu) + \pd{J^*}{\bx} f(\bx,\bu)\right],$$ respectively, where
$c^*$ is a constant.-->
</p>
</subsection>
<subsection id="mdp"><h1>Stochastic control for finite MDPs</h1>
<p> One of the most amazing features of the dynamic programming, additive
cost approach to optimal control is the relative ease with which it
extends to optimizing stochastic systems. </p>
<p>For discrete systems, we can generalize our dynamics on a graph by
adding action-dependent transition probabilities to the edges. This new
dynamical system is known as a Markov Decision Process (MDP), and we
write the dynamics completely in terms of the transition probabilities
\[\Pr(s[n+1] = s' | s[n] = s, a[n] = a). \] For discrete systems, this is
simply a big lookup table. The cost that we incur for any execution of
system is now a random variable, and so we formulate the goal of control
as optimizing the expected cost, e.g. \begin{equation} J^*(s[0]) =
\min_{a[\cdot]} E \left[ \sum_{n=0}^\infty \ell(s[n],a[n]) \right].
\label{eq:stochastic_dp_optimality_cond} \end{equation} Note that there
are many other potential objectives, such as minimizing the worst-case
error, but the expected cost is special because it preserves the dynamic
programming recursion: \[ J^*(s) = \min_a E \left[\ell(s,a) +
J^*(s')\right] = \min_a \left[ \ell(s,a) + \sum_{s'} \Pr(s'|s,a) J^*(s')
\right].\] Remarkably, if we use these optimality conditions to construct
our value iteration algorithm \[ \hat{J}(s) \Leftarrow \min_a \left[
\ell(s,a) + \sum_{s'} \Pr(s'|s,a) \hat{J}(s') \right],\] then this
algorithm has the same strong convergence guarantees of its counterpart
for deterministic systems. And it is essentially no more expensive to
compute!</p>
<subsubsection><h1>Stochastic interpretation of deterministic,
continuous-state value iteration</h1>
<p> There is a particularly nice observation to be made here. Let's
assume that we have discrete control inputs and discrete-time
dynamics, but a continuous state space. Recall the fitted value
iteration on a mesh algorithm described above. In fact, the resulting
update is exactly the same as if we treated the system as a discrete
state MDP with $\beta_i$ representing the probability of
transitioning to state $x_i$! This sheds some light on the impact of
discretization on the solutions -- discretization error here will
cause a sort of diffusion corresponding to the probability of
spreading across neighboring nodes.</p>
</subsubsection>
<p>This is a preview of a much more general toolkit that we develop later
for <a href="robust.html">stochastic and robust control</a>.</p>
</subsection>
<subsection id="LP"><h1>Linear Programming Dynamic Programming</h1>