Errata and Notes for:
Errata:
- p. xviii, Ben Van Roy should be acknowledged only once in the list. (Ben Van Roy)
- p. 155, the parameter alpha was 0.01, not 0.1 as stated. (Abinav Garg)
- p. 233, last line of caption: "ne-step" should be "one-step". (Michael Naish)
- p. 309, the reference for Tstisiklis and Van Roy (1997b) should
be to Technical Report LIDS-P-2390, Massachusetts Institute of Technology. (Ben Van Roy)
- p. 146, the windy gridworld example may have used alpha=0.5
rather than alpha=0.1 as stated. Can you confirm this?
- p. 322, in the index entry for TD error, the range listed as "174-165" should be "174-175". (Jette Randlov)
- p. 197, bottom formula last theta_t(2) should be theta_t(n). (Dan Bernstein)
- p. 151, second line of the equation, pi(s_t,a_t) should be pi(s_{t+1},a_t). (Dan Bernstein)
- p. 174, 181, 184, 200, 212, 213: in the boxed algorithms on all these pages, the setting
of the eligibility traces to zero should appear not in the first line,
but as a new first line inside the first loop (just after the "Repeat..."). (Jim Reggia)
- p. 215, Figure 8.11, the y-axis label. "first 20 trials" should be "first 20 episodes".
- p. 215. The data shown in Figure 8.11 was apparently not generated exactly as
described in the text, as its details (but not its overall shape) have defied
replication. In particular, several researchers have reported best "steps per episode"
in the 200-300 range.
- p. 78. In the 2nd max equation for V*(h), at the end of the first line, "V*(h)" should be
"V*(l)". (Christian Schulz)
- p. 29. In the upper graph, the third line is unlabeled, but should be labeled
"epsilon=0 (greedy)".
- p. 212-213. In these two algorithms, a line is missing that is recommended, though
perhaps not required. A next to the last line should be added, just before ending the
loop, that recomputes Q_a. That line would be Q_a <- \sum_{i\in F_a} theta(i).
- p. 127, Figure 5.7. The first two lines of step (c) refer to pairs s,a and times t
at or later than time \tau. In fact, it should only treat them for times later
than \tau, not equal. (Thorsten Buchheim)
- p. 267, Table 11.1. The number of hidden units for TD-Gammon 3.0 is given as 80, but
should be 160. (Michael Naish)
- p. 98, Figure 4.3. Stuart Reynolds points out that for some MDPs the given
policy iteration algorithm never terminates. The problem is
that there may be small changes in the values computed in step 2 that cause the policy
to forever be changing in step 3. The solution is to terminate
step 3 not when the policy is stable, but as soon as the largest change in state value
due to a policy change is less than some epsilon.
- p. 259, the reference to McCallum, 1992 should be to Chrisman, 1992. And in
the references section, on p. 302, the (incorrect) listing for McCallum, 1992
should not be there. (Paul Crook)
Notes:
- p. 212-213. In these two algorithms, it is implicit that the set of features for
the terminal state (and all actions) is the empty set.
- p. 28. The description of the 10-armed testbed could be clearer. Basically there are
2000 randomly generated 10-armed bandits. The Q*(a) of each of these were selected from
a normal distribution with mean 0 and variance 1. Then, on each play with each bandit, the
reward was determined by adding to Q*(a) another normally distributed random number
with mean 0 and variance 1.
- p. 127, Figure 5.7. This algorithm is only valid if all policies are proper,
meaning that they produce episodes that always eventually terminate (this assumption
is made on the first page of the chapter). This restriction on environments can be
lifted if the algorithm is modified to use epsilon-soft policies, which are proper
for all environments. Such a modification is a good exercise for the reader! Alternative
ideas for off-policy Monte Carlo learning are discussed in this recent
research paper.
- John Tsitsiklis has obtained some new results which come very close to solving
"one of the most important open theoretical questions in reinforcement learning" --
the convergence of Monte Carlo ES. See
here.
- The last equation on page 214 can be a little confusing. The minus sign here is
meant to be grouped with the 0.0025 (as the spacing suggests). Thus the consecutive plus and
minus signs have the same effect as a single minus sign. (Chris Hobbs)