Meddling in the unknown: Ensembles for reinforcement learning

More than 6 years after my PhD thesis on "ensembles for sequence learning", which included a chapter on ensemble representations of value functions using bootstrap replicates (Sec 8.5.1), it seems that the subject is back in vogue. Back then I used Thompson sampling (which I called "sampling-greedy", being an ignorant fellow), and a discount-sensitive variant of it in a number of bandit problems, and experimented with different types of ensembles, from plain online bootstrap to particle filters.

Recently, Hado van Hasselt presented Double Q-Learning which maintains two value functions (with the twist that one depends on the other), while Eckles and Kaptein present Thompson sampling with the online bootstrap.

What kind of bootstrap should we use? Let's say we have a sequence of observations $x_t \in X$, $t=1,\ldots, n$. One option is the standard bootstrap where a set of samples are drawn with replacement. For the $i$-th estimate $\theta_i : X^n \to \Theta$ of a parameter $\theta$, we use $n$ points drawn from replacement from the original sequence. Alternatively, one could use the so-called "Bayesian" bootstrap, which places "posterior" probabilities on each $x_t$. In practice both methods are quite similar, and the latter method is the basis of online bootstrapping (simply add a new observation with a certain probability, or weight to one member of the ensemble).

Like Eckles and Kaptein, I had used a version of the online bootstrap. Back then, I had also experimented with mixing different types of estimators, but the results were not very encouraging. Some kind of particle filtering appeared to be working much better.

One important question is whether this can be taken to be a posterior distribution or not. It could certainly be the case for identical estimators, but things become more complicated when they are not: for example when each estimator encodes different assumptions about the model class, such as its stationarity. In that case, the probability of the complete model class should be involved, and that implies that there would be interactions between the different estimators. Concretely, if you wish to obtain a bootstrap ensemble estimate of expected utility, you could write something like \[ \E(U \mid x) \approx \sum_{i=1}^K \E_{\theta_i} (U \mid x) \] assuming $\theta_i$ are distributed approximately according to $P(\theta \mid x) = P_\theta(x) P(\theta)$. However, typically this is not the case! Consider the simple example where $\theta_1$ assumes $x_t$ are i.i.d and $\theta_2$ assumes they are Markov. Then, clearly $P_\theta(x)$ would converge to different values for both (no matter how the $x_t$ are allocated). This implies that some reweighting would be required in the general case. In fact, there exist simple reweighting procedures which can approximate posterior distributions rather well via the bootstrap, and which avoid the computational complexity of MCMC. It would be interesting to look at that in more detail perhaps.

Meddling in the unknown

Wednesday, October 22, 2014

Ensembles for reinforcement learning

No comments:

Blog Archive

About Me