$\renewcommand\Pr{\mathbb{P}}$ $\newcommand\E{\mathbb{E}}$

Wednesday, October 22, 2014

Ensembles for reinforcement learning


More than 6 years after my PhD thesis on "ensembles for sequence learning", which included a chapter on ensemble representations of value functions using bootstrap replicates (Sec 8.5.1), it seems that the subject is back in vogue. Back then I used Thompson sampling (which I called "sampling-greedy", being an ignorant fellow), and a discount-sensitive variant of it in a number of bandit problems, and experimented with different types of ensembles, from plain online bootstrap to particle filters.

Recently, Hado van Hasselt presented Double Q-Learning which maintains two value functions (with the twist that one depends on the other), while Eckles and Kaptein present Thompson sampling with the online bootstrap.

What kind of bootstrap should we use? Let's say we have a sequence of observations $x_t \in X$, $t=1,\ldots, n$. One option is the standard bootstrap where a set of samples are drawn with replacement. For the $i$-th estimate $\theta_i : X^n \to \Theta$ of a parameter $\theta$, we use $n$ points drawn from replacement from the original sequence. Alternatively, one could use the so-called "Bayesian" bootstrap, which places "posterior" probabilities on each $x_t$. In practice both methods are quite similar, and the latter method is the basis of online bootstrapping (simply add a new observation with a certain probability, or weight to one member of the ensemble).

Like Eckles and Kaptein, I had used a version of the online bootstrap. Back then, I had also experimented with mixing different types of estimators, but the results were not very encouraging. Some kind of particle filtering appeared to be working much better.

One important question is whether this can be taken to be a posterior distribution or not. It could certainly be the case for identical estimators, but things become more complicated when they are not: for example when each estimator encodes different assumptions about the model class, such as its stationarity. In that case, the probability of the complete model class should be involved, and that implies that there would be interactions between the different estimators. Concretely, if you wish to obtain a bootstrap ensemble estimate of expected utility, you could write something like \[ \E(U \mid x) \approx \sum_{i=1}^K \E_{\theta_i} (U \mid x) \] assuming $\theta_i$ are distributed approximately according to $P(\theta \mid x) = P_\theta(x) P(\theta)$. However, typically this is not the case! Consider the simple example where $\theta_1$ assumes $x_t$ are i.i.d and $\theta_2$ assumes they are Markov. Then, clearly $P_\theta(x)$ would converge to different values for both (no matter how the $x_t$ are allocated). This implies that some reweighting would be required in the general case. In fact, there exist simple reweighting procedures which can approximate posterior distributions rather well via the bootstrap, and which avoid the computational complexity of MCMC. It would be interesting to look at that in more detail perhaps.

Tuesday, July 22, 2014

The unfalsifiability of rationality

Spurred by a recent discussion on "rational" versus "adaptive" models and economics, I recalled something I worked on briefly in 2011. How can we actually define rationality in the first place in order to obtain a falsiable hypothesis.

For simplicity, let us restrict ourselves to the following setting. A set of $n$ subjects participate in a psychological experiment. In this experiment, the $i$-th subject obtains some sequence of observations $x_{i,t} \in X$, with After each observation, it performs an action $a_{i,t} \in A$. We assume that some observations and action sequences are preferred to others. In particular, the observations may entail the dispersal of monetary rewards or presumably pleasurable stimuli. How can we tell whether subjects are rational?

Let us construct a family of belief and utility models with parameters $\theta \in \Theta$. That is, for a subject with parameters $\theta_i$, we define the following probabilities for the next observation \[ \Pr_{\theta_i} (x_{i,t+1} \mid x_{i,1:t}, a_{i,1:t}) \] conditioned on the observation-action history. We also assume that each subject has a utility function $U_i : X^* \times A^* \to \mathbb{R}$ and that it tries to maximise its expected utility \[ \E^{\pi_i}_{\theta_i} U_i \] with some policy $\pi_i : X^* \times A^* \to A$ for selecting actions.

The falsiability problem occurs because, even if we assume a priori a particular utility function, any subject's actions will always be consistent with some belief model, if $\Theta$ is large enough. Consequently, we must forget thinking about formal rationality in the first place, and start discussing reasonable models instead.

What are reasonable models? One standard answer comes from Occam's razor. Given a nice family of distributions (or programs), we could simply bet that most subjects would put most of their belief on the simplest parts of the family.

However, for any hypothesis test, we need an alternative hypothesis. Unfortunately, no obvious way of building "unreasonable" models exists. However, one could always think about the $\epsilon$-contamination class with "oracle" beliefs: these assign maximum probability to the observed data. I then performed an analysis which seemed to support the null hypothesis that the beliefs are "reasonable", when the subjects (in this case algorithms) are indeed reasonable, even in the absence of a known alternative hypothesis.

I am not sure if there is interest in that sort of thing. If yes, then perhaps a bit of refinement of the alternative class model may be in order.

Tuesday, July 15, 2014

Inflation expectations. Whose expectations?

#economics #braindump Reading Simon Wren-Lewis made me want to look at econometric studies of Phillips curve models. Now, IANAE, so I might be misinterpreting a lot of things here, but anyway, as far as I can understand, one interesting question in macroeconimics is inflation and inflation expectations.

Using the notation from a paper by Nason and Smith here and whatever info I could dig out. If we look at a discrete-time model with inflation rate $\pi_t$, the basic equation is \[ \pi_t = \gamma_f \E_t \pi_{t+1} + \lambda x_t, \] where $E_t$ is an expectation operator; I choose to interpret the underlying probability space as that of beliefs over states of the world. Meanwhile $\gamma_f, \lambda$ are scalars and $x_t$ is "marginal costs" which I interpret as some external variable.

One interesting thing in that paper is that it explicitly mentions the alternative model \[ \pi_t = \gamma_b \pi_{t-1} \gamma_f \E_t \pi_{t+1} + \lambda x_t. \] Now I don't see how this helps things, because one would expect the expectations to be dependent on $\pi_{t-1}$ in any case. And why are we only looking at $\pi_{t-1}$ and not $\pi_t$ or maybe $\pi_{t-k}$? The rationale seems to be that a very simple (apparently, didn't look at the details) price-setting model ends up having the above form.

The thing that is strange about all this is that the $\E_t$ operator is not clearly defined yet. At the bottom of page 366 the authors talk about iterated expectations. This works out symbolically, but... let's say our information state at time $t$ is $z_t$. Then we have the probability distribution over the states of the world $P(\omega \mid z_t)$, with each world state $\omega \in \Omega$ corresponding to a particular $\pi_t(\omega0)$ sequence and the corresponding expectation \[ \E_P(\pi_{t+1} \mid z_t) = \int_{\Omega} \pi_{t+1}(\omega) dP(\omega \mid z_t). \] But that is not necessarily the same expectation as that of another. Say we have a probability measure $Q$ over the possible beliefs $B \in \cal B$ the other forecaster has, conditioned on the evidence. Then \[ \E_Q(\pi_{t+1} \mid z_t) = \int_{\cal B} \int_{\Omega} \pi_{t+1}(\omega) dB(\omega) dQ(B \mid z_t). \] So I don't see how it follows that "our effort to predict what someone with better information will forecast simply gives us our own best forecast". For example it might be that the evidence tells us that the other forecaster's belief is that the next period inflation is going to be equal to the current period inflation. So generally $\E_P \neq \E_Q$. In addition, even if the expectations are equal, the distribution of possible future inflation may not be. I am either missing something obvious, or is this rather sloppy? What kind of statistically meaningful conclusions can one obtain in this way, if any?

Any enlightening comments?