$\renewcommand\Pr{\mathbb{P}}$ $\newcommand\E{\mathbb{E}}$

Thursday, May 21, 2020

Sweden, Tegnell and coronavirus: Separating lies from stupidity. Is it possible?

The Swedish state epidimiologist Tengell is the face of the coronavirus pandemic in Sweden. He appears calm and confident. However, his statements are misleading at best, and bare-faced lies at worst. He cherry picks data and anecdotes, and makes statements without any evidence or models. Unfortunately, journalists never seem to challenge him, and the state media is operating in what appears to be a propaganda mode. Now, as Sweden has become one of the countries with the highest deaths per capita in the last few weeks, it is time for a thorough debunking of Tegnell's dangerous nonsense. The picture that emerges is either of an utterly incompetent government, or one that has performed a cynical calculation and decided that letting a few thousand people die preventable deaths is good if it will let them stay in power. So far the strategy seems to have worked, as voters have been supportive of the government.

Tegnell said "Closedown, lockdown, closing borders — nothing has a historical scientific basis, in my view."

There is a great scientific basis for lockdowns in periods of epidemics. A simple example is the 1918/19 influenza pandemic and response. A range of measures were tried, and the comparison between Philadelphia and Seattle is interesting. As Philly did nothing to prepared, the system was overwhelmed and Philadelphia was devastated. Seattle instead started quarantining cases and shutting things down, to great success.

Though there can be no stratified randomised double blind trial of measures, historical data does tell us what strategies have a good chance of working. This is a similar situation to economics, where we must rely mainly on modelling and natural experiments. Thankfully, the data from the US at the time do provide us with a great natural experiment.

Closing borders, or screening does make sense when there is no endemic transmission. Since transmission in Sweden is uncontrolled it makes no sense to shut borders down. Denmark's shutting down borders was probably unncessary given the high incidence in that country at te time, but Norway was probably right to close borders (especially with Sweden) as it had (and still has) few cases.


To say that there is no scientific basis for shutdowns is utter nonsense and deliberate misinformation.

Tegnell claimed that "In Sweden we are following the tradition that we have in Sweden and working very much with voluntary measures, very much with informing the public about the right things to do. That has worked reasonably well so far."

However, it is interesting to contrast Sweden with Norway, similar countries with different strategies. Firstly, Norway has had 43 deaths / million, while Sweden 384, an order of magnitude more. In fact, Sweden had the highest death rate / million in the world in the week 13-20 of May.

Tegnell dismissed the figures on Tuesday night, arguing that it was misleading to focus on the death toll over a single week.

In fact, Sweden has consistently been in the top 10 of deaths per million for multiple weeks, and is right now 8th in overall deaths since the beginning of the pandemic. Consequently, this is no statistical fluke, but a direct consequence of Sweden's failed response.

In terms of the public following the advice, it is also interesting to compare with Norway and Greece in terms of mobility changes.
 RecreationGrocery Parks Transit Work Home
 Sweden -16     -1 146 -17 -5 2
 Norway -18 5 84 -22 -11 3 
 Greece -54 4 45 -34 -25 9
Coronavirus resulted in a mass exodus of Swedes in the parks. Nobody stayed at home, and almost everybody continued going to work, since most schools did not close. The government had claimed that they cannot shut down schools because of the need for health workers, but they could have only allowed the children of essential workers to go to school. So, that sounds like an excuse.

We are somewhere around 20 percent [of infected individuals] plus in Stockholm now, Tegnell claimed after results were published showing around 7 percent of individuals had been infected three weeks ago.

Tegnell pulls this number directly out of his arse. If the disease really spread at that rate, this implies a tripling of cases every three weeks. Does the model they use really say something like that? Indeed, if we look at a statement by a colleague of his, Tom Britton, a maths professor who helped develop its forecasting model:

“It means either the calculations made by the agency and myself are quite wrong, which is possible, but if that’s the case it’s surprising they are so wrong,” he told the newspaper Dagens Nyheter. “Or more people have been infected than developed antibodies.” [Source: the Guardian, see also original article]

This is mind-blowing. His model has been falsified by the data to the extent that it is "surprising that they are so wrong". He refuses to believe that it is wrong, and suggest that the data is wrong. It is also interesting that Tegnell says the number is " a bit lower than we’d thought", which directly contradicts the previous statement about the calculations being "so wrong".

Later Tegnell continued to doubt the data saying that "In Sweden, anybody who has the diagnosis of COVID-19 and dies within 30 days after that is called a COVID-19 case, irrespective of the actual cause of death. And we know that in many other countries there are other ways of counting that are used".

However, excess deaths are in line with these calculations.

"We calculated on more people being sick, but the death toll really came as a surprise to us," Tegnell said. "We really thought our elderly homes would be much better at keeping this disease outside of them then they have actually been.

There are reports that the state utterly failed to protect the elderly population. They were slow in forbidding visitors to elderly homes (they only forbid it on March 30th!), they did not provide protective equipment to healthcare workers, and they discouraged to bring elderly people to hospitals if they were infected. (In fact, ICUs are mainly full of young rather than old patients, who just seem to die alone in care homes). At the moment of writing, Tegnell blamed the care homes for not following basic hygiene rules, rather than the government and his agency for not making sure they are able to do so by providing protective equipment.

June 3 Update: Tegnell defended strategy while admitting too many people have died

In fact he said “Other countries started with a lot of measures all at once. The problem with that is that you don’t really know which of the measures you have taken is most effective,” he said, adding that conclusions would have to be drawn about “what else, besides what we did, you could do without imposing a total shutdown.”.  Closing schools would have been a start. Why not go the other way around, and start with a total lockdown, slowly relaxing some measure to see what is crucial? It makes no sense to start easy given the uncertainty.

July 28 Update: Tegnell bizarrely talks about "evidence of more public spread than home spread"

In particular, after doubting ECDC evidence that masks seem to help limit the spread of the disease, he said "If there is more indications that the disease spreads more in public places rather than at home, the authority will of course consider a mask recommendation". This is puzzling because the spread of the disease happens by default outside the home: If everybody stayed at home the disease would stop spreading, and after two weeks it'd be nearly eliminated.

But after all, the ex-welder and metal-unionist prime minister, who is utterly out of his depth, like most of the useless government cabinet, did say in his first video address about the crisis "be prepared to lose your loved ones"







Thursday, October 27, 2016

A critical reading of Pearl's Causality

I wanted to get a better understanding of causal models. I first went through Dawid's excellent survey, but I was still somewhat uncertain about what can and cannot be done in causal inference. My opinion was that everything can be done with decision diagrams, but Pearl was kind enough to point out that this was not the case. So, I decided to start reading his book in my spare time. I intend to be as critical as possible. The book didn't help to clarify my thinking very much (in contrast to decision diagrams).

In this post, I'll mainly go through the first chapter, and chapter 7, which is the one with have a formal definition of causal models. These are simply graphical models, with two types of random variables: endogenous and exogenous. There are two questions to answer:

1. How do we model interventions? Here, Pearl offers 'do' actions, which simply disconnect endogenous variables from input by the rest of the graph, and replace them with a fixed choice. This makes sense, but the set of actions is limited to these, and so the formalism looks like a specialisation of decision diagrams.

2. How do we deal with observed variables when dealing with counterfactuals and interventions? Should we assume that "everything else remains the same"? It's hard to motivate this for interventions, but it seems to make sense for counterfactuals. But in the end, it should all depend on whether observations are helping us infer model parameters or simply 'what is currently happening'. The book offers no clear guidance on this topic.

All in all, I found the formalisms in this book clear but restrictive, and the examples and discussion unhelpful and confusing. For example, It is unclear, and left to intuition, whether to consider some variables observed or not, and "structural equation modelling" does nothing to clarify the issue. But it is possible I misunderstood something.

Without further ado, here are my detailed notes.

*Chapter 1

This chapter introduces the book's main themes.  After going through the basics of Bayesian networks, the concept of Causal Bayesian Networks is introduced.  The general principle is that directed links should imply causal effects.  There is also the principle of interventions. If the BN is causal, then performing an action that sets a variable to some value also involves removing the link between that variable and its parents. This preserves the probability properties that we'd expect.  The chapter also discusses the stability of causal relations, talking about how causal relationships are /ontological/ while probabilistic ones are /epistemic/.

Functional causal models are basically described as equations of the form
\[
x_i = f_i (P_i, u_i)
\]
where $P_i$ are the parents of $i$ and $u_i$ are disturbances.

There are thre types of queries that Pearl wishes to discuss:
- Predictions: Probability of $x_i$ if we observe $x_j$.
- Interventions: Probability of $x_i$ if we /set/ $x_j$ to some value.
- Counterfactuals: Probabiltiy of $x_i$ if $x_j$ had some other value.

In my view, counterfactuals are no different from predictions conceptually. Since $x_j$ is a random variable, what we have essentially is
\[
\Pr(x_i = a \mid x_j = b)
 = \Pr(\{\omega : x_i(\omega) = a \wedge x_j(\omega) = b\})
 / \Pr(\{\omega : x_j(\omega) = b\}).
\]
So now we are counting in how many worlds $x_i, x_j$ take a joint value.  So, for the counterfactual, we need to be able to same something about how something else might have happened.  That would necessarily mean changing the value of $x_j$ for that specific $\omega$.

For more on counterfactuals, Pearl gives an example where $x, y \in \{0,1\}$ and where $\Pr(y \mid x) = 1/2$ for all $x, y$. This example is basically that of patients either dying or getting cured by treatments. Statistically, it's impossible to separate the case of the effect of the treatment being nil, and the patients belonging in two populations - with one population dying due to the treatment and the other being cured.

The solution to this paradox is to *give the treatment* to a sub-sample of the population that survived. However, this effectively creates an additional random variable. Instead of having one
treatment-effect pair, we have two for each patient $\omega$ $(x_1, y_1)$ and $(x_2, y_2)$. In this context, $x_i$ are decision variables. Hence, decision diagrams seem to be perfectly sufficient to model this.

Finally, this section introduces an alternative functional equation notation for the conditional probability table. This is more economical as a representation, but I see no other advantage in it (e.g. it does not allow for some other causal inferences). I would personally prefer to consider $u_1, u_2$ as samples from the underlying probability space rather than as additive 'disturbances' and then fall back to familiar decision theoretic territory.

* Chapter 7

This is the first formal chapter, and the only one that actually needs
reading. A causal model is a triple
\[
M = \langle U, V, F\rangle
\]
where
- $U$ are background/exogenous variables
- $V$ are endogenous variables (determined by variables in the model, i.e. $U \cup V$)
- $F$ is a set of functions so that
\[
v_i = f_i (u_i, \mathcal{P}_i),
\]
where $\mathcal{P}_i$ are the parents of $v_i$ in $V$.
Thus, every causal model $M$ can be represented as a directed graph.

A submodel is an important concept. If $X \subset V$ and $x$ is a
realisation of $X$, then $M_X = \langle U, V, G_X \rangle$ and $F_x = \{f_i V_i \neq X\} \cup \{X = x\}$. This corresponds to deleting all variables and replacing them with constant functions.

Pearl then considers a very specific set of actions, such that each action $x$ \on $X$ consists of assigning value $x$ to $X$. However, more generally we can speak of actions as policies $\pi$. Each policy has a certain effect on variables, such as setting some of them to a specific value.


In the decision theoretic setting, we can instead consider a set of actions $A$. We can then simply define the random variables $v_i$ as the function
\[
v_i = v_i(f_i, a).
\]
Pearl's account amounts to a particular special case, where the set of actions $A$ include the null action (that has no effect) and a set of interventions on specific variables (which corresponds to submodels).


*Potential responses* of actions are simply the set of possible of other variables when we fix some action $a$.

*Counterfactual* The counter factual that some variable $Y = y$ is simply the euqality $Y_x(u) = y$, so we need to calculate the probabiltiy of $Y = y$ under the coutnerfactual model $M_{X|x}$. Here note that nothing done here is different from anything done in standard decision diagrams... [See Theorem 7.1.7]

*Probabilistic Causal Model* A probabilistic causal model is a tuple
\[
\langle M, P \rangle
\]
where $P$ is simply a probability measure on the exogenous variables $U$.
Then counterfactual statements have the following probabilities
\[
P(Y = y \mid a)
=
P(\{u : Y(\omega) = y \} \mid a) .
\]
So, this is the same as the standard decision theoretic calculus.
I once again see no reason to introduce new notation, which only seems to limit us to very specific action models.

**** Probabilistic example in linear models

Here, we have a model with the dependencies
\begin{align}
q &= b_1 p + d_1 i + u_1
\\
p &= b_2 p + d_2 w + u_2
\end{align}

In this model, the set of endogenous variables is
\[
V = \{q, p\}
\]
and the exogenous ones are
\[
u = \{u_1, u_2, i, w\},
\]
with $i, w$ observed and $u_1, u_2$ latent and independent of $i, w$. We can assume that $u$ is jointly Gaussian. Then the model ask three questions.

1. What is the expected value of the demand $q$ if we set the price to $p_0$? I take this to mean $E[q | a(p := p_0)]$, but the book actually also conditions on $i$.

The modified model says that $p$ depends only on our non-null action, which is $p := p_0$.
Then
\[
E[q | a(p := p_0)] = b_1 p_0 + d_1 E[i] + E[u_1],
\qquad
E[q | a(p := p_0), i] = b_1 p_0 + d_1 i + E[u_1]
\]
But we also know that
\[
E[u_1] = b_1 E[q] - b_1 E[p] - d_1 E[i],
\]
so we can replace that directly in there to obtain
\begin{align}
E[q | a(p := p_0)] &= E[q] + b_1(p_0 - E[p])
\\
E[q | a(p := p_0), i] &= E[q] + b_1(p_0 - E[p]) + d_1(i - E[i])
\end{align}
This seems good, though the question didn't say about conditioning on $i$.

2. This is a simple observation question.

3. Given the actual price is $p_0$, what would be the expeted demand if we set it to $p_1$? I interpret this to mean
\[
E[q | a(p:=p_1), p_0, i, w],
\]
ie since this is a counterfactual, we assume that /everything else, including the exogenous variables is equal/. I am not sure why this needs to be the case. I understand whyt we need to condition on the observed endogenous variable $p$, but conditioning on the exogenous variables lacks any motivation.

Wednesday, August 26, 2015

An annotated reading of rational expectations - Conclusion: Where a sleight of hand strengthens the assumptions

Rational agents, or oracles? Concludingf remarks on rational expectations

So far, dear reader, I had been reading the theory of rational expectations under the assumptions I laid out in my first post about this. Most important of those is that agents are rational with respect to their own subjective beliefs. This is what I'd call the classic decision theory assumption about rational agents.

However, economics appears to assign a distinctly different meaning to these words. In particular, Manski, a researcher in econometrics, claims that rational expectations mean that agents' subjective beliefs match the actual probability distribution from which Nature generates events. Is that claim true? 

From my reading of Muth's, it's certainly ambiguous. The central assumption made in the beginning is that merely their aggregate belief matches Nature. However, later on the paper develops a notational obscurity that makes it hard to distinguish about whose expectations we are talking about, and what knowledge each actor has. This includes:

  1. The knowledge of each economic agent
  2. The knowledge of the person writing up the model (i.e. how much the model corresponds to reality).
  3. What the process itself depends on.

Nevertheless, my reading of Manski implies that the rational expectations economics subgroup has actually taken the idea to extremes. For them, "rational agents" are what I call "oracles", because they know everything about the system.

Muth makes things difficult by re-using assumptions without clearly stating them, but his own assumptions are somewhat weaker than that. I will go over this in the next section. 

Correlated deviations.

As the uncorrelated deviations is rather unrealistic, the paper then discusses correlations. In particular, it uses the model of disturbances $\epsilon_k$ to define the disturbance $U_t$ at time $t$:
\begin{equation}
U_t = \sum_{i=0}^\infty w_i \epsilon_{t-i},
\end{equation}
where $\epsilon_k$ are i.i.d standard normal (and so the $\epsilon$ vector has diagonal covariance matrix).

The price is now also a random variable, depending on another parameter vector $v$, so that
\begin{equation}
P_t = \sum_{i=0}^\infty v_i \epsilon_{t-i}
\end{equation}
Now the paper talks about $P_t^e$, the aggregate expected value, which it claims is a linear function of the $\epsilon_{t-i}$ for $i > 1$. However, if the agents don't observe those directly, this is impossible.

I am flummoxed. I believe that this is just clumsy notation and actual expectation of the real process with respect to the filtration generated by $\epsilon_1, \ldots, \epsilon_t$ is meant, and not the aggregate belief!

 In particular, since the agents can't observe $\epsilon$, it must simply denote the expected value of the process given all the values of $\epsilon$ to time $t-1$. So this is another example of the conflation of what information is used within the process itself and what is known by the agents.

Arguably, one could say that if all all the agents observed the previous $\epsilon$ values, then their aggregate conclusion would be equal to the expected value of the process. This is still a point that needs to be spelled out. However, the asymmetry of information between the agents and the process generating the data is critical. At this point, it is still not clear whether the assumption is that given the publicly available information, agents' aggregate predictions are identical to the expected value of the process, even if the latter has hidden state.

Now we can try and plug in the process in the equilibrium model

But the way Muth does it just assumes that the expectation of the market will include the sequence of all hidden information $\epsilon_1, \ldots, \epsilon_{t-1}$, as previously discussed. So, in some sense, doing so is not very useful.

Muth's solution is then to first recognise we have to write $P_t^e$ in terms of observables as $P_t^e = \sum_{j=1}^\infty z_j P_{t-j}$. But how does he actually do that? The algebra is straightforward, but what is plugged into what? Muth uses the much stronger "rationality" assumption that \[ \E_M (P_t \mid \epsilon_1, \ldots, \epsilon_{t-1}) = P_t^e\]  to obtain the final result.

This is competely unconvincing. The information that the disturbances $\epsilon$ might carry can be a lot, and the agents should not know it. Thus, from that point on, Muth is subscribing to a very "oracle"-like definition of rationality. The lack of discussion surrounding the trivial algebra, couple with ambiguous notation, creates such a confusion in concepts [who knows what, whose expectations we are talking about], that the paper's claims that its assumptions are mild cannot be supported at all.

The final part of this blog was going to talk about the deviations from rationality that Muth discusses, but I am not sure I have the heart to go into it right now.







Tuesday, August 25, 2015

An annotated reading of rational expectations theory - Part 3

A final comment on Section 3 of Muth's paper

As discussed, this paper now introduces a simple market model (theory?) which is
\begin{align}
D_t &= - \beta P_t\\
S_t &= \gamma P_t^e + U_t\\
D_t &= S_t
\end{align}
where $D_t$ is the demand and $S_t$ is the supply, while $P_t^e$ is the expected market price at time $t$, conditioned on all previous information. Here all $P_t$ is a deviation from the equilibrium, so they are better thought as the amount expected to gain by producing and selling a unit.

The main ambiguity here is  whose expectation that is. It most natural to assume that this is the aggregate expectation of the population, i.e. that the amount produced during the $t$-th period is basically proportional to the price that people expect to get. It's better to make this into a formal assumption about the theory.

Assumption 3. The theory predicts that total production is linear with market expectations, i.e.
\[
P_t^e = \E_\Phi \E_p (P_t \mid x_t),
\]
where $x_t = (P_1, D_1, S_1, \ldots, P_{t-1}, D_{t-1}, S_{t-1})$ is the current information state, which may include some other side-information.

This assumption essentially tells us that the amount of supply is proportional to how much people expect to gain (since the quantities are deviations from the equilibrium) for the goods they produce. At first glance, this appears reasonable, but what decision model does this imply for the producers?

The $U_t$ variable can be taken to be simply zero mean noise in this context, and part of our model $M$. Let us now just equate everything, and write the quantities $U_t, P_t$ in terms of the state $\omega_t$ to obtain:
\begin{equation}
%\gamma P_t^e + U(\omega_t) = - \beta P(\omega_t),
 P(\omega_t) = - \frac{\gamma}{\beta} P_t^e - \frac{1}{\beta} U(\omega_t),
\end{equation}
noting that $P_t^e$ only directly depends on the history of observations and the priors of the population, and not any specific $\omega$.

What is the expected price according to this model? We can take expectations with respect to the model's distribution $M$, to obtain
\begin{align}
\E_M (P_t \mid x_t)
&= \int_{\Omega_t}  P(\omega_t) dM(\omega_t \mid x_t) \\
&=  - \frac{\gamma}{\beta} P_t^e  - \frac{1}{\beta} \int_{\Omega_t}  U(\omega_t) dM(\omega_t \mid x_t) \\
&=  - \frac{\gamma}{\beta} P_t^e  - \frac{1}{\beta}\E_M (U_t \mid x_t)
\end{align}
where $M(\omega_t \mid x_t)$ is the conditional distribution of the model given the information.

Uncorrelated deviations. The paper assumes that we can have $\E_M (U_t \mid x_t) = 0$ for any $x_t$ according to the model $M$. I am not sure how to actually interpret that. This seems to be the case if the actual supply by each provider is proportional to the market price it expects, plus some zero-mean noise due to externalities. This is not necessarily true, but the paper later relaxes this assumption.

Rationality assumption. This boils down to simply
\[
\E_M P_t = P_t^e,
\]
i.e. that the aggregate market prediction agrees with the model's expectation, which actually emplies that the expected price must be the equilibrium price, i.e. $P_t^e = 0$, recalling that quantities are equilibrium deviations. It is unclear whether this is the marginal expectation, or expectations conditioned on something. If the latter, then what?

Finally, can this assumption be relaxed somehow? Though we don't know what it is... precisely. 

Given all of the above assumptions, we can rewrite the  aggregate expectated prediction as
\[
P_t^e = - \frac{1}{\beta + \gamma} \E_M (U_t \mid x_t).
\]
However, I am not sure what this tells us. If we already know $M$, we also must know the beliefs of the agents, and so we already know $P_t^e$. So what is the point?


Monday, August 24, 2015

An annotated reading of rational expectations theory - Part 2

Whereby, having previously guessed a general model from the assertions of Muth's original paper, I now proceed to look at the paper itself.

To summarise, my own take on those assertions would be that they imply:

  1. Each agent $i$ has his own prior belief $p_i(\omega)$ about the state of the world.
  2. This prior belief is generated independently of those of other agents from some distribution $\Psi$.
  3. This belief is conditioned on some information $x$, which is common to all; so each agent has a posterior belief $p_i(\omega \mid x)$.
  4. The agents are interested in predicting some quantity $y$, which depends deterministically on the state of the world. Let us call this $f(\omega)$. 
  5. Given the posterior $p_i(\omega \mid x)$, each agent can calculate a corresponding posterior distribution for the quantity itself. This can be done by firstly sampling $\omega$ from the posterior and then calculating $f(\omega)$. 
  6. A more limited prediction is to just calculate the expectation of $f$ itself. For simplicity, let's call $f_i(x) = \E_{p_i} (f \mid x)$ the expected value of $f$ for the $i$-th agent, given information $x$.
  7. There is some "theory" $\theta$ which predicts a particular value for $f$, let us call this $\theta(f, x)$
  8. The prior distribution $\Psi$ is such that, for any information $x$, we have that $\theta(f, x) = \E_{\Psi} \E_{p_i} (f \mid x)$.
Section 3 of the paper discusses the problem of price fluctuations. Here we are specifically talking about some kind of time series, although that is not strictly speaking following from the discussion previously.

Here I immediately run into a problem. The paper presents a set of demand supply equations, where it assumes that the demand $D_t$ is equal to the supply $S_t$, i.e. $D_t = S_t$. There is also a market price $P_t$ for a single good, as well as a fluctuation term $U_t$. This is all fine. However, the supply is defined as
\[
S_t = \gamma P_t^e + U_t,
\]
where $P_t^e$ is defined to be "the market price expected to prevail during the t-th period on the basis of information available through the (t-1)'st period". I can't help but feel that this is quite imprecise, so let us turn this back into the basic framework I outlined above.

At time $t$, the state of the world is $\omega_t$. Agents are allowed to make inferences about the complete state of the world $\omega = (\omega_1, \ldots, \omega_t, \omega_{t+1}, \ldots)$.
I assume that the model that we have (again here I am not sure if that describes the actual system or our model of the system) is with respect to the aggregate expectation, so that if $P_t = f(\omega_t)$ then the first interpretation is this.

Interpretation 1. The "price expected to prevail" is the aggregate population expectation
\[
P_t^e = \E_{\Psi} \E_{p_i} (f \mid x).
\]
where $x = (S_1,ldots, S_{t-1}, D_1, \ldots, D_{t-1}, P_1, \ldots, P_{t-1})$ I am not quite sure that's what is intended, though. It is equally likely that the equations represent a "true" model of some sort.

Interpretation 2. The "price expected to prevail" is that of the "true" model
Here there is some other true model which places probability $M$ upon world states, and then
\[
P_t^e = \E_{M} (f \mid x).
\]

Not clear at all what is meant.


Sunday, August 23, 2015

An annotated reading of rational expectations theory - Part 1

Following up on this post by Noah, I revisited the rational expectations literature. This is an attempt to do some annotated reading.

Muth's paper from 1961 opens by saying that "the expectations of firms ... tend to be distributed, for the same information set, about the prediction of the theory." 

How can we formalise this more precisely? Adopting a point of view where each economic agent $i$ has some prior belief $p_i$ over the set of possible states of the world $\Omega$, we need to define what we mean by expectations. Let $F$ be a function space of functions $f : \Omega \to Y$. For example, one particular $f$ could be the dollar price of gold per ounce for different world states $\omega$., To take expectations, we need to define a prior belief $p$ over the possible states of the world, with a posterior belief $p(\omega \mid x)$ given some information $x$. Combining those, the expected value of some $f \in F$, given information $x$ is:
\[
\E_p(f \mid x) = \int_\Omega f(\omega) d p(\omega \mid x).
\]
The assertion that economic agents' expectations are distributed about the prediction of the theory, can also be quantified. First we need to identify what we mean by prediction of the theory.

Definition 1. A theory $\theta$ is a function $\theta : F \times X \to Y$ where $F$ is the set of economic quantities we wish to measure, $X$ is the set of information states and $Y$ is the set of possible predictions.

Assumption 1. (a) Each economic agent $i$ has a prior belief $p_i$ drawn independently from a prior distribution $\Psi$. (b) For any information state $x \in X$, the corresponding set of posterior distributions $p_i(\omega \mid x)$ satisfies
\[
\E_{\Psi} \E_p(f \mid x) = \theta(f, x).
\]

This is a quite strong assumption. Note that the outer expectation is with respect to the population distribution of subjective beliefs while the inner one is with respect to the actual beliefs of each agent.
When the number of agents is large, then
\[
\frac{1}{N} \sum_{i=1}^N  \E_{p_i}(f \mid x) \approx \theta(f, x).
\]
I am not sure yet if that's what the paper is really about (I just scanned it once and now going through it) but I am tempted to make guesses. The actual paper never explicitly talks about the above, but that's what I am guessing it assumes behind the scenes. The remaining sections seem to be about a specific market problem, and then about very specific expectations/belief models for the agents.


Assumptions in section 2 of the paper

The paper concludes section 2 by making assumptions about:
1. normal random disturbances. I have no idea what those disturbances are supposed to be. Of what?
2.  Certainly equivalence. This probably means that the optimal investment choice depends only on the posterior expectation of $f$ and not on the complete posterior belief.
3. The system equations are linear. I have no idea how to interpret this.

(to be continued)

Wednesday, October 22, 2014

Ensembles for reinforcement learning


More than 6 years after my PhD thesis on "ensembles for sequence learning", which included a chapter on ensemble representations of value functions using bootstrap replicates (Sec 8.5.1), it seems that the subject is back in vogue. Back then I used Thompson sampling (which I called "sampling-greedy", being an ignorant fellow), and a discount-sensitive variant of it in a number of bandit problems, and experimented with different types of ensembles, from plain online bootstrap to particle filters.

Recently, Hado van Hasselt presented Double Q-Learning which maintains two value functions (with the twist that one depends on the other), while Eckles and Kaptein present Thompson sampling with the online bootstrap.

What kind of bootstrap should we use? Let's say we have a sequence of observations $x_t \in X$, $t=1,\ldots, n$. One option is the standard bootstrap where a set of samples are drawn with replacement. For the $i$-th estimate $\theta_i : X^n \to \Theta$ of a parameter $\theta$, we use $n$ points drawn from replacement from the original sequence. Alternatively, one could use the so-called "Bayesian" bootstrap, which places "posterior" probabilities on each $x_t$. In practice both methods are quite similar, and the latter method is the basis of online bootstrapping (simply add a new observation with a certain probability, or weight to one member of the ensemble).

Like Eckles and Kaptein, I had used a version of the online bootstrap. Back then, I had also experimented with mixing different types of estimators, but the results were not very encouraging. Some kind of particle filtering appeared to be working much better.

One important question is whether this can be taken to be a posterior distribution or not. It could certainly be the case for identical estimators, but things become more complicated when they are not: for example when each estimator encodes different assumptions about the model class, such as its stationarity. In that case, the probability of the complete model class should be involved, and that implies that there would be interactions between the different estimators. Concretely, if you wish to obtain a bootstrap ensemble estimate of expected utility, you could write something like \[ \E(U \mid x) \approx \sum_{i=1}^K \E_{\theta_i} (U \mid x) \] assuming $\theta_i$ are distributed approximately according to $P(\theta \mid x) = P_\theta(x) P(\theta)$. However, typically this is not the case! Consider the simple example where $\theta_1$ assumes $x_t$ are i.i.d and $\theta_2$ assumes they are Markov. Then, clearly $P_\theta(x)$ would converge to different values for both (no matter how the $x_t$ are allocated). This implies that some reweighting would be required in the general case. In fact, there exist simple reweighting procedures which can approximate posterior distributions rather well via the bootstrap, and which avoid the computational complexity of MCMC. It would be interesting to look at that in more detail perhaps.