This is a preliminary version. The algorithm was originally proposed in (Metropolis et al., 1953) and later extended in (Hastings, 1970). For an exhaustive description of the algorithm and other MC methods I refer the reader to (Robert & Casella, 2005).

## Quick overview:

- Defns of
*detailed balance*and*stationary distribution*of a Markov chain. - Detailed balance implies stationarity.
- Defn of
*MH Kernel*. - MH Kernel satisfies detailed balance: formal and intuitive proof.

## Prelims

We start with a couple of definitions to set the stage for the algorithm and why it works.
Recall that a **stationary distribution** $\pi$ of the chain $K$ is
characterized by

for all events $Y$. Note that for the discrete finite case, if $K$ is just a transition matrix, this tranlates to

\[\pi = \pi K.\]Moreover, for the Markov chain $(X_n)$ defined by $X_n \sim K(X_{n-1}, X_n)$, if
$X_0 \sim \pi$ then $X_n \sim \pi$ for every $n$. What is a bit more involved is showing that now matter the initial distribution the chain *converges* to $\pi$ (cf. (Robert & Casella, 2005)), i.e. showing that for any state $x$ (and for some appropriate norm) we have

*ergodicity*, that is, independence of initial conditions." (Robert & Casella, 2005)

A Markov chain with transition kernel $K$ satisfies the **detailed balance** condition if there is a function $f$ such that for every $x,y$

If $f$ is a probability density – as it will be in the MH case below – it implies that the chain is reversible and $f$ is a *stationary distribution of the chain*.
Simply compute

## Metropolis-Hastings transition kernel

Given a distribtution $p(x)$ we will now create a Markov chain $(X_1,X_2,\ldots)$ with stationary distribution $p(x)$.
For a family of proposal distributions $q(y \mid x)$ we define the **acceptance probability** (of a proposed state transition $x \to y$) as

As the name suggests $a(x,y)$ guides which proposals to accept and which to reject. The way it is set up ensures that $p$ is the stationary distribution of the chain. But we are getting ahead of our self. Let us define the Markov chain first.

Pick an initial state $x_0$ and sample $x_1, x_2, x_3, \ldots$ by iterating the following process for $t \geq 1$:

\[\left.\begin{aligned} y_t \ & \sim \ q(y \mid x_{t-1}) \\ a_t \ & \sim \ \text{Bernoulli}\big(a({x_{t-1}, y_t}) \big) \end{aligned}\right. \ \ \ \ \longrightarrow \ \ \ \ \ x_t := \begin{cases} y_t & \text{if $a_t$ is true,} \\ x_{t-1} & \text{else.} \end{cases}\]We can also write down the MH kernel explicityly: Let $r(x) = 1 - \int a(x,\tilde{y}) \ q(\tilde{y} \mid x) d\tilde{y}$ denote the mass of ending up in the second branch in the above if-else statement, or equivalently the mass of the event of any proposal being rejected. Then we have

\[K(x,y) = a(x,y) \ q(y \mid x) + r(x) \ \delta_x(y).\]## Detailed balance for MH kernel

The above Markov chain satisfies the detailed balance condition: Note that the detailed balance condition is always satisfied for $x=y$ which is the only case for which the second term is non-vanishing. It remains to show detailed balance for the first term, which can be written as

\[\begin{eqnarray*} p(x) \ \big( a(x,y) \ q(y \mid x) \big) &=& \min \Big\{ 1, \frac{q(x \mid y) \ p(y)}{q(y \mid x) \ p(x)} \Big\} \ q(y \mid x) \ p(x) \\ &=& \min \Big\{ q(y \mid x) \ p(x), q(x \mid y) \ p(y) \Big\}. \end{eqnarray*}\]The expression on the right side is clearly symmetric in $x$ and $y$ and thus implies detailed balance.

## Intuition

Note that we consider the probability ratio of the transitions of the reversed transition $y \to x$ over the proposed transition $x \to y$ and we distinguish two cases:

\[\text{(i)} \ \ a=1 \ \ \ \text{and} \ \ \ \text{(ii)} \ \ a<1 .\]The first case translates to $p(y \to x) \geq p(x \to y)$. In order to reestablish balance we have to increase the denominator $p(x \to y)$, which means we want to accept the proposed transition. Conversely, the second case translates to $p(y \to x) < p(x \to y)$ and we want to reject the proposed transition (proportionally to how bad we are out of balance) to avoid amplifying the imbalance.

# References

- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines.
*The Journal of Chemical Physics*,*21*(6), 1087–1092. https://doi.org/10.1063/1.1699114 - Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
*Biometrika*,*57*(1), 97–109. https://doi.org/10.1093/biomet/57.1.97 - Robert, C. P., & Casella, G. (2005).
*Monte Carlo Statistical Methods (Springer Texts in Statistics)*. Springer-Verlag.