How to Engineer Default Priors for Bayesian Hypothesis Testing

Build integral priors with reference analysis for model comparison without prior knowledge

Consider these birth records from Paris collected between 1745 and 1770:

Table 1: Boys and girls born in Paris between 1745 and 1770 [1, p. 59]

How might we use the data to test the hypothesis that the birth rate of boys is greater than the birth rate for girls?

The famous mathematician Pierre-Simon Laplace first pursued this question in 1778 [1]. Laplace modeled birth gender as independent binomially distributed random values parameterized by the probability of a boy being born and sought to compute “the probability that the possibility of the birth of a boy is equal or less than 1/2”. Today, we would call this approach Bayesian statistics; but at the time no such term existed, and Laplace switched freely between using the term “probability” for deductive and inductive inference. Laplace recognized that an answer would depend on the prior used. Seeking a result that represent lack of any prior knowledge, Laplace assumed a uniform prior, writing

When we have nothing given a priori on the possibility of an event, it is necessary to assume all the possibilities, from zero to unity, equally probable; thus, observation can alone instruct us on the ratio of the births of boys and of girls, we must, considering the thing only in itself and setting aside the events, to assume the law of possibility of the births of a boy or of a girl constant from zero to unity, and to start from this hypothesis into the different problems that we can propose on this object.

Pierre-Simon Laplace, [1, p. 26]

Laplace then computed the answer as

where B=251527, G=241945, and Γ denotes the gamma function, and he concluded

We can regard as certain as any other moral truth, that the difference observed in Paris between births of boys and those of girls is due to a greater possibility in the births of boys.

Pierre-Simon Laplace, [1, p. 35]

But as Fisher would emphasize nearly 200 years later, the uniform prior gives results that depend arbitrarily on how the binomial model is parameterized ([2]).

Suppose that instead of parameterizing by the probability of a boy being born, p, we parameterized by the variable, θ, with

Applying integration by substitution shows us that when φ is a monotonically increasing function onto [0, 1], then

Thus,

And a uniform prior in p is equivalent to the prior cos[θ]/2 in θ. Similarly, we can show that a uniform prior in θ is equivalent to the prior 1/π p ^(-1/2) (1 — p)^(-1/2) in p.

Figure 1: Uniform priors under different parameterizations

If the uniform prior depends arbitrarily on the parameterization, how should we pick a prior when we want to assume no prior knowledge?

Constructing Non-informative Priors

Let’s approach the problem experimentally. Suppose someone hands us a prior, π(p), and claims that it represents “knowing nothing”. What tests could we run to validate the claim?

Let’s try picking a value p_true ∈ (0, 1). Now let’s generate n random values from the distribution P(·|p_true). 

y <- number of successes in n samples from P(· | p_true)

Suppose we use the prior to compute the two-tailed 95% posterior credible interval about p:

p_low, p_high <- 
     a, b such that 
        1 / Z integral_0^a P(p | y) π(p) dp = 0.025
     and
        1 / Z integral_b^1 P(p | y) π(p) dp = 0.025
     where
        Z = integral_0^1 P(p | y) π(p) dp

Does the interval [p_low, p_high] contain p_true? What if we repeat this experiment many times? If π(p) is a good non-informative prior, then the fraction of simulation runs that contain p_true should approach a value close to 95% and we can say that the prior has good frequentist matching coverage.

Let’s try this experiment for the uniform prior Laplace used with n=3 and p=0.123.

import numpy as np
from scipy.stats import beta

alpha = 0.95 # use 95% credible sets
low = (1 - alpha) / 2
high = 1 - low

p = 0.123
n = 3
N = 1000 # number of simulation runs
ys = []

counts = np.zeros(n+1)
    # tallies for each possible obervation y=0, y=1, y=2, y=3

# randomly sample values
for _ in range(N):
    y = np.random.binomial(n, p)
    counts[y] += 1
    ys.append(y)

# compute frequentist coverage performance
for y, cnt in enumerate(counts):
    dist = beta(y+1, n - y + 1)
    a = dist.ppf(low)
    b = dist.ppf(high)
    contains = p > a and p < b
    print(y, cnt, a, b, contains)

Running the code gives me the result

Table 2: Coverage performance for Laplace’s prior with p=0.123, n=3, and 1000 simulation runs.

We can compute the frequentist coverage of the simulation as

If we keep n fixed at 3, vary p, and continually increase the number of times we run the simulation, the result will converge to the graph below. 

Figure 2: Frequentist coverage performance for Laplace’s prior with n=3.

We see that frequentist coverage is for the most part close to the target of 95%. But towards the endpoints, p=0 and p=1, the prior gives poor frequentist matching coverage. Now, let’s try testing what is called Jeffreys prior, 1/π p ^(-1/2) (1 — p)^(-1/2):

Figure 3: Frequentist coverage performance for Jeffreys prior (red) with n=3. For comparison, Laplace’s prior (blue) is included.

We can see that the prior gives better coverage close to the endpoints while still giving good coverage away from the endpoints.

What happens if we try different values of n?

Figure 4: Frequentist coverage performance for Jeffreys prior (red) and Laplace’s prior (blue)

Matching coverage improves for both of the priors as n increases, but Jeffreys prior consistently performs better.

Can we do better than Jeffreys prior? Well, no, or at least not asymptotically. It can be shown that with Jeffreys prior the frequentist matching coverage error decreases on the order of O(n^-1) as n increases; and for any other prior, frequentist matching coverage error decreases at a slower rate [sec 0.2.3.2 of 3, and 4].

Non-informative Priors and Hypothesis Testing

Now consider the problem of testing two hypotheses:

If y denotes the observed data and we assume that both hypotheses are equally likely, then we can compute the posterior probability for H_i as

where

and π_i(p) denotes the prior for hypothesis H_i. 

Hence, given priors, we can determine posterior probabilities for the hypotheses. What priors should we use? One option is to adapt Jeffreys prior by restricting and renormalizing. 

Figure 5: Jeffreys prior for two hypotheses

As we saw, Jeffreys prior has optimal frequentist matching performance. However, many believe that priors for hypothesis testing should concentrate more probability mass towards the decision boundary of the two hypotheses, which Jeffreys prior doesn’t do.

Typical prior beliefs will concentrate closer to the dividing line between the hypotheses (zero in the one-sided testing problem mentioned above), and that using a prior distribution which is extremely diffuse is thus unreasonable, at least for small or moderate sample sizes.

Berger & Mortera [5]

In the literature are widely used diffuse, vague or flat priors and the objective ones like the Jeffreys prior or the reference prior, to estimate the parameters of regressor models. However, the use of these priors is not recommended for Bayesian model selection due to the fact, among other reasons, that their formulation does not take into account the null hypothesis, making difficult for π_2(θ_2) to be concentrated around the hypothesis, which is a widely accepted condition.

Salmerón, Cano, & Robert, [6]

Using a process called integral priors introduced by Cano, Robert, and Salmerón ([7]), we can adapt non-informative priors to be more suitable for hypothesis testing. Here’s how the process works for the binomial distribution: Let π_1(p) and π_2(p) denote the hypothesis priors (we can think of these as unknown probability densities to be solved for) and let π^N_1(p) and π^N_2(p) denote Jeffreys priors when restricted and renormalized to the parameter space of H_i. Let y∈{0, 1} denote the number of successes in a single observation. Then π_1(p) and π_2(p) are the unique priors that satisfy the equations

where

Integral priors build on the ideas of imaginary observations and expected posterior priors introduced by Pérez and Berger ([8]). We can think of π_i(p) as the expected posterior of a single observation that follows the distribution of the other prior. As Cano et al explain

The argument to propose these equations is that a priori we suppose that the two models are equally valid and provided with ideal unknown priors π_i(θ_i), i=1,2, that yield true marginals allowing to balance each model with respect to the other one since the prior π_i(θ_i) is derived from the marginal m_j(x), j≠i, as an unknown expected posterior prior. This is nothing but the idea of being a priori neutral comparing the two models.

Cano, Salmerón, & Robert, [7]

It’s fairly easy to solve for π_1(p) and π_2(p) in the case of the binomial distribution. Put

Then

and

Plotting the integral priors next to the Jeffreys priors shows how the integral priors naturally distribute more probability mass towards the boundary of the two hypotheses:

Figure 6: Integral prior for binomial model together with Jeffreys prior (dashed)

Let’s compare the priors on the birth records data set Laplace analyzed:

Table 3: Reanalyzing birth records with Laplace’s prior, Jeffreys prior, and the integral prior.

As we might expect, there’s not much difference. With enough data, the differences in the priors are of little consequence. The more interesting question is, how do the priors compare on smaller data sets where a prior contributes more to the posterior? To explore this question, we can simulate different small data sets with varying values of p_true.

Simulation Experiment

Let’s start by looking at data with p_true=0.5 and use Laplace’s prior, Jeffreys prior, and the integral prior to test the two hypotheses: H_1: p<0.5 and H_2: p>0.5.

Table 4: Laplace’s prior, Jeffreys prior, and the integral prior for simulated binomial data with p=0.5.

We can see the differences are largest when n is small, n=1, n=2, and that the integral prior produces posteriors closer in value.

Now, let’s look at a similar experiment data where p_true=0.6.

Table 5: Laplace’s prior, Jeffreys prior, and the integral prior for simulated binomial data with p=0.6.

Like the first example, the integral prior produces posteriors closer in value; but after 15 samples, π(p>0.5) > 0.95 for all priors.

SP500 Performance

Let’s try using integral priors on some real world data. Consider the daily price changes in the SP500 index. Suppose we model whether the price increases or decreases as independent daily events following a binomial distribution with parameter p_up and consider the two hypotheses

Below I show how the posteriors for the two hypotheses with the integral prior evolve starting from Jan 2nd, 1990:

Table 6: Posteriors for the two hypotheses p_up < 0.5 and p_up > 0.5 using integral priors.

Conclusion

Though often used, the uniform prior leads to results that depend arbitrarily on the parameterization and therefore can’t reasonably claim to represent “knowing nothing”. In comparison, Jeffreys prior is uniquely determined and, for single parameter models, has optimal frequentist coverage performance. If the goal is prediction or inference with an assumed binomial distribution, Jeffreys prior is a well justified choice. If, however, the goal is model comparison, Jeffreys prior can be ill-suited as it doesn’t concentrate probability mass at the boundary dividing hypotheses. Fortunately, by using the process of integral priors, we can use Jeffreys prior as a building block to produce a prior that is both independent of parameterization and well-suited for hypothesis testing.

References

[1]: Laplace, P. (1778). Mémoire sur les probabilités. Translated by Richard J. Pulskamp.

[2]: Fisher, R. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368

[3]: Berger, J., J. Bernardo, and D. Sun (2022). Objective bayesian inference and its relationship to frequentism.

[4]: Welch, B. L. and H. W. Peers (1963). On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society Series B-methodological 25, 318–329.

[5]: Berger, J. and J. Mortera (1999). Default bayes factors for nonnested hypothesis testing. Journal of the American Statistical Association 94 (446), 542–554. postscript

[6]: D. Salmerón, J. A. Cano and C. P. Robert (2015). Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Statistica Sinica, Vol. 25, №3 (July 2015), pp. 1009–1023

[8]: Pérez, José M., and James O. Berger. “Expected-Posterior Prior Distributions for Model Selection.” Biometrika 89, no. 3 (2002): 491–511. http://www.jstor.org/stable/4140597.