How to Apply the Central Limit Theorem to Bounded Data

What can we say about the mean of data distributed in an interval [a, b]?

Let’s imagine that we’re measuring the approval rating of an unpopular politician. Suppose we sample ten polls and get the values

Table 1: Sample approval ratings for an unpopular politician

How can we construct a posterior distribution for our belief in the politician’s mean approval rating? 

Let’s assume that the polls are independent and identically distributed random variables, X_1, …, X_n. The central limit theorem tells us that the sample mean will asymptotically approach a normal distribution with variance σ²/n

where μ and σ² are the mean and variance of X_i.

Figure 1: Plots of a normalized histogram of sample approval means for our unpopular politician together with the normal distribution approximation for n=1, n=3, n=5, n=7, n=10, and n=20. We can see that by n=10, the sample mean distribution is quite close to its normal approximation.

Motivated by this asymptotic limit, let’s approximate the likelihood of observed data y with

Using the objective prior 

(more on this later) and integrating out σ² gives us a t distribution for the posterior, π(µ|y)

where

Let’s look at the posterior distribution for the data in Table 1.

Figure 2: Posterior distribution for a politician’s approval rating after observing the polls in Table 1.

We can see that something is clearly wrong. 1/4 of the posterior’s probability mass is between [-∞, 0]. But of course, we know that an approval rating must be between 0% and 100%. Let’s consider a simple fix: We’ll restrict the posterior distribution to be between 0 and 100 and renormalize so that the posterior integrates to 1.

Figure 3: Posterior distribution for a politician’s approval rating after observing the polls in Table 1 and restricting the distribution to be between 0% and 100%.


This is certainly better. But let’s think a bit harder because something is still not right. The Bhatia–Davis inequality tells us that variance for a distribution in [m, M] is bounded above by

where μ is the mean of the distribution. 

Figure 4: Max variance given mean using Bhatia–Davis inequality.

Thus if the mean is close to one of the endpoints 0% or 100%, then the data must have variance close to zero. Clearly, if we’ve observed distinct values, though, the variance can’t be zero, so the posterior for the mean approval rating should go to zero at the endpoints, which the posterior in Figure 3 doesn’t do.

In this blog post, I’m going to show how we can use reference analysis to derive a posterior that incorporates these constraints and better expresses our belief. Before getting into the details, let’s plot out the posterior for the data in Table 1 when the Bhatia-Davis inequality is properly incorporated.

Figure 5: Posterior distribution for the mean approval rating given the data in Table 1 and apply the constraints on variances from the Bhatia-Davis inequality.

We see that the posterior now correctly goes to zero at 0%, while differing minimally from the unconstrained posterior elsewhere.

What is reference analysis?

Reference analysis is a general purpose framework for deriving noninformative priors given a likelihood function and a parameter of interest.

What makes a prior noninformative? A key property is that it produces credible sets that match the frequentist coverage of the model across its parameter space. We can intuitively think of frequentist matching coverage as providing an answer to the question “How accurate are the Bayesian credible sets produced using a given prior?”.

Note: For more background on objective priors and frequentist matching coverage, see Introduction to Objective Bayesian Inference, [1], and [2].

Given the Fisher information matrix for a model and a parameter of interest, reference analysis provides a process to produce a prior with excellent frequentist matching performance, Proposition 0.2 of [2] and chapter 10 of [1].

Let’s look at how this works for the normal distribution with the mean as the parameter of interest.

Reference analysis: unconstrained normal distribution

Suppose y represents observed independent identically distributed normal data with unknown mean, µ, and variance, σ². Using reference analysis, we’ll derive an objective prior with the mean as the parameter of interest. We start by computing the Fisher information matrix. The likelihood function is given by

Put

Differentiating f, we have

Now, we can fill in the Fisher information matrix, I(µ, σ²):

We compute the inverse of the Fisher information matrix, V(µ, σ²), as

Because

we need to proceed (Proposition 0.2 of [2]) with a sequence of compact subsets

where

We can use the sets A_n = [1/(n+1), n+1]. Put

Put µ_0, σ²_0 = 1 (any values will do here). Then the reference prior is given by (Case 2 of Proposition 0.2 in [2])

We can run a simple simulation to test the frequentist coverage performance of the prior.

import numpy as np
import scipy

# Set a target coverage percentage.
# We use 95%
alpha = 0.95
low = (1 - alpha) / 2
high = 1 - low

N = 10000 # a big number

# Run a simulation to estimate the frequentist coverage
# performance of the prior for given true parameters
def coverage_sim(mu_true, sigma_true, n):
    res = 0
    for _ in range(N):
      # randomly sample data given true parameters
      y = np.random.normal(loc=mu_true, scale=sigma_true, size=n)

      mu = np.mean(y)
      s = np.std(y, ddof=1) / np.sqrt(n)

      # determine whether mean_true is within the two tailed
      # posterior credible set covering <alpha> percent of
      # the probability mass
      dist = scipy.stats.t(df=n-1, loc=mu, scale=s)
      t = dist.cdf(mu_true)
      if low < t and t < high:
          res += 1

    res /= N
    return res

# Try different values of mu_true, sigma_true, and n
#
# For a good noninformative prior, the fequentist coverage
# performance will consistently be close to the target <alpha>
ns = [3, 5, 10, 20]
mus = [0.0, 1.2, 3.4]
sigma2s = [0.01, 0.1, 1.0]
for n in ns:
    for sigma2 in sigma2s:
        for mu in mus:
            cov = coverage_sim(mu, np.sqrt(sigma2), n)
            print(n, sigma2, mu, ':', cov)

When I run the simulation, it outputs the values

Table 2: Frequentist coverage results for 1/σ² prior and various parameters. Values closer to 0.95 indicate better performance.

We can see that all the coverages are close to the targeted 95%, indicating the objective prior will perform well.

In fact, in this case the prior is exactly frequentist matching. See [§5.1.1 of 1].

Now that we’ve seen how reference analysis works for the unconstrained normal distribution, let’s use it to produce a prior for a constrained normal distribution that incorporates the Bhatia–Davis inequality

Reference analysis: constrained normal distribution

Suppose the mean is constrained to be between a and b and the variance is constrained by

Like the unconstrained case, the conditional prior for variance is improper

so we need to proceed with compact subsets. Put

where

Then

and

Put L = log ε_n. Then

where

and

When L → -∞, we have

Thus,

Constrained normal distribution posterior

Let’s now use the reference prior to derive an equation for the posterior. We have

where

and Γ(∙, ∙) denotes the incomplete gamma function

While the posterior, π(µ|y), may not lend itself to easy symbolic integration, using numerical techniques, it’s easy to perform operations on the distribution efficiently and to a high level of accuracy.

Note: See How to Approximate Functions with Adaptive Sparse Grids in Chebyshev Nodes for more background on such numerical methods.

Here’s a quick tour of how we can work with the distribution using the statistical software package bbai:

from bbai.model import BoundedNormal
import numpy as np

y = [1.2, 9.5, 6.3, 2.1] # some arbitrary data
a = 0   # lower bound on the mean
b = 10  # uppper bound on the mean

# fit a distribution to the data
dist = BoundedNormal(a, b)
dist.fit(y)

print(dist.pdf(5))
    # The probability density function of the distribution at 5
    #    prints 0.267
print(dist.cdf(5))
    # The cumulative density function of the distribution at 5
    #    prints 0.551
print(dist.ppf(0.5))
    # The percentage point function of the distribution (inverse of CDF)
    # at 50%
    #     prints 4.811

Validation

Imagine we’re a user looking for the right statistical tool for mean value inference of normally distributed data where the mean is constrained to be in [a, b] and the variance is bounded above by the Bhatia–Davis inequality

Let’s suppose the statistical tools we’re considering all provide a common interface that takes the bounds [a, b], the observed data, a target percentage, and returns a posterior credible set:

# Sketch of an example statistical tool we're considering for inference 
# of constrained normal data
def mean_credible_set(a, b, y, alpha):
   # ...
   # Let µ denote the unknown mean of the distribution associated with the
   # observations y
   #
   # Compute a posterior credible set, [low, high], such that 
   #    π(low ≤ µ ≤ high|y) = alpha
   # ...
   return low, high

Let’s now imagine that we’re presented with multiple such candidate tools. They are available to us as a function that we can invoke, but we have no knowledge of their internal workings. How might we go about experimentally validating the tools to decide which is better?

Clearly, frequentist matching coverage is important. A tool that produces posterior credible sets that don’t match up with their frequentist coverage isn’t suitable. 

But frequentist matching can’t be the only metric. Consider this example:

import numpy as np

def mean_credible_set_bad(a, b, y, alpha):
    mean = np.mean(y)
    std = np.std(y)
    t = (y[0] - mean) / std
    t *= 100
    t -= int(t)
    t = np.abs(t)
    if t < alpha:
        return [a, b]
    else:
        return [a, a]

Clearly, mean_credible_set_bad isn’t a good tool for inference. But if we run a frequentist coverage simulation with it, we’ll get reasonable performance.

mu = 0.123
cov = 0
N = 1000
for _ in range(N):
    y = np.random.normal(loc=mu, scale=0.321, size=10)
    low, high = mean_credible_set_bad(-10, 10, y, 0.95)
    if low < mu and mu < high:
        cov += 1 / N
print(cov) # Printed 0.957

This example highlights that we also need to consider the size of the credible sets produced. A good inferential tool should both have frequentist matching coverage close to the target and produce credible sets of minimal size.

With these goals in mind, let’s write a function to benchmark candidate tools. We’ll run a simulation to measure both frequentist coverage and the average length of covering intervals.

Note: These metrics aren’t necessarily meant to be exhaustive; but the do measure what should be key requirements.

N = 10000
alpha = 0.95

def run_sim(f, a, b, mu_true, sigma2_true, n):
    # Measure frequentist coverage and the
    # average length of covering intervals.
    #
    # Ideally, f should produce intervals having coverage 
    # close to <alpha> of minimal average length.
    res_cov = 0
    res_len = 0
    for _ in range(N):
        y = np.random.normal(loc=mu_true, scale=np.sqrt(sigma2_true), size=n)
        low, high = f(a, b, y, alpha)
        if low < mu_true and mu_true < high:
            res_cov += 1
            res_len += high - low
    res_cov /= N
    res_len /= N
    return res_cov, res_len

Let’s look at the three candidates:

def chopped_t(a, b, y, alpha):
    # Take the posterior credible set from the unrestricted t distribution
    # and intersect it with the interval [a, b]
    #
    # The result will no longer be a true credible set, but we can still
    # benchmark its frequentist coverage and length
    n = len(y)
    mean = np.mean(y)
    s = np.std(y, ddof=1)
    s /= np.sqrt(n)

    dist = scipy.stats.t(df=n-1, loc=mean, scale=s)
    low = (1 - alpha) / 2
    high = 1 - low

    res_a = dist.ppf(low)
    res_a = max(res_a, a)

    res_b = dist.ppf(high)
    res_b = min(res_b, b)

    return res_a, res_b

def renormalized_t(a, b, y, alpha):
    # Take the unrestricted t distribution and then
    # restrict and renormalize it so that the posterior integrates 
    # to 1 over the interval [a, b]
    n = len(y)
    mean = np.mean(y)
    s = np.std(y, ddof=1)
    s /= np.sqrt(n)

    dist = scipy.stats.t(df=n-1, loc=mean, scale=s)
    low = (1 - alpha) / 2
    high = 1 - low

    norm = dist.cdf(b) - dist.cdf(a)

    low *= norm
    low += dist.cdf(a)
    high *= norm
    high += dist.cdf(a)

    return dist.ppf(low), dist.ppf(high)

def reference_t(a, b, y, alpha):
    # Use the reference prior and posterior we derived
    # for mean inference of the constrained normal distribution.
    from bbai.model import BoundedNormal
    model = BoundedNormal(a, b)
    model.fit(y)
    low = (1 - alpha) / 2
    high = 1 - low
    return model.ppf(low), model.ppf(high)

Running the simulation across a range of parameters with the bounds fixed to a=0, b=10, I get the result

Table 3: Frequentist coverage performance and average covering length (in parenthesis) for the three methods chopped, renormalized, and reference using normally distributed data with mean in [0, 10] and variance bounded above by µ(10-µ). Coverage closer to 0.95 with smaller covering lengths indicates better performance.

Some observations

  1. First looking at the entries for µ=5 and σ²=1, all three of the methods give similar performance. This should be expected. With smaller variance and the mean far from either endpoint, the data and methods should resemble the unrestricted normal distribution.

  2. The chopped model consistently gives good coverage performance. By construction, its coverage should match the reference priors performance for the unrestricted normal distribution, which is exactly frequentist matching. However, we see that it often doesn’t do a good job of incorporating the constraints and often produces covering intervals significantly larger than the other two methods.

  3. The reference method consistently produces the smallest covering intervals and usually has coverage close to the target 0.95. For values where coverage is good, we see that it can produce intervals that are sometimes 20%–30% smaller than the other methods. We can see that for smaller n and variance close to the Bhatia-Davis limit coverage starts to break down a bit, but in general coverage is good.


Discussion

Q1: Can we produce a better reference prior for this problem?

Possibly. Reference analysis provides a framework for producing objective priors; it doesn’t necessarily produce a single unique prior. 

The method that usually produces the best performing prior is what Berger et al call the one-at-a-time reference prior algorithm:

There is a more general reference prior method called the one-at-a-time reference prior algorithm. … This is, indeed, our recommended reference prior algorithm, since it is based on one-dimensional information matrices, which all perspectives suggest are optimal [2].

Berger, J., J. Bernardo, and D. Sun (2022). Objective bayesian inference and its relationship to frequentism.

I produced a prior using the algorithm from Proposition 0.2 of [2] which is based on an asymptotic normal approximation. For many problems the two methods can result in the same prior; but they can also differ and in such cases the one-at-a-time algorithm is usually better.

While the one-at-a-time algorithm is often impractical, I suspect that for this problem it’s probably workable; so perhaps, it could be used to produce a better prior.

Q2: What’s the justification for the likelihood approximation you used?

Letting y denotes the observed independent identically distributed values from an unknown distribution with finite variance, the likelihood approximation I used was to model the individual observations, y_i, as normally distributed.

I don’t believe the approximation has a good justification and I’m skeptical that this is the best way to apply the central limit theorem. William Gosset, aka Student, who popularized the t distribution, makes clear that the normality assumption used when deriving the t distribution is more a practicality than something that is justified by the central limit theorem:

It is usual, however, to assume a normal distribution, because, in a very large number of cases, this gives an approximation so close that a small sample will give no real information as to the manner in which the population deviates from normality: since some law of distribution must be assumed it is better to work with a curve whose area and ordinates are tabled, and whose properties are well known. This assumption is accordingly made in the present paper, so that its conclusions are not strictly applicable to populations known not to be normally distributed; yet it appears probable that the deviates from normality must be very extreme to lead to serious error ([3]).

William Gosset (aka Student), The Probable Error of a Mean

Q3: Why did you use the likelihood approximation if you don’t think it is well justified?

Even if the particular likelihood approximation I used is somewhat dubious, I think the process of using reference analysis to incorporate the Bhatia–Davis inequality is still valid and could be adopted to other likelihood approximations. Experimenting and showing how it performs for the crude approximation I used is still a step forward.

Conclusion

Statistical methods should provide us with greater insight than what we can achieve with simple back-of-the-envelope reasoning. When there’s a mismatch, something is clearly wrong and we should rework our methods.

A thermometer, a scale, and a tape measure — all three of these instruments have a minimum and maximum value.

Using a t distribution to model data that’s naturally bounded is a method that presents such a clash. The simple work-around of restricting the distribution and renormalizing doesn’t fully resolve the problem, as it doesn’t incorporate the bound imposed on variance via the Bhatia–Davis inequality

With reference analysis, we saw how to fix the issue by taking the likelihood function and deriving a pior for the constrained parameter space. Experimental validation shows us that the approach performs well and can produce smaller, more tailored posterior credible sets than alternatives like chopping and restriction with renormalization.

References

[1]: Berger, J., J. Bernardo, and D. Sun (2024). Objective Bayesian Inference. World Scientific.

[2]: Berger, J., J. Bernardo, and D. Sun (2022). Objective bayesian inference and its relationship to frequentism.

[3]: Student. The probable error of a mean. Biometrika VI (1908).