How to Approximate Functions with Adaptive Sparse Grids in Chebyshev Nodes

Use Chebyshev interpolants and adaptive sparse grids to efficiently approximate smooth functions of multiple variables

An adaptive sparse grid in Chebyshev-Gauss-Lobatto nodes.

Consider this approximation problem: Suppose you have a function f(x), x ∈ [-1, 1]^p, that is expensive to evaluate; and hence, you’d like to build a new function f’ that is cheaper to evaluate but accurately approximates f on [-1, 1]^p using as few evaluations of f as possible.

Approximation like this has many applications across statistics and other fields. f might, for example, represent a probability density function where evaluation requires an expensive integration and efficient approximation gives you a mechanism to compute credible sets.

In this blog post, I’m going to cover two powerful techniques for building function approximations: Chebyshev interpolants and adaptive sparse grids.

Chebyshev interpolants are a basic building block for approximation. If f is a little bit smooth (e.g. Lipschitz continuous), then its Chebyshev interpolants will converge; if f is differentiable to a high order or analytic, its Chebyshev interpolants will converge extremely rapidly (much better than other techniques such as cubic splines) [1, 2].

Combining Chebyshev interpolants with adaptive sparse grids gives an efficient mechanism to approximate functions of higher order.

Let’s start by looking at the single dimension case.

Approximation of a single variable function

There’s an unfortunate misconception often repeated that high degree polynomials interpolants are problematic (see myth 1 of [1]). In fact, high degree polynomial interpolants are often the best tool for the job, we just need to choose the points correctly.

While it’s true that interpolation in equispaced points can be disastrous, interpolation in Chebyshev nodes has excellent convergence properties.

Figure 1: Example of Chebyshev nodes on the interval [-1, 1]. The nodes correspond to equispaced points on a semicircle and hence have greater density towards the endpoints of the interval.

Example 1: Runge function

Consider the Runge function

Equispaced interpolants are notoriously bad at approximating this function (see Runge’s phenomenon). But as we see, Chebyshev interpolants handle the function just fine.

Figure 2: Chebyshev interpolants for the Runge function with 9, 17, and 33 points.

Chebyshev interpolants are robust. If f is Lipschitz continuous on [-1, 1], its Chebyshev interpolants will converge (see 3.1 of [2]).

And for smooth functions, Chebyshev interpolants converge rapidly. If f has v derivatives with the vth derivative having bounded variation V, then

where p_n denotes the Chebyshev interpolant of f with n points and

When f is analytic, convergence is even better

for some ρ > 1 [see theorems 7.2 and 8.2 in 2].

The converge rate of Chebyshev interpolants is far better than what you can expect from something like cubic splines where the rate is typically O(n^-4) [see 3]. As Lloyd Trefethen writes

The introduction of splines is a red herring: the true advantage of splines is not that they converge where polynomials fail to do so, but that they are more easily adapted to irregular point distributions and more localized [2, p. 104]

Handling Functions of Multiple Variables

There’s a natural extension of Chebyshev interpolants to multiple dimensions. Let X^n denote the set of n Chebyshev nodes on [-1, 1]. Then the Cartesian product

determines a unique multivariate Chebyshev interpolant p_{n^p}. As with the single dimension case, the multivariate Chebyshev interpolant will approximate f well for n large enough; however, in terms of the number of evaluations required, this will become very expensive very quickly. The naive way of extending approximation to multiple dimensions is not very efficient. Fortunately, we can do much better.

Put

Then X^i define the nested sequence of points called the Chebyshev-Gauss-Lobatto nodes and serve as a basic building block for more efficient interpolation [8].

Figure 3: Chebyshev-Gauss-Lobatto nodes for X^1 to X^5.

Let ψ_j^i denote the unique (m_i-1) degree polynomial where

Let V^i denote the vector space spanned by the basis functions {ψ_j^i} for j=1,…,m_i.

Figure 4: Example basis functions

Put ΔX⁰={}, ΔX^i=X^i \ X^{i-1} and

We’ll look to approximate functions using elements of vector spaces

where I = {i : i_1 + ··· +i_p ≤r} and r is some integer controlling the level of refinement of the approximation. The value of f at the points

will uniquely determine an interpolant in W^I.

Figure 5: Sparse grids for r=3 to r=8.

Sparse grids were originally discovered by Russian mathematician Sergey Smolyak [4] and used to develop more efficient quadrature rules for numerical integration. Suppose a function f of dimension p is v times differentiable. Using a dense quadrature rule, the integration error can be expressed as

where N_l is the number of evaluation points. When a sparse grid is used, the error improves to

Similar improvements in performance carry over to interpolation as well [5].

Adaptive Approximation

While sparse grids are much more efficient for approximating multivariable functions than dense grids, they are still far from optimal. In most real-world problems, a function will require more evaluation points in certain areas than others to reach a target accuracy. A couple of modifications to Smolyak sparse grids to make them more adaptive can make the algorithm even more efficient.

First, consider the index set, I. Instead of requiring that I = {i : i_1 + ··· +i_p ≤r} for some r, we can impose the weaker requirement that the index set need only be admissible: If iI and i_k > 0, then i-e_k∈I where e_k denotes the unit vector with (e_k)_k = 1. Relaxing the requirement to admissibility allows for certain dimensions of the grid to have greater refinement than others.

Next by tracking the difference between successive refinements, we can approximate the localized error in the approximation and we can adapt the sparse grid algorithm to only add refinement points to regions of the domain that haven’t already reached the desired target accuracy.

We can combine these two modifications into a greedy algorithm that continually refines the sparse grid in the region that approximate the function least well until all regions have reached the desired accuracy (see [7] for a detail description of such an algorithm).

Let’s look at how this works on a few examples. We’ll use the Python package bbai to fit adaptive sparse grid interpolants.

Example 2: Localized Adaptivity

Consider this function (taken from [7]):

The function plots a sharp ridge along the semicircle x²+y²=0.3 with a maximum value of 10. Let’s fit a sparse grid to approximate.

import numpy as np

def f(x, y):
    t = np.abs(0.3 - x*x - y*y)
    return 1/(t + 0.1)

# Fit a sparse grid to approximate f
from bbai.numeric import SparseGridInterpolator
ranges = [(0, 1), (0, 1)]
interp = SparseGridInterpolator(tolerance=1.0e-2, ranges=ranges)
interp.fit(f)

# Test the accuracy at some random points
N = 100
np.random.seed(0)
sample_pts = np.random.uniform(0, 1, size=(N, 2))
errs = [np.abs(interp.evaluate(x, y) - f(x, y)) for x, y in sample_pts]
print('mean_error: ', np.mean(errs))  
  # prints:
  #   mean_error: 0.0019457748570626515
print('max_error: ', np.max(errs))
  # prints:
  #   max_error: 0.029750550242468213

We can visualize the constructed grid by overlaying the interpolation points on a contour plot for the function. As we can see, the algorithm adaptively adds more interpolation points in the region surrounding the ridge.

Figure 6: Sparse grid to approximate the function {|0.3-x^2 - y^2| + 0.1}^-1 for 0<=x<=1, 0<=y<=1.

Example 3: Adaptivity by Dimension

Next let’s look at a function where one dimension is much more influential than the other dimension:

We’ll again fit the function and test it at some random points.

import numpy as np

def f(x, y):
    dx = x - 0.5
    dy = y - 0.5
    mult = 0.01
    return np.exp(-2 * dx*dx - mult * dy * dy)

# Fit a sparse grid to approximate f
from bbai.numeric import SparseGridInterpolator
ranges = [(0, 1), (0, 1)]
interp = SparseGridInterpolator(tolerance=1.0e-5, ranges=ranges)
interp.fit(f)

# Test the accuracy at some random points
N = 100
np.random.seed(0)
sample_pts = np.random.uniform(0, 1, size=(N, 2))
errs = [(np.abs(interp.evaluate(x, y) - f(x, y))) / f(x, y) for x, y in sample_pts]
print('mean_error: ', np.mean(errs))
  # prints:
  #   mean_error: 3.265838801494769e-10
print('max_error: ', np.max(errs))
  # prints:
  #   max_error: 1.7325188938498375e-09

Plotting the grid points out, we can see that it is more refined for the x dimension than the y dimension.

Figure 7: Sparse grid to approximate the function exp{-2(x-0.5)^2 - 0.01(y-0.5)^2} for 0<=x<=1, 0<=y<=1

Example 4: Fitting a Gaussian of 10 Dimensions

Let’s look at what happens when we try to fit a function of much higher dimensionality.

# Define a Gaussian of 10 dimensions
import numpy as np
p = 10
mult = np.array([0.2699884 , 0.35183688, 0.296529  , 0.26805488, 0.20841667,
       0.31774713, 0.21527071, 0.43870707, 0.47407319, 0.18863377])
offsets = np.array([0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606,
       0.0871293 , 0.0202184 , 0.83261985, 0.77815675, 0.87001215])
def f(*args):
    n = len(args[0])
    t = np.zeros(n)
    for j, arg in enumerate(args):
        delta = arg - offsets[j]
        t -= mult[j] * delta * delta
    return np.exp(t)

Let’s measure how long it takes to fit this function.

# Fit function
from bbai.numeric import SparseGridInterpolator
domain = [(0, 1)]*p
interp = SparseGridInterpolator(tolerance=1.0e-4, ranges=domain)
import time
t1 = time.time()
interp.fit(f)
t2 = time.time()
print('took {} seconds to fit sparse grid'.format(t2 - t1))
   # Prints
   #   took 1.0177533626556396 seconds to fit sparse grid
print('num_points =', interp.points.shape[1])
   # Prints
   #  num_points = 13256

It only takes a second on my computer to fit the function and we only needed ~13k points. Let’s test the interpolant for accuracy.

# Test accuracy at some random points
N = 100
X = np.random.uniform(size=(N, p))
args = [X[:, j] for j in range(p)]
f_true = f(*args)
f_approx = interp.evaluate(*args)
errs = np.abs((f_true - f_approx) / f_true)
print('mean_error:', np.mean(errs))
   # Prints
   #   mean_error: 1.2278939934746143e-05
print('max_error:', np.max(errs))
   # Prints
   #   max_error: 7.69422569392403e-05

We see that the fit is quite accurate. Of course not every higher dimensional function will be as easy to handle as a Gaussian, but this shows that adaptive sparse grids can scale to dimensionalities that would quickly become problematic for a dense grid.

You can see this and some other examples in this Jupyter notebook.

An Application to Statistics

Consider the function

where X_1 and X_2 are independent normally distributed random variables with mean µ_ML and standard deviation σ_ML.

For context, this function comes up when computing default Bayes factors for hypothesis testing of the mean of normally distributed data with unknown variance. See the blog post Introduction to Objective Bayesian Hypothesis Testing and the reference [9] for more details.

Clearly, the function lends itself to easy approximation via sampling; but with Chebyshev interpolants, we can engineer something more accurate and efficient.

Some algebraic manipulation will simplify f to

where

Put

Then

From Cochran’s theorem, it follows that \sqrt{2} \bar{z} follows a standard normal distribution and 2 s_z² follows a chi-squared distribution with 1 degree of freedom and that the distributions are independent. Thus,

follows a noncentral t distribution with centrality parameter \sqrt{2} (µ_ML — t) / σ_ML.

Put

where Z denotes a random variable from a noncentral t distribution with centrality parameter t. Given G, it’s trivial to compute f. Here’s the plan to approximate G (and hence f):

  1. We can numerically evaluate G at any given point to a high level of accuracy using two double exponent exp-sinh quadrature rules [10] to the left and right of t.

  2. We’ll use a Chebyshev interpolant to approximate G. Since G is an odd function, we need only approximate G on [0, ∞]. We can easily turn this into a finite range with the transformation φ(s) = s/(1-s) and G(φ(1)) = π/2 [11, p. 326].

  3. We’ll convert the interpolant to an equivalent representation as a Chebyshev series [2, p. 13] and then chop the coefficients that are smaller than a minimal threshold [12]. This gives us an approximation that is even more efficient to evaluate.

Let’s try this out and test the accuracy against the simpler simulation approach.

import numpy as np

# These are the coefficient of the approximation when represented as 
# a truncated series of Chebyshev polynomials of the second kind
coefs = [
    0.77143906574069909,     0.85314572098378538,     0.024348685879360281,
    -0.080216391111436719,   -0.016243633646524293,   0.014244249784927322,
    0.0083546074842089004,   -0.0013585115546325592,  -0.0032124111873301194,
    -0.00091825774110923682, 0.0007309343106888075,   0.00071403856022216007,
    9.4913853419609061e-05,  -0.00023489116699729724, -0.00017729416753392774,
    -8.6319144348730995e-06, 7.1368665041116644e-05,  5.0436256633845485e-05,
    1.8715564507905244e-06,  -2.1699237167998914e-05, -1.6200449174481386e-05,
]

def g(s):
    # Invert the mapping to go from the domain [0, ∞] -> [-1, 1]
    t = s / (1 + s)
    t = 2 * t - 1

    # evaluate the approximation at t
    return np.polynomial.chebyshev.chebval(t, coefs)

# Approximate f using Chebyshev polynomials
def f(mu, sigma, t):
    t = np.sqrt(2) * (mu - t) / sigma
    mult = 1.0
    if t < 0:
        mult = -1.0
        t = -t
    return 0.5 - mult * g(t) / np.pi

# Approximate f using a brute force simulation approach
N = 1000000
def f_sim(mu, sigma, t):
    vals = np.random.normal(mu, sigma, size=(N, 2))
    X1 = vals[:, 0]
    X2 = vals[:, 1]
    xbar = (X1 + X2) / 2.0
    sx = np.sqrt(((X1 - xbar)**2 + (X2 - xbar)**2) / 2.0)
    vals = 0.5 - np.arctan((xbar - t) / sx) / np.pi
    return np.mean(vals)


# Compare the approximations against each other
mu_ml = 0.123
std_ml = 1.5
np.random.seed(0)
for t in [0.0, 0.01, 0.1, 0.2, 0.3, 5, 10, 100]:
    ft = f(mu_ml, std_ml, t)
    fpt = f_sim(mu_ml, std_ml, t)
    rerr = np.abs((ft - fpt) / fpt)
    print(t, f(mu_ml, std_ml, t), f_sim(mu_ml, std_ml,t), rerr)

Prints

0.0 0.47059206926744995 0.4705617410005234 0.0006829181799177236
0.01 0.47297735696592846 0.47284440543778317 0.00044800674755964237
0.1 0.49449429024483676 0.4947273411923894 0.0001393914288991927
0.2 0.518426675093887 0.5184967680183266 2.2982920448911075e-06
0.3 0.5422530672885648 0.5421840161788148 0.0008338542554252333
5 0.943798414491211 0.9438468766277546 7.809732402546117e-05
10 0.9726189450498345 0.9726397857047364 4.339034701734454e-05
100 0.9973033311969053 0.9973052128447453 1.2957899605273032e-06

We can see that the two approximations methods differ by only fractions of a percent. See this Jupyter notebook for the full example.

Discussion

Q1: Is interpolation in Chebyshev nodes the most accurate way to approximate a smooth function?

No. But practically, they may be the best solution and more accurate algorithms often do not yield much of an improvement.

Other systems of polynomials such as Legendre polynomials [2, p. 21] lead to interpolants with similar accuracy to those of Chebyshev polynomials. The major advantage of Chebyshev polynomials is that one can use a Fast Fourier Transform to efficiently convert between interpolant values and coefficients in the basis expansion.

The Remes algorithm can be used to produce a more accurate approximating polynomial; but as Trefethen notes [Myth 3 of 1], for analytic and smooth functions the best approximation will not lead to a large improvement in accuracy over Chebyshev interpolants.

Q2: Do adaptive sparse grids solve the “curse of dimensionality”?

I don’t find the “curse of dimensionality” to be the most useful framework for thinking about many problems.

Adaptive sparse grids will do much better than requiring exponential growth in the number of evaluation points needed to achieve the same accuracy as dimensionality increases; but they will require more than a linear increase in the number of evaluation points, so high enough dimension approximation problems will still eventually become too expensive.

The important takeaway is that adaptive sparse grids are a great tool for solving many practical problems, and they can handle dimensionalities that would be infeasible for a dense grid.

Conclusion

As a rough mental model, I like to think of three levels for solutions to numerical problems.

Level 1: Exact answers that can be expressed as an integer, rational value, or symbolic answers expressing a value in terms of known constants or special functions.

Level 2: Answers that are so close to the real value that for practical purposes you can treat them as if they are exact. Examples include invoking standard numerical functions (e.g. tan(1)) or applying BLAS-LAPACK operations.

Level 3: Answers that are close enough to be useful but where there is a notable difference from the real solution. This might include approximating a sum of independent identically distributed random variables with finite variance by a normal distribution or many nondeterministic sampling algorithms.

Especially in statistics, I feel too much emphasis is placed on Level 1 and Level 3 methods. A problem is often not considered “solved” unless it presents a Level 1 solution; and frequently unjustified assumptions, such as conjugate priors, are made so that answers can be neatly expressed in a level 1 form. To an end user, though, software that provides a level 2 solution is just as good as getting a level 1 answer; and as we saw with the hypothesis testing example, using modern numerical techniques such as high-precision quadrature rules, sparse grids, and Chebyshev interpolants we can provide deterministic, efficient level 2 solutions to many common statistical problems that don’t lend themselves to an exact solution.

References

[1]: Trefethen, Lloyd. (2011). Six myths of polynomial interpolation and quadrature. Mathematics Today. 47.

[2]: Trefethen Lloyd N (2013). Approximation Theory and Approximation Practice, Extended Edition. USA: Society for Industrial and Applied Mathematics, 2019.

[3]: Kershaw, D. (1971). A Note on the Convergence of Interpolatory Cubic Splines. SIAM Journal on Numerical Analysis, 8(1), 67–74. http://www.jstor.org/stable/2949522

[4]: S. A. Smolyak (1963). Quadrature and interpolation formulas for tensor products of certain classes of functions. Dokl. Akad. Nauk SSSR, 148:1042–1043, 1963. Russian, Engl. Transl.: Soviet Math. Dokl. 4:240–243, 1963.

[5]: Garcke, Jochen (2012). Sparse Grids in a Nutshell. In Garcke, Jochen; Griebel, Michael (eds.). Sparse Grids and Applications. Springer. pp. 57–80. ISBN 978–3–642–31702–6.

[6]: Gerstner Thomas, Griebel Michael. Dimension–Adaptive Tensor–Product Quadrature. Computing. 09 2003. 71. 65–87

[7]: Jakeman John D., Roberts Stephen G (2011). Local and Dimension Adaptive Sparse Grid Interpolation and Quadrature.

[8]: Klimke Andreas (2005). Uncertainty Modeling using Fuzzy Arithmetic and Sparse Grids. 01 2006. 40–41.

[9]: Berger, J. and J. Mortera (1999). Default bayes factors for nonnested hypothesis testing. Journal of the American Statistical Association 94 (446), 542–554. postscript

[10]: Takahasi, Hidetosi; Mori, Masatake (1974), Double Exponential Formulas for Numerical Integration, Publications of the Research Institute for Mathematical Sciences, 9 (3): 721–741

[11]: Boyd, John & Marilyn, To & Eliot, Paraphrasing. (2000). Chebyshev and Fourier Spectral Methods.

[12]: Aurentz and Trefethen (2017). Chopping a Chebyshev series, Trans. Math. Softw.