This semester I am a graduate-student instructor for Math 1B — the second half of a first-year calculus sequence. Students attend large three hours a week of lectures with 400 classmates, and also have three hours of GSI-led "section". We get remarkable freedom with our classes: I write my own quizzes and worksheets, and plan my own material. I spend a lot of my time with "big picture" material: I feel like it's not that valuable for the students to watch me solve problems at the chalkboard. Or rather, it's very valuable to do so in an interactive setting, but I have yet to figure out how to make an interactive class with twenty to thirty students. In my office hours I often have closer to six students, and there we work through homework problems together.
We just finished a unit on techniques with which to approximate an integral, and it is about these that I would like to tell you. Although elementary, approximation techniques connect with both advanced mathematics and deep philosophical questions.
Let me begin with the latter. Why do we do calculus? Certainly, few mathematicians will ever use much calculus in her later work. But physicists and engineers and statisticians do: indeed, anyone whose job involves processing numerical data will use calculus for every calculation, even if it seems like she's just asking the computer to do some magic.
So calculus, and, indeed, arithmetic, is primarily a tool for dealing with "physical numbers": measurements and quantities and values. What's interesting about numbers in the real world is that they are never precise: physical numbers are "fat" in the sense that they come (or ought to come) equipped with error terms. Fat numbers cannot be added or multiplied on the nose — errors can propagate, although they also tend to cancel out — and fat numbers do not form a group (the number "exactly zero" is not physical). The universe has a built-in uncertainty principle, not because of quantum mechanics, but simply because we are limited beings, only able to make finitely many measurements. But a "real number" in the mathematicians' sense has infinitely much data in it, and so can never exist in the real world.
Much of the time, the hardest of physical calculations consist of evaluating integrals. We have a (fat) function (which may approximate an honestly "true" function, or may be a convention that we are using to think about messy data), usually consisting of a collection of data-points, which we would like to integrate over some interval. Riemann tells us how to do this: divide the interval into a series of small subintervals, thereby dividing the area under the curve into a series of small regions; estimate the heights of each small region; add these together.
We are now left with the difficult task of estimating the heights of the regions. In some sense, our estimates don't matter: provided our function is sufficiently nice (and if it isn't, we have no right to call it a "function"), then all possible procedures of estimates will eventually converge to the same value as the number of regions gets large. But some methods will converge faster and slower, and we need to be able to calculate this integral in a reasonable amount of time: neither humans nor computers can make truly large numbers of calculations. Much better would be if we picked a method for which we could bound the number of calculations needed to get within an allowable error.
One easy procedure is to use the height of the function at the left-hand side of the region. To investigate this, let's do a simple example: let's integrate
\int_0^1 x dx
By algebra, we know that this area is actually 1/2. If we use the "left-hand rule" for estimating heights, we get an estimate of the total area equal to 0. Our error is 1/2.
Why did I start with this function to test the rule? Let's say I had any linear function y=mx+b. Then the "+b" part cannot affect the error — the error depends on first (and higher) derivatives, and y=x is the simplest function with a first derivative.
By multiplying an using linearity, the error from using the left-hand rule in evaluating
\int_0^1 (mx + b) dx
will be m/2: error estimates must always scale linearly with the derivtive on which they depend. And integrating over [a,b]? Holding m constant but scaling [0,1] to [a,b] requires scaling both the x- and y-ranges, so our error should be proportional to m(b-a)^2.
This we can read from dimensional analysis. If x has units=seconds, say, and y is in feet, then m is in ft/sec, but the integral and the error are both in foot-seconds. If we want to estimate the error E in terms of m and the domain of integration, we must have E = Am, and since (b-a) is the only other unitful input, and has to correct for the seconds in A, we must have E \propto m(b-a)^2, and using the above coefficient of proportionality, we have E = m(b-a)^2/2.
What other parameters do we have? We could divide [a,b] into n regions. If we do, then our total error is
E = n*m((b-a)/n)^2/2 = m(b-a)^2/(2n).
This is the error for a line. But we expect every function to behave locally like a line, so if we have a generic function, we expect this to estimate the error. We can make that estimate into a stronger inequality by being more precise: E should be a number, so we should use some m that estimates the derivative; if we pick m bigger than the (absolute value) of the derivative, and we will bound the (absolute) error. (You can make this rough inequality exactly strict by simply imagining some curvature in the graph. It's clear the the line passing through the left endpoint whose slope is the maximum slope of the function is strictly above the function everywhere.)
We could have instead used the right endpoint of each interval to estimate the total integral. In the large-n limit, we expect the function over each small interval to behave like a straight line; the right-hand rule thus gives us an error almost exactly the same as the left-hand rule, except with the opposite sign (this is certainly true for a straight line). Thus we can get a much better estimate of the true integral by averaging the right- and left-hand estimate; doing this gives the "trapezoid" estimate, because of how it behaves geometrically on each subinterval.
When we average out the errors of the left- and right-hand rules, we are averaging out any dependence of the error on the first derivative of the function. Indeed, the trapezoid rule gives exactly the right results for the straight line. But there may still be an error. Our estimates on the errors are only true when the functions are straight lines; the failure of the errors to exactly cancel out must depend to first-order on the second derivative, as this measures the deviation in the (first) derivative. Dimensional analysis gives the correct error estimate for the trapezoid method:
Error(trapezoid) \leq m (b-a)^3 / (c n^2)
where c is some numerical constant, and m is a bound on the absolute second derivative. We can calculate c by working a particular example: the trapezoid estimate is exact for lines, and so up to scaling and subtracting a line, any parabola is y = x^2. But the trapezoid rule predicts an area of 1/2 for the integral from 0 to 1, whereas the true area is 1/4. Thus c=12.
How do we get a better estimate? If we had another method that also had an error depending on the second derivative, we could average again, and get an error of order the third derivative. Notice that by dimensional analysis the power of n in the denominator is always equal to the order of the derivative: the more accurately we approximate our curves by polynomials, the more quickly our estimates converge on the true values.
One such second-derivative-dependent measure is given by the "midpoint rule": estimate the height of each subinterval by the value of the function on the midpoint of the interval. I'll say why this is second-derivative-dependent in a moment. Dimensional analysis gives the same formula for the error as for trapezoids, and looking at y=x^2 yields c=24.
Error(midpoint) \leq m (b-a)^3 / (24 n^2).
Thus the midpoint rule is twice as accurate as the trapezoid rule; taking the appropriately weight average yields "Simpson's rule". This should be called the "parabola rule", since geometrically Simpson estimates each subinterval by drawing the parabola that passes through the midpoint and the two endpoints.
Naively, we might predict that Simpson's rule, by averaging out any dependence on the second derivative, should give an error linear in the third derivative. But in fact it is linear in the fourth. A hands-on way of seeing this is to simply apply it to a cubic. We might as well be interested in the integral from -1 to +1, and we may assume n=1. Given any cubic function, we can subtract some parabola so that it goes through y=0 at x=-1, 0, and +1; the parabola rule predicts an area of 0. And the true area? There is a one-dimensional family of cubics through three points, and we have a one-dimensional family through these three points: namely, scalar multiples of x^3 - x = x(x-1)(x+1). Thus, Simpson's rule is exact on cubics, and the fourth derivative is the smallest that can create an error.
There is a more elegant way of seeing that Simpson's error depends on the fourth and not third derivatives, which will also explain why the midpoint rule, which naively just measures the height at a point and so ought to error-ful with the first derivative really depends on the second. In particular, both the midpoint rule and Simpson's rule are symmetric under the reflection x \to -x. But the first and third derivatives (and indeed all odd derivatives) pick up minus signs under this reflection. By dimensional analysis, the error in an estimation method must be linear in the controlling derivative, and so any symmetric method cannot depend on odd derivatives.
Thus, I end with a cute challenge: come up with an (interesting, simple) estimation method with an error that really does scale as the third derivative of the function.