Doing a lot of mathematical calculations fast and accurately

[Warning: You need to enable Javascript, lest the math be all garbled and hurt your eyes even more.]

Programmers know how to improve the speed and robustness of their codes. However, when a program does a lot of calculations, new problems arise and new solutions are needed. Both are fairly language-independent: the same issues, with the same fixes, will exist in C++, Java, Fortran, Javascript.

Consequently, I will focus on the numerical aspects, without paying too much attention to opportunities for language-specific optimizations. The code snippets are in C++; the necessary #include and using (e.g. using namespace std) are assumed to be there (to make the code leaner and the syntax less specifically C++). Also some elements will seem to be missing (such as classes without constructors whose instances are miraculously initialized just fine) so as to focus on what is relevant without diluting it in an ocean of boilerplate code.

Do it faster

Making a code run faster is a wide-spread concern among programmers (of compiled languages). And calculations, when they are very massive or very numerous (e.g. deep nested loops), can be a performance drag. This is where programming meets applied mathematics and computations.

Avoid useless calculations

Assume we have a class defining 3D vectors

class vector3D {
double x, y, z;
inline double length() const { return sqrt(x*x + y*y + z*z); }
};

and we want to find, in some list, all those shorter than a certain length:

vector<vector3D> myVectors = ...;
const double cutOff = 2.5; // e.g. this value may be passed as an argument
for (vector3D vec : myVectors)
if (vec.length() < cutOff) { ... }

This code is inefficient because there is in fact no need to calculate all these square roots. A quick benchmark on my laptop found that the square root calculation was an order of magnitude slower than calculating the three products and two sums of x*x + y*y + z*z.

A better alternative (with additions and modifications in bold) is thus

class vector3D {
double x, y, z;
inline double sqrLength() const { return x*x + y*y + z*z; }
inline double length() const { return sqrt(sqrLength()); }
};

and

vector<vector3D> myVectors = ...;
const double cutOff = 2.5;
const double sqrCutOff = cutOff * cutOff;
for (vector3D vec : myVectors)
if (vec.sqrLength() < sqrCutOff) { ... }

When you have to first calculate the square of what you are interested in (e.g. Euclidean distance, standard deviation), ask yourself whether this square root is really necessary. However, sqrLength() may run into trouble if x, y or z is very large, see below.

Your code variables need not be your mathematical variables

If you have to implement, for \(1 \le i \le N\),

\begin{cases} x_i = x_{i-1} + \sqrt{2}\,\sigma \times (2\,a_i + b_i + c_i + 2\,d_i + e_i + f_i) \\ y_i = y_{i-1} + \sqrt{2}\,\sigma \times (a_i + 2\,b_i + c_i + d_i + 2\,e_i + f_i) \\ z_i = z_{i-1} + \sqrt{2}\,\sigma \times (a_i + b_i + 2\,c_i + d_i + e_i + 2\,f_i) \end{cases}

the following will work fine:

const unsigned N = ...;
const double sigma = ...;
array<double, N> a(...), b(...), ...; // set the values of a, b, ..., f
array<double, N> x, y, z;
x[0]=...; y[0]=...; z[0]=...; // initial values must be set by hand
for (unsigned i = 1; i < N; ++i) {
x[i] = x[i-1] + sqrt(2) * sigma * (2*a[i] + b[i] + c[i] + 2*d[i] + e[i] + f[i]);
y[i] = y[i-1] + sqrt(2) * sigma * (a[i] + 2*b[i] + c[i] + d[i] + 2*e[i] + f[i]);
z[i] = z[i-1] + sqrt(2) * sigma * (a[i] + b[i] + 2*c[i] + d[i] + e[i] + 2*f[i]);
}

However, it is uselessly repetitive. We are naturally drawn to implementing the exact variables that appear in the mathematical equations: if the equations have \(a\), \(b\) and \(c\), so will the program (likewise with the distance in the previous example). But in this case, we do not need them individually:

const unsigned N = ...;
// sqrt2Sigma would be ambiguous: sqrt(2)*sigma or sqrt(2*sigma)?
const double sigmaSqrt2 = sqrt(2) * ...;
// we do not need the values of a, b, ..., f individually
// so we initialize directly what we do need
array<double, N> a_plus_d(...), b_plus_e(...), c_plus_f(...);
array<double, N> x, y, z;
x[0]=...; y[0]=...; z[0]=...; // initial values still set by hand
for (unsigned i = 0; i < N; ++i) {
// sumAToF (and APlusD, etc.) would be far less clear
double sum_a_to_f = a_plus_d[i] + b_plus_e[i] + c_plus_f[i];
x[i] = x[i-1] + sigmaSqrt2 * (sum_a_to_f + a_plus_d[i]);
y[i] = y[i-1] + sigmaSqrt2 * (sum_a_to_f + b_plus_e[i]);
z[i] = z[i-1] + sigmaSqrt2 * (sum_a_to_f + c_plus_f[i]);
}

Depending on how \(x\), \(y\) and \(z\) are used, one may even choose to store \(x'_i = x_i / (\sqrt{2}\,\sigma)\) instead of \(x_i\):

\[x'_i = x'_{i-1} + \text{sum_a_to_f}_i + \text{a_plus_d}_i. \]

Avoid calculating the same thing over and over again

The following excerpt is based on an actual code I came across (the names of the variables have been changed to safeguard their privacy):

const unsigned N = ...;
const double a = ...;
const double b = ...;
array<double, N> u, v;
for (unsigned i = 0; i < N; ++i) {
update(u[i]); // do with u what needs doing
v[i] = u[i] * exp((a-b)*i);
}

The weakness of this algorithm is subtle. One must first notice that, with exponentials as with powers, \(e^{(a-b) \times i} = (e^{a-b})^i = e^{a-b} \times (e^{a-b})^{i-1}\). And since we use these powers in increasing order we can simply multiply in each iteration:

[same declarations]
const double constExp = exp(a-b);
double bufferExp = 1; // exp((a-b)i) = 1 for i = 0
for (unsigned i = 0; i < N; ++i) {
update(u[i]);
v[i] = u[i] * bufferExp;
bufferExp *= constExp;
}

On my laptop this runs 20 times faster (I said subtle, not small).

Reuse what you already calculated (example: Lennard-Jones potential)

The energy between two atoms with the Lennard-Jones potential is given by

\[U = U_0 \left[ \left(\frac{r_0}{r}\right)^{\!12} - 2\left(\frac{r_0}{r}\right)^{\!6} \right]\!.\]

In practice, this will be dedimensionalized, so that \(U_0\) and \(r_0\) can be assumed to be 1. The force is negative the gradient (see appendix):

\[d\,F_x = \left( \frac{12}{r^{14}} - \frac{12}{r^8} \right) d\,x\]

(and the same with \(y\) and \(z\)). The implementation is straightforward, based on a loop over all pairs of atoms:

const unsigned D = ...; // 2 for 2D and 3 for 3D
for (atom1 : atomList) {
for (atom2 : atom1.neighborList()) { // O(N) algorithm
double r = distance(atom1, atom2); // we assume this has been implemented
double dForce = 12 / pow(r, 8) - 12 / pow(r, 14);
for (unsigned dim = 0; dim < D; dim++) {
double proj = projection(atom1, atom2, dim); // provides dx, dy, dz
atom1.addToForce(-dForce * proj, dim);
atom2.addToForce( dForce * proj, dim);
}
potEnergy += 1 / pow(r, 12) - 2 / pow(r, 6); // 1/r^12-2/r^6
}
}

(For the sake of simplicity some functions are assumed to be implemented elsewhere.) What is wrong with this? About everything. The only good thing about it is that it is O(\(N\)), not O(\(N^2\)): the inner loop is not over all atoms but only over the neighbors of the first atom (we keep a list of them, with help from the NSA), so the inner loop has constant complexity.

The powers (6 and 12) are calculated multiple times (and pow() may or may not be efficient with simple integral exponents). The distance is obtained through a perfectly pointless square root. Am I finicky? Yes (otherwise I would code in Python). But when the loop above runs millions of times for millions of atoms, and when simulations take days or weeks, no gain is really negligible. The following implementation is better (3.5 times faster on my laptop) without being more complicated:

inline double sqr (double x) { return x*x; }
inline double cube(double x) { return x*x*x; }
double r2, inverse_r6, dForce, proj;
for (atom1 : atomList) {
for (atom2 : atomList) {
r2 = sqrDistance(atom1, atom2); // see above
inverse_r6 = 1 / cube(r2);
dForce = 12 * inverse_r6 / r2 * (1 - inverse_r6);
// [ 12/r^6/r^2 * (1 - 1/r^6) = 12/r^8 - 12/r^14 ]
for (unsigned dim = 0; dim < D; dim++) { // loop over spatial dimensions
proj = projection(atom1, atom2, dim);
atom1.addToForce(-dForce * proj, dim);
atom2.addToForce( dForce * proj, dim);
}
potEnergy += inverse_r6 * (inverse_r6 - 2); // (1/r^6 - 2)/r^6 = 1/r^12 - 2/r^6
}
}

One would make further modifications, such as having the class for atoms handle vectors directly instead of individual coordinates. Note that a pow6() function would be pointless in this case: the function taking \(x^2\) in and returning \(x^6\) is called cube().

Do it more accurately

A code may crash, freeze, return the wrong result, etc. These are often clear-cut: either black or white. When floating-point calculations are involved arises the grey zone of imprecise results — not completely wrong, but not quite right either.

Do not lose accuracy for no good reason

Imagine that we want to track the time evolution of some data (wittily called Data in the code snippet) and calculate the average change each time:

class Data {
vector<double> values;
void update() { . . . } // calculate the new values
double average() {
return accumulate(values.begin(), values.end(), 0.) / values.size(); }
};
const unsigned numberIterations = ...;
Data myData(); // constructor assumed to initialize myData
for (unsigned t=0; t<numberIterations; ++t) {
Data oldData = myData; // oldData holds the old values
myData.update(); // and myData the updated ones
double avgChange = myData.average() - oldData.average();
}

This looks fine, but is likely to produce incorrect results.

Consider a simpler case, viz. the following subtraction:

123456789.1234
− 123456789

An eight year old knows that the answer is 0.1234. But if you attempt this operation with two floats, your code will not be as accurate as an eight-year old. This problem is known as "big number minus big number".

The case of integers is obviously different. It is the worst of situations (adding up too many positive integers will dump you straight into negative territory — the mother of all inaccuracies). It is the best of situations (the risk is so blatant than only the greenest of tyros would fall for it.)

Mathematically speaking, \(\sum_i b_i - \sum_i a_i = \sum_i (b_i - a_i)\). But computationally they may differ because of the finite accuracy of floating-point numbers. This is particularly relevant when \(b_i\) is of the form \(a_i + \varepsilon_i\), where the \(\{\varepsilon_i\}\) are small. This is because the accuracy of floating point numbers is relative: a number on the order of 1 will be accurate to about 10⁻¹⁵, but only to 10⁻⁵ if it is around 10¹⁰.

The following would improve the accuracy:

class Data {
[same old]
static Data change() { . . . } // replaces update()
};
[same declarations and initializations]
for (unsigned t=0; t<numberIterations; ++t) {
Data dData = myData.change();
myData += dData; // operator+=() assumed to be overloaded
double avgChange = dData.average();
}

This solves the problem by calculating the new value of myData off its previous value and dDdata instead of calculating dData as the difference between data and oldData. Of course, this really works only if myData.change() does not simply return myData.update() - myData.

Implement a better (yet mathematically equivalent) formula

Example 1

Instead of rushing ahead and implementing your equation as it is written on paper, you may benefit from looking for a mathematically equivalent one which will run faster and (or) more accurately. For instance, if \(\varepsilon\) is small, \(\sqrt{x+\varepsilon} - \sqrt{x}\) is open to the problem of big number minus big number. This problem can easily be averted by remembering that \((a-b)(a+b) = a^2 - b^2\) still works if \(a\) and \(b\) are square roots:

\[\sqrt{x+\varepsilon} - \sqrt{x} = \dfrac{(x+\varepsilon) - x}{\sqrt{x+\varepsilon} + \sqrt{x}} = \dfrac{\varepsilon}{\sqrt{x + \varepsilon} + \sqrt{x}} \quad \left(\!\approx \dfrac{\varepsilon}{2 \sqrt{x}}\right).\]

Example 2

Likewise there are several ways of calculating a variance. The variance of a set of numbers is:

\[\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i-m)^2,\]

where the average \(m\) is \(\frac{1}{n} \sum_{i=1}^{n} x_i\). (I am not interested here in matters of \(n\) vs. \(n-1\).) The square can be expanded:

\[\sigma^2 %= \frac{1}{n} \sum_i (x_i-m)^2 = \frac{1}{n} \sum_i (x_i^2 + m^2 - 2\,x_i\,m) = \frac{1}{n}\sum_i x_i^2 + m^2 - 2\,m \frac{1}{n} \sum_i x_i.\]

Since the last sum is \(m \times n\), one finds that

\[\sigma^2 = \frac{1}{n}\sum_i x_i^2 - m^2,\]

i.e. the average of the squares minus the square of the average. However, this clever little formula, being open to the problem of big number minus big number, looks better on paper than in a code.

Examples 3 and 4

Another two (slightly more complicated) situations where a better mathematical formula makes a difference, a trigonometric function and the Verlet algorithm to discretize Newton's law, are given in an appendix.

Do not pay for accuracy you will not consume

I insisted that, on top on the usual trade-offs between speed, memory and readability, in codes carrying out a lot of calculations there is a trade-off with accuracy. A little cleverness can often give you more accurate results at very little cost (except using your brain and math). But you should also ask yourself whether it may not sometimes be a good idea to give up some precision you do not in fact need in exchange for using less memory or making the program run faster.

Use a smaller type for calculations

Say you want to create a class of fractions. You make it a template so that users can have fractions of 16-bit, 32-bit or 64-bit integers. But if you define 4/8 as a fraction of long longs and want to reduce it, it is probably a good idea to test whether numerator and denominator would not both fit is a smaller type and then call the reduction algorithm for that type:

template<typename T>
Fraction<T> Fraction<T>::convert_and_reduce(const Fraction<T>& fraction) {
T largest_abs = max( abs(fraction.numerator()), abs(fraction.denominator()) );
if (largest_abs < UINT_MAX )
return Fraction<int>::do_reduction(fraction);
else
return Fraction<long long>::do_reduction(fraction);
}

(The code is obviously far from complete, but I think the trick is clear.) On my computer the speed-up is on the order of 2 or 3.

Using integers (?)

Imagine you need to deal with points, line segments, etc. in 2D (e.g. to describe a building blue print). It is fairly natural to store such dimensions as doubles. But you will not need numbers both on the scale of 10¹⁰ and of 10⁻¹⁰: all coordinates will be similar in scale. So they could be stored as integers, giving distances in millimeters for instance. A 16-bit unsigned integral type can go up to 65 535 mm = 65.5 m, which should be enough to describe most buildings. You do not need to know building dimensions down to the millimeter, but using millimeters ensures that the round-off errors will remain smaller than the uncertainty on the empirical measurements (centimeters would be more prone to a build-up of errors).

This strategy may be useful when there are only simple calculations, e.g. monetary amounts can probably be advantageously stored as integers representing cents in accounting. Additions and subtractions will then be between integers, no problem. If there are more involved calculations, it may be necessary to use a different type to do the math: calculating the length of the diagonal of a rectangle requires to square the lengths of its sides, and these intermediate values may be larger than 65 535. So, 16-bit integers may be used for storage and doubles for calculations. (Full disclosure: I had this idea after implementing financial calculations when it was too late to change the type, so this is more a hunch than a tested idea; but if you do try it, let me know how it went.)

Do it at all

Inaccurate or slow calculations are bad. But it could be worse: some computations plainly crash.

Double-precision floats are typically limited to about 10³⁰⁸ in absolute value. Try to go past this limit and it will overflow. In some cases, this may be hard to avoid: your numbers are just huge. But in most cases, the final result would be well within bounds, but intermediate calculations overflow.

Back to the lengths of vectors

Using sqrLength(), as I advised above, can for instance be an issue if x, y or z is very large since it may overflow. For example, if x = y = z = 10²⁰⁰ then the square of the length equals 3 × 10⁴⁰⁰, which is larger than the largest double the computer can handle (even though the length itself is not, as it is barely larger than x, y and z).

If this is a genuine risk in your application then it would be a good idea to bypass the squared length and compare lengths directly. Of course length() cannot involve the square of the length and should be calculated using something like:

inline double length() const {
double myMax = max(max(abs(x), abs(y)), abs(z));
return myMax * sqrt( (x/myMax)*(x/myMax) + (y/myMax)*(y/myMax) + (z/myMax)*(z/myMax) );
}

x/myMax, etc. are all between 0 and 1 (at least one of them exactly 1), so we take the square root of a number between 1 and 3 — no risk of overflow.

Factorials

You do not deal with numbers larger than 10³⁰⁸ every day. Well, you may. Factorials and exponentials for instance get pretty big pretty fast. An important case is that of combinations (and permutations):

\[\binom nk = \frac{n!}{k!\,(n-k)!}\]

with \(k \le n\). Since \(n! = 1 \times 2 \times 3 \times \ldots \times (n-1) \times n\), numerator and denominator partly cancel out (mathematically), so that

\[\binom nk = \frac{(k+1) \times (k+2) \times \ldots \times (n-1) \times n}{1 \times 2 \times \ldots \times (n-k-1) \times (n-k)} = \frac{(n-k+1) \times (n-k+2) \times \ldots \times (n-1) \times n}{1 \times 2 \times \ldots \times (k-1) \times k}.\]

As the numerator and denominator have the same number of factors, we can do them in pairs, alternating between multiplication and division to minimize the risk of overflow:

inline static uint64_t combination(uint32_t n, uint32_t k) {
long double ans = 1;
uint32_t i_min = min(k, n-k), i_max = max(k, n-k);
for (uint32_t i = 1; i <= i_min; ++i) {
ans *= i_max+i;
ans /= i;
}
return round(ans);
}

Since \(\binom {n}{k} = \binom {n}{n-k}\) we used the variant requiring the fewest computations of the two (through i_min). Since dividing by i can create havoc if ans is not a multiple of it, I use a long double to carry out the calculations (one may consider using fractions, I have not tried it). The return type is (unsigned) 64-bit integer, whereas everything else is 32 bits, because the combination is generally much larger than n and k (in generic programming, it would probably be necessary to use the same large type for all).

Mathematical appendices

Derivation of the force for the Lennard-Jones potential

The force is minus the gradient of the potential energy (i.e. the derivative with respect to position). However, the energy is known as a function of the distance \(r\), not of the positions themselves (we have easy access to \(-dU/dr\) rather than to \(-dU/dx\)). Since \(d\,r^2 = 2\,r\,d\,r\) and \(d\,r^2 / d\,x = d(x^2+y^2+z^2) / d\,x = 2\,x\), we can write:

\[\frac{d\,U}{d\,x} = \frac{d\,U}{d\,r} \frac{d\,r}{d\,x} = \frac{d\,U}{d\,r} \frac{1}{2\,r} \frac{d\,r^2}{d\,x} = \frac{d\,U}{d\,r} \frac{d\,x}{r}.\]

All we have to do now is calculate the derivative of the energy with respect to \(r\):

\[\frac{d\,U}{d\,r} = \frac{d}{d\,r} \left(\frac{1}{r^{12}} - \frac{2}{r^6} \right) = -12\frac{1}{r^{13}} - (-6) \frac{2}{r^7}.\]

Verlet algorithm to discretize Newton's law

In order to implement Newton's law, one must discretize it. A naïve approach is a straight Taylor expansion:

\[\vec{r}(t+\delta t) = \vec{r}(t) + \frac{d\, \vec{r}(t)}{d\,t} \delta t + \frac{d^2 \vec{r}(t)}{d\,t^2} \frac{\delta t^2}{2} + o(\delta t^2).\]

The first and second derivatives being respectively the velocity and the acceleration (and the latter being the force over the mass), one gets

\[\vec{r}(t+\delta t) = \vec{r}(t) + \vec{v}(t)\, \delta t + \frac{\vec{F}(t)}{m} \frac{\delta t^2}{2} + o(\delta t^2).\]

However, one can get more accuracy with a very small trick, namely expanding at both \(t+\delta t\) and \(t-\delta t\):

\[\vec{r}(t+\delta t) = \vec{r}(t) + \frac{d\, \vec{r}(t)}{d\,t} \delta t + \frac{d^2\, \vec{r}(t)}{d\,t^2} \frac{\delta t^2}{2} + \frac{d^3\, \vec{r}(t)}{d\,t^3} \frac{\delta t^3}{6} + o(\delta t^3).\] \[\vec{r}(t-\delta t) = \vec{r}(t) - \frac{d\, \vec{r}(t)}{d\,t} \delta t + \frac{d^2\, \vec{r}(t)}{d\,t^2} \frac{\delta t^2}{2} - \frac{d^3\, \vec{r}(t)}{d\,t^3} \frac{\delta t^3}{6} + o(\delta t^3).\]

All we now have to do is take their sum. The odd terms vanish,

\[\vec{r}(t+\delta t) + \vec{r}(t-\delta t) = 2\, \vec{r}(t) + 2\, \frac{d^2\, \vec{r}(t)}{d\,t^2} \frac{\delta t^2}{2} + o(\delta t^3),\]

and we finally get the Verlet algorithm:

\[\vec{r}(t+\delta t) = 2\, \vec{r}(t) - \vec{r}(t-\delta t) + \frac{\vec{F}(t)}{m} \delta t^2 + o(\delta t^3).\]

If \(\delta t = 10^{-3}\) then \(\delta t^2 = 10^{-6}\), but all it takes to get \(\delta t^3 = 10^{-6}\) is \(\delta t = 10^{-2}\). Concretely, running with a longer time step is like walking with a longer stride: much faster. This is often used in mathematical models such as molecular dynamics.

Efficient and accurate computations