Summing an array of floating point numbers

No and no.

In fact, the use of 32-bit float types is on the rise.

Moving data in an out of memory is now the bottleneck, not arithmetic operations per se, and float lets you pack more data into a giving volume of memory.

Not only that, there is growing interest in floating point types even smaller than 32-bit floats.

See the following posts for examples: 8-bit floating point 16-bit floating point Posit numbers And while sometimes you need less precision than 32 bits, sometimes you need more precision than 64 bits.

See, for example, this post.

Now for some code.

Here is the obvious way to sum an array in C: float naive(float x[], int N) { float s = 0.

0; for (int i = 0; i < N; i++) { s += x[i]; } return s; } Kahan’s algorithm is not much more complicated, but it looks a little mysterious and provides more accuracy.

float kahan(float x[], int N) { float s = x[0]; float c = 0.

0; for (int i = 1; i < N; i++) { float y = x[i] – c; float t = s + y; c = (t – s) – y; s = t; } return s; } Now to compare the accuracy of these two methods, we’ll use a third method that uses a 64-bit accumulator.

We will assume its value is accurate to at least 32 bits and use it as our exact result.

double dsum(float x[], int N) { double s = 0.

0; for (int i = 0; i < N; i++) { s += x[i]; } return s; } For our test problem, we’ll use the reciprocals of the first 100,000 positive integers rounded to 32 bits as our input.

This problem was taken from [2].

#include <stdio.

h&gt: int main() { int N = 100000; float x[N]; for (int i = 1; i <= N; i++) x[i-1] = 1.

0/i; printf(“%.

15g?.”, naive(x, N)); printf(“%.

15g.”, kahan(x, N)); printf(“%.

15g.”, dsum(x, N)); } This produces the following output.

12.

0908508300781 12.

0901460647583 12.

0901461953972 The naive method is correct to 6 significant figures while Kahan’s method is correct to 9 significant figures.

The error in the former is 5394 times larger.

In this example, Kahan’s method is as accurate as possible using 32-bit numbers.

Note that in our example we were not exactly trying to find the Nth harmonic number, the sum of the reciprocals of the first N positive numbers.

Instead, we were using these integer reciprocals rounded to 32 bits as our input.

Think of them as just data.

The data happened to be produced by computing reciprocals and rounding the result to 32-bit floating point accuracy.

But what if we did want to compute the 100,000th harmonic number.There are methods for computing harmonic numbers directly, as described here.

Related posts Anatomy of a floating point number Why IEEE floating point numbers have +0 and -0 IEEE floating point exceptions in C++ [1] The condition number of a problem is basically a measure of how much floating point errors are multiplied in the process of computing a result.

A well-conditioned problem has a small condition number, and a ill-conditioned problem has a large condition number.

With ill-conditioned problems, you may have to use extra precision in your intermediate calculations to get a result that’s accurate to ordinary precision.

[2] Handbook of Floating-Point Arithmetic by Jen-Michel Muller et al.

.. More details

Leave a Reply