## My Bug, My Bad #2: Sunk by Float

posted by Craig Gidney on March 19, 2013

Last week I read a post that sparked my interest: Comparing an Integer With a Floating-Point Number (Part 1, Part 2). The problem from the post is interesting because a) it involves floating point numbers and b) neither type (32-bit integers, single precision floats) can represent all values of the other type (and you’re not allowed to cheat by casting to double).

The post starts solving the float/int comparison problem with an approach based on boundaries: find the smallest float guaranteed to be a whole number, do different things on either side of that boundary, etc. I find this distasteful because it screams “easy to get wrong and not realize it!”. You’re in danger of testing the wrong corner cases if you compute the wrong boundary value (i.e. it’s difficult to repeat yourself differently in the tests), so a mistake might go unnoticed.

I figured I would try to come up with a different approach, and I did:

bool Equals(int i, float f) {
// warning: this is wrong
float iToF = (float)i;
return (int)iToF == i && f == iToF;
}

The main idea behind the above code is that only integers that can be represented as floats will survive unchanged through a round-trip casting to float and back to int. If an integer can be represented as a float, then we can use its float version for the comparison. If an integer can’t be represented as a float, then there’s no way it’s equal to any float.

The other idea behind the above code is being horribly, terribly wrong. Wrong stacked on top of wrong stacked on top of even more wrong, in a never-ending cycle of pain and unreliable knowledge… but maybe I’m being a bit hard on myself. Working with floating point numbers is notoriously difficult, after all. Lets break down the issues in the above code.

### Wrong #1: Rounding out of range

The code assumes round-tripping a 32-bit integer through float will give back a proper 32-bit integer. That’s actually not the case. For example, the maximum 32-bit signed integer is rounded up when you convert it to float: from to . When you convert the float back to an Int32, it’s out of the valid range.

Bad things happen when you try to convert a too-large/out-of-range float into an int. In C it’s undefined behavior, which is very very bad. Even in a “safe” language like C# the spec says the cast either throws an exception or returns an “unspecified value”, depending on if you’re in a checked or unchecked context.

Relying on undefined behavior or on an unspecified value is a great way to turn your code into a time bomb. Everyone loves wasting time figuring out why apparently working code broke when you started using a more recent compiler… right?

Before even getting into the muck of floating point precision, the code is wrong. The round-trip might fail more than it should. On the other hand, a few ‘is it in range?’ checks could bypass this problem. The wrong goes so much deeper.

### Wrong #2: Precision

The code assumes floats have a fixed amount of precision. This sounds reasonable, except in reality the compiler and the hardware will “help out” by providing more precision in some cases.

For example, the compiler is likely to keep the value of the local variable iToF in a register, instead of in memory. On x86, floating point registers have more bits of precision than required. The extra precision, whether it comes from registers or from optimizations, has observable effects. Consider this C# code:

float f = (float)(1 << 24);
int i0 = (1 << 24) + 1;
int i1 = (int)(f + 1f);
int i2 = (int)(float)(f + 1f);
Console.WriteLine(i0); // 16777217
Console.WriteLine(i1); // 16777217
Console.WriteLine(i2); // 16777216

Those well versed in the dangers of floating points will notice that floats don't have enough precision to represent the integer . That's the first level of trickery. It explains i2 printing instead of . The second level of trickery is the "bonus" precision, which allows the apparently impossible feat of converting a float to an int and getting stored in i1 despite the fact that floats can't represent .

Note that the above sample might behave differently on your machine or in the future. The C# spec does not specify how extra precision is handled, stating nothing more concrete than "Floating-point operations may be performed with higher precision than the result type of the operation.".

(Terrible Trivia You've Just Learned: In C# is there any type T such that casting from T to T is not guaranteed to be a no-op? Yup. Float. In the above sample, on my machine, casting a float expression to float removes the extra precision.)

In terms of my wrong code for solving the integer/float comparison problem, extended precision means the round-trip might succeed more often than it should. Even if the integer can't be represented as a float, there's no guarantee extended precision won't be used (in the future or on other machines or at different optimization levels) to allow the round-trip to preserve all integers.

For those keeping track: that's a total of two time bombs accidentally written into the same two-line function.

Actually, this whole extended precision thing calls into question the problem itself.

### Wrong #3: Assuming Floats Have a Consistent Value

The truly terrifying thing about "bonus" precision is that you don't control it. It can be available for one comparison, but not the next, creating some truly nasty bugs.

Consider the following code:

if (someFloat > 1) Print("More");
if (someFloat <= 1) Print("Not More");

You might be surprised to find out that it's possible for the above code to print "More" and "Not More" at the same time. If someFloat has extra precision and is barely larger than 1, then it passes the 'greater than 1' check. If it then loses that precision, perhaps due to the call to Print needing floating point registers, then it will be rounded to 1 in time to also pass the 'less than or equal to 1' check.

Basically, any code that compares a float to the same value twice can be broken by a sufficiently evil compiler, even a compliant-to-spec one. If that doesn't immediately terrify you, maybe a practical example will:

Vector2 Normal(Vector2 v) {
var length = Math.Sqrt(v.x*v.x + v.y*v.y);
if (length == 0) return new Vector2(0, 0);
return new Vector2(v.x / length, v.y / length);
}

The above is a typical function to get the normal of a vector. Similar code appears in almost every geometry library ever. It's not really "wrong", exactly, but bad things happen if you look at it through the eyes of a sufficiently evil compiler.

You might assume that, given an input vector with real coordinates (not infinity or NaN), this function is guaranteed to return a vector with real coordinates with a length approximately equal to 1 or 0. Except, if you take the normal of a vector near zero then the extra precision might be available for the length == 0 check but not for the divisions, resulting in non-real outputs. So much for the assumed "real preserving" invariant. Ugh.

Using a check improves the situation a lot, by preventing the evil compiler from rounding to zero for the division. On the other hand, it increases the size of the boundary where the compiler might flip the result from a unit vector to the zero vector or from the zero vector to a unit vector.

What I'm getting at is that floats flout the idea of being comparable. The idea that we can write a function that tells if an integer is equal to a float is, frankly, a little bit misguided. We'll either get false positives, when precision is lost after the check, or false negatives, when precision is temporarily lost (or purposefully removed) during the check.

So, in a sense, even the function specification I was implementing was wrong. No code can implement it perfectly. Compilers just don't treat floats that way.

### A Corrected Solution

To correct the code I gave above, the three wrongs I listed (rounding, extra precision, impossible spec) need to be addressed.

The rounding issue is the easiest to fix. It only happens in one case (numbers near int.MaxValue), and can be caught with an explicit check. The other two issues will be fixed by slightly altering the spec.

To make the function independent of the whims of the compiler, I will avoid comparing against the overprecise-by-unknown-amount runtime value of the float. Instead, the function will compare against the nearest representable IEEE single precision float (i.e. what you'd naively expect the function to do, before learning about extended precision). You can get that value, in many languages, by casting to a volatile float or storing in a volatile float field.

Applying both of these fixes (while still trying to avoid risky boundaries) gives a corrected solution in C#:

private static volatile float staticVolatileFloat;
private static float RemoveExtendedPrecision(float f) {
staticVolatileFloat = f;
return staticVolatileFloat;
}

public static bool EqualAfterRemovingExtendedPrecision(int i, float f) {
f = RemoveExtendedPrecision(f);
var iToF = RemoveExtendedPrecision((float)i);

if (iToF == RemoveExtendedPrecision((float)int.MaxValue))
return false; // rounded out of range, can't be represented
if ((int)iToF != i)
return false; // rounded to a different value, can't be represented

// can be represented faithfully, naive equality check will work
return iToF == f;
}

(To be honest, I'm drawing a blank trying to think of natural uses for a function like EqualAfterRemovingExtendedPrecision.)

I tested that the function corresponds to what happens if you compare after casting both inputs to double, so hopefully it's correct now and forever. Maybe someone who knows even more about floating point horrors will correct that optimism.

Update: If a different valid IEEE rounding rule is being used, (float)int.MaxValue might be rounded down instead of up. As a result, the "corrected" code would wrongly categorize the largest integer representable as a float as being not representable and create false negatives. Pessimism increased.

### Summary

Floats are terrifying. Maybe not as terrifying as unicode, but pretty darn terrifying.

Remember: floats are fundamentally imprecise approximations of continuous quantities. Not only because of their representation, but because of how they are treated by compilers and CPUs. As a result, trying to force a precise discrete result out of a float (by rounding or doing a comparison) is just inherently risky. Prepare to compromise.

---

---