softdouble.c
floating point numbers are a useful way to store an approximation of real numbers inside a finite computer memory — often inside single registers. they support a large dynamic range and precision relative to the magnitude of the number.
floating point numbers come in many sizes, defined by their component's width in bits. IEEE 754 defines two standard floating point formats, the "single precision": 32 bits of data, comomnly referred to as a "float", and the "double precision": 64 bits of data, referred to as a "double".
in microcontrollers and other low-resource hardware, it's not always the case that a floating point processor is available on board.
for this reason, a lot of compilers support software floats (sometimes shortened to softfloats), such as gcc's option -msoft-float
.
here, instead of using hardware instructions to perform operations on floating point numbers, the operations are instead implement, or "emulated", in software,
representing floating point numbers using equally sized unsigned numbers.
in this post, we'll explore my attempt at solving a slightly more contrived problem — what if we have hardware processors which support single-precision numbers (floats), but not double-precision numbers (doubles)? can we make software doubles, which emulate double-precision numbers using only hardware capable of operating on single-precision numbers?
i implemented this code for code guessing #14. the submission on the site had all it's functions renamed and obfuscated for extra obscurity, but here we'll be looking at the original version. if you want to take a look at the code without my comments on it, the original file is available here at this link.
background — what are floats
because we'll be working with binary strings a lot in this article, we'll use the notation to mean the binary number . for example, would represent the number 3, and represent 1.625.
we'll also use the notation to notate binary sequences. for example, if , then we have
we represent a number using a "mantissa" of bits, an "exponent" or bits and a "sign bit" which is either (represented as a zero bit) or (represented as a one bit) reading from right to left, we first use the exponent to determine the order of magnitude of the number. the subtraction by subtracts a number exactly halfway in the range of , giving us a symmetrical range around 0. we then represent the significand using the mantissa of the number. we choose to set the first bit to 1 because if it were to be 0 we could simply decrease the exponent by one and shift the mantissa to the right, giving us multiple ways of representing the same number. we note that the significand here will always be in the range (half open range). lastly, we multiply by the sign to make the number positive or negative.
as an example, say we wanted to represent the number 6.21 in this format. we'll fix the number of mantissa bits to and the number of exponent bits to
we'll first rewrite the number as , where the significand is between 1 and 2. in this case, we see
we write 1.5525 in binary as , and the exponent . from this, we see that and . lastly, we have that the number is positive, so the sign bit
int main() { printf("hello world"); }