Definition
Floating-Point Number Representation
The floating-point number representation is a method of representing decimal numbers on the computer.
Exponent Notation
The foundation for understanding floating-point numbers is the exponent notation. Decimal numbers can be represented as follows:
where:
- is the decimal number to be represented
- is the mantissa
- is the exponent
- is the basis
Clearly, this notation does not have a single unique representation for all decimal numbers. For example: . Thus, the notation must be normalised in order to have a single valid representation in the memory.
There must be a single digit directly before the command, and the single digit must not be 0 for binary representation. Now, the representation of the above example is more clear:
Remark
Notice that we postulated that the pre-comma digit of the mantissa must not be zero. Therefore, in the binary system, the only other possible digit is 1. Using that knowledge, less information has to be stored.
Layout
The first bit denotes the sign of the represented number, where:
- 0 means positive, and
- 1 means negative.
The mantissa is represented as fixed-point number with exact one pre-comma digit.
IEEE754
The following floating-point number system can represent numbers between .
The excess is relative to the smallest representable number:
IEEE754 Conventions
Some values have fixed representatios:
- Zero:
- Sign: 0 or 1
- Exponent Bits: all set to 0
- Mantissa: all set to 0
- NaN:
- Sign: 0
- Exponent Bits: all set to 1
- Mantissa: at least one bit is not 0
- Plus Infinity:
- Sign: 0
- Exponent Bits: all set to 1
- Mantissa Bits: all set to 0
- Minus Infinity:
- Sign: 1
- Exponent Bits: all set to 1
- Mantissa Bits: all set to 0
Conversion
Converting to Normalised Numbersj
Example: Converting into IEEE754 Single Precision:
- Step: Convert into the binary number system:
-
Step: Normalise
-
Step: Offset exponent with excess :
- Step: Layout:
The implicit bit (, red) can be omitted since it will always be . Thus, the layout is:
The result is a 32 bit word. The least significant bit (cyan) signals that the number if negative.
Rounding
Rounding is important in numeric since computers can only store a certain number of bits, meaning a limited precision. There are multiple methods of rounding for floating point numbers.
Truncate
Truncating means removing a part of or the whole fractional part from a number.
Example: Truncating to 2 decimal digits is .
Directed Rounding
Directed rounding refers to rounding an number in a specific direction:
- Round Towards Zero: Always rounds the number closer to zero.
- Round Away from Zero: Always rounds the number farther from zero.
Examples:
- Rounding Towards Zero:
- Round Away from Zero:
Round to Nearest
Rounding to the nearest refers to rounding a number to the closest value at a specified level of precision, such as the nearest integer, tenth, hundredth, or any other place value.