numeric

Definition

Floating-Point Number Representation

The floating-point number representation is a method of representing decimal numbers on the computer.

Exponent Notation

The foundation for understanding floating-point numbers is the exponent notation. Decimal numbers can be represented as follows:

where:

  • is the decimal number to be represented
  • is the mantissa
  • is the exponent
  • is the basis

Clearly, this notation does not have a single unique representation for all decimal numbers. For example: . Thus, the notation must be normalised in order to have a single valid representation in the memory.

There must be a single digit directly before the command, and the single digit must not be 0 for binary representation. Now, the representation of the above example is more clear:

Remark

Notice that we postulated that the pre-comma digit of the mantissa must not be zero. Therefore, in the binary system, the only other possible digit is 1. Using that knowledge, less information has to be stored.

Layout

The first bit denotes the sign of the represented number, where:

  • 0 means positive, and
  • 1 means negative.

The mantissa is represented as fixed-point number with exact one pre-comma digit.


IEEE754

The following floating-point number system can represent numbers between .

The excess is relative to the smallest representable number:

IEEE754 Conventions

Some values have fixed representatios:

  • Zero:
    • Sign: 0 or 1
    • Exponent Bits: all set to 0
    • Mantissa: all set to 0
  • NaN:
    • Sign: 0
    • Exponent Bits: all set to 1
    • Mantissa: at least one bit is not 0
  • Plus Infinity:
    • Sign: 0
    • Exponent Bits: all set to 1
    • Mantissa Bits: all set to 0
  • Minus Infinity:
    • Sign: 1
    • Exponent Bits: all set to 1
    • Mantissa Bits: all set to 0

Conversion

Converting to Normalised Numbersj

Example: Converting into IEEE754 Single Precision:

  1. Step: Convert into the binary number system:
  1. Step: Normalise

  2. Step: Offset exponent with excess :

  1. Step: Layout:

The implicit bit (, red) can be omitted since it will always be . Thus, the layout is:

The result is a 32 bit word. The least significant bit (cyan) signals that the number if negative.


Rounding

Rounding is important in numeric since computers can only store a certain number of bits, meaning a limited precision. There are multiple methods of rounding for floating point numbers.

Truncate

Truncating means removing a part of or the whole fractional part from a number.

Example: Truncating to 2 decimal digits is .

Directed Rounding

Directed rounding refers to rounding an number in a specific direction:

  1. Round Towards Zero: Always rounds the number closer to zero.
  2. Round Away from Zero: Always rounds the number farther from zero.

Examples:

  • Rounding Towards Zero:
  • Round Away from Zero:

Round to Nearest

Rounding to the nearest refers to rounding a number to the closest value at a specified level of precision, such as the nearest integer, tenth, hundredth, or any other place value.

Precision of Rounding

Arithmetic

Addition

todo

Subtraction

todo

Multiplication

todo

Division

todo