Rules for Arithmetic Operations

Fixed-point arithmetic refers to how signed or unsigned binary words are operated on. The simplicity of fixed-point arithmetic functions such as addition and subtraction allows for cost-effective hardware implementations. 

This section describes the blockset-specific rules that are followed when arithmetic operations are performed on inputs and parameters. These rules are organized into four groups based on the operations involved: addition and subtraction, multiplication, division, and shifts. For each of these four groups, the rules for performing the specified operation are presented followed by an example using the rules.

Computational Units

The core architecture of many processors contains several computational units including arithmetic logic units (ALUs), multiply and accumulate units (MACs), and shifters. These computational units process the binary data directly and provide support for arithmetic computations of varying precision. The ALU performs a standard set of arithmetic and logic operations as well as division. The MAC performs multiply, multiply/add, and multiply/subtract operations. The shifter performs logical and arithmetic shifts, normalization, denormalization, and other operations.

Addition and Subtraction

Addition is the most common arithmetic operation a processor performs. When two n-bit numbers are added together, it is always possible to produce a result with n + 1 nonzero digits due to a carry from the leftmost digit. For two's complement addition of two numbers, there are three cases to consider:

  • If both numbers are positive and the result of their addition has a sign bit of 1, then overflow has occurred; otherwise the result is correct.

  • If both numbers are negative and the sign of the result is 0, then overflow has occurred; otherwise the result is correct.

  • If the numbers are of unlike sign, overflow cannot occur and the result is always correct.

Fixed-Point Blockset Summation Process

Consider the summation of two numbers. Ideally, the real-world values obey the equation

where Vb and Vc are the input values and Va is the output value. To see how the summation is actually implemented, the three ideal values should be replaced by the general slope/bias encoding scheme described in Scaling.

The solution of the resulting equation for the stored integer, Qa, is given by the equation in Addition. Using shorthand notation, that equation becomes

where Fsb and Fsc are the adjusted fractional slopes and Bnet is the net bias. The offline conversions, and online conversions and operations are discussed below.

Offline Conversions.   FsbFsc, and Bnet are computed offline using round-to-nearest and saturation. Furthermore, Bnet is stored using the output data type.

Online Conversions and Operations.   The remaining operations are performed online by the fixed-point processor, and depend on the slopes and biases for the input and output data types. The worst (most inefficient) case occurs when the slopes and biases are mismatched. The worst-case conversions and operations are given by these steps:

  1. The initial value for Qa is given by the net bias, Bnet.

     

  2. The first input integer value, Qb, is multiplied by the adjusted slope, Fsb.

     

  3. The previous product is converted to the modified output data type where the slope is one and the bias is zero. This conversion includes any necessary bit shifting, rounding, or overflow handling.

     

  4. The summation operation is performed, which includes any necessary overflow handling.

     

  5. Steps 2 to 4 are repeated for every number to be summed. 

It is important to note that bit shifting, rounding, and overflow handling are applied to the intermediate steps (and 4) and not to the overall sum.

Streamlining Simulations and Generated Code

If the scaling of the input and output signals is matched, the number of summation operations is reduced from the worst (most inefficient) case. For example, when an input has the same fractional slope as the output, step reduces to multiplication by one and can be eliminated. Trivial steps in the summation process are eliminated for both simulation and code generation. Exclusive use of radix point-only scaling for both input signals and output signals is a common way to eliminate the occurrence of mismatched slopes and biases, and results in the most efficient simulations and generated code.

Example: The Summation Process

Suppose you want to sum three numbers. Each of these numbers is represented by an 8-bit word, and each has a different radix point-only scaling. Additionally, the output is restricted to an 8-bit word with radix point-only scaling of 2-3.

The summation is shown below for the input values 19.875, 5.4375, and 4.84375.

Applying the rules from the previous section, the sum follows these steps:

  1. Since the biases are matched, the initial value of Qa is trivial.

     

  2. The first number to be summed (19.875) has a fractional slope that matches the output fractional slope. Furthermore, the radix points and storage types are identical so the conversion is trivial.

     

  3. The summation operation is performed.

     

  4. The second number to be summed (5.4375) has a fractional slope that matches the output fractional slope, so a slope adjustment is not needed. The storage data types also match but the difference in radix points requires that both the bits and the radix point be shifted one place to the right. Note that a loss in precision of one bit occurs, with the resulting value of QTemp determined by the rounding mode. For this example, round-to-floor is used. Overflow cannot occur in this case since the bits and radix point are both shifted to the right.

     

  5. The summation operation is performed. Note that overflow did not occur, but it is possible for this operation.

     

  6. The third number to be summed (4.84375) has a fractional slope that matches the output fractional slope, so a slope adjustment is not needed. The storage data types also match but the difference in radix points requires that both the bits and the radix point be shifted two places to the right. Note that a loss in precision of two bit occurs, with the resulting value of QTemp determined by the rounding mode. For this example, round-to-floor is used. Overflow cannot occur in this case since the bits and radix point are both shifted to the right.

     

  7. The summation operation is performed. Note that overflow did not occur, but it is possible for this operation.

As shown below, the result of step differs from the ideal sum.

Blocks that perform addition and subtraction include the FixPt Sum, FixPt Matrix Gain, and FixPt FIR blocks. 

Multiplication

The multiplication of an n-bit binary number with an m-bit binary number results in a product that is up to m + n bits in length for both signed and unsigned words. Most processors perform n-bit by n-bit multiplication and produce a 2n-bit result (double bits) assuming there is no overflow condition.

For example, the Texas Instruments TMS320C2x family of processors performs two's complement 16-bit by 16-bit multiplication and produces a 32-bit (double bit) result.

Fixed-Point Blockset Multiplication Process

Consider the multiplication of two numbers. Ideally, the real-world values obey the equation

where Vb and Vare the input values and Va is the output value. To see how the multiplication is actually implemented, the three ideal values should be replaced by the general slope/bias encoding scheme described in Scaling.

The solution of the resulting equation for the output stored integer, Qa, is given below.

The worst-case implementation of this equation occurs when the slopes and biases of the input and output signals are mismatched. This worst-case implementation is permitted in simulation but is not always permitted for code generation since it often requires more resources than is considered practical for an embedded system. For code generation and bit-true simulations, the biases must be zero and the fractional slopes must match for most blocks. When these requirements are met, the implementation reduces to

The bit-true implementation of this equation is discussed below.

Offline Conversions.  As shown in the previous section, no offline conversions are performed.

Online Conversions and Operations.  The online conversions and operations for matched slopes and biases of zero are given by these steps:

  1. The integer values, Qb and Qc, are multiplied together. To maintain the full precision of the product, the radix point of QRawProduct is given by the sum of the radix points of Qb and Qc.

     

  2. The previous product is converted to the output data type. This conversion includes any necessary bit shifting, rounding, or overflow handling. Conversions are discussed in Signal Conversions.

     

  3. Steps and 2 are repeated for each additional number to be multiplied. 

Example: The Multiplication Process

Suppose you want to multiply three numbers. Each of these numbers is represented by a 5-bit word, and each has a different radix point-only scaling. Additionally, the output is restricted to a 10-bit word with radix point-only scaling of 2-4. The multiplication is shown below for the input values 5.75, 2.375, and 1.8125.

Applying the rules from the previous section, the multiplication follows these steps:

  1. The first two numbers (5.75 and 2.375) are multiplied. Note that the radix point of the product is given by the sum of the radix points of the multiplied numbers.

     

  2. The result of step is converted to the output data type. Conversions are discussed in Signal Conversions. Note that a loss in precision of one bit occurs, with the resulting value of QTemp determined by the rounding mode. For this example, round-to-floor is used. Furthermore, overflow did not occur but is possible for this operation. 

     

  3. The result of step 2 and the third number (1.8125) are multiplied. Note that the radix point of the product is given by the sum of the radix points of the multiplied numbers.

     

  4. The product is converted to the output data type. Note that a loss in precision of four bits occurred, with the resulting value of QTemp determined by the rounding mode. For this example, round-to-floor is used. Furthermore, overflow did not occur but is possible for this operation.

Division

As with multiplication, division with mismatched scaling is complicated. Mismatched division is permitted for simulation only. For code generation and bit-true simulation, the signals must all have zero biases and matched fractional slopes.

Fixed-Point Blockset Division Process

Consider the division of two numbers. Ideally, the real-world values obey the equation

where Vb and Vare the input values and Va is the output value. To see how the division is actually implemented, the three ideal values should be replaced by the general slope/bias encoding scheme described in Scaling.

For the case where the slopes are one and the biases are zero for all signals, the solution of the resulting equation for the output stored integer, Qa, is given below.

This equation involves an integer division and some bit shifts. If Ea ≥ Eb - Ec then any bit shifts are to the right and the implementation is simple. However, if Ea < Eb - Ec then the bit shifts are to the left and the implementation can be more complicated. The essential issue is the output has more precision than the integer division provides. To get full precision, a fractional division is needed. The C programming language provides access to integer division only for fixed-point data types. Depending on the size of the numerator, some of the fractional bits may be obtained by performing a shift prior to the integer division. In the worst case, it may be necessary to resort to repeated subtractions in software.

In general, division of values is an operation that should be avoided in fixed-point embedded systems. Division where the output has more precision than the integer division (i.e., Ea < Eb - Ec) should be used with even greater reluctance. Division of signals with nonzero biases or mismatched slopes is not supported.

Example: The Division Process

Suppose you want to divide two numbers. Each of these numbers is represented by an 8-bit word, and each has a radix point-only scaling of 2-4. Additionally, the output is restricted to an 8-bit word with radix point-only scaling of 2-4

The division of 9.1875 by 1.5000 is shown below.

For this example,

Assuming a large data type was available, this could be implemented as

where the numerator uses the larger date type. If a larger data type was not available, integer division combined with four repeated subtractions would be used. Both approaches produce the same result, with the former being more efficient.

Shifts

Nearly all microprocessors and digital signal processors support well-defined bit-shift (or simply shift) operations for integers. For example, consider the 8-bit unsigned integer 00110101. The results of a 2-bit shift to the left and a 2-bit shift to the right are shown below.

Shift Operation

Binary Value

Decimal Value

No shift (original number)

00110101

53

Shift left by 2 bits

11010100

212

Shift right by 2 bits

00001101

13

You can perform a shift with the Fixed-Point Blockset using either the FixPt Conversion block or the FixPt Gain block. The FixPt Conversion block shifts both the bits and radix point while the FixPt Gain block shifts the bits but not the radix point. These two modes of shifting as well as shifting to the right are discussed below.

Performing a "plain" or "raw" machine-level shift such as those given in the example above with the Fixed-Point Blockset is complicated by the available scaling options. Therefore, a single "FixPt Shift" block is not provided. For more information about scaling, refer to Scaling.

 

Shifting to the Right

Shifts to the right can be classified as a logical shift right or an arithmetic shift right. For a logical shift right, a 0 is incorporated into the most significant bit for each bit shift. For an arithmetic shift right, the most significant bit is recycled for each bit shift. With the Fixed-Point Blockset, shifting to the right follows these rules:

  • For signed numbers, an arithmetic shift right is performed. Therefore, the most significant bit is recycled for each bit shift. For example, given the signed fixed-point number 10110.101, a bit shift two places to the right with the radix point unmoved yields the number 11101.101.

  • For unsigned numbers, a logical shift right is performed. Therefore, the most significant bit is a 0 for each bit shift. For example, given the unsigned fixed-point number 10110.101, a bit shift two places to the right with the radix point unmoved yields the number 00101.101.

Shifting Bits and the Radix Point

With the FixPt Conversion block, you can perform a shift operation on the input by specifying the appropriate radix point-only scaling for the output. This block shifts both the bits and the radix point.

In most cases, you will perform a "plain" or "raw" shift. To perform such a shift using the FixPt Conversion block, you must configure the block's dialog box this way:

  • The output data type is identical to the input data type.

  • The rounding mode is set to Floor. Therefore, bits simply fall off the left or fall off the right when a shift occurs.

  • Overflows wrap.

  • The output scaling is specified to reflect the required shift.

For example, suppose you start with the fixed-point number 00110.101 (a decimal value of 6.625), which is characterized by the blockset as an 8-bit unsigned, generalized fixed-point number with radix point-only scaling of 2-3. To shift the bits and radix point two places to the right, the input scaling of 2-3 is multiplied by 22, which yields a scaling of 2-1. To shift the bits and radix point two places to the left, the input scaling of 2-3 is multiplied by 2-2, which yields as scaling of 2-5. This situation is shown below

Shift Operation

Scaling

Binary Value

Decimal Value

No shift (original number)

2-3

00110.101

6.625

Shift right by 2 bits

2-1

0000110.1

6.5

Shift left by 2 bits

2-5

110.10100

6.625

The figure below shows the fixed-point model used to generate the above data.

Refer to Chapter 9, Block Reference for more information about the FixPt Conversion block.

Shifting Bits but Not the Radix Point

With the FixPt Gain block, you can perform a shift operation on the input by specifying the gain as a power of two. This block shifts only the bits and not the radix point.

In most cases, you will perform a plain or raw shift. To perform such a shift using the FixPt Gain block, you must configure the block's dialog box this way:

  • The output data type is identical to the input data type.

  • The rounding mode is set to Floor. Therefore, bits simply fall off the left or fall off the right when a shift occurs.

  • Overflows wrap.

  • The gain is specified as the appropriate power of 2 to reflect the required shift.

For example, suppose you start with the same fixed-point number, 00110.101, defined above. To shift the bits two places to the left, a gain of 4 is specified, and to shift the bits two places to the right, a gain of 0.25 is specified. This situation is shown below

Shift Operation

Gain Value

Binary Value

Decimal Value

N/A (original number)

2-3

00110.101

6.625

Shift left by 2 bits

4

11010.100

26.5

Shift right by 2 bits

0.25

00001.101

1.625

The figure below shows the fixed-point model used to generate the above data.

Refer to Chapter 9, Block Reference for more information about the FixPt Gain block.