Large-Scale Computations Using an Adaptive Numerical Format

PublishedNovember 26, 2019

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

12 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computational device, comprising: an input memory, configured to receive a first array of input numbers having a first precision, such that each input number is represented by N bits; an output memory, configured to store a second array of output numbers having a second precision, less than the first precision, such that each output number is represented by M bits, M<N; quantization logic, which is configured to read the input numbers from the input memory, to extract from each input number a set of M bits, at a bit offset within the input number that is indicated by a quantization factor, and to write a corresponding output number based on the extracted set of bits to the second array in the output memory; and a quantization controller, which is configured to set the quantization factor so as to optimally fit an available range of the output numbers in the second array to an actual range of the input numbers in the first array in extraction of the M bits from the input numbers, wherein the quantization controller is configured to adjust the quantization factor responsively to a predefined limitation on overflow in the extraction of the M bits from the input numbers and to fit the available range of the output numbers to the actual range of the input numbers in the first array by estimating a largest value among the input numbers, and setting the quantization factor so that the largest value fills but does not overflow the set of bits extracted by the quantization logic, wherein the predefined limitation on the overflow is defined by a saturation margin SM, and wherein the quantization controller is configured to set the quantization factor to a value QF such that 2 (M+QF) is no less than the estimated largest value among the input numbers, and 2 (M+QF−SM) is less than the estimated largest value among the input numbers.

2. The device according to claim 1 , wherein the input numbers comprise fixed-point numbers having a predefined radix point, and the quantization factor indicates a shift of the radix point in the output numbers relative to the input numbers.

3. The device according to claim 1 , wherein the input numbers comprise floating-point numbers, while the output number comprise fixed-point numbers, and wherein the quantization logic is configured to convert the floating-point numbers to the fixed-point numbers, while setting a radix point of the fixed-point numbers responsively to the quantization factor.

4. The device according to claim 1 , wherein the quantization logic is further configured to extract a sign bit from the input number and to apply the sign bit to the corresponding output number.

5. The device according to claim 1 , wherein the bits of the input number that are less significant than the extracted set of bits define a quantization remainder, and wherein the quantization logic is configured to make a comparison between the quantization remainder and a random number, and to derive the corresponding output number by rounding the extracted set of bits responsively to the comparison.

6. The device according to claim 1 , wherein the bits of the input number that are more significant than the extracted set of bits define an overflow, and wherein the quantization controller is configured to increment the quantization factor when the overflow exceeds a predefined limit.

7. A method for computation, comprising: receiving in an input memory a first array of input numbers having a first precision, such that each input number is represented by N bits; reading the input numbers from the input memory into quantization logic, and extracting from each input number, by the quantization logic, a set of M bits, M<N, at a bit offset within the input number that is indicated by a quantization factor; writing, from the quantization logic to an output memory, a corresponding output number based on the extracted set of bits to a second array of output numbers having a second precision, less than the first precision, such that each output number is represented by M bits; and setting the quantization factor, so as to optimally fit an available range of the output numbers in the second array to an actual range of the input numbers in the first array in extraction of the M bits from the input numbers, wherein setting the quantization factor comprises adjusting the quantization factor responsively to a predefined limitation on overflow in the extraction of the M bits from the input numbers, wherein adjusting the quantization factor comprises fitting the available range of the output numbers to the actual range of the input numbers in the first array by estimating a largest value among the input numbers, and selecting the quantization factor so that the largest value fills but does not overflow the set of bits extracted by the quantization logic, and wherein the limitation on the overflow is defined by a saturation margin SM, and wherein setting the quantization factor comprises assigning a value QF to the quantization factor such that 2 (M+QF) is no less than the estimated largest value among the input numbers, and 2 (M+QF−SM) is less than the estimated largest value among the input numbers.

8. The method according to claim 7 , wherein the input numbers comprise fixed-point numbers having a predefined radix point, and the quantization factor indicates a shift of the radix point in the output numbers relative to the input numbers.

9. The method according to claim 7 , wherein the input numbers comprise floating-point numbers, while the output number comprise fixed-point numbers, and wherein extracting the M bits comprises converting the floating-point numbers to the fixed-point numbers, while setting a radix point of the fixed-point numbers responsively to the quantization factor.

10. The method according to claim 7 , wherein writing the corresponding output number comprises extracting a sign bit from the input number and applying the sign bit to the corresponding output number.

11. The method according to claim 7 , wherein the bits of the input number that are less significant than the extracted set of bits define a quantization remainder, and wherein writing the corresponding output number comprises making a comparison between the quantization remainder and a random number, and deriving the corresponding output number by rounding the extracted set of bits responsively to the comparison.

12. The method according to claim 7 , wherein the bits of the input number that are more significant than the extracted set of bits define an overflow, and wherein setting the quantization factor comprises incrementing the quantization factor when the overflow exceeds a predefined limit.

Patent Metadata

Filing Date

Unknown

Publication Date

November 26, 2019

Inventors

Itay Hubara

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search