10491239

Large-Scale Computations Using an Adaptive Numerical Format

PublishedNovember 26, 2019
Assigneenot available in USPTO data we have
InventorsItay Hubara
Technical Abstract

Patent Claims
12 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A computational device, comprising: an input memory, configured to receive a first array of input numbers having a first precision, such that each input number is represented by N bits; an output memory, configured to store a second array of output numbers having a second precision, less than the first precision, such that each output number is represented by M bits, M<N; quantization logic, which is configured to read the input numbers from the input memory, to extract from each input number a set of M bits, at a bit offset within the input number that is indicated by a quantization factor, and to write a corresponding output number based on the extracted set of bits to the second array in the output memory; and a quantization controller, which is configured to set the quantization factor so as to optimally fit an available range of the output numbers in the second array to an actual range of the input numbers in the first array in extraction of the M bits from the input numbers, wherein the quantization controller is configured to adjust the quantization factor responsively to a predefined limitation on overflow in the extraction of the M bits from the input numbers and to fit the available range of the output numbers to the actual range of the input numbers in the first array by estimating a largest value among the input numbers, and setting the quantization factor so that the largest value fills but does not overflow the set of bits extracted by the quantization logic, wherein the predefined limitation on the overflow is defined by a saturation margin SM, and wherein the quantization controller is configured to set the quantization factor to a value QF such that 2 (M+QF) is no less than the estimated largest value among the input numbers, and 2 (M+QF−SM) is less than the estimated largest value among the input numbers.

Plain English Translation

A computational device processes numerical data by reducing precision while minimizing overflow. The device includes an input memory storing a first array of numbers with high precision (N bits per number) and an output memory storing a second array with lower precision (M bits per number, where M < N). Quantization logic extracts M bits from each input number at a bit offset determined by a quantization factor (QF), then writes the extracted bits as output numbers. A quantization controller dynamically adjusts QF to optimize the range of output numbers relative to the input numbers' actual range. The controller estimates the largest input value and sets QF such that this value fills the M-bit output range without overflow, constrained by a saturation margin (SM). Specifically, QF is chosen so that 2^(M+QF) is at least the largest input value, while 2^(M+QF−SM) is less than the largest input value, ensuring the output range is maximized without exceeding hardware limits. This approach efficiently reduces precision while preserving dynamic range and preventing overflow.

Claim 2

Original Legal Text

2. The device according to claim 1 , wherein the input numbers comprise fixed-point numbers having a predefined radix point, and the quantization factor indicates a shift of the radix point in the output numbers relative to the input numbers.

Plain English Translation

This invention relates to digital signal processing, specifically to a device for quantizing input numbers to produce output numbers with a reduced bit width. The problem addressed is the need to efficiently reduce the precision of numerical data while maintaining computational accuracy, particularly in fixed-point arithmetic systems where hardware resources are constrained. The device processes input numbers represented as fixed-point numbers with a predefined radix point. A quantization factor is applied to shift the radix point in the output numbers relative to the input numbers, effectively scaling the values. This allows the output numbers to be represented with fewer bits while preserving the relative magnitude of the input data. The quantization factor determines the degree of scaling, enabling control over the trade-off between precision and bit width reduction. The device may include additional components such as a multiplier for applying the quantization factor to the input numbers and a rounding unit to adjust the output numbers to the desired bit width. The quantization process ensures that the output numbers remain within a specified range, preventing overflow or underflow. This technique is particularly useful in digital signal processing applications where fixed-point arithmetic is preferred for its efficiency and lower power consumption compared to floating-point operations. The invention provides a flexible and hardware-efficient method for reducing numerical precision while maintaining computational integrity.

Claim 3

Original Legal Text

3. The device according to claim 1 , wherein the input numbers comprise floating-point numbers, while the output number comprise fixed-point numbers, and wherein the quantization logic is configured to convert the floating-point numbers to the fixed-point numbers, while setting a radix point of the fixed-point numbers responsively to the quantization factor.

Plain English Translation

This invention relates to a digital processing device that converts floating-point numbers to fixed-point numbers with configurable precision. The device addresses the challenge of efficiently handling numerical data in systems where fixed-point arithmetic is preferred for performance or hardware constraints, but input data is in floating-point format. The device includes quantization logic that processes input floating-point numbers to produce output fixed-point numbers. The quantization logic adjusts the radix point of the fixed-point numbers based on a quantization factor, allowing dynamic control over the precision and range of the output. This ensures accurate conversion while maintaining compatibility with fixed-point processing requirements. The device may be used in digital signal processing, embedded systems, or other applications where floating-point to fixed-point conversion is necessary. The quantization factor determines the scaling applied during conversion, enabling trade-offs between precision and computational efficiency. The system ensures that the converted fixed-point numbers retain meaningful values by dynamically positioning the radix point according to the input data range and the specified quantization factor. This approach optimizes resource usage while preserving numerical accuracy.

Claim 4

Original Legal Text

4. The device according to claim 1 , wherein the quantization logic is further configured to extract a sign bit from the input number and to apply the sign bit to the corresponding output number.

Plain English Translation

This invention relates to digital signal processing, specifically to quantization logic in data conversion systems. The problem addressed is the loss of sign information during quantization, which can lead to errors in processing signed numerical data. The invention provides a quantization device that preserves the sign bit of input numbers during quantization, ensuring accurate representation of both positive and negative values in the output. The device includes quantization logic that processes input numbers, which may be in floating-point or fixed-point format, and converts them into output numbers with reduced precision. The quantization logic extracts the sign bit from each input number and applies it to the corresponding output number, maintaining the original sign throughout the conversion process. This ensures that the output retains the correct polarity of the input, which is critical for applications requiring precise numerical representation, such as digital signal processing, machine learning, and scientific computing. The quantization logic may also include additional features, such as rounding or truncation of the mantissa or fractional part of the input number, to further reduce precision while preserving the sign. The device can be implemented in hardware, software, or a combination thereof, and may be integrated into larger systems such as digital signal processors, neural network accelerators, or general-purpose computing platforms. By preserving the sign bit, the invention improves the accuracy and reliability of quantized numerical data in various computational tasks.

Claim 5

Original Legal Text

5. The device according to claim 1 , wherein the bits of the input number that are less significant than the extracted set of bits define a quantization remainder, and wherein the quantization logic is configured to make a comparison between the quantization remainder and a random number, and to derive the corresponding output number by rounding the extracted set of bits responsively to the comparison.

Plain English Translation

This invention relates to digital signal processing, specifically to quantization techniques used in data compression or signal processing systems. The problem addressed is improving the accuracy and efficiency of quantization, particularly in scenarios where rounding errors can degrade performance. The device includes quantization logic that processes an input number by extracting a set of most significant bits (MSBs) to form a base value. The remaining less significant bits (LSBs) are treated as a quantization remainder. The quantization logic then compares this remainder to a random number. Based on this comparison, the extracted MSBs are rounded up or down to produce the final output number. This probabilistic rounding method reduces quantization noise and distortion by introducing controlled randomness, which helps mitigate systematic errors that arise from deterministic rounding. The random number comparison ensures that rounding decisions are not biased, leading to a more uniform distribution of rounding errors. This technique is particularly useful in applications like audio processing, image compression, or any system where precise quantization is critical. The method improves signal quality by reducing artifacts caused by traditional rounding methods while maintaining computational efficiency. The use of randomness in the rounding process helps achieve a balance between accuracy and performance, making it suitable for real-time processing systems.

Claim 6

Original Legal Text

6. The device according to claim 1 , wherein the bits of the input number that are more significant than the extracted set of bits define an overflow, and wherein the quantization controller is configured to increment the quantization factor when the overflow exceeds a predefined limit.

Plain English Translation

This invention relates to digital signal processing, specifically to a device for adaptive quantization of numerical data. The problem addressed is the need to dynamically adjust quantization factors to prevent overflow in systems where input numbers have varying bit significance. The device includes a quantization controller that monitors the most significant bits (MSBs) of an input number to detect overflow conditions. When the overflow exceeds a predefined threshold, the quantization factor is incremented to reduce the resolution of the processed data, thereby preventing overflow and maintaining system stability. The device also extracts a subset of bits from the input number for further processing, while the remaining MSBs are used to determine overflow. This adaptive adjustment ensures efficient use of computational resources while avoiding data loss due to overflow. The invention is particularly useful in applications such as digital signal processing, data compression, and real-time signal analysis, where dynamic range and precision must be balanced to prevent system failures.

Claim 7

Original Legal Text

7. A method for computation, comprising: receiving in an input memory a first array of input numbers having a first precision, such that each input number is represented by N bits; reading the input numbers from the input memory into quantization logic, and extracting from each input number, by the quantization logic, a set of M bits, M<N, at a bit offset within the input number that is indicated by a quantization factor; writing, from the quantization logic to an output memory, a corresponding output number based on the extracted set of bits to a second array of output numbers having a second precision, less than the first precision, such that each output number is represented by M bits; and setting the quantization factor, so as to optimally fit an available range of the output numbers in the second array to an actual range of the input numbers in the first array in extraction of the M bits from the input numbers, wherein setting the quantization factor comprises adjusting the quantization factor responsively to a predefined limitation on overflow in the extraction of the M bits from the input numbers, wherein adjusting the quantization factor comprises fitting the available range of the output numbers to the actual range of the input numbers in the first array by estimating a largest value among the input numbers, and selecting the quantization factor so that the largest value fills but does not overflow the set of bits extracted by the quantization logic, and wherein the limitation on the overflow is defined by a saturation margin SM, and wherein setting the quantization factor comprises assigning a value QF to the quantization factor such that 2 (M+QF) is no less than the estimated largest value among the input numbers, and 2 (M+QF−SM) is less than the estimated largest value among the input numbers.

Plain English Translation

This method relates to computational quantization, a technique used to reduce the precision of numerical data for efficient processing, storage, or transmission. The problem addressed is the need to convert high-precision input numbers into lower-precision output numbers while minimizing data loss and preventing overflow. The method involves receiving an array of input numbers, each represented by N bits, and extracting a subset of M bits (M < N) from each number at a configurable bit offset determined by a quantization factor. The extracted bits are written to an output array, where each output number is represented by M bits, resulting in reduced precision. The quantization factor is dynamically adjusted to optimize the range of output numbers, ensuring they fit within the available range of the input numbers without overflow. This adjustment is based on estimating the largest input value and selecting a quantization factor that ensures this value fills but does not exceed the output bit range, while adhering to a predefined saturation margin (SM) to prevent overflow. The method ensures efficient quantization by balancing precision reduction with data integrity, making it suitable for applications requiring low-precision arithmetic, such as machine learning or signal processing.

Claim 8

Original Legal Text

8. The method according to claim 7 , wherein the input numbers comprise fixed-point numbers having a predefined radix point, and the quantization factor indicates a shift of the radix point in the output numbers relative to the input numbers.

Plain English Translation

This invention relates to numerical processing, specifically methods for quantizing fixed-point numbers in computational systems. The problem addressed is the need to efficiently adjust the precision of numerical values while maintaining computational efficiency, particularly in hardware-accelerated or resource-constrained environments. The method processes input numbers that are fixed-point numbers with a predefined radix point, meaning they have a fixed number of integer and fractional bits. The quantization process involves applying a quantization factor that determines a shift of the radix point in the output numbers relative to the input numbers. This shift effectively scales the numerical values, allowing for controlled precision reduction or adjustment. The method ensures that the quantization operation is performed in a manner that preserves the integrity of the numerical representation while optimizing computational resources. The technique is particularly useful in applications such as digital signal processing, machine learning inference, and embedded systems, where fixed-point arithmetic is commonly used to balance precision and performance. By adjusting the radix point shift, the method enables dynamic control over the precision of intermediate or final results, which can be critical for meeting hardware constraints or optimizing power consumption. The approach avoids the overhead of floating-point operations while still providing flexibility in numerical representation.

Claim 9

Original Legal Text

9. The method according to claim 7 , wherein the input numbers comprise floating-point numbers, while the output number comprise fixed-point numbers, and wherein extracting the M bits comprises converting the floating-point numbers to the fixed-point numbers, while setting a radix point of the fixed-point numbers responsively to the quantization factor.

Plain English Translation

This invention relates to numerical processing systems, specifically methods for converting floating-point numbers to fixed-point numbers in a way that optimizes precision and computational efficiency. The problem addressed is the need to efficiently convert floating-point numbers, which have variable precision, into fixed-point numbers, which have a fixed precision and radix point, while maintaining accuracy and minimizing computational overhead. The method involves extracting a subset of bits from the floating-point numbers to form the fixed-point numbers, where the number of extracted bits (M) is determined by a quantization factor. The quantization factor dynamically adjusts the radix point of the fixed-point numbers to ensure that the conversion preserves the necessary precision for the intended application. This approach is particularly useful in digital signal processing, embedded systems, and other applications where fixed-point arithmetic is preferred for its efficiency and lower power consumption compared to floating-point arithmetic. The method ensures that the conversion process is both accurate and computationally efficient, making it suitable for real-time processing environments.

Claim 10

Original Legal Text

10. The method according to claim 7 , wherein writing the corresponding output number comprises extracting a sign bit from the input number and applying the sign bit to the corresponding output number.

Plain English Translation

A method for processing numerical data involves handling input numbers and generating corresponding output numbers. The method addresses the challenge of efficiently managing numerical data, particularly in systems where sign information must be preserved or manipulated. The process includes extracting a sign bit from an input number and applying this sign bit to the corresponding output number. This ensures that the sign information of the input number is accurately reflected in the output, which is critical for maintaining data integrity in applications such as arithmetic operations, data encoding, or digital signal processing. The method may also involve additional steps, such as converting the input number into a different numerical format or performing intermediate calculations, before applying the sign bit to the output. By explicitly handling the sign bit, the method ensures that the output number retains the correct sign, whether positive or negative, of the original input. This approach is particularly useful in systems where numerical precision and sign accuracy are essential, such as in scientific computing, financial calculations, or embedded systems. The method can be implemented in hardware, software, or a combination of both, depending on the specific requirements of the application.

Claim 11

Original Legal Text

11. The method according to claim 7 , wherein the bits of the input number that are less significant than the extracted set of bits define a quantization remainder, and wherein writing the corresponding output number comprises making a comparison between the quantization remainder and a random number, and deriving the corresponding output number by rounding the extracted set of bits responsively to the comparison.

Plain English Translation

This invention relates to digital signal processing, specifically to methods for quantizing numerical values with controlled rounding to reduce quantization error. The problem addressed is the inherent error introduced when converting high-precision numerical values into lower-precision representations, such as in digital signal processing or data compression. Traditional quantization methods often produce deterministic rounding errors that can lead to bias or artifacts in processed signals. The method involves extracting a set of most significant bits from an input number to form a base representation of the quantized value. The remaining less significant bits, which are not retained in the quantized output, form a quantization remainder. To mitigate rounding errors, the method compares this remainder against a random number. If the remainder is greater than the random number, the extracted set of bits is rounded up; otherwise, it is rounded down. This stochastic rounding approach introduces randomness into the quantization process, reducing systematic bias and improving the accuracy of the quantized output. The method can be applied in various digital processing systems where precision is limited, such as audio processing, image compression, or machine learning, where controlled rounding helps maintain signal integrity while reducing computational overhead. The use of randomness in the rounding decision ensures that errors are distributed more uniformly, leading to better overall performance in applications sensitive to quantization artifacts.

Claim 12

Original Legal Text

12. The method according to claim 7 , wherein the bits of the input number that are more significant than the extracted set of bits define an overflow, and wherein setting the quantization factor comprises incrementing the quantization factor when the overflow exceeds a predefined limit.

Plain English Translation

This invention relates to digital signal processing, specifically methods for adaptive quantization of numerical data to prevent overflow in computational systems. The problem addressed is the risk of overflow when processing high-precision input numbers, which can lead to data loss or system errors. The solution involves dynamically adjusting a quantization factor based on the significance of input bits to maintain numerical stability. The method processes an input number by extracting a subset of its bits for further computation. The remaining, more significant bits are monitored to detect potential overflow conditions. If the magnitude of these higher-order bits exceeds a predefined threshold, the quantization factor is incremented to reduce the resolution of the processed data, thereby preventing overflow. This adaptive adjustment ensures that the system can handle varying input ranges without exceeding computational limits. The technique is particularly useful in applications requiring real-time processing, such as digital signal processing, where input data may vary widely in magnitude. By dynamically adjusting the quantization factor, the method avoids the need for fixed, conservative bit allocations that may waste computational resources or fail to handle extreme values. The approach balances precision and system stability, making it suitable for embedded systems, audio processing, and other domains where numerical overflow must be mitigated.

Patent Metadata

Filing Date

Unknown

Publication Date

November 26, 2019

Inventors

Itay Hubara

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “LARGE-SCALE COMPUTATIONS USING AN ADAPTIVE NUMERICAL FORMAT” (10491239). https://patentable.app/patents/10491239

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/10491239. See llms.txt for full attribution policy.