Critical Stage Optimization for Reconfigurable Architectures

PublishedSeptember 30, 2025

Assigneenot available in USPTO data we have

InventorsAdam BORDELON David Alan KOEPLINGER

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for reducing latency and increasing throughput in a reconfigurable computing system, the method comprising: receiving a user program for execution on a reconfigurable dataflow computing system, the reconfigurable dataflow computing system comprising a grid of compute units and a grid of memory units interconnected with a switching array, the user program comprising a plurality of tensor-based algebraic expressions; converting the plurality of tensor-based algebraic expressions to an intermediate representation comprising a plurality of stages, including a first stage and a second stage which is adjacent to the first stage, each stage comprising one or more logical operations executable via dataflow through one or more compute units of the grid of compute units, each stage preceded by and followed by a buffer, each buffer corresponding to one or more memory units within the grid of memory units; detecting a memory mapping operation within a critical the first stage; and moving the memory mapping operation to the second stage; wherein the memory mapping operation is executable by the one or more memory units within the second stage and wherein dataflow through the buffer is controlled by one or more memory units within the grid of memory units.

2. The method of claim 1, wherein the memory mapping operation comprises one or more of a transpose operation, a reshape operation, a layout transformation, a roll operation, a permutation operation, a slice operation or a tile operation.

3. The method of claim 1, wherein the one or more logical operations correspond to one or more template library functions.

4. The method of claim 1, wherein the first stage comprises a logical operation selected from one or more of matrix multiplication operation, a batch normalization operation, a batch Cholesky operation, or a layer normalization operation.

5. The method of claim 1, wherein the second stage comprises a logical operation selected from a ReLU operation, a Sigmoid operation, or a Hyperbolic Tangent operation.

6. The method of claim 1, wherein the one or more logical operations are represented as dataflow statements or compute graph nodes.

7. The method of claim 1, wherein moving the memory mapping operation the second stage exposes further optimizations.

8. The method of claim 7, wherein the further optimizations include fusing buffers.

9. The method of claim 1, wherein the first stage has a highest latency among the plurality of stages.

10. A system for reducing latency and increasing throughput in reconfigurable dataflow processors, the system comprising: a host computer comprising an optimization module configured to conduct a method comprising: receiving a user program for execution on a reconfigurable dataflow computing system, the reconfigurable dataflow computing system comprising a grid of compute units and a grid of memory units interconnected with a switching array, the user program comprising a plurality of tensor-based algebraic expressions; converting the plurality of tensor-based algebraic expressions to an intermediate representation comprising a plurality of stages, including a first stage and a second stage which is adjacent to the first stage, each stage comprising one or more logical operations executable via dataflow through one or more compute units of the grid of compute units, each stage preceded by and followed by a buffer, each buffer corresponding to one or more memory units within the grid of memory units; detecting a memory mapping operation within the first stage; and moving the memory mapping operation to the second stage; wherein the memory mapping operation is executable by the one or more memory units within the second stage and wherein dataflow through the buffer is controlled by one or more memory units within the grid of memory units.

11. The system of claim 10, wherein the memory mapping operation comprises one or more of a transpose operation, a reshape operation, a layout transformation, a roll operation, a permutation operation, a slice operation or a tile operation.

12. The system of claim 10, wherein the one or more logical operations correspond to one or more template library functions.

13. The system of claim 10, wherein the first stage comprises a logical operation selected from one or more of matrix multiplication operation, a batch normalization operation, a batch Cholesky operation, or a layer normalization operation.

14. The system of claim 10, wherein the second stage comprises a logical operation selected from a ReLU operation, a Sigmoid operation, or a Hyperbolic Tangent operation.

15. The system of claim 10, wherein the one or more logical operations are represented as dataflow statements or compute graph nodes.

16. The system of claim 10, wherein moving the memory mapping operation to the second stage exposes further optimizations.

17. The system of claim 16, wherein the further optimizations include fusing buffers.

18. The system of claim 10, wherein the first stage has a highest latency among the plurality of stages.

19. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, wherein the program instructions are executable by a processor to cause the processor to conduct a method comprising: receiving a user program for execution on a reconfigurable dataflow computing system, the reconfigurable dataflow computing system comprising a grid of compute units and a grid of memory units interconnected with a switching array, the user program comprising a plurality of tensor-based algebraic expressions; converting the plurality of tensor-based algebraic expressions to an intermediate representation comprising a plurality of stages, including a first stage and a second stage which is adjacent to the first stage, each stage comprising one or more logical operations executable via dataflow through one or more compute units of the grid of compute units, each stage preceded by and followed by a buffer, each buffer corresponding to one or more memory units within the grid of memory units; detecting a memory mapping operation within the first stage; and moving the memory mapping operation to the second stage; wherein the memory mapping operation is executable by the one or more memory units within the second stage and wherein dataflow through the buffer is controlled by one or more memory units within the grid of memory units.

20. The computer program product of claim 19, wherein the first stage has a highest latency among the plurality of stages.

Patent Metadata

Filing Date

Unknown

Publication Date

September 30, 2025

Inventors

Adam BORDELON

David Alan KOEPLINGER

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search