Systolic Arithmetic on Sparse Data

PublishedJuly 15, 2025

Assigneenot available in USPTO data we have

InventorsABHISHEK R. APPU PRASOONKUMAR SURTI JILL BOYCE SUBRAMANIAM MAIYURAN MICHAEL APODACA+6 more

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A parallel processor comprising: a cache memory; and a processing cluster coupled with the cache memory, the processing cluster including a plurality of multiprocessors coupled with a data interconnect, wherein a multiprocessor of the plurality of multiprocessors includes a tensor core configured to: load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data; perform an inverse transform of the first numerical transform; perform a tensor operation on the tensor data after the inverse transform is performed; and write output of the tensor operation to a memory coupled with the processing cluster.

2. The parallel processor of claim 1, wherein to write output of the tensor operation to the memory coupled with the processing cluster includes to write the output to a main memory of the parallel processor.

3. The parallel processor of claim 2, wherein to write the output of the tensor operation to the main memory of the parallel processor includes to write the output to the cache memory.

4. The parallel processor of claim 1, the tensor core configured to: decompress or decode the tensor data before the inverse transform is applied to the tensor data.

5. The parallel processor of claim 1, the tensor operation performed on the tensor data associated with an inference operation performed via the tensor core.

6. The parallel processor of claim 5, the tensor core configured to compress or encode the output of the tensor operation within the tensor core after performing the numerical transform on the tensor data.

7. The parallel processor of claim 1, the tensor core configured to apply a second numerical transform to the output of the tensor operation and generate metadata to indicate that the second numerical transform is applied to the output of the tensor operation.

8. The parallel processor of claim 7, the tensor core configured to: apply the first numerical transform to at least a portion of the output of the tensor operation to generate first test transform data; apply the second numerical transform to at least a portion of the output of the tensor operation to generate second test transform data; determine compressibility metrics based on analysis of the first test transform data and the second test transform data; and apply the second transform to the output of the tensor operation based on the compressibility metrics.

9. The parallel processor of claim 8, wherein to write the output of the tensor operation to the memory of the processing cluster includes to: compress or encode the output of the tensor operation after the second transform is applied; and write the output of the tensor operation after compression or encoding is applied to the output of the tensor operation.

10. The parallel processor of claim 9, wherein the first numerical transform or the second numerical transform is selected from a set of numerical transforms including a discrete cosine transform, a discrete sine transform, a bit-flip transform, and a bit-rotate transform.

11. A method comprising: performing numerical operations to train a neural network model via a tensor core, including generating a first matrix of weights associated with the neural network model; applying a numerical transform to the first matrix of weights to generate a set of transformed weights and a transform type, wherein the transform type identifies the numerical transform applied to the first matrix of weights, the first matrix of weights is a sparse matrix, and the transformed weights compress to a higher compression ratio than the first matrix of weights; and applying a numerical inverse transform to the transformed weights to generate a second matrix of weights, wherein the numerical inverse transform to perform is identified via the transform type associated with the set of transformed weights.

12. The method of claim 11, further comprising: applying a first numerical transform to at least a portion of the first matrix of weights to generate first test transform data; applying a second numerical transform to at least a portion of the first matrix of weights to generate second test transform data; determining compressibility metrics based on analysis of the first test transform data and the second test transform data; and sending a recommended transform to enable selection of a numerical transform to apply to the first matrix of weights.

13. The method of claim 12, wherein the numerical transform is selected from a set of numerical transforms including a discrete cosine transform, a discrete sine transform, a bit-flip transform, and a bit-rotate transform.

14. A graphics processing system comprising: a memory device; and a graphics processor coupled with the memory device, the graphics processor including a cache memory and a processing cluster coupled with the cache memory, the processing cluster including a plurality of multiprocessors coupled with a data interconnect, wherein a multiprocessor of the plurality of multiprocessors includes a tensor core configured to: load tensor data and metadata associated with the tensor data from the cache memory, wherein the metadata indicates a first numerical transform applied to the tensor data; perform an inverse transform of the first numerical transform; perform a tensor operation on the tensor data after the inverse transform is performed; and write output of the tensor operation to a memory coupled with the processing cluster.

15. The graphics processing system of claim 14, wherein to write output of the tensor operation to the memory coupled with the processing cluster includes to write the output to the memory device.

16. The graphics processing system of claim 15, wherein to write the output of the tensor operation to the memory device.

17. The graphics processing system of claim 14, the tensor core configured to: decompress or decode the tensor data before the inverse transform is applied to the tensor data.

18. The graphics processing system of claim 14, the tensor operation performed on the tensor data associated with an inference operation performed via the tensor core.

19. The graphics processing system of claim 18, the tensor core configured to compress or encode the output of the tensor operation within the tensor core after performing the numerical transform on the tensor data.

20. The graphics processing system of claim 14, the tensor core configured to apply a second numerical transform to the output of the tensor operation and generate metadata to indicate that the second numerical transform is applied to the output of the tensor operation.

Patent Metadata

Filing Date

Unknown

Publication Date

July 15, 2025

Inventors

ABHISHEK R. APPU

PRASOONKUMAR SURTI

JILL BOYCE

SUBRAMANIAM MAIYURAN

MICHAEL APODACA

ADAM T. LAKE

JAMES HOLLAND

VASANTH RANGANATHAN

ALTUG KOKER

LIDONG XU

NIKOS KABURLASOS

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search