US-10991452

Hardware acceleration of short read mapping for genomic and other types of analyses

PublishedApril 27, 2021

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A scalable FPGA-based solution to the short read mapping problem in DNA sequencing is disclosed which greatly accelerates the task of aligning short length reads to a known reference genome. A representative system comprises one or more memory circuits storing a plurality of short reads and a reference genome sequence; and one or more field programmable gate arrays configured to select a short read; to extract a plurality of seeds from the short read, each seed comprising a genetic subsequence of the short read; for each seed, to determine at least one candidate alignment location (CAL) in the reference genome sequence to form a plurality of CALs; for each CAL, to determine a likelihood of the short read matching the reference genome sequence in the vicinity of the CAL; and to select one or more CALs having the currently greater likelihood of the short read matching the reference genome sequence.

Patent Claims

30 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for acceleration in a system of short read mapping to a reference genome sequence for genomic analysis, the system having at least one host computing system, one or more field programmable gate arrays, and one or more memory circuits, the method comprising: using the one or more field programmable gate arrays, selecting a short read from a plurality of short reads, each short read of the plurality of short reads comprising a sequence of a plurality of genetic bases; using the one or more field programmable gate arrays, extracting a plurality of seeds from the selected short read, each seed of the plurality of seeds comprising a genetic subsequence of the selected short read; using the one or more field programmable gate arrays, for each seed of the plurality of seeds, determining at least one candidate alignment location in the reference genome sequence to form a plurality of candidate alignment locations; using the one or more field programmable gate arrays, for each candidate alignment location of the plurality of candidate alignment locations, determining a likelihood of the selected short read matching the reference genome sequence in a vicinity of the candidate alignment location; and using the one or more field programmable gate arrays, selecting one or more candidate alignment locations, of the plurality of candidate alignment locations, having a currently greater likelihood of the selected short read matching the reference genome sequence.

2. The method of claim 1 , wherein the step of determining at least one candidate alignment location further comprises: using the one or more field programmable gate arrays, accessing a reference genome index using a selected seed of the plurality of seeds.

3. The method of claim 2 , further comprising: using the host computing system, partitioning the reference genome index over a plurality of memories to form a plurality of reference genome index partitions.

4. The method of claim 3 , wherein the step of selecting one or more candidate alignment locations having the currently greater likelihood further comprises: using the one or more field programmable gate arrays, selecting one or more first candidate alignment locations having a currently greater one or more first likelihoods from a first reference genome index partition of the plurality of reference genome index partitions; using the one or more field programmable gate arrays, comparing the one or more first likelihoods with one or more second likelihoods of one or more second candidate alignment locations from a second reference genome index partition of the plurality of reference genome index partitions; and using the one or more field programmable gate arrays, selecting the one or more candidate alignment locations having the currently greater likelihood of the one or more first and second likelihoods.

5. The method of claim 4 , further comprising: using the one or more field programmable gate arrays, transferring the one or more candidate alignment locations having the currently greater likelihood for mapping of the selected short read using a next, third partition of the plurality of reference genome index partitions.

6. The method of claim 2 , wherein the step of accessing the reference genome index further comprises: using the one or more field programmable gate arrays, hashing the selected seed, and using the hashed seed to access the reference genome index.

7. The method of claim 6 , wherein the step of hashing the selected seed further comprises: using the one or more field programmable gate arrays, generating a forward sequence and a reverse complement sequence for the selected seed; using the one or more field programmable gate arrays, determining which of the forward sequence or the reverse complement sequence is lexicographically smaller; using the one or more field programmable gate arrays, hashing the lexicographically smaller sequence to produce a hash result; and using the one or more field programmable gate arrays, using the hash result as the hashed seed to access the reference genome index.

8. The method of claim 2 , wherein the reference genome index comprises a pointer table and a candidate alignment location table.

9. The method of claim 8 , wherein each entry of the pointer table comprises a first predetermined number of most significant bits of a hashed seed and a pointer to a corresponding part of the candidate alignment location table.

10. The method of claim 9 , wherein each entry of the candidate alignment location table comprises a second predetermined number of least significant bits of the hashed seed and a corresponding candidate alignment location.

11. The method of claim 10 , wherein the step of accessing the reference genome index further comprises: using the one or more field programmable gate arrays, using the first predetermined number of the most significant bits of a selected hashed seed, accessing the pointer table to obtain a corresponding pointer; and using the one or more field programmable gate arrays, using the corresponding pointer and the second predetermined number of the least significant bits of the selected hashed seed, determining the candidate alignment location.

12. The method of claim 2 , further comprising: using the host computing system, creating the reference genome index.

13. The method of claim 12 , further comprising: using the host computing system, determining all <seed, location> tuples in the reference genome sequence to form a plurality of <seed, location> tuples; using the host computing system, sorting and eliminating redundant tuples from the plurality of <seed, location> tuples; using the host computing system, for each seed of the plurality of seeds, generating a forward sequence and a reverse complement sequence and determining which of the forward sequence or the reverse complement sequence is lexicographically smaller; using the host computing system, for each seed of the plurality of seeds, hashing the lexicographically smaller sequence to produce a hash result; and using the host computing system, for each seed of the plurality of seeds, using the hash result as a hashed seed for the reference genome index.

14. The method of claim 13 , further comprising: using the host computing system, creating a pointer table, each entry of the pointer table comprising a first predetermined number of most significant bits of a selected hashed seed and a corresponding pointer; and using the host computing system, creating a candidate alignment location table, each entry of the candidate alignment location table comprising a second predetermined number of least significant bits of the selected hashed seed and a corresponding candidate alignment location.

15. The method of claim 1 , further comprising: using the one or more field programmable gate arrays, filtering the plurality of candidate alignment locations to eliminate any redundant candidate alignment locations.

16. The method of claim 1 , wherein the step of determining the likelihood of the short read matching the reference genome sequence further comprises: using the one or more field programmable gate arrays performing a Smith-Waterman string matching of the short read with the reference genome sequence.

17. The method of claim 16 , further comprising: using the host computing system or using the one or more field programmable gate arrays, instantiating a plurality of Smith-Waterman engines in the field programmable gate array.

18. The method of claim 1 , wherein the step of determining a likelihood of the selected short read matching the reference genome sequence in the vicinity of the candidate alignment location further comprises: using the one or more field programmable gate arrays, determining the vicinity of the candidate alignment location as a sequence beginning at the start of the candidate alignment location minus a predetermined offset and extending through the end of the candidate alignment location plus a length of the selected short read and the predetermined offset.

19. The method of claim 1 , further comprising: using a plurality of field programmable gate arrays, performing the selection, extraction, and determination steps in parallel.

20. A system for acceleration of short read mapping to a reference genome sequence for genomic analysis, the system coupled to a host computing system, the system comprising: one or more memory circuits storing a plurality of short reads and a reference genome sequence, each short read of the plurality of short reads comprising a sequence of a plurality of genetic bases, and further storing a reference genome index partitioned over the one or more memory circuits to form a plurality of reference genome index partitions; and one or more field programmable gate arrays coupled to the one or more memory circuits, the one or more field programmable gate arrays configured to select a short read from the plurality of short reads; to extract a plurality of seeds from the selected short read, each seed of the plurality of seeds comprising a genetic subsequence of the selected short read; to hash a selected seed, of the plurality of seeds, and use the hashed seed to access at least one reference genome index partition of the plurality of reference genome index partitions to determine at least one candidate alignment location in the reference genome sequence, for each seed of the plurality of seeds, to form a plurality of candidate alignment locations; for each candidate alignment location of the plurality of candidate alignment locations, to determine a likelihood of the selected short read matching the reference genome sequence in a vicinity of the candidate alignment location; and to select one or more candidate alignment locations, of the plurality of candidate alignment locations, having a currently greater likelihood of the selected short read matching the reference genome sequence.

21. The system of claim 20 , wherein the one or more field programmable gate arrays are further configured to select one or more first candidate alignment locations having a currently greater one or more first likelihoods from a first reference genome index partition of the plurality of reference genome index partitions; to compare the one or more first likelihoods with one or more second likelihoods of one or more second candidate alignment locations from a second reference genome index partition of the plurality of reference genome index partitions; to select the one or more candidate alignment locations having the currently greater likelihood of the one or more first and second likelihoods; and to transfer the one or more candidate alignment locations having the currently greater likelihood for mapping of the selected short read using a next, third partition of the plurality of reference genome index partitions.

22. The system of claim 21 , wherein the one or more field programmable gate arrays are further configured to generate a forward sequence and a reverse complement sequence for the selected seed; to determine which of the forward sequence or the reverse complement sequence is lexicographically smaller; to hash the lexicographically smaller sequence to produce a hash result; and to use the hash result as the hashed seed to access the reference genome index.

23. The system of claim 20 , wherein the reference genome index comprises a pointer table and a candidate alignment location table, wherein each entry of the pointer table comprises a first predetermined number of most significant bits of a hashed seed and a pointer to a corresponding part of the candidate alignment location table; and wherein each entry of the candidate alignment location table comprises a second predetermined number of least significant bits of the hashed seed and a corresponding candidate alignment location.

24. The system of claim 23 , wherein the one or more field programmable gate arrays are further configured to use the first predetermined number of the most significant bits of a selected hashed seed to access the pointer table to obtain a corresponding pointer; and to use the corresponding pointer and the second predetermined number of the least significant bits of the selected hashed seed to determine the candidate alignment location.

25. The system of claim 20 , wherein the host computing system is adapted to create the reference genome index; and wherein the host computing system is further adapted to determine all <seed, location> tuples in the reference genome sequence to form a plurality of <seed, location> tuples; to sort and eliminate redundant tuples from the plurality of <seed, location> tuples; for each seed of the plurality of seeds, to generate a forward sequence and a reverse complement sequence and determine which of the forward sequence or the reverse complement sequence is lexicographically smaller; for each seed of the plurality of seeds, to hash the lexicographically smaller sequence to produce a hash result; and for each seed of the plurality of seeds, to use the hash result as the hashed seed for the reference genome index.

26. The system of claim 25 , wherein the host computing system is further adapted to create a pointer table, each entry of the pointer table comprising a first predetermined number of most significant bits of a selected hashed seed and a corresponding pointer; and create a candidate alignment location table, each entry of the candidate alignment location table comprising a second predetermined number of least significant bits of the selected hashed seed and a corresponding candidate alignment location; and wherein the one or more field programmable gate arrays are further configured to filter the plurality of candidate alignment locations to eliminate any redundant candidate alignment locations.

27. The system of claim 20 , wherein the one or more field programmable gate arrays are further configured to determine the vicinity of the candidate alignment location as a sequence beginning at the start of the candidate alignment location minus a predetermined offset and extending through the end of the candidate alignment location plus a length of the selected short read and the predetermined offset.

28. The system of claim 20 , wherein the one or more field programmable gate arrays are further configured to perform the selections, extraction, and determinations in parallel.

29. The system of claim 20 , wherein the reference genome sequence is divided into a plurality of reference blocks, each reference block having a size corresponding to a single read from the one or more memory circuits.

30. A system for acceleration of short read mapping to a reference genome sequence for genomic analysis, the system coupled to a host computing system, the system comprising: one or more memory circuits storing a plurality of short reads and a reference genome sequence divided into a plurality of reference blocks, each reference block having a size corresponding to a single read from the one or more memory circuits; and further storing a reference genome index, each short read comprising a sequence of a plurality of genetic bases, the reference genome index comprising a pointer table and a candidate alignment location table; and one or more field programmable gate arrays coupled to the one or more memory circuits, the one or more field programmable gate arrays configured to select a short read from the plurality of short reads; to extract a plurality of seeds from the selected short read, each seed of the plurality of seeds comprising a genetic subsequence of the selected short read; for each seed of the plurality of seeds, to generate a forward sequence and a reverse complement sequence for a selected seed of the plurality of seeds and determine which of the forward sequence or the reverse complement sequence is lexicographically smaller; to hash the lexicographically smaller sequence to produce a hash result; to use the hash result to access the reference genome index to determine a candidate alignment location in the reference genome sequence to form a plurality of candidate alignment locations; for each candidate alignment location of the plurality of candidate alignment locations, to perform string matching to determine a likelihood of the short read matching the reference genome sequence in a vicinity of the candidate alignment location; and to select a candidate alignment location, of the plurality of candidate alignment locations, having a currently greatest likelihood of the short read matching the reference genome sequence.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G16B C12Q

Patent Metadata

Filing Date

August 4, 2017

Publication Date

April 27, 2021

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search