A usage model and the underlying technology used to provide sequence analysis as part of a relational database system. Included components include the semantic and syntactic integration of the sequence analysis with an existing query language, the storage methods for the sequence data, and the design of a multipart execution scheme that runs the sequence analysis as part of a potentially larger database query, especially using parallel execution techniques.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for sequence analysis comprising: storing at least one subject sequence as a relation in a first data object of a relational database; determining an instruction execution plan for an instruction of a query language associated with an operation of the relational database that performs a comparison of at least one relational query sequence within a second data object of the relational database against the at least one relational subject sequence within the first data object, the instruction execution plan having a first part that comprises at least one relational database operation that uses the at least one relational query sequence of the second data object and a third data object of the relational database comprising controls, and the instruction execution plan also having a subsequent part that comprises at least one relational database operation including a JOIN operation, such that the instruction execution plan has multiple parts and includes steps for evaluating data in one or more sequences to determine which of the subject sequence or query sequence is to be specified as an inner table or outer table for the JOIN operation; executing the at least one relational database operation in the first part of the instruction execution plan in a data processing unit; storing a data output result relation of execution of the first part of the instruction execution plan as a database relation in the relational database; using the data output result relation of execution of the first part of the instruction execution plan and the at least one subject sequence of the first data object as at least two data inputs to the at least one relational database operation in the subsequent part of the plan; and executing the at least one relational database operation in the subsequent part of the plan in a data processing unit to complete the comparison.
2. A method as in claim 1 wherein the instruction execution plan specifics that multiple query sequences are to be compared, and wherein the selection of query sequences is carried out in a predetermined order as defined by the instruction execution plan.
3. A method as in claim 1 wherein a data processing system for carrying out the instruction execution plan has multiple processing units, and wherein the instruction execution plan further specifies which instructions in the instruction execution plan are to be distributed to designated ones of the processing units for execution.
4. A method as in claim 3 wherein the instruction execution plan further comprises sequence data broadcast instructions to broadcast data to two or more of the processing units.
5. A method as in claim 3 further comprising: returning at least one relation that represents results of a first part of the instruction execution plan performed by the multiple processing units to a host processing unit; and broadcasting at least one such relation from the host processing unit to processing units specified for carrying out the subsequent part of the instruction execution plan.
6. A method as in claim 3 wherein one or more parts of the instruction execution plan are compiled on a host processing unit prior to distribution for execution on other processing units.
7. A method as in claim 3 wherein the instruction execution plan parts are realized as a set of relational database query-specific instruction statements that are compiled and dynamically bound to the processing units at execution time.
8. A method as in claim 3 where the comparison of at least one relational query sequence against at least one relational subject sequence is carried out by distributing the at least one relational subject sequence evenly across multiple processing units.
9. A method as in claim 8 wherein the at least one relational subject sequence is distributed as whole records to the multiple processing units.
10. A method as in claim 9 wherein the subject sequences are sized such that a number of record bytes distributed to a given processing unit is equal to a number of record bytes distributed to other processing units.
11. A method as in claim 9 wherein records representing subject sequences are duplicated on at least some processing units.
12. A method as in claim 9 wherein each processing unit contains a complete copy of a subject sequence, and each processing unit receives a unique query sequence.
13. A method as in claim 9 wherein the subject sequences are duplicated among processing units by broadcasting from a central processing unit.
14. A method as in claim 1 wherein the subsequent part of the instruction execution plan includes a relational database operation selected from a group consisting of SORT, AGGREGATE, SCAN, PROJECT, and RESTRICT.
15. A method as in claim 1 wherein the relational database operation in the first part of the instruction execution plan includes a SELECT operation.
16. A method as in claim 1 wherein the sequences represent data selected from a group consisting of nucleic acid, amino acid, and protein identifiers.
17. A method as in claim 1 wherein the sequences represent non-biological data.
18. A method as in claim 1 wherein neither the subject sequence nor the query sequence are stored as a static materialized relational database definition.
19. A. method as in claim 1 wherein the comparison operation carried out by the plan determines a degree of similarity of the at least one query sequence to a portion of the at least one subject sequence.
20. A method as in claim 1 wherein the instruction execution plan specifies streaming operations, coupled with tuple set operations, to compare the query sequence against the subject sequence for optimizing performance.
21. A method as in claim 1 wherein the at least one relational database operation of the first part of the instruction execution plan comprises a cross-product join of a control table and the at least one query sequence, and the at least one relational database operation of the second part of the instruction execution plan comprises a join of the resultant cross-product join of the first part of the instruction plan and the at least one subject sequence to produce a result set.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
April 20, 2010
August 12, 2014
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.