Organizing, Joining, and Performing Statistical Calculations on Massive Sets of Data

PublishedJanuary 13, 2015

Assigneenot available in USPTO data we have

InventorsSrinivas S. Vemuri Maneesh Varshney Krishna P. Puttaswamy Naga Rui Liu

Technical Abstract

Patent Claims

28 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of correlating multiple multi-dimensional datasets, the method comprising: selecting one or more dimensions of a first dataset as a partition key; selecting a cost constraint; dividing the first dataset into a first set of blocks with one or more computers, said dividing comprising, for each block in the first set of blocks: associating with the block a distinct subset of partition key values such that a combination of all records of the first dataset having partition key values within the associated subset of partition key values satisfies the cost constraint; collecting all records of the first dataset having partition key values within the associated subset of partition key values; sorting the collected records using a sort key comprising one or more of the dataset dimensions other than the partition key; and writing the block to storage after said collecting and said sorting; dividing a second multi-dimensional dataset that includes the partition key into a second set of blocks; and correlating the first dataset and the second dataset by: for each pair of corresponding blocks in the first set of blocks and the second set of blocks, the corresponding blocks consisting of a first block in the first set of blocks and a second block in the second set of blocks: storing the first block in memory; for each of multiple sub-blocks of the second block, correlating the sub-block with the first block; and aggregating the correlations between the first block and each of the multiple sub-blocks of the second block.

2. The method of claim 1 , wherein the cost constraint is a maximum storage size of the collected records.

3. The method of claim 2 , wherein the maximum storage size is configured to allow the collected records to be stored in memory by a single computer process.

4. The method of claim 1 , further comprising, for each block in the first set of blocks: updating an index to identify: the block; the subset of partition key values associated with the block; and the storage location of the block.

5. The method of claim 4 , wherein a plurality of the blocks in the first set of blocks is stored in a single file.

6. The method of claim 4 , further comprising incrementally updating the first set of blocks by: receiving an incremental update to the first dataset; dividing the incremental update according to the index to form incremental blocks corresponding to one or more blocks of the first set of blocks; and merging the incremental blocks with corresponding blocks of the first set of blocks; wherein a given incremental block corresponds to a block of the first set of blocks having the same subset of partition key values.

7. The method of claim 1 , wherein dividing the second multi-dimensional dataset comprises: for each record in the second dataset, using the partition key value of the record to assign the record to a block in the second set of blocks; and for each block in the second set of blocks, sorting the records assigned to the block using a second sort key; wherein each block in the second set of blocks corresponds to a block in the first set of blocks and is associated with the same partition key values as the corresponding block in the first set of blocks; and wherein a given record in the second dataset is assigned to the block in the second set of blocks that is associated with partition key values that include the given record's partition key value.

8. The method of claim 1 , wherein correlating a sub-block of the second block with the first block comprises: storing the sub-block in memory; joining the sub-block with each of a plurality of sub-blocks of the first block; and aggregating the plurality of joins.

9. The method of claim 1 , further comprising: prior to said correlating: assembling a daily update to the first dataset after dividing the first dataset into the set of blocks; dividing the daily update into an update set of blocks corresponding to the first set of blocks; and storing the update set of blocks in memory; and only after said aggregating: physically merging each update block with its corresponding block in the first set of blocks.

10. The method of claim 1 , wherein: the first dataset comprises computed metrics of users of an online service for a predetermined time period; the partition key comprises a user identifier dimension of the first dataset; and the sort key comprises a metric identifier dimension of the first dataset.

11. The method of claim 10 , wherein: the cost constraint ensures each of the multiple blocks of the first set of blocks is able to fit into a memory space allocated to a computer process programmed to join a block of the first set of blocks with a block of a second set of blocks created by dividing a second multi-dimensional dataset comprising the partition key.

12. A non-transitory computer-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method of correlating multiple multi-dimensional datasets, the method comprising: selecting one or more dimensions of a first dataset as a partition key; dividing the first dataset with one or more computers, said dividing comprising, for each of multiple blocks in a first set of blocks: associating with the block a distinct subset of partition key values such that a combination of all records of the first dataset having partition key values within the associated subset of partition key values satisfies the cost constraint; collecting all records of the first dataset having partition key values within the associated subset of partition key values; sorting the collected records using a sort key comprising one or more of the dataset dimensions other than the partition key; and writing the block to storage after said collecting and said sorting; dividing a second multi-dimensional dataset that includes the partition key into a second set of blocks; and correlating the first dataset and the second dataset by: for each pair of corresponding blocks in the first set of blocks and the second set of blocks, the corresponding blocks consisting of a first block in the first set of blocks and a second block in the second set of blocks: storing the first block in memory; for each of multiple sub-blocks of the second block, correlating the sub-block with the first block; and aggregating the correlations between the first block and each of the multiple sub-blocks of the second block.

13. A system, comprising: a first multi-dimensional dataset; one or more processors; and memory comprising instructions that, when executed by the one or more processors, cause the system to: select one or more dimensions of the first dataset as a partition key; select a cost constraint; divide the first dataset into a first set of blocks, said dividing comprising, for each block in the first set of blocks: associating with the block a distinct subset of partition key values such that a combination of all records of the first dataset having partition key values within the associated subset of partition key values satisfies the cost constraint; collecting all records of the first dataset having partition key values within the associated subset of partition key values; sorting the collected records using a sort key comprising one or more of the dataset dimensions other than the partition key; and writing the block to storage after said collecting and said sorting; divide a second multi-dimensional dataset that includes the partition key into a second set of blocks; and correlate the first dataset and the second dataset by: for each pair of corresponding blocks in the first set of blocks and the second set of blocks, the corresponding blocks consisting of a first block in the first set of blocks and a second block in the second set of blocks: storing the first block in memory; for each of multiple sub-blocks of the second block, correlating the sub-block with the first block; and aggregating the correlations between the first block and each of the multiple sub-blocks of the second block.

14. The system of claim 13 , wherein the cost constraint is a maximum storage size of the collected records.

15. The system of claim 14 , wherein the maximum storage size is configured to allow the collected records to be stored in a portion of the memory allocated to a single process executed by one of the one or more processors.

16. The system of claim 13 , wherein the memory further comprises instructions that, when executed by the one or more processors, cause the system to, for each block in the first set of blocks: update an index to identify: the block; the subset of partition key values associated with the block; and the storage location of the block.

17. The system of claim 16 , wherein the memory further comprises instructions that, when executed by the one or more processors, cause the system to incrementally update the first set of blocks by: receiving an incremental update to the first dataset; dividing the incremental update according to the index to form incremental blocks corresponding to one or more blocks of the first set of blocks; and merging the incremental blocks with corresponding blocks of the first set of blocks; wherein a given incremental block corresponds to a block of the first set of blocks having the same subset of partition key values.

18. The system of claim 14 , wherein dividing the second multi-dimensional dataset comprises: for each record in the second dataset, using the partition key value of the record to assign the record to a block in the second set of blocks; and for each block in the second set of blocks, sorting the records assigned to the block using a second sort key; wherein each block in the second set of blocks corresponds to a block in the first set of blocks and is associated with the same partition key values as the corresponding block; and wherein a given record in the second dataset is assigned to the block in the second set of blocks that is associated with the partition key values that include the given record's partition key value.

19. The system of claim 13 , wherein: the first dataset comprises computed metrics of users of an online service for a predetermined time period; the partition key comprises a user identifier dimension of the first dataset; and the sort key comprises a metric identifier dimension of the first dataset.

20. The system of claim 19 , wherein: the cost constraint ensures each of the multiple blocks of the first set of blocks is separately able to fit into a memory space allocated to a process executed by one of the one or more processors to join a block of the first set of blocks with a block of the second set of blocks.

21. A computer-implemented method of correlating two multi-dimensional datasets, the method comprising: partitioning a first dataset into a first set of blocks: selecting as a partition key one or more fields common to the two datasets; and populating each block of the first set of blocks with all records of the first dataset having a partition key value included in a unique subset of partition key values corresponding to the block; wherein each record of the first dataset is included in no more than one block of the first set of blocks; partitioning a second dataset into a second set of blocks, by: associating each block in the second set of blocks with a block in the first set of blocks; and populating each block of the second set of blocks with all records of the second dataset having a partition key value included in the subset of partition key values corresponding to the associated block in the first set of blocks; and correlating the two datasets by: (a) opening a first block in the first set of blocks; (b) for each of multiple sub-blocks of the associated second block in the second set of blocks: 1. joining the sub-block with the first block; and 2. aggregating results of the joining with results of joining of other sub-blocks; and (c) repeating (a) through (b) for all other blocks in the first set of blocks.

22. The method of claim 21 , wherein a maximum storage size of each block in the first set of blocks is configured to allow the block to be stored in memory by a single computer process.

23. The method of claim 21 , further comprising, for each block in the first set of blocks: updating an index to identify: the block; the subset of partition key values associated with the block; and the storage location of the block.

24. The method of claim 23 , wherein a plurality of the blocks in the first set of blocks is stored in a single file.

25. The method of claim 23 , further comprising incrementally updating the first set of blocks by: receiving an incremental update to the first dataset; dividing the incremental update according to the index to form incremental blocks corresponding to one or more blocks of the first set of blocks; and merging the incremental blocks with corresponding blocks of the first set of blocks; wherein a given incremental block corresponds to a block of the first set of blocks having the same subset of partition key values.

26. The method of claim 21 , wherein partitioning the second dataset comprises: for each record in the second dataset, using the partition key value of the record to assign the record to a block in the second set of blocks; and for each block in the second set of blocks, sorting the records assigned to the block using a sort key; wherein each block in the second set of blocks corresponds to a block in the first set of blocks and is associated with the same partition key values as the corresponding block in the first set of blocks; and wherein a given record in the second dataset is assigned to the block in the second set of blocks that is associated with partition key values that include the given record's partition key value.

27. The method of claim 21 , further comprising: prior to said correlating: assembling a daily update to the first dataset after partitioning the first dataset; dividing the daily update into an update set of blocks corresponding to the first set of blocks; and storing the update set of blocks in memory; and only after said correlating: physically merging each update block with its corresponding block in the first set of blocks.

28. The method of claim 21 , wherein: the first dataset comprises computed metrics of users of an online service for a predetermined time period; and the partition key comprises a user identifier dimension of the first dataset.

Patent Metadata

Filing Date

Unknown

Publication Date

January 13, 2015

Inventors

Srinivas S. Vemuri

Maneesh Varshney

Krishna P. Puttaswamy Naga

Rui Liu

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search