US-8918372

Content-aware distributed deduplicating storage system based on consistent hashing

PublishedDecember 23, 2014

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A set of metadata associated with backup data is obtained. A consistent hash key for the backup data is generated based at least in part on the set of metadata. The backup data is assigned to one of a plurality of deduplication nodes based at least in part on the consistent hash key.

Patent Claims

25 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for processing backup data, comprising: obtaining corresponding sets of metadata associated with each of at least some of a plurality of backup data; generating, using a processor, one or more consistent hash keys for each of at least some of the plurality of backup data based at least in part on the corresponding sets of metadata; assigning the plurality of backup data to at least some of a plurality of deduplication nodes comprising a deduplication system based at least in part on the one or more consistent hash keys and on an interval associated with each of least some of the plurality of deduplication nodes; detecting, based at least in part on a table, that a number of functioning deduplication nodes in the deduplication system has changed, wherein the table is configured to store a plurality of received deduplication node identifier information and received timestamp information; determining, in response to detecting that the number of functioning deduplication nodes has changed, a new interval to assign to a first deduplication node and a new interval to assign to a second deduplication node; and redistributing at least some of the plurality of backup data based at least in part on the one or more consistent hash keys and on the new intervals assigned to each of the first deduplication node and the second deduplication node, including by performing one of the following: reassigning at least a portion of backup data from the first deduplication node and the second deduplication node to a third deduplication node, and reassigning at least a portion of backup data from the third deduplication node to the first deduplication node and the second deduplication node; and further comprising selecting the first deduplication node and the second deduplication node; and wherein: in the event the third deduplication node comprises a new deduplication node that is added to the deduplication system, the first deduplication node and the second deduplication node are each selected based at least in part on a corresponding utilization value; and in the event the third deduplication node comprises an existing deduplication node that is removed from the deduplication system, the first deduplication node and the second deduplication node are selected based at least in part on a respective association with intervals that are adjacent to the interval associated with the third deduplication node.

2. The method of claim 1 , wherein obtaining the corresponding sets of metadata includes analyzing the plurality of backup data and generating the corresponding sets of metadata based at least in part on the analysis.

3. The method of claim 2 , wherein the analysis comprises one or more of the following: a run-time analysis, an algorithmic selection based at least in part on a policy requirement, a heuristic analysis, or an analysis based at least in part on an environmental condition extant at a time of backup.

4. The method of claim 1 , wherein assigning a backup data to a deduplication node includes: obtaining the intervals associated with each of at least some of the plurality of deduplication nodes; and selecting, from the plurality of deduplication nodes, a deduplication node that is associated with an interval that includes the consistent hash key generated for the backup data.

5. The method of claim 1 , wherein a backup data is assigned to a deduplication node based on a consistent hash key only in the event that the backup data is determined to not have been flagged.

6. The method of claim 5 , wherein in the event that the backup data is determined to have been flagged, assigning the backup data to a deduplication node not based on the consistent hash key but based at least in part on a policy associated with flagged backup data.

7. The method of claim 6 , wherein the policy associated with flagged backup data includes assigning flagged backup data to a storage node based at least in part on a source organization associated with flagged backup data.

8. The method of claim 1 further comprising receiving, from one of the plurality of deduplication nodes a message that includes a deduplication node identifier, a hash key, and routing information.

9. The method of claim 8 , wherein the deduplication node identifier, the hash key, the routing information, and a corresponding timestamp are stored as an entry in the table.

10. The method of claim 9 , wherein the timestamp is included in the message.

11. The method of claim 1 further comprising cleaning the table including by determining whether to delete a first entry from the table based at least in part on a corresponding timestamp.

12. The method of claim 1 , wherein the first deduplication node is determined to have the highest utilization value of the plurality of deduplication nodes in the deduplication system, wherein the second deduplication node is one of two deduplication nodes that are associated with intervals that are adjacent to the interval associated with the first deduplication node, and wherein the second deduplication node is determined to have a higher utilization value than the other adjacent deduplication node.

13. A system for processing backup data, comprising: a processor configured to: obtain corresponding sets of metadata associated with each of at least some of a plurality of backup data; generate one or more consistent hash keys for each of at least some of the plurality of backup data based at least in part on the corresponding sets of metadata; assign the plurality of backup data to at least some of a plurality of deduplication nodes comprising a deduplication system based at least in part on the one or more consistent hash keys and on an interval associated with each of least some of the plurality of deduplication nodes; detect, based at least in part on a table, that a number of functioning deduplication nodes in the deduplication system has changed, wherein the table is configured to store a plurality of received deduplication node identifier information and received timestamp information; determine, in response to detecting that the number of functioning deduplication nodes has changed, a new interval to assign to a first deduplication node and a new interval to assign to a second deduplication node; and redistribute at least some of the plurality of backup data based at least in part on the one or more consistent hash keys and on the new intervals assigned to each of the first deduplication node and the second deduplication node, including by performing one of the following: reassigning at least a portion of backup data from the first deduplication node and the second deduplication node to a third deduplication node, and reassigning at least a portion of backup data from the third deduplication node to the first deduplication node and the second deduplication node; and a memory coupled with the processor and configured to provide the processor with instructions; wherein: the processor is further configured to select the first deduplication node and the second deduplication node; in the event the third deduplication node comprises a new deduplication node that is added to the deduplication system, the first deduplication node and the second deduplication node are each selected based at least in part on a corresponding utilization value; and in the event the third deduplication node comprises an existing deduplication node that is removed from the deduplication system, the first deduplication node and the second deduplication node are selected based at least in part on a respective association with intervals that are adjacent to the interval associated with the third deduplication node.

14. The system of claim 13 , wherein obtaining the corresponding sets of metadata includes analyzing the plurality of backup data and generating the corresponding sets of metadata based at least in part on the analysis.

15. The system of claim 14 , wherein the analysis comprises one or more of the following: a run-time analysis, an algorithmic selection based at least in part on a policy requirement, a heuristic analysis, or an analysis based at least in part on an environmental condition extant at a time of backup.

16. The system of claim 13 , wherein assigning a backup data to a deduplication node includes: obtaining the intervals associated with each of at least some of the plurality of deduplication nodes; and selecting, from the plurality of deduplication nodes, a deduplication node that is associated with an interval that includes the consistent hash key generated for the backup data.

17. The system of claim 13 , wherein the backup data is assigned to a deduplication node based on a consistent hash key only in the event that the backup data is determined to not have been flagged.

18. The system of claim 17 , wherein in the event that the backup data is determined to have been flagged, assigning the backup data to a deduplication node not based on the consistent hash key but based at least in part on a policy associated with flagged backup data.

19. The system of claim 18 , wherein the policy associated with flagged backup data includes assigning flagged backup data to a storage node based at least in part on a source organization associated with flagged backup data.

20. The system of claim 13 , wherein the first deduplication node is determined to have the highest utilization value of the plurality of deduplication nodes in the deduplication system, wherein the second deduplication node is one of two deduplication nodes that are associated with intervals that are adjacent to the interval associated with the first deduplication node, and wherein the second deduplication node is determined to have a higher utilization value than the other adjacent deduplication node.

21. A computer program product for processing backup data, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for: obtaining corresponding sets of metadata associated with each of at least some of a plurality of backup data; generating one or more consistent hash keys for each of at least some of the plurality of backup data based at least in part on the corresponding sets of metadata; assigning the plurality of backup data to at least some of a plurality of deduplication nodes comprising a deduplication system based at least in part on the one or more consistent hash keys and on an interval associated with each of least some of the plurality of deduplication nodes; detecting, based at least in part on a table, that a number of functioning deduplication nodes in the deduplication system has changed, wherein the table is configured to store a plurality of received deduplication node identifier information and received timestamp information; determining, in response to detecting that the number of functioning deduplication nodes has changed, a new interval to assign to a first deduplication node and a new interval to assign to a second deduplication node; and redistributing at least some of the plurality of backup data based at least in part on the one or more consistent hash keys and on the new intervals assigned to each of the first deduplication node and the second deduplication node, including by performing one of the following: reassigning at least a portion of backup data from the first deduplication node and the second deduplication node to a third deduplication node, and reassigning at least a portion of backup data from the third deduplication node to the first deduplication node and the second deduplication node; and further comprising computer instructions for selecting the first deduplication node and the second deduplication node; and wherein: in the event the third deduplication node comprises a new deduplication node that is added to the deduplication system, the first deduplication node and the second deduplication node are each selected based at least in part on a corresponding utilization value; and in the event the third deduplication node comprises an existing deduplication node that is removed from the deduplication system, the first deduplication node and the second deduplication node are selected based at least in part on a respective association with intervals that are adjacent to the interval associated with the third deduplication node.

22. The computer program product of claim 21 further comprising computer instructions for receiving, from one of the plurality of deduplication nodes a message that includes a deduplication node identifier, a hash key, and routing information.

23. The computer program product of claim 22 wherein the deduplication node identifier, the hash key, the routing information, and a corresponding timestamp are stored as an entry in the table.

24. The computer program product of claim 23 , wherein the timestamp is included in the message.

25. The computer program product of claim 21 , further comprising computer instructions for cleaning the table, including by determining whether to delete a first entry from the table based at least in part on a corresponding timestamp.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F

Patent Metadata

Filing Date

September 19, 2012

Publication Date

December 23, 2014

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search