For calculation of digest segmentations for input data using similar data in a data deduplication system using a processor device in a computing environment, a stream of input data is partitioned into input data chunks. Similar repository intervals are calculated for each input data chunk. Anchor positions are determined between an input data chunk and the similar repository intervals, based on data matches between a previous input data chunk and previous similar repository intervals. Digest segmentations of the similar repository intervals are projected onto the input data chunk, starting at the anchor positions.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for calculation of digest segmentations for input data in a data deduplication system using a processor device in a computing environment, comprising: partitioning an input stream of data into input data chunks, the input data chunks each being at least 16 Megabytes (MB) in size; calculating similar repository intervals for an input data chunk, the repository intervals produced using a single linear scan of rolling hash values to calculate both similarity elements and digest block boundaries corresponding to the repository intervals; wherein each of the rolling hash values are discarded upon contributing to the calculation; identifying anchor positions for the input data chunk and each one of the similar repository intervals; projecting digest segmentations of the similar repository intervals starting at the anchor positions onto the input data chunk; wherein an anchor position is defined as a pair of ending position of a last data match, in the input data and in repository data, calculated between a previous input data chunk in the input stream of data and a previous similar repository interval, whose ending positions in the input stream of data and in the repository data are closest to starting positions of a current input data chunk and of a current similar repository interval respectively; determining when and when not to apply a similarity search associated with the similar repository intervals for the input data chunk based on a deduplication result of a previous input data chunk in the input stream of data; and avoiding the similarity search if the deduplication result of the previous input data chunk in the input stream of data is one of above and equal to a predetermined deduplication result threshold, thereby only calculating the rolling hash values of the input data chunk when needed to be used in the similarity search in the data deduplication system of the computing environment.
2. The method of claim 1 , further including storing a position and a size of a last data match for each one of the similar repository intervals for each of plurality of input streams of data.
3. The method of claim 1 , further including using positions and sizes of digest segments of a similar repository interval starting at an anchor position in the similar repository interval for projecting the positions and the sizes of the digest segments onto the input data chunk starting from a respective anchor position in the input data chunk.
4. The method of claim 1 , further including calculating a plurality of digest segmentations for the input data chunk using each one of the digest segmentations of the similar repository intervals, wherein respective digest values are computed for each one of the plurality of calculated digest segmentation.
5. The method of claim 4 , further including selecting for storage at least one digest segmentation of the plurality of digest segmentations calculated for the input data chunk that produced a highest deduplication ratio for the input data chunk.
6. The method of claim 5 , further including concatenating into a single segmentation the plurality of digest segmentations selected for all sub-sections of the input data chunk if the input data chunk is partitioned into sub-sections such that each of the sub-sections has a set of the similar repository intervals.
7. A system for calculation of candidate digest segmentations for an input data chunk in a data deduplication system of a computing environment, the system comprising: the data deduplication system; a repository operating in the data deduplication system; a memory in the data deduplication system; a search structure in association with the memory in the data deduplication system; and at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device: partitions an input stream of data into input data chunks, the input data chunks each being at least 16 Megabytes (MB) in size, calculates similar repository intervals for an input data chunk, the repository intervals produced using a single linear scan of rolling hash values to calculate both similarity elements and digest block boundaries corresponding to the repository intervals; wherein each of the rolling hash values are discarded upon contributing to the calculation, identifies anchor positions for the input data chunk and each one of the similar repository intervals, projects digest segmentations of the similar repository intervals starting at the anchor positions onto the input data chunk; wherein an anchor position is defined as a pair of ending position of a last data match, in the input data and in repository data, calculated between a previous input data chunk in the input stream of data and a previous similar repository interval, whose ending positions in the input stream of data and in the repository data are closest to starting positions of a current input data chunk and of a current similar repository interval respectively, determining when and when not to apply a similarity search associated with the similar repository intervals for the input data chunk based on a deduplication result of a previous input data chunk in the input stream of data, and avoiding the similarity search if the deduplication result of the previous input data chunk in the input stream of data is one of above and equal to a predetermined deduplication result threshold, thereby only calculating the rolling hash values of the input data chunk when needed to be used in the similarity search in the data deduplication system of the computing environment.
8. The system of claim 7 , wherein the at least one processor device stores a position and a size of a last data match for each one of the similar repository intervals for each of plurality of input streams of data.
9. The system of claim 7 , wherein the at least one processor device uses positions and sizes of digest segments of a similar repository interval starting at an anchor position in the similar repository interval for projecting the positions and the sizes of the digest segments onto the input data chunk starting from a respective anchor position in the input data chunk.
10. The system of claim 7 , wherein the at least one processor device calculates a plurality of digest segmentations for the input data chunk using each one of the digest segmentations of the similar repository intervals, wherein respective digest values are computed for each one of the plurality of calculated digest segmentation.
11. The system of claim 10 , wherein the at least one processor device selects for storage at least one digest segmentation of the plurality of digest segmentations calculated for the input data chunk that produced a highest deduplication ratio for the input data chunk.
12. The system of claim 11 , wherein the at least one processor device concatenates into a single segmentation the plurality of digest segmentations selected for all sub-sections of the input data chunk if the input data chunk is partitioned into sub-sections such that each of the sub-sections has a set of the similar repository intervals.
13. A computer program product for calculation of candidate digest segmentations for an input data chunk in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising: a first executable portion that partitions an input stream of data into input data chunks, the input data chunks each being at least 16 Megabytes (MB) in size; a second executable portion that calculates similar repository intervals for an input data chunk, the repository intervals produced using a single linear scan of rolling hash values to calculate both similarity elements and digest block boundaries corresponding to the repository intervals; wherein each of the rolling hash values are discarded upon contributing to the calculation; a third executable portion that identifies anchor positions for the input data chunk and each one of the similar repository intervals; a fourth executable portion that projects digest segmentations of the similar repository intervals starting at the anchor positions onto the input data chunk; wherein an anchor position is defined as a pair of ending position of a last data match, in the input data and in repository data, calculated between a previous input data chunk in the input stream of data and a previous similar repository interval, whose ending positions in the input stream of data and in the repository data are closest to starting positions of a current input data chunk and of a current similar repository interval respectively; a fifth executable portion that determines when and when not to apply a similarity search associated with the similar repository intervals for the input data chunk based on a deduplication result of a previous input data chunk in the input stream of data; and a sixth executable portion that avoids the similarity search if the deduplication result of the previous input data chunk in the input stream of data is one of above and equal to a predetermined deduplication result threshold, thereby only calculating the rolling hash values of the input data chunk when needed to be used in the similarity search in the data deduplication system of the computing environment.
14. The computer program product of claim 13 , further including a seventh executable portion that stores a position and a size of a last data match for each one of the similar repository intervals for each of plurality of input streams of data.
15. The computer program product of claim 13 , further including a seventh executable portion that uses positions and sizes of digest segments of a similar repository interval starting at an anchor position in the similar repository interval for projecting the positions and the sizes of the digest segments onto the input data chunk starting from a respective anchor position in the input data chunk.
16. The computer program product of claim 13 , further including a seventh executable portion that calculates a plurality of digest segmentations for the input data chunk using each one of the digest segmentations of the similar repository intervals, wherein respective digest values are computed for each one of the plurality of calculated digest segmentation.
17. The computer program product of claim 16 , further including an eighth executable portion that selects for storage at least one digest segmentation of the plurality of digest segmentations calculated for the input data chunk that produced a highest deduplication ratio for the input data chunk.
18. The computer program product of claim 17 , further including a ninth executable portion that concatenates into a single segmentation the plurality of digest segmentations selected for all sub-sections of the input data chunk if the input data chunk is partitioned into sub-sections such that each of the sub-sections has a set of the similar repository intervals.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
July 15, 2013
September 29, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.