Efficiency Sets for Determination of Unique Data

PublishedJanuary 7, 2025

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: applying, by a computing device, a first membership test to a first group of candidate block identifiers corresponding to a first data set in a distributed storage system to generate a first efficiency set; applying, by the computing device, a second membership test to a second group of candidate block identifiers corresponding to a second data set in the distributed storage system to generate a second efficiency set; wherein at least one of the first membership test or the second membership test includes a filter that specifies that a candidate block identifier is added to the first efficiency set or the second efficiency set, respectively, when the candidate block identifier matches a threshold number of bits of a filter sequence of bits; determining, by the computing device, a set difference based on a comparison of the first efficiency set and the second efficiency set; and estimating, by the computing device, an amount of unique data within the second data set based on the set difference.

2. The method of claim 1, further comprising: estimating, by the computing device, an amount of memory that would be reclaimed by deleting the second data set based on the amount of unique data estimated within the second data set.

3. The method of claim 1, wherein the estimating comprises: applying, by the computing device, a third membership test to the set difference to generate a new set difference; and estimating, by the computing device, an amount of unique data within the second data set based on the new set difference.

4. The method of claim 1, wherein applying the first membership test comprises: increasing, by the computing device, a strictness of a filter of the first membership test when a size of the first efficiency set reaches a threshold number of entries.

5. The method of claim 1, wherein applying the second membership test comprises: increasing, by the computing device, a strictness of a filter of the second membership test when a size of the second efficiency set reaches a threshold number of entries.

6. The method of claim 1, wherein each of the first membership test and the second membership test includes a bitmask selected based on a targeted probability of accuracy of the first efficiency set and the second efficiency set.

7. The method of claim 1, wherein the first membership test and the second membership test have different membership criteria.

8. The method of claim 1, wherein the first data set includes an active data set stored at a first volume and the second data set includes a set of one or more snapshots stored at the first volume.

9. The method of claim 1, wherein the first data set includes an active data set and a first set of snapshots stored at a first volume, the second data set includes a second set of snapshots stored at the first volume, and the first set of snapshots is different from the second set of snapshots.

10. The method of claim 1, wherein the threshold number of bits determines a strictness of the filter.

11. The method of claim 1, wherein the first membership test and the second membership test include filters of different filter levels and wherein determining, by the computing device, the set difference comprises: identifying a maximum filter level of the different filter levels; applying the maximum filter level to both the first efficiency set and second efficiency set; and determining, by the computing device, the set difference based on a comparison of the first efficiency set and the second efficiency set after the maximum filter level has been applied.

12. A computing device comprising: a memory containing a machine-readable medium comprising machine executable code having stored thereon instructions for estimating an amount of memory used for storing unique data in a distributed storage system; and a processor coupled to the memory, the processor configured to execute the machine executable code to: apply a first membership test to a first group of candidate block identifiers corresponding to a first data set in a distributed storage system to generate a first efficiency set in which the first membership test specifies a first threshold number of bits of a candidate block identifier of the first group of candidate block identifiers that should match a first filter sequence of bits for the candidate block identifier to be added to the first efficiency set; apply a second membership test to a second group of candidate block identifiers corresponding to a second data set in the distributed storage system to generate a second efficiency set in which the second membership test specifies a second threshold number of bits of a candidate block identifier of the second group of candidate block identifiers that should match a second filter sequence of bits for the candidate block identifier to be added to the second efficiency set; determine a set difference based on a comparison of the first efficiency set and the second efficiency set; apply a third membership test to the set difference to generate a new set difference; and estimate an amount of unique data within the second data set based on the new set difference.

13. The computing device of claim 12, further comprising: estimate an amount of memory that would be reclaimed by deleting the second data set based on the amount of unique data estimated within the second data set.

14. The computing device of claim 12, wherein the first efficiency set comprises a union of an efficiency set for each volume of a plurality of volumes in the distributed storage system.

15. The computing device of claim 12, wherein the first threshold number of bits of the first membership test and the second threshold number of bits of the second membership test are different.

16. The computing device of claim 12, wherein the processor is configured, for the comparison of the first efficiency set and the second efficiency set, to execute the machine executable code to: tuning a threshold number of entries for each of the first efficiency set and the second efficiency set based on a target range of statistical uncertainty in the estimating of the amount of unique data within the second data set, wherein increasing the threshold number of entries reduces the statistical uncertainty.

17. The computing device of claim 12, wherein the processor is configured to execute the machine executable code to: remove the unique data from the distributed storage system in response to a request to remove the second data set, wherein a difference between the second data set and the unique data remains used in the distributed storage system after completing the request to remove the second data set.

18. A non-transitory machine-readable medium having stored thereon instructions for estimating an amount of memory used for storing unique data in a distributed storage system, comprising machine executable code which when executed by at least one machine, causes the machine to: applying a first membership test to a first group of candidate block identifiers corresponding to a first data set in a distributed storage system to generate a first efficiency set, wherein the first membership test specifies a first threshold number of bits of a candidate block identifier of the first group of candidate block identifiers that should match a first filter sequence of bits for the candidate block identifier to be added to the first efficiency set; and wherein the first membership test increasing in strictness, by increasing the first threshold number of bits, each time a threshold number of entries in the first efficiency set is reached until all of the first group of candidate block identifiers have been tested for membership; apply a second membership test to a second group of candidate block identifiers corresponding to a second data set in the distributed storage system to generate a second efficiency set, wherein the second membership test specifies a second threshold number of bits of a candidate block identifier of the second group of candidate block identifiers that should match a second filter sequence of bits for the candidate block identifier to be added to the second efficiency set; and wherein the second membership test increases in strictness, by increasing the second threshold number of bits, each time a threshold number of entries in the first efficiency set is reached until all of the second group of candidate block identifiers have been tested for membership; subtract the first efficiency set from the second efficiency set to determine a set difference; and estimate an amount of unique data within the second data set based on the set difference.

19. The non-transitory machine-readable medium of claim 18, further comprising code which causes the machine to: estimate an amount of memory that would be reclaimed by deleting the second data set based on the amount of unique data estimated within the second data set.

20. The non-transitory machine-readable medium of claim 18, further comprising code which causes the machine to: remove the unique data from the distributed storage system in response to a request to remove the second data set, wherein a difference between the second data set and the unique data remains used in the distributed storage system after completing the request to remove the second data set.

Patent Metadata

Filing Date

Unknown

Publication Date

January 7, 2025

Inventors

Alyssa Proulx

Mark David Olson

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search