Apparatus, methods, and other embodiments associated with de-duplication seeding are described. One example method includes re-configuring a data de-duplication repository with a blocklet from a data de-duplication seed corpus. Reconfiguring the repository may include adding a blocklet from the seed corpus to the repository, activating a blocklet identified with the seed corpus in the repository, removing a blocklet from the repository, and de-activating a blocklet in the repository. The example method may also include re-configuring a data de-duplication index associated with the data de-duplication repository with information about the blocklet. Reconfiguring the repository and the index increases the likelihood that a blocklet ingested by a data de-duplication apparatus that relies on the repository and the index will be treated as a duplicate blocklet by the data de-duplication apparatus.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A non-transitory computer-readable medium storing computer-executable instructions that when executed by a computer cause the computer to perform a data de-duplication method, the method comprising: re-configuring a data de-duplication repository with a first blocklet taken from a source other than a data stream being ingested by a data de-duplication apparatus, where the source other than the data stream being ingested is a seed corpus, and where re-configuring the repository comprises moving the first blocklet from the seed corpus into the repository; re-configuring a data de-duplication index associated with the data de-duplication repository with index information about the first blocklet, where reconfiguring the data de-duplication repository or the data de-duplication index increases the likelihood that a second blocklet will be treated as a duplicate blocklet when processed by the data de-duplication apparatus using the data de-duplication repository and the data-duplication index to support duplicate blocklet determinations, and generating a new seed corpus, where generating the new seed corpus comprises selecting a seed blocklet from an existing repository based, at least in part, on one or more of, a reference count associated with the seed blocklet, an attribute describing the generalness of the seed blocklet, a trial and error approach, and a random approach.
2. The non-transitory computer-readable medium of claim 1 , where re-configuring the repository comprises activating the first blocklet in the repository.
3. The non-transitory computer-readable medium of claim 1 , where reconfiguring the index comprises moving information about the first blocklet into the index.
4. The non-transitory computer-readable medium of claim 1 , where reconfiguring the index comprises activating information about the first blocklet in the index.
5. The non-transitory computer-readable medium of claim 1 , comprising selecting the seed corpus from two or more available seed corpora.
6. The non-transitory computer-readable medium of claim 5 , where the seed corpus is selected as a function of one or more of, a relationship between data to be ingested by the data de-duplication apparatus and the seed corpus, a historical performance measurement associated with the seed corpus, an on-the-fly performance measurement associated with the seed corpus, a user action, a calendar date, a day of the week, a time of day, a user identity, and an occurrence of a pre-defined event.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
January 11, 2012
November 18, 2014
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.