A method of data caching for compliance and storage systems that provide keyword search query based access to documents computes a value for each data document based on a document information-retrieval relevancy metric for user keyword queries and a recency, frequency of each query. The values are adapted to changing query frequencies and popularities. Then selecting and evicting documents from a cache can be based on the values according to a knapsack solution. A weight is computed for each query such that recent, more frequent queries get a higher weight. A information-retrieval metric is used for measuring a relevancy of a document for a query. A weighted sum is taken of the information-retrieval metric times a query weight over all queries.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method of data caching for compliance and storage systems that provides keyword search query based access to documents, the method comprising: searching documents from a storage device by a keyword based interface; staging from a cache documents that are read and that are expected to be needed again from the storage device; computing a document weight for each of the documents read and expected to be needed again, wherein the document weight is based on a document information retrieval (IR) relevancy metric for user keyword queries and a recency and a frequency of each query and the document weight models a probability of a particular document being accessed again through a query, and wherein the document weight is based on a relevance of each document for queries in a query history; placing a processor and a disk in data communication with a First In First Out queue and a cache; and if the document being accessed again was not already in the cache, evicting another document from the cache to make room for the document being accessed again to be placed in the cache by packing elements in the order of a document weight-to-size ratio, highest to smallest, and evicting documents with a smallest document weight-to-size ratio first; maintaining a query history of recent queries from a user in a query history first-in first-out queue; assigning each query from a user a query weight based on a position of the query from a user in the First In First Out queue, wherein the query weight models a probability of a query or a related query being invoked again; wherein each one of the document weight is recomputed by the processor when a document to be retrieved was not previously cached; updating the query history First-in First-Out queue and each of the document weights when a new query has been entered; adapting each of the document weights to changing query frequencies and popularities; and selecting and evicting documents from the cache according to a knapsack solution.
2. The method of claim 1 , wherein the computing of said value for each data document further comprises: computing a higher document weight for recent queries; computing a higher document weight for more frequent queries; computing a information-retrieval (IR) metric measuring a relevancy of a document for a query; and taking a weighted sum of the IR metric times a query weight over all queries.
3. The method of claim 1 , wherein if all document sizes are the same, the ordering is done according to cached values of said document weight.
4. The method of claim 1 , further comprising: allowing direct document accesses to be interspersed between keyword query accesses; and calculating a document weight for a direct access as a query which matches only one document.
5. A document search system, comprising: a keyword based interface that searches documents from a storage device; a cache that stages documents that are read and that are expected to be needed again from said storage device, wherein said cache further includes a document weight that is maintained for each document, said document weight models a probability of a particular document being accessed again through a query, said document weight is based on a relevance of each document for queries in a query history, and if said document being accessed again was not already in said cache, another document is evicted from said cache to make room for said document being accessed again to be placed in said cache by packing elements in the order of a document weight-to-size ratio, highest to smallest, and documents with a smallest document weight-to-size ratio are evicted first; a query history first-in first-out (FIFO) queue that maintains a query history of recent queries from a user, wherein, each query is assigned a query weight based on its position in said FIFO queue, wherein the query weight models a probability of a query or a related query being invoked again; a processor connected to said query history FIFO queue, wherein said processor computes a value for each data document based on a document information retrieval (IR) relevancy metric for user keyword queries and a recency and a frequency of each query, and said processor recomputes each one of said document weight (Dw) for each data document when a document to be retrieved was not previously cached; an updating system that updates said query history FIFO queue, each of said query weight, and each of said document weight when a new query has been entered; a mechanism that adapts each one of said document weight for each data document to changing query frequencies and popularities; and a mechanism selecting and evicting documents from said cache based on said document weight for each data document according to a knapsack solution.
6. The document search system of claim 5 , wherein if a document selected from a result set was not already in said cache, said document selected from a result set will be fetched from a storage system and a document weight for said document selected from a result set and not already in the cache is calculated by iterating over all queries then in said query history FIFO queue.
7. The document search system of claim 5 , wherein: if said document being accessed again was not already in said cache, another document is evicted from said cache to make room for said document being accessed again to be placed in said cache according to a 0(1) knapsack computation.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
April 17, 2008
March 20, 2012
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.