Legal claims defining the scope of protection, as filed with the USPTO.
1. A method comprising: receiving, by a processing device, data representing a plurality of corpora, each of the plurality of corpora including a set of documents; receiving, by a processing device, data representing terms that appear in the corpora; for each one of the terms, determining, by a processing device, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora; receiving, by a processing device, data representing a subset of the terms that also appear in a document; for each term in the subset of the terms, determining, by a processing device, a term frequency for the term in the document; for each term in the subset of the terms, determining, by a processing device, an augmented term frequency-inverse document frequency value based on: (i) the term frequency, and (ii) the plurality of inverse document frequency values that were determined for the term in the subset of the terms; and for each term in the subset of the terms that also appear in the document, selecting, by a processing device, as a combined inverse document frequency value, a minimum value of a plurality of normalized inverse document frequency values determined for the term in the subset of the terms.
2. The method of claim 1 , further comprising determining, by a processing device, one or more keywords for the document based on the augmented term frequency-inverse document frequency values determined for the subset of the terms.
3. The method of claim 1 , wherein the determining, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora comprises: determining, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora and inversely proportional to a count of documents that are in the respective one of the plurality of corpora and include the one of the terms.
4. The method of claim 1 , further comprising, for each one of the plurality of corpora, determining, by a processing device, a normalization factor based on a subset of the inverse document frequency values associated with the one of the plurality of corpora and a set of reference inverse document frequency values.
5. The method of claim 4 , further comprising: for each term in the subset of the terms that also appear in the document, determining, by a processing device, the plurality of normalized inverse document frequency values each: (a) associated with a respective one of the plurality of corpora and (b) based on: (i) the normalization factor that was determined for the respective one of the plurality of corpora and (ii) the inverse document frequency value determined for the term in the subset of the terms and associated with the respective one of the plurality of corpora.
6. The method of claim 5 , further comprising: for each term in the subset of the terms that also appear in the document, determining, by a processing device, a combined inverse document frequency value based on a geometric mean of the plurality of normalized inverse document frequency values that were determined for the term in the subset of the terms.
7. A non-transitory computer readable storage medium having instructions stored thereon, the instructions being executable by a machine to result in a method comprising: receiving data representing a plurality of corpora, each of the plurality of corpora including a set of documents; receiving data representing terms that appear in the corpora; for each one of the terms, determining a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora; receiving data representing a subset of the terms that also appear in a document; for each term in the subset of the terms, determining a term frequency for the term in the document; for each term in the subset of the terms, determining, an augmented term frequency-inverse document frequency value based on: (i) the term frequency, and (ii) the plurality of inverse document frequency values that were determined for the term in the subset of the terms; and for each term in the subset of the terms that also appear in the document, selecting, by a processing device, as a combined inverse document frequency value, a minimum value of a plurality of normalized inverse document frequency values determined for the term in the subset of the terms.
8. The non-transitory computer readable medium of claim 7 , the method further comprising: determining, by a processing device, one or more keywords for the document based on the augmented term frequency-inverse document frequency values determined for the subset of the terms.
9. The non-transitory computer readable medium of claim 7 , wherein the determining, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora comprises: determining, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora and inversely proportional to a count of documents that are in the respective one of the plurality of corpora and include the one of the terms.
10. The non-transitory computer readable medium of claim 7 , the method further comprising: for each one of the plurality of corpora, determining a normalization factor based on a subset of the inverse document frequency values associated with the one of the plurality of corpora and a set of reference inverse document frequency values.
11. The non-transitory computer readable medium of claim 10 , the method further comprising: for each term in the subset of the terms that also appear in the document, determining the plurality of normalized inverse document frequency values each: (a) associated with a respective one of the plurality of corpora and (b) based on: (i) the normalization factor that was determined for the respective one of the plurality of corpora and (ii) the inverse document frequency value determined for the term in the subset of the terms and associated with the respective one of the plurality of corpora.
12. The non-transitory computer readable medium of claim 11 , the method further comprising: for each term in the subset of the terms that also appear in the document, determining a combined inverse document frequency value based on a geometric mean of the plurality of normalized inverse document frequency values that were determined for the term in the subset of the terms.
13. A system comprising: a processing device to receive data representing a plurality of corpora, each of the plurality of corpora including a set of documents; a processing device to receive data representing terms that appear in the corpora; a processing device to determine, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora; a processing device to receive data representing a subset of the terms that also appear in a document; a processing device to determine, for each term in the subset of the terms, a term frequency for the term in the document; a processing device to determine, for each term in the subset of the terms, an augmented term frequency-inverse document frequency value based on: (i) the term frequency, and (ii) the plurality of inverse document frequency values that were determined for the term in the subset of the terms; and a processing device to determine, for each term in the subset of the terms that also appear in the document, selecting, by a processing device, as a combined inverse document frequency value, a minimum value of a plurality of normalized inverse document frequency values determined for the term in the subset of the terms.
14. The system of claim 13 , further comprising: a processing device to determine one or more keywords for the document based on the augmented term frequency-inverse document frequency values determined for the subset of the terms.
15. The system of claim 13 , wherein the determine, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora comprises: determine, for each one of the terms, a plurality of inverse document frequency values each associated with a respective one of the plurality of corpora and inversely proportional to a count of documents that are in the respective one of the plurality of corpora and include the one of the terms.
16. The system of claim 13 , further comprising: a processing device to determine, for each one of the plurality of corpora, a normalization factor based on a subset of the inverse document frequency values associated with the one of the plurality of corpora and a set of reference inverse document frequency values.
17. The system of claim 16 , further comprising: a processing device to determine, for each term in the subset of the terms that also appear in the document, the plurality of normalized inverse document frequency values each: (a) associated with a respective one of the plurality of corpora and (b) based on: (i) the normalization factor that was determined for the respective one of the plurality of corpora and (ii) the inverse document frequency value determined for the term in the subset of the terms and associated with the respective one of the plurality of corpora.
Unknown
July 5, 2016
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.