A method, electronic device and computer program product for categorizing a document that includes determining a key words associated with a document category and corresponding weight. The method also includes determining a score of the document with respect to the key word at least based on frequencies of the key word appearing in a field of the document and the weight and determining that the document is in the document category in response to the score of the document being higher than a threshold.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method of categorizing a document, comprising: determining a key word and a weight associated with a document category; determining, at least based on a frequency of the key word appearing in a field of the document and the weight, a score of the document with respect to the key word; and in response to the score of the document being greater than a threshold, determining that the document is in the document category, wherein the document category is one of a plurality of document categories; determining a plurality of scores of the document corresponding to the plurality of document categories; determining a post-threshold based on a comparison between a maximum score of the plurality of scores and a pre-threshold; normalizing the plurality of scores to obtain a plurality of normalized scores; and in response to a normalized score of the plurality of normalized scores of the document being greater than the post-threshold, determining that the document is in the document category corresponding to the normalized score.
2. The method according to claim 1 , wherein the key word and the weight are obtained by a neural network trained based on a text corpus.
3. The method according to claim 1 , wherein the field of the document comprises at least one of a title field and a content field.
4. The method according to claim 1 , wherein the field of the document comprises only a content field.
5. The method according to claim 1 , wherein determining the score of the document is further based on at least one of: the number of key words appearing in the field of the document, wherein the key word is one of the key words; a length of the field of the document; and the number of documents of a plurality of documents in which the key words appear, wherein the document is one of the plurality of documents.
6. The method according to claim 1 , wherein determining the score of the document comprises: normalizing a plurality of weights to obtain a plurality of normalized weights, wherein the weight is one of the plurality of weights; and determining the score of the document base d on the plurality of normalized weights.
7. The method according to claim 1 , further comprising: in response to determining that the document is in the document category, applying a tag corresponding to the document category to the document.
8. The method according to claim 7 , further comprising: in response to a query request for the document, providing a query result comprising the document and the tag.
9. An electronic device, comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions executable by the at least one processing unit, the instructions, when executed by the at least one processing unit, causing the electronic device to perform a method, the method comprising: determining a key word and a weight associated with a document category; determining, at least based on a frequency of the key word appearing in a field of the document and the weight, a score of the document with respect to the key word; and in response to the score of the document being greater than a threshold, determining that the document is in the document category, wherein the document category is selected from a plurality of document categories; determining a plurality of scores of the document corresponding to the plurality of document categories; determining a post-threshold based on a comparison between a maximum score of the plurality of scores and a pre-threshold; normalizing the plurality of scores to obtain a plurality of normalized scores; and in response to the a normalized score of the plurality of normalized scores of the document being greater than the post-threshold, determining that the document is in the document category corresponding to the normalized score.
10. The electronic device according to claim 9 , wherein the key word and the weight are obtained by a neural network trained based on a text corpus.
11. The electronic device according to claim 9 , wherein the field of the document comprises at least one of a title field and a content field.
12. The electronic device according to claim 9 , wherein the field of the document comprises only a content field.
13. The electronic device according to claim 9 , wherein determining the score of the document is further based on at least one of: the number of key words appearing in the field of the document, wherein the key word is one of the key words; a length of the field of the document; and the number of documents of a plurality of documents in which the key words appear, wherein the document is one of the plurality of documents.
14. The electronic device according to claim 9 , wherein determining the score of the document comprises: normalizing a plurality of weights to obtain a plurality of normalized weights, wherein the weight is one of the plurality of weights; and determining the score of the document based on the plurality of normalized weights.
15. The electronic device according to claim 9 , wherein the method further comprises: in response to determining that the document is in the document category, applying a tag corresponding to the document category to the document.
16. The electronic device according to claim 15 , wherein the method further comprises: in response to a query request for the document, providing a query result comprises the document and the tag.
17. A computer program product being tangibly stored on a non-transitory computer readable medium and comprising machine executable instructions which, when executed, causing a machine to perform a method, the method comprising: determining a key word and a weight associated with a document category; determining, at least based on a frequency of the key word appearing in a field of the document and the weight, a score of the document with respect to the key word; and in response to the score of the document being greater than a threshold, determining that the document is in the document category, wherein the document category is one of a plurality of document categories; determining a plurality of scores of the document corresponding to the plurality of document categories; determining a post-threshold based on a comparison between a maximum score of the plurality of scores and a pre-threshold; normalizing the plurality of scores to obtain a plurality of normalized scores; and in response to a normalized score of the plurality of normalized scores of the document being greater than the post-threshold, determining that the document is in the document category corresponding to the normalized score.
18. The computer program product according to claim 17 being tangibly stored on a non-transient computer readable medium and comprising machine executable instructions which, when executed, causing a machine to perform a method, the method further comprising: in response to determining that the document is in the document category, applying a tag corresponding to the document category to the document; and in response to a query request for the document, providing a query result comprises the document and the tag.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
August 29, 2018
December 8, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.