An exemplary method may comprise receiving a matrix for a set of documents, each cell of the matrix including a frequency value indicating a number of instances of a corresponding text segment in a corresponding document, receiving an indication of a relationship between two text segments, each of the two text segments associated with a first column and a second column, respectively, of the matrix, adjusting, for each document, a frequency value of the second column based on the frequency value of the first column, projecting each frequency value into a reference space to generate a set of projection values, identifying a plurality of subsets of the reference space, clustering, for each subset of the plurality of subsets, at least some documents that correspond to projection values, and generating a graph of nodes, each of the nodes identifying one or more of the documents corresponding to each cluster.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A non-transitory computer readable medium comprising instructions, the instructions being executable by a processing device processor to perform a method, the method comprising: receiving a set of document identifiers, a plurality of text segments, and a plurality of frequency values, each document identifier of the set of document identifiers identifying one of a plurality of documents, each of the plurality of text segments being associated with a particular document identifier of the set of document identifiers, each of the plurality of text segments being in at least one document of the plurality of documents which is identified by the particular document identifier, and each of the plurality of frequency values indicating a number of instances of a corresponding text segment in a corresponding document of the plurality of documents; receiving an indication of a relationship between a first text segment of the plurality of text segments and a second text segment of the plurality of text segments, the first text segment being associated with a first frequency value of the plurality of frequency values, the second text segment being associated with a second frequency value of the plurality of frequency values; adjusting, for each document, the second frequency value based on a frequency value of the first frequency value; projecting each frequency value of the plurality of frequency values into a reference space to generate a set of projection values in the reference space; identifying a plurality of subsets of the reference space, at least some of the plurality of subsets including at least some of the projection values of the set of projection values in the reference space; clustering, for each subset of the plurality of subsets, at least some documents of the set of documents that correspond to a subset of the set of projection values to generate clusters of one or more documents; and generating a graph of nodes, each of the nodes identifying one or more of the documents corresponding to each cluster.
2. The computer readable medium of claim 1 , wherein clustering, for each subset of the plurality of subsets, at least some documents of the set of documents comprises: determining a distance between at least two documents of the set of documents corresponding to at least two projection values in a first subset of the plurality of subsets; comparing the distance to a threshold value; and clustering each of the at least two documents in two different clusters or one cluster based on the comparison.
3. The computer readable medium of claim 1 , wherein generating the graph of nodes comprises generating a graphical representation of the graph of nodes.
4. The computer readable medium of claim 1 , the method further comprising generating a link between at least two nodes of the graph of nodes, each node corresponding to different clusters, a first document of the set of documents being a member of the different clusters.
5. The computer readable medium of claim 4 , wherein generating the graph of nodes comprises generating a graphical representation of the graph of nodes and generating the link comprises generating an edge between the at least two nodes.
6. The computer readable medium of claim 1 , wherein the plurality of subsets of the reference space have a non-empty intersection.
7. The computer readable medium of claim 1 , wherein adjusting, for each document, the second frequency value based on a frequency value of the first frequency value comprises generating a third frequency value based on the frequency value of the first frequency value and leaving the second frequency value unchanged.
8. The computer readable medium of claim 1 , wherein projecting each frequency value comprises projecting each frequency value, including each of the adjusted frequency values, into the reference space to generate the set of projection values in the reference space.
9. The computer readable medium of claim 8 , wherein the second frequency value remains unchanged.
10. The computer readable medium of claim 1 , wherein the text segments are from at least one dictionary of text segments.
11. The computer readable medium of claim 1 , wherein one or more of the text segments are words.
12. The computer readable medium of claim 1 , wherein one or more of the text segments are n-grams.
13. The computer readable medium of claim 1 , wherein each frequency value is a term frequency-inverse document frequency for the corresponding text segment and the corresponding document.
14. A system comprising: at least one processor; memory including executable instructions, the executable instructions being executable by the at least one processor to control the at least one processor to: receive a set of document identifiers, a plurality of text segments, and a plurality of frequency values, each document identifier of the set of document identifiers identifying one of a plurality of documents, each of the plurality of text segments being associated with a particular document identifier of the set of document identifiers, each of the plurality of text segments being in at least one document of the plurality of documents which is identified by the particular document identifier, and each of the plurality of frequency values indicating a number of instances of a corresponding text segment in a corresponding document of the plurality of documents; receive an indication of a relationship between a first text segment of the plurality of text segments and a second text segment of the plurality of text segments, the first text segment being associated with a first frequency value of the plurality of frequency values, the second text segment being associated with a second frequency value of the plurality of frequency values; adjust, for each document, the second frequency value based on a frequency value of the first frequency value; project each frequency value of the plurality of frequency values into a reference space to generate a set of projection values in the reference space; identify a plurality of subsets of the reference space, at least some of the plurality of subsets including at least some of the projection values of the set of projection values in the reference space; cluster, for each subset of the plurality of subsets, at least some documents of the set of documents that correspond to a subset of the set of projection values to generate clusters of one or more documents; and generate a graph of nodes, each of the nodes identifying one or more of the documents corresponding to each cluster.
15. The system of claim 14 , wherein clustering, for each subset of the plurality of subsets, at least some documents of the set of documents comprises: determining a distance between at least two documents of the set of documents corresponding to at least two projection values in a first subset of the plurality of subsets; comparing the distance to a threshold value; and clustering each of the at least two documents in two different clusters or one cluster based on the comparison.
16. The system of claim 14 , wherein generating the graph of nodes comprises generating a graphical representation of the graph of nodes.
17. The system of claim 14 , the method further comprising generating a link between at least two nodes of the graph of nodes, each node corresponding to different clusters, a first document of the set of documents being a member of the different clusters.
18. The system of claim 17 , wherein generating the graph of nodes comprises generating a graphical representation of the graph of nodes and generating the link comprises generating an edge between the at least two nodes.
19. The system of claim 14 , wherein the plurality of subsets of the reference space have a non-empty intersection.
20. The system of claim 14 , wherein adjusting, for each document, the second frequency value based on a frequency value of the first frequency value comprises generating a third frequency value based on the frequency value of the first frequency value and leaving the second frequency value unchanged.
21. The system of claim 14 , wherein projecting each frequency value comprises projecting each frequency value, including each of the adjusted frequency values, into the reference space to generate the set of projection values in the reference space.
22. The system of claim 21 , wherein the second frequency value remains unchanged.
23. The system of claim 14 , wherein the text segments are from at least one dictionary of text segments.
24. The system of claim 14 , wherein one or more of the text segments are words.
25. The system of claim 14 , wherein one or more of the text segments are n-grams.
26. The system of claim 14 , wherein each frequency value is a term frequency-inverse document frequency for the corresponding text segment and the corresponding document.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
September 7, 2018
June 9, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.