A computer-implemented tool for automated classification and interpretation of documents, such as life science documents supporting clinical trials, is configured to perform a combination of raw text, document construct, and image analyses to enhance classification accuracy by enabling a more comprehensive machine-based understanding of document content. The combination of analyses provides context for classification by leveraging relative spatial relationships among text and image elements, identifying characteristics and formatting of elements, and extracting additional metadata from the documents as compared to conventional automated classification tools.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method for classifying and interpreting life science documents, comprising: receiving digitized representations of the life science documents, the digitized representations including document elements comprising one or more of text or image; performing text analysis of the digitized representations of the life science documents, the text analysis including identifying raw words in the text; performing construct analysis of the digitized representations of the life science documents, the construct analysis including identifying document context that describes characteristics of document elements and relative spatial positioning of document elements on pages of the life science documents; performing image analysis of the digitized representations of the life science documents, the image analysis including identifying images and processing the identified images to extract additional characteristics for document elements; collectively utilizing results of the text, construct, and image analyses to classify the life science documents into one or more predefined classes; and tagging the life science documents with classification tags and event tags, wherein the event tags are configured for operation as a trigger or an alert.
2. The computer-implemented method of claim 1 in which the relative spatial positions include one of header, footer, caption, footnote, or title.
3. The computer-implemented method of claim 1 in which the identification of context further includes identifying formatting of the life science documents.
4. The computer-implemented method of claim 1 in which the image analysis further includes identifying logos, graphics, diagrams, diagram text, or captions.
5. The computer-implemented method of claim 4 in which the image analysis further includes interpreting one or more of the identified logos, graphics, diagrams, diagram text, or captions.
6. The computer-implemented method of claim 1 in which the characteristics of the document elements include one of font, size, or format of text.
7. The computer-implemented method of claim 1 in which the construct analysis further includes tracking text neighboring each document element.
8. The computer-implemented method of claim 1 in which the image analysis further includes image to text conversion to extract text in digitized form from images.
9. The computer-implemented method of claim 1 further comprising classifying content in the life science documents into one or more predefined classes.
10. The computer-implemented method of claim 1 in which the text analysis includes tracking a sequence of text in the life science documents.
11. The computer-implemented method of claim 1 in which one or more of the text, construct, or image analyses generate metadata that is associated with the life science documents, wherein the metadata is utilized, at least in part, to perform the classifying.
12. The computer-implemented method of claim 1 in which the one or more predefined classes comprise classes defined by the Drug Information Association.
13. A computing device configured to operate as a computer-implemented automated classification and interpretation tool, comprising: one or more processors; and one or more non-transitory computer-readable storage media storing instructions which, when executed by the one or more processors, cause the computing device to: deconstruct one or more life science documents into a standardized data structure to generate document elements comprising images and digitized text as an input to the computer-implemented automated classification and interpretation tool, perform a combination of text, construct, and image analyses on the document elements to create context-based representations of the life science documents whereby spatial relationships among document elements are identified, extract metadata that describes one or more of the document elements, utilize the context-based representations and extracted metadata to assist classification of the life science documents into pre-defined classes; and interpret the life science documents into classification tags or event tags, wherein the event tags operate as alerts, actions or triggers.
14. The computing device of claim 13 in which the executed instructions further cause the computing device to classify the life science document using machine learning processes, wherein the machine learning processes are adjustable according to input from a human operator.
15. One or more non-transitory computer-readable storage media storing executable instructions which, when executed by one or more processors in a computing device, implement a computer-implemented automated classification tool configured to perform a method including the steps of: identifying raw text in a digitized life science document; identifying construction of the digitized life science document to identify relative spatial locations of text and image elements in the digitized life science document; identifying images to extract text in digitized form; identifying characteristics of raw and extracted text; utilizing results of each of the identification steps in combination to generate metadata; classifying the life science document utilizing the generated metadata; and tagging the life science documents with classification tags and event tags, wherein the event tags are configured for operation as a trigger or an alert.
16. The one or more non-transitory computer-readable storage media of claim 15 in which the classifying utilizes one or more of weighting the results, application of latent semantic analysis, or non-parametric ANCOVA (analysis of covariance).
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
March 1, 2019
November 17, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.