Training Language Models Using Text Corpora Comprising Realistic Optical Character Recognition (ocr) Errors

PublishedMay 24, 2022

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method, comprising: generating, by a computer system, an initial set of images based on an input text corpus comprising text; determining, by the computer system, one or more characteristics of one or more simulated defects of a particular type, wherein determining the one or more characteristics comprises: determining positional information of each of the simulated defects, wherein the positional information comprises one or more coordinates randomly assigned to the one or more simulated defects, the one or more randomly assigned coordinates to be within a range of values from an initial value to a value corresponding to a width of an image of the initial set of images; overlaying, by the computer system, the one or more simulated defects of the particular type over the initial set of images to generate an augmented set of images comprising one or more text segments based on the one or more characteristics of the one or more simulated defects; generating an output text corpus based on the augmented set of images; and training, using the output text corpus, a language model for optical character recognition.

2. The method of claim 1 , wherein generating the initial set of images further comprises: segmenting the input text corpus into a plurality of segments; generating a rendering of one or more of the segments; and generating one or more images comprising one or more of the segments.

3. The method of claim 1 , wherein the one or more simulated defects comprise a line or a spot in one or more of the augmented set of images.

4. The method of claim 1 , wherein the one or more simulated defects represent at least one of a printing defect, a scanning defect, or a photo defect.

5. The method of claim 1 , wherein overlaying the one or more simulated defects over the initial set of images further comprises: modifying one or more portions of an image of the initial set of images based on the one or more characteristics of the one or more simulated defects.

6. The method of claim 5 , wherein determining the one or more characteristics of the simulated defects further comprises: determining dimensional information of each of the simulated defects.

7. The method of claim 5 , wherein determining the one or more characteristics of the simulated defects further comprises: determining color information of each of the simulated defects.

8. The method of claim 5 , wherein determining the one or more characteristics of the simulated defects further comprises selecting a number of the simulated defects to be overlaid onto one or more of the initial set of images.

9. The method of claim 5 , wherein modifying one or more portions of the image of the initial set of images based on the one or more characteristics of the simulated defects comprises: adjusting values of one or more pixels in the image of the initial set of images based on the characteristics of the one or more simulated defects.

10. The method of claim 1 , further comprising varying a number of realistic optical character recognition (OCR) errors in the output text corpus for learning a plurality of language models, wherein the realistic OCR errors comprise context-dependent information.

11. The method of claim 1 , wherein generating the output text corpus based on the augmented set of images comprises performing optical character recognition on the augmented set of images.

12. The method of claim 1 , wherein the language model for optical character recognition comprises at least one of a language model using word embeddings or a language model using character embeddings.

13. The method of claim 1 , wherein the input text corpus comprises straight text.

14. A system, comprising: a memory; a processing device, coupled to the memory, the processing device to: generate an initial set of images based on an input text corpus comprising text; determine one or more characteristics of one or more simulated defects of a particular type, wherein determining the one or more characteristics comprises: determining positional information of each of the simulated defects, wherein the positional information comprises one or more coordinates randomly assigned to the one or more simulated defects, the one or more randomly assigned coordinates to be within a range of values from an initial value to a value corresponding to a width of an image of the initial set of images; overlay the one or more simulated defects of the particular type over the initial set of images to generate an augmented set of images comprising one or more text segments based on the one or more characteristics of the one or more simulated defects; generate an output text corpus based on the augmented set of images; and train, using the output text corpus, a language model for optical character recognition.

15. The system of claim 14 , wherein, to generate the initial set of images, the processing device is further to: segment the input text corpus into a plurality of segments; generate a rendering of one or more of the segments; and obtain one or more images comprising one or more of the segments.

16. The system of claim 15 wherein the one or more simulated defects comprise a line or a spot in one or more of the augmented set of images.

17. The system of claim 15 , wherein the simulated defects represents at least one of a printing defect, a scanning defect, or a photo defect.

18. The system of claim 14 , wherein, to overlay the one or more simulated defects over the initial set of images, the processing device is further to: modify one or more portions of an image of the initial set of images based on the one or more characteristics of the one or more simulated defects.

19. The system of claim 18 , wherein the one or more characteristics further comprise at least one of dimensional information of the simulated defects, a number of the simulated defects, or color information of each of the simulated defects.

20. A computer-readable non-transitory storage medium comprising executable instructions that, when executed by a processing device, cause the processing device to: generate an initial set of images based on an input text corpus comprising text; determine one or more characteristics of one or more simulated defects of a particular type, wherein determining the one or more characteristics comprises: determining positional information of each of the simulated defects, wherein the positional information comprises one or more coordinates randomly assigned to the one or more simulated defects, the one or more randomly assigned coordinates to be within a range of values from an initial value to a value corresponding to a width of an image of the initial set of images; overlay the one or more simulated defects of the particular type over the initial set of images to generate an augmented set of images based on the one or more characteristics of the one or more simulated defects; generate an output text corpus based on the augmented set of image comprising text segments; and train, using the output text corpus, a language model for optical character recognition.

Patent Metadata

Filing Date

Unknown

Publication Date

May 24, 2022

Inventors

Ivan Germanovich Zagaynov

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search