Reconstructing High-Fidelity Electronic Documents from Images via Generation of Synthetic Fonts

PublishedApril 14, 2009

Assigneenot available in USPTO data we have

InventorsDennis G. Nicholson

Technical Abstract

Patent Claims

19 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for creating an electronic version of a document, comprising: using a computer to perform: receiving images for the document; extracting character images from the images; generating a synthetic font for the document from the extracted character images, wherein generating the synthetic font involves: producing glyphs from the extracted character images, wherein producing glyphs from the extracted character images involves grouping similar character images into clusters, and iteratively: registering extracted character images in each cluster with sub-pixel accuracy, extracting a high-resolution, noise-reduced prototype from the registered character images for each cluster, measuring a distance from each registered character image to its associated prototype, and using the measured distances to purify each cluster via histogram analysis of inter-cluster and intra-cluster distances; obtaining character labels for the glyphs; and using the glyphs and associated character labels to form the synthetic font; and constructing the electronic version of the document by, using the synthetic font to represent text regions of the document, wherein the synthetic font represents both a logical content and a visual appearance of characters in the text regions, wherein the visual appearance of characters in the synthetic font are faithful replicas of corresponding characters on printed pages from which the images were generated, and using image-segments extracted from the images for the document to represent non-text regions of the document.

2. The method of claim 1 , wherein obtaining character labels for the glyphs involves performing an optical character recognition (OCR) operation on the glyphs.

3. The method of claim 1 , wherein using the measured distances to purify each cluster involves statistically analyzing extracted character images which are similar to each other to ensure that the character images fall into homogenous clusters.

4. The method of claim 3 , wherein the statistical analysis is based on an inter-character distance metric.

5. The method of claim 1 , wherein producing glyphs from the extracted character images further involves: converting the extracted character images to grayscale prior to said iteratively registering, extracting, measuring and using.

6. The method of claim 1 , wherein extracting the noise-reduced prototype from the registered character images for a given cluster involves averaging registered character images in the given cluster to produce a reduced-noise glyph which is representative of the given cluster.

7. A computer-readable storage device storing instructions that when executed by a computer cause the computer to perform a method for creating an electronic version of a document, the method comprising: receiving images for the document; extracting character images from the images; generating a synthetic font for the document from the extracted character images, wherein generating the synthetic font involves: producing glyphs from the extracted character images, wherein producing glyphs from the extracted character images involves grouping similar character images into clusters and iteratively: registering extracted character images in each cluster with sub-pixel accuracy, extracting a high-resolution, noise-reduced prototype from the registered character images for each cluster, measuring a distance from each registered character image to its associated prototype, and using the measured distances to purify each cluster via histogram analysis of inter-cluster and intra-cluster distances; obtaining character labels for the glyphs; and using the glyphs and associated character labels to form the synthetic font; and constructing the electronic version of the document by, using the synthetic font to represent text regions of the document, wherein the synthetic font represents both a logical content and a visual appearance of characters in the text regions, wherein the visual appearance of characters in the synthetic font are faithful replicas of corresponding characters on printed pages from which the images were generated, and using image-segments extracted from the images for the document to represent non-text regions of the document.

8. The computer-readable storage device of claim 7 , wherein obtaining character labels for the glyphs involves performing an optical character recognition (OCR) operation on the glyphs.

9. The computer-readable storage device of claim 7 , wherein using the measured distances to purify each cluster involves statistically analyzing extracted character images which are similar to each other to ensure that the character images fall into homogenous clusters.

10. The computer-readable storage device of claim 9 , wherein the statistical analysis is based on an inter-character distance metric.

11. The computer-readable storage device of claim 7 , wherein producing glyphs from the extracted character images further involves: converting the extracted character images to grayscale prior to said iteratively registering, extracting, measuring and using.

12. The computer-readable storage device of claim 7 , wherein extracting the noise-reduced prototype from the registered character images for a given cluster involves averaging registered character images in the given cluster to produce a reduced-noise glyph which is representative of the given cluster.

13. A method for generating a synthetic font, comprising: using a computer to perform: receiving a set of scanned character images; producing glyphs from the set of scanned character images, wherein producing glyphs from the set of scanned character images involves grouping similar character images into clusters, and iteratively: registering scanned character images in each cluster with sub-pixel accuracy, extracting a high-resolution, noise-reduced prototype from the registered character images for each cluster, measuring a distance from each registered character image to its associated prototype, and using the measured distances to purify each cluster via histogram analysis of inter-cluster and intra-cluster distances; obtaining character labels for the glyphs; and using the glyphs and associated character labels to form the synthetic font, whereby the synthetic font can represent both a logical content and a visual appearance of characters in a document, wherein the visual appearance of characters in the synthetic font are faithful replicas of corresponding characters on printed pages from which the images were generated.

14. The method of claim 13 , wherein obtaining character labels for the glyphs involves performing an optical character recognition (OCR) operation on the glyphs.

15. The method of claim 13 , wherein using the measured distances to purify each cluster involves statistically analyzing character images which are similar to each other to ensure that the scanned character images fall into homogenous clusters.

16. The method of claim 15 , wherein the statistical analysis is based on an inter-character distance metric.

17. The method of claim 13 , wherein producing glyphs from the set of scanned character images further involves, prior to said iteratively registering, extracting, measuring and using: increasing a resolution of the scanned character images through up-sampling; and converting the scanned character images to grayscale using an inverse of a scanner modulation transfer function.

18. The method of claim 13 , wherein extracting the noise-reduced prototype from the registered character images for a given cluster involves averaging registered character images in the given cluster to produce a reduced-noise glyph which is representative of the given cluster.

19. A computer-readable storage device storing instructions that when executed by a computer cause the computer to perform a method for generating a synthetic font, the method comprising: receiving a set of scanned character images; producing glyphs from the set of scanned character images, wherein producing glyphs from the set of scanned character images involves grouping similar character images into clusters and iteratively: registering scanned character images in each cluster with sub-pixel accuracy, extracting a high-resolution, noise-reduced prototype from the registered character images for each cluster, measuring a distance from each registered character image to its associated prototype, and using the measured distances to purify each cluster via histogram analysis of inter-cluster and intra-cluster distances; obtaining character labels for the glyphs; and using the glyphs and associated character labels to form the synthetic font, whereby the synthetic font can represent both a logical content and a visual appearance of characters in a document, wherein the visual appearance of characters in the synthetic font are faithful replicas of corresponding characters on printed pages from which the images were generated.

Patent Metadata

Filing Date

Unknown

Publication Date

April 14, 2009

Inventors

Dennis G. Nicholson

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search