Patentable/Patents/US-7715635
US-7715635

Identifying similarly formed paragraphs in scanned images

PublishedMay 11, 2010
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

A system and method for identifying and/or categorizing similarly formed paragraphs in a digital image is set forth. An exemplary system includes a processor and a memory. The memory stores executable components which when direct the system to perform the following: obtain at least one page image of reflowable textual content and identify at least one paragraph of textual content. Thereafter, for each identified paragraph, a plurality of paragraph metrics regarding the identified paragraph is determined. Based on the paragraph metrics, similarly formed paragraphs are clustered.

Patent Claims
52 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A computing device for identifying similarly formed paragraphs according to an analysis of paragraph metrics, the computing device comprising: a processor; and a memory, the memory storing at least one component suitable for execution by the processor and which, when executed, directs the computing device to: obtain at least one page image, each page image comprising reflowable textual content; identify paragraphs of reflowable textual content in the obtained at least one page image; for each identified paragraph, determine a plurality of metrics regarding the identified paragraph; and perform a clustering analysis of the identified paragraphs based on at least one of the plurality of metrics of each paragraph, thereby resulting in at least one cluster of similarly formed paragraphs found on the at least one page image.

2

2. The computing device of claim 1 , wherein performing the clustering analysis of the identified paragraphs based on at least one of the plurality of metrics of each paragraph comprises performing a quality threshold (QT) clustering analysis to generate at least one cluster of similarly formed paragraphs found in the at least one page image.

3

3. The computing device of claim 2 , wherein performing the clustering analysis of the identified paragraphs based on at least one of the plurality of metrics of each paragraph further comprises combining at least two of the plurality of metrics and ordering a combination of metrics according to their relevancy of clustering paragraphs prior to performing said QT clustering analysis.

4

4. The computing device of claim 3 , wherein combining at least two of the plurality of metrics comprises combining at least two of the plurality of metrics using a principle component analysis (PCA) on the plurality of metrics.

5

5. The computing device of claim 4 further configured to repeatedly: update the metrics of each paragraph within a cluster to reflect a standard paragraph of the cluster; and perform a QT clustering analysis on the updated metrics of each updated paragraph; until the number of clusters generated by the QT clustering analysis is not reduced.

6

6. The computing device of claim 2 , wherein the clustering analysis of the identified paragraphs based on at least one of the plurality of metrics of each paragraph comprises one of a statistical analysis or a deterministic analysis on the at least one of the plurality of metrics of each paragraph.

7

7. The computing device of claim 1 further configured to determine the content boundaries of the reflowable textual content for each page image of the at least one page image.

8

8. The computing device of claim 7 , wherein determining the content boundaries of the reflowable textual content for each page image comprises determining content boundaries to exclude non-reflowable content found on each page image.

9

9. The computing device of claim 8 , wherein determining metrics for each identified paragraph comprises determining a bounding region for each of the identified paragraphs.

10

10. The computing device of claim 9 , wherein determining metrics for each identified paragraph further comprises, for each identified paragraph, determining the distances between the paragraph's bounding region and the closest adjacent paragraph or the distances between the paragraph's bounding region and a corresponding content boundary when the paragraph is immediately adjacent to a content boundary.

11

11. The computing device of claim 10 , wherein determining metrics for each identified paragraph further comprises, for each identified paragraph, determining the distance of indentation of the first line of text in the identified paragraph.

12

12. The computing device of claim 11 , wherein determining metrics for each identified paragraph further comprises, for each identified paragraph, determining the line height of the paragraph, the line height being the distance between the baselines of two consecutive lines of textual content.

13

13. The computing device of claim 12 , wherein determining metrics for each identified paragraph further comprises, for each identified paragraph, determining the level of hierarchical nesting of the textual content.

14

14. The computing device of claim 13 , wherein determining metrics for each identified paragraph further comprises, for each identified paragraph, determining the width of the paragraph's bounding region.

15

15. The computing device of claim 14 , wherein determining metrics for each identified paragraph further comprises, for each identified paragraph, determining whether the paragraph includes only a single line, and setting the paragraph width and distance to the right to zero if there is only one line.

16

16. The computing device of claim 1 further comprising a local storage area, and wherein obtaining the at least one page image comprises obtaining the at least one page image from the local storage area.

17

17. The computing device of claim 1 further comprising an input interface connected to a digitizing device, and wherein obtaining the at least one page image comprises obtaining the at least one page image from the digitizing device via the input interface.

18

18. The computing device of claim 1 further comprising a network interface connected to a network, and wherein obtaining the at least one page image comprises obtaining the at least one page image from an external source on the network via the network interface.

19

19. The computing device of claim 18 , wherein the network interface is a wireless network interface that wirelessly connects the computing device to a network, wherein obtaining the at least one page image comprises obtaining the at least one page image from an external source on the network via the wireless network interface.

20

20. The computing device of claim 1 , wherein the computing device is further directed to: associate a paragraph category with each cluster resulting from the clustering analysis; and generate a paragraph style for each paragraph category, wherein each paragraph style corresponds to at least some paragraph metrics of a typical paragraph of the categorized cluster.

21

21. The computing device of claim 20 , wherein the computing device is further directed to determine whether the number of clusters resulting from the clustering analysis exceeds a predetermined threshold, and if so, obtaining human input regarding associating a paragraph category with each cluster.

22

22. A computer-implemented method for categorizing similarly formed paragraphs in at least one page image of reflowable textual content, the method comprising: obtaining at least one page image; identifying a plurality of paragraphs of reflowable textual content in each page of the at least one page image; determining a plurality of paragraph metrics regarding each of the plurality of identified paragraphs; clustering the identified paragraphs into at least one cluster of paragraphs according to at least some of the paragraph metrics; associating a paragraph category with each cluster of paragraphs; and generating a paragraph style for each paragraph category, wherein each paragraph style corresponds to at least some paragraph metrics of a typical paragraph of the corresponding categorized cluster.

23

23. The method of claim 22 further comprising determining whether the number of clusters exceeds an expected threshold, and if so, obtaining human input regarding associating a paragraph category with each cluster.

24

24. The method of claim 22 , wherein clustering the identified paragraphs into at least one cluster of paragraphs according to at least some of the paragraph metrics comprises performing a clustering analysis of at least some of the paragraph metrics, the result yielding a clustering of the identified paragraphs.

25

25. The method of claim 24 , wherein the clustering analysis of the paragraphs metrics comprises a quality threshold (QT) clustering analysis.

26

26. The method of claim 24 further comprising combining at least two paragraph metrics and ordering the combined paragraph metrics according to those combinations that are most relevant for clustering the paragraphs, and performing a clustering analysis of the ordered paragraph metric combinations.

27

27. The method of claim 26 , wherein combining at least two paragraph metrics comprises performing a principle component analysis (PCA) on the paragraph metrics.

28

28. The method of claim 26 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs comprises determining content boundaries for the reflowable content of the corresponding page image.

29

29. The method of claim 28 , wherein determining content boundaries for the reflowable content of the corresponding page image comprises excluding non-reflowable content on the corresponding page image.

30

30. The method of claim 28 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises determining a bounding region for each of the identified paragraphs.

31

31. The method of claim 30 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises, for each identified paragraph, determining distances of the paragraph to adjacent paragraphs or to the content boundaries of the corresponding page image when the paragraph is immediately adjacent to a content boundary, wherein the distances include at least one of a distance up, a distance left, a distance right, and a distance down.

32

32. The method of claim 31 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises, for each identified paragraph, determining the amount of indentation for the first line of text in each identified paragraph.

33

33. The method of claim 32 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises, for each identified paragraph, determining the line height of the paragraph, the line height being the distance between the baselines of two consecutive lines of textual content in the paragraph.

34

34. The method of claim 33 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises, for each identified paragraph, determining a level of nesting of the textual content with regard to a hierarchical document structure.

35

35. The method of claim 33 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises, for each identified paragraph, determining the width of the paragraph's bounding region.

36

36. The method of claim 35 , wherein determining a plurality of paragraph metrics regarding each of the identified paragraphs further comprises, for each identified paragraph, determining whether the paragraph includes only a single line of textual content, and setting the paragraph width and distance right metrics to zero if there is only a single line of textual content in the paragraph.

37

37. The method of claim 24 further comprising: repeatedly: updating the metrics of each paragraph within each of the at least one clusters of paragraphs to reflect a standard paragraph of the cluster; and performing a clustering analysis of the updated paragraph metrics, the result yielding another clustering of the identified paragraphs; until the number of clusters generated by the clustering analysis is not reduced.

38

38. A computer-readable medium bearing computer-executable instructions which, when executed by a computer, configure the computer to: obtain at least one page image having a plurality of paragraphs of textual content therein; identify a plurality of paragraphs of textual content from the at least one page image; for each identified paragraph, determine a plurality of paragraph metrics; perform a clustering analysis of the identified paragraphs based on the paragraph metrics, thereby yielding at least one cluster of similarly formed paragraphs of the at least one page image; and repeatedly: standardize the paragraph metrics of each paragraph of each cluster to be consistent with the paragraphs within its cluster; and perform a subsequent clustering analysis of the identified paragraphs based on the standardized paragraph metrics, the subsequent clustering analysis yielding a clustering of paragraphs; until the number of clusters yielded by the subsequent clustering analysis is no longer reduced.

39

39. The computer-readable medium of claim 38 , wherein the clustering analysis of identified paragraphs comprises one of a statistical analysis and a deterministic analysis.

40

40. The computer-readable medium of claim 38 , wherein the clustering analysis comprises: performing a principle component analysis (PCA) of the paragraph metrics to generate at least one combination of paragraph metrics for clustering the paragraphs; and performing a quality threshold (QT) clustering analysis based on the results of the PCA to yield at least one cluster of similarly formed paragraphs of the at least one page image.

41

41. The computer-readable medium of claim 38 , wherein the computer is further configured to: associate a paragraph category with each cluster of paragraphs; and generate a paragraph style for each paragraph category, wherein each paragraph style corresponds to at least some paragraph metrics of a typical paragraph of the corresponding categorized cluster.

42

42. The computer-readable medium of claim 41 , wherein the computer is further configured to determine whether the number of clusters exceeds an expected threshold, and if so, obtain human input regarding associating a paragraph category with each cluster.

43

43. A computer-implemented method for identifying similarly formed paragraphs according to an analysis of paragraph metrics, the computer-implemented method comprising: as implemented by one or more computing devices configured with specific executable instructions, obtaining at least one page image, each page image comprising reflowable textual content; identifying paragraphs of reflowable textual content in the obtained at least one page image; for each identified paragraph, determining a plurality of metrics regarding the identified paragraph; and performing a clustering analysis of the identified paragraphs based on at least one of the plurality of metrics of each paragraph, thereby resulting in at least one cluster of similarly formed paragraphs found on the at least one page image.

44

44. The computer-implemented method of claim 43 , wherein determining metrics for each identified paragraph comprises determining a bounding region for each of the identified paragraphs.

45

45. The computer-implemented method of claim 43 , wherein determining metrics for each identified paragraph comprises, for each identified paragraph, determining the distance of indentation of the first line of text in the identified paragraph.

46

46. The computer-implemented method of claim 43 , wherein determining metrics for each identified paragraph comprises, for each identified paragraph, determining a line height of the paragraph, the line height being the distance between baselines of two consecutive lines of textual content.

47

47. The computer-implemented method of claim 43 , wherein determining metrics for each identified paragraph comprises, for each identified paragraph, determining a level of hierarchical nesting of the textual content.

48

48. The computer-implemented method of claim 44 , wherein determining metrics for each identified paragraph comprises, for each identified paragraph, determining a width of the paragraphts bounding region.

49

49. The computer-implemented method of claim 43 , further comprising: associating a paragraph category with each cluster resulting from the clustering analysis; and generating a paragraph style for each paragraph category, wherein each paragraph style corresponds to at least one paragraph metric of a typical paragraph of the categorized cluster.

50

50. A computer-readable storage medium bearing computer-executable instructions which, when executed by a computer, configure the computer to: obtain at least one page image, each page image comprising reflowable textual content; identify paragraphs of reflowable textual content in the obtained at least one page image; for each identified paragraph, determine a plurality of metrics regarding the identified paragraph; and perform a clustering analysis of the identified paragraphs based on at least one of the plurality of metrics of each paragraph, thereby resulting in at least one cluster of similarly formed paragraphs found on the at least one page image.

51

51. The computer-readable storage medium of claim 50 , wherein performing the clustering analysis comprises performing a quality threshold (QT) clustering analysis to generate at least one cluster of similarly formed paragraphs found on the at least one page image.

52

52. The computer-readable storage medium of claim 50 , wherein the computer is further configured to: associate a paragraph category with each cluster resulting from the clustering analysis; and generate a paragraph style for each paragraph category, wherein each paragraph style corresponds to at least one paragraph metric of a typical paragraph of the categorized cluster.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

September 28, 2006

Publication Date

May 11, 2010

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Identifying similarly formed paragraphs in scanned images” (US-7715635). https://patentable.app/patents/US-7715635

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.