9996614

Method and System for Determining Relevant Text in a Web Page

PublishedJune 12, 2018
Assigneenot available in USPTO data we have
InventorsPaul Broman
Technical Abstract

Patent Claims
18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A method comprising: receiving, by a computing device, a web page; analyzing, via the computing device, said web page, and based on said analysis, identifying text elements in said web page, said identification comprising determining a position of each text element on said web page, said identification further comprising determining a physical dimension of each text element on said web page, each text element comprising a set of characters or symbols; determining, via the computing device, a weight value for each text element, each weight value determination based on a determined position and a determined physical dimension of a respective text element; assigning, via the computing device, for each located text element, said determined weight value to a respective text element; comparing, via the computing device, for each text element, the assigned weight value to a threshold weight; determining, via the computing device, whether said assigned weight satisfies said threshold weight, when said assigned weight satisfies said threshold weight, storing, via the computing device, information related to said set of characters or symbols of each text element in storage, wherein when said assigned weight does not satisfy the threshold weight: analyzing, via the computing device, each text element, and based on said analysis, determining a layout of each text element, comparing, via the computing device, each text element's layout to each other, and determining, based on said comparison, a similarity score for each text element, and storing, when said text element has an assigned weight below said threshold weight, said information of the text element in said storage when said text element satisfies said similarity score.

2

2. The method of claim 1 further comprising, for each text element, determining, by the computing device, the size of the each text element when rendered.

3

3. The method of claim 1 wherein the locating of text elements in the web page further comprises using W3C Document Object Model (DOM) standard to locate text nodes and parent elements.

4

4. The method of claim 3 further comprising storing, by the computing device, the text nodes in a text node array and storing the parent elements in a parent element array.

7

7. The method of claim 4 further comprising marking, by the computing device, the each text element as potentially relevant if the weight of the each text element is above the threshold weight.

8

8. The method of claim 7 further comprising sorting, by the computing device, the parent element array in descending order by weight before comparing the weight of the each text element to the threshold weight.

9

9. The method of claim 7 further comprising sorting, by the computing device, the parent element array in ascending order by a first node index value to find adjacent elements next to the elements marked as relevant by weight.

10

10. The method of claim 9 wherein finding adjacent elements further comprises determining one or more of whether the each text element has less than a predetermined number of characters of text, whether a left edge of a previous text element and the each text element match, and whether space between the top of the each text element and the bottom of the previous text element is less than a maximum allowed gap.

11

11. The method of claim 4 further comprising storing, by the computing device, the text from the each text element if the each text element is marked as relevant or (a left edge of a previous text element and the each text element match and the space between the top of the each text element and the bottom of the previous text element is less than a maximum allowed gap and a ratio computed for the current text element and the previous text element is similar), or (a left edge of a next text element and the each text element match and the space between the bottom of the each text element and the top of the next text element is less than a maximum allowed gap and a ratio computed for the current text element and the next text element is similar).

12

12. A non-transitory computer readable storage medium tangibly storing computer program instructions, that when executed by a computing device, cause the computing device to perform a method comprising: receiving, by the computing device, a web page; analyzing, via the computing device, said web page, and based on said analysis, identifying text elements in said web page, said identification comprising determining a position of each text element on said web page, said identification further comprising determining a physical dimension of each text element on said web page, each text element comprising a set of characters or symbols; determining, via the computing device, a weight value for each text element, each weight value determination based on a determined position and a determined physical dimension of a respective text element; assigning, via the computing device, for each located text element, said determined weight value to a respective text element; comparing, via the computing device, for each text element, the assigned weight value to a threshold weight; determining, via the computing device, whether said assigned weight satisfies said threshold weight, wherein when said assigned weight satisfies said threshold weight, storing, via the computing device, information related to said set of characters or symbols of each text element in storage, wherein when said assigned weight does not satisfy the threshold weight: analyzing, via the computing device, each text element, and based on said analysis, determining a layout of each text element, comparing, via the computing device, each text element's layout to each other, and determining, based on said comparison, a similarity score for each text element, and storing, when said text element has an assigned weight below said threshold weight, said information of the text element in said storage when said text element satisfies said similarity score.

13

13. The non-transitory computer readable storage medium of claim 12 further comprising computer program instructions defining the step of, for each text element, determining the size of the each text element when rendered.

14

14. The non-transitory computer readable storage medium of claim 12 wherein the computer program instructions defining the step of the assigning of the weight value further comprises computer program instructions defining the step of assigning the weight value based on a position of the each text element in the web page.

15

15. The non-transitory computer readable storage medium of claim 12 wherein the computer program instructions defining the step of locating of text elements in the web page further comprises computer program instructions defining the step of using W3C Document Object Model (DOM) standard to locate text nodes and parent elements.

16

16. The non-transitory computer readable storage medium of claim 15 further comprising computer program instructions defining the step of storing the text nodes in a text node array and storing the parent elements in a parent element array.

19

19. The non-transitory computer readable storage medium of claim 18 further comprising computer program instructions defining the step of marking the each text element as potentially relevant if the weight of the each text element is above the threshold weight.

20

20. The non-transitory computer readable storage medium of claim 19 further comprising computer program instructions defining the step of determining one or more of whether the each text element has less than a predetermined number of characters of text, whether a left edge of a previous text element and the each text element match, and whether space between the top of the each text element and the bottom of the previous text element is less than a maximum allowed gap.

21

21. The non-transitory computer readable storage medium of claim 16 further comprising computer program instructions defining the step of storing the text from the each text element if the each text element is marked as relevant or a left edge of a previous text element and the each text element match and the space between the top of the each text element and the bottom of the previous text element is less than a maximum allowed gap and a ratio computed for the current text element and the previous text element is similar, or a left edge of a next text element and the each text element match and the space between the bottom of the each text element and the top of the next text element is less than a maximum allowed gap and a ratio computed for the current text element and the next text element is similar.

22

22. A computing device comprising: a processor; and a non-transitory computer-readable storage medium tangibly storing thereon program logic executable by the processor, the program logic comprising: logic executed by the processor for receiving, by a computing device, a web page; logic executed by the processor for analyzing, via the computing device, said web page, and based on said analysis, identifying text elements in said web page, said identification comprising determining a position of each text element on said web page, said identification further comprising determining a physical dimension of each text element on said web page, each text element comprising a set of characters or symbols; logic executed by the processor for determining, via the computing device, a weight value for each text element, each weight value determination based on a determined position and a determined physical dimension of a respective text element; logic executed by the processor for assigning, via the computing device, for each located text element, said determined weight value to a respective text element; logic executed by the processor for comparing, via the computing device, for each text element, the assigned weight value to a threshold weight; logic executed by the processor for determining, via the computing device, whether said assigned weight satisfies said threshold weight, when said assigned weight satisfies said threshold weight, storing, via the computing device, information related to said set of characters or symbols of each text element in storage, wherein when said assigned weight does not satisfy the threshold weight: logic executed by the processor for analyzing, via the computing device, each text element, and based on said analysis, determining a layout of each text element, logic executed by the processor for comparing, via the computing device, each text element's layout to each other, and determining, based on said comparison, a similarity score for each text element, and logic executed by the processor for storing, when said text element has an assigned weight below said threshold weight, said information of the text element in said storage when said text element satisfies said similarity score.

Patent Metadata

Filing Date

Unknown

Publication Date

June 12, 2018

Inventors

Paul Broman

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “METHOD AND SYSTEM FOR DETERMINING RELEVANT TEXT IN A WEB PAGE” (9996614). https://patentable.app/patents/9996614

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.