US-11281729

Method for automatically generating a wrapper for extracting web data, and a computer system

PublishedMarch 22, 2022

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Methods for automatically generating a wrapper for extracting web data and corresponding computer systems are disclosed. In one arrangement, a first wrapper is used to generate a second wrapper. The first wrapper extracts target data from one or more target web pages hosted by one or more target web servers. The second wrapper is capable of extracting the same target data from the same one or more target web pages without using a web browser engine to perform a) sending requests to the one or more target web servers, and/or b) processing replies from the one or more target web servers. The generation of the second wrapper comprises analysing one or both of the following: (i) code defining the first wrapper, (ii) interactions between the first wrapper and the one or more target web servers that occur during execution of the first wrapper.

Patent Claims

31 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for automatically transforming a wrapper for extracting web data, comprising: transforming a first wrapper into a second wrapper, wherein: the first wrapper is configured to extract target data from one or more target web pages hosted by one or more target web servers; the second wrapper is configured to extract the same target data from the same one or more target web pages without using a web browser engine to perform a) sending requests to the one or more target web servers, and/or b) processing replies from the one or more target web servers; and the transforming comprises analyzing one or both of the following: (i) code defining the first wrapper, (ii) interactions between the first wrapper and the one or more target web servers that occur during execution of the first wrapper.

2. The method of claim 1 , wherein the first wrapper is configured to extract the target data using a web browser engine.

3. The method of claim 1 , wherein the first wrapper is configured to simulate, where necessary, user input to the one or more target web pages.

4. The method of claim 1 , wherein the analysis of t interactions comprises identifying data transformations that the one or more target web servers apply to parameters sent to them by the first wrapper.

5. The method of claim 1 , wherein the analysis of interactions comprises construction and use of a dependency graph comprising nodes and arcs, each node representing an interaction or set of interactions and each arc representing propagation of parameters from one interaction or set of interactions to another interaction or set of interactions.

6. The method of claim 5 , wherein the dependency graph is constructed by first including the interactions conveying the target data and then iteratively including other interactions which convey parameters necessary for already included interactions.

7. The method of claim 5 , wherein the dependency graph consists exclusively of nodes corresponding to interactions conveying the target data and nodes and arcs necessary for providing parameters for the interactions conveying the target data.

8. The method of claim 5 , further comprising constructing one or snore data-selectors configured to extract selected data from replies sent by the one or more target servers, wherein the selected data comprises one or more of the following: a portion of the target data, all of the target data, and one or more parameters required by one or more of the interactions.

9. The method of claim 8 , wherein the data-selectors are synthetized via generalization from examples.

10. The method of claim 8 , wherein each data-selector comprises one or more of the following: (i) regular expressions, (ii) generalized regular expressions, (iii) XPath expressions, (iv) XPath-like expressions that apply to HTML, (v) expressions in some appropriate tree-grammar, tree automata, finite-state automata, procedures, and programs.

11. The method of claim 8 , wherein the data-selectors are used to propagate data according to the arcs of the dependency graph.

12. The method of claim 1 , wherein the analysis of the interactions uses recorded traces representing sequences of messages exchanged between the first wrapper and the one or more target web servers.

13. The method of claim 12 , wherein multiple traces are obtained that correspond to multiple executions of the first wrapper, each execution being performed with different input data to the first wrapper.

14. The method of claim 13 , further comprising grouping interactions from the multiple traces that satisfy an equivalence relation and generating from the group a generalized HTTP request for the group.

15. The method of claim 14 , wherein the equivalence relation requires HTTP requests in the interactions to comprise one or more of the following: the same parametrized URLs, names of parameters and structures of response bodies.

16. The method of claim 13 , wherein similar interactions from the multiple traces are grouped together based on analysis of parameters of the interactions.

17. The method of claim 16 , wherein parameters for each group of similar interactions are identified by analysis of differences in values of the parameters.

18. The method of claim 17 , wherein the analysis of differences in values of the parameters can identify composite parameters of each group of similar interactions, where composite parameters are parameters consisting of other parameters.

19. The method of claim 12 , wherein the recorded traces comprise sequences of HTTP request-reply pairs between the first wrapper and the one or more target web servers.

20. The method of claim 1 , wherein the analysis of the interactions includes identification of parameters of requests which can be removed without changing replies to the requests by more than a predetermined extent.

21. The method of claim 1 , wherein the analysis of the interactions comprises identifying requests originating from the first wrapper that are not necessary for extracting the target data and omitting those requests during transformation of the first wrapper into the second wrapper.

22. The method of claim 1 , wherein the analysis of the interactions comprises identifying data transformations that the first wrapper applies to replies and to parameters of the interactions.

23. The method of claim 22 , wherein the data transformation comprises transforming each reply to a parameter of another interaction or to the target data.

24. The method of claim 22 , wherein the analysis of the interactions comprises analysis of JavaScript programs that the first wrapper executes.

25. The method of claim 1 , comprising: executing the first wrapper at a client to extract the target data from the one or more target web pages, the first wrapper extracting the target data by simulating, where necessary, user input to the one or more target web pages, the simulated user input specifying the target data to be extracted; analyzing interactions between the first wrapper and the one or more target web servers that occurred during execution of the first wrapper; and using the analysis of the interactions to transform the first wrapper into the second wrapper.

26. The method of claim 25 , wherein the simulation of user input by the first wrapper comprises rendering a web page using a web browser engine.

27. The method of claim 26 , wherein the user input comprises one or more interactions that a user can make with a displayed web page.

28. The method of claim 1 , wherein the transforming comprises direct compilation of the first wrapper into the second wrapper.

29. A non transitory computer readable storage medium having recorded thereon program code that, when executed on a computer system, instructs the computer system to carry out the method of claim 1 .

30. A computer system configured to automatically transform a wrapper for extracting web data, the computer system being configured to perform the following steps: transforming a first wrapper into a second wrapper, wherein: the first wrapper is configured to extract target data from one or more target web pages hosted by one or more target web servers; the second wrapper is configured to extract the same target data from the same one or more target web pages without using a web browser engine to perform a) sending requests to the one or more target web servers, and/or b) processing replies from the one or more target web servers; and the transforming comprises analyzing one or both of the following: (i) code defining the first wrapper, (ii) interactions between the first wrapper and the one or more target web servers that occur during execution of the first wrapper.

31. A method for automatically generating a wrapper for extracting web data, comprising: executing a first wrapper to extract target data from one or more target web pages hosted by one or more target web servers; analyzing one or both of the following: (i) code defining the first wrapper, and (ii) interactions between the first wrapper and the one or more target web servers that occur during the execution of the first wrapper; and using the analysis to generate an output wrapper that, when executed, extracts the same target data from the same one or more target web pages without using a web browser engine to perform a) sending requests to the one or more target web servers, and/or b) processing replies from the one or more target web servers.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F

Patent Metadata

Filing Date

July 12, 2018

Publication Date

March 22, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search