US-8751498

Finding and disambiguating references to entities on web pages

PublishedJune 10, 2014

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A system and method for disambiguating references to entities in a document. In one embodiment, an iterative process is used to disambiguate references to entities in documents. An initial model is used to identify documents referring to an entity based on features contained in those documents. The occurrence of various features in these documents is measured. From the number occurrences of features in these documents, a second model is constructed. The second model is used to identify documents referring to the entity based on features contained in the documents. The process can be repeated, iteratively identifying documents referring to the entity and improving subsequent models based on those identifications. Additional features of the entity can be extracted from documents identified as referring to the entity.

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for identifying documents referring to an entity, the entity being associated with a first set of features, the method comprising: at a computer having one or more processors and memory storing programs for execution by the one or more processors: identifying a first set of documents based on a first model and the first set of features, wherein the first model includes a first set of rules specifying at least one combination of features from the first set of features that are sufficient for identifying a document referring to the entity, and each document in the first set of documents includes a sufficient number of features in common with the first set of features to identify a document referring to the entity according to the first model; determining a second model based on features included in one or more documents in the first set of documents, wherein the second model includes a second set of rules specifying at least one combination of features from the first set of documents that are sufficient for identifying a document referring to the entity; identifying a second set of documents based on the second model, wherein each document in the second set of documents includes a sufficient number of features in common with the first set of features to identify a document referring to the entity according to the second model, and wherein the second set of documents includes at least one document not included in the first set of documents; and extracting one or more facts from the second set of documents and associating the extracted facts with the entity.

2. The method of claim 1 , wherein the first set of features is stored as a set of facts in a fact repository in association with a second object that corresponds to the entity.

3. The method of claim 1 , wherein the first model is different than the second model.

4. The method of claim 1 , wherein determining the second model comprises determining a number of occurrences of the first set of features in the first set of documents.

5. The method of claim 1 , further comprising: identifying a second set of features based on the second set of documents; determining if the second set of features are associated with the entity; and responsive to determining that the second set of features are associated with the entity, identifying a third set of documents based on a third model and the second set of features, each document of the third set of documents comprising a sufficient number of features in common with the second set of features to identify a document referring to the entity according to the third model.

6. The method of claim 5 , wherein the second set of features includes at least one feature not included in the first set of features.

7. The method of claim 5 , wherein the first set of features includes at least one feature not included in the second set of features.

8. The method of claim 5 , further comprising: storing at least one feature of the second set of features as a fact in the fact repository.

9. The method of claim 1 , further comprising: estimating importance of the entity based on the second set of documents.

10. The method of claim 1 , further comprising: estimating importance of the entity based on a number of documents in the second set of documents.

11. The method of claim 1 , further comprising: estimating importance of the entity based on an estimated importance of at least one of the documents in the second set of documents.

12. The method of claim 1 , further comprising: associating at least one of the documents of the second set of documents with the entity.

13. The method of claim 1 , wherein identifying a second set of documents based on the second model and the first set of features comprises estimating a probability that a document of the second set of documents refers to the entity.

14. The method of claim 1 , wherein the first set of features comprises at least a first feature and a second feature, and wherein the second model specifies that an occurrence of the first feature is sufficient to identify a document referring to the entity.

15. The method of claim 14 , wherein the second model specifies that an occurrence of the second feature is not sufficient to identify a document referring to the entity.

16. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for: identifying a first set of documents based on a first model and a first set of features, wherein the first model includes a first set of rules specifying at least one combination of features from the first set of features that are sufficient for identifying a document referring to an entity, and each document in the first set of documents includes a sufficient number of features in common with the first set of features to identify a document referring to an entity according to the first model; determining a second model based on features included in one or more documents in the first set of documents, wherein the second model includes a second set of rules specifying at least one combination of features from the first set of documents that are sufficient for identifying a document referring to the entity; identifying a second set of documents based on the second model, wherein each document in the second set of documents includes a sufficient number of features in common with the first set of features to identify a document referring to the entity according to the second model, and wherein the second set of documents includes at least one document not included in the first set of documents; and extracting one or more facts from the second set of documents and associating the extracted facts with the entity.

17. The non-transitory computer readable storage medium of claim 16 , wherein the first set of features is stored as a set of facts in the fact repository in association with a second object that corresponds to the entity.

18. The non-transitory computer readable storage medium of claim 16 , wherein the first model is different than the second model.

19. A computer system comprising: a processor; memory; and one or more programs, wherein the one or more programs comprising instructions for: identifying a first set of documents based on a first model and an first set of features, wherein the first model includes a first set of rules specifying at least one combination of features from the first set of features that are sufficient for identifying a document referring to an entity, and each document in the first set of documents includes a sufficient number of features in common with the first set of features to identify a document referring to the entity according to the first model; determining a second model based on features included in one or more documents in the first set of documents, wherein the second model includes a second set of rules specifying at least one combination of features from the first set of documents that are sufficient for identifying a document referring to the entity; identifying a second set of documents based on the second model, wherein each document in the second set of documents includes a sufficient number of features in common with the first set of features to identify a document referring to the entity according to the second model, and wherein the second set of documents includes at least one document not included in the first set of documents; and extracting one or more facts from the second set of documents and associating the extracted facts with the entity.

20. The system of claim 19 , wherein the first set of features is stored as a set of facts in the fact repository in association with a second object that corresponds to the entity.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F G06N

Patent Metadata

Filing Date

February 1, 2012

Publication Date

June 10, 2014

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search