Patentable/Patents/US-8589404
US-8589404

Semantic data integration

PublishedNovember 19, 2013
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

Systems and methods are provided for retrieving data relevant to a subject of interest. Occurrences of each of a plurality of n-grams within the data record are identified. A multinomial distribution is defined from the respective numbers of occurrence of a subset of the plurality of n-grams. The multinomial distribution is stored in a semantic model as a point on an information manifold. The semantic model is configured to represent an indexed family of probability distributions as points on the information manifold. It is determined if the data record is relevant to the subject of interest according to the position of the point on the information manifold, and the data record is retrieved if the data record is relevant to the subject of interest.

Patent Claims
19 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A system for identifying the relevance of a data record to a subject of interest comprising: at least one non-transitory computer readable medium storing machine executable instructions comprising: an indexer configured to identify occurrences of each of a plurality of n-grams within the data record; a distribution generator configured to associate at least one of the plurality of n-grams with a semantic parameter, where a semantic parameter is a value derived at least one of associated meanings, inter-symbol structure, and associated source properties of the data record, and define a multinomial distribution from the respective numbers of occurrence of a subset of the plurality of n-grams and the assigned semantic parameter; a semantic model configured to represent a family of probability distributions as points on an information manifold, the information manifold having an intrinsic geometry defined by the family of probability distributions such that a distance between two points on the information manifold represents a similarity between the probability distributions represented by the two points and the semantic model storing the multinomial distribution as a point on the information manifold; a classifier configured to determine the relevance of the data record according to the position of the point on the information manifold; and a processor operatively connected to one or more of the at least one non-transitory computer readable media and configured to execute at least a subset of the machine executable instructions.

2

2. The system of claim 1 , the information manifold being represented as an N-simplex, where N is equal to a number of possible distributions for a given set of identifiable n-grams and semantic parameters in a universe of discourse represented by the information manifold.

3

3. The system of claim 1 , wherein the semantic parameter represents a geographic location associated with the data record.

4

4. The system of claim 1 , wherein the semantic parameter represents a time period associated with the data record.

5

5. The system of claim 1 , wherein the semantic parameter comprises an additional n-gram that is not present in the data record but is related to one of the plurality of n-grams.

6

6. The system of claim 5 , wherein the additional phrase is a name of an organization and the one of the plurality of n-grams is a name of an individual in the organization.

7

7. The system of claim 5 , wherein the one of the plurality of n-grams in a name of a first individual and the additional n-gram is a name of a second individual having a familial relationship to the first individual.

8

8. The system of claim 1 , wherein the classifier comprises at least one support vector machine configured to define a region on the information manifold containing a plurality of indexed distributions representing data records relevant to the subject of interest.

9

9. A computer implemented method for retrieving data relevant to a subject of interest comprising: creating respective initial multinomial distributions from each of a plurality of data records; augmenting each initial multinomial distribution with a semantic parameter to form a plurality of augmented multinomial distributions from a family of multinomial distributions, the semantic parameter for each initial multinomial distribution representing a portion of a semantic content of the data record associated with the initial multinomial distribution as a value derived at least one of associated meanings, inter-symbol structure, and associated source properties of a data record; creating a semantic model representing the plurality of augmented distributions as points on an information manifold, the information manifold having an intrinsic geometry defined by the family of multinomial distributions such that a distance between two points on the information manifold represents a similarity between the multinomial distributions represented by the two points; defining a region on the information manifold associated with the subject of interest; and retrieving at least one data record within the defined region.

10

10. The computer implemented method of claim 9 , wherein retrieving the at least one data record comprises providing the at least one data record to a user via a graphical user interface.

11

11. The computer implemented method of claim 9 , wherein defining a region on the information manifold comprises: allowing the user to select a first set of the plurality of data records that are relevant to the subject of interest and a second set of the plurality of data records that are not relevant to the subject of interest; and defining the region on the information manifold according to the selected first and second sets.

12

12. The computer implemented method of claim 11 , wherein defining the region on the information manifold according to the selected first and second sets comprises training a support vector machine on the first and second sets.

13

13. The computer implemented method of claim 9 , wherein augmenting each initial multinomial distribution with a semantic parameter comprises: defining a first grid over a geospatial region of interest, the first grid comprising a first plurality of subregions each having a first area; defining a second grid over the geospatial region of interest, the second grid comprising a second plurality of subregions, each of the subregions of the second grid having a second area greater than the first area; and determining at least one subregion of the first plurality of subregions and at least one subregion of the second plurality of subregions associated with the portion of the semantic content of the data record.

14

14. A system comprising: a first non-transitory computer readable medium storing a first set of machine executable instructions; a first processor and operatively connected to the first non-transitory computer readable medium, the first processor being local to the first non-transitory computer readable medium; a second non-transitory computer readable medium storing a second set of machine executable instructions, the second non-transitory computer readable medium being remote from the first non-transitory computer readable medium and connected via a network connection; and a second processor and operatively connected to the second non-transitory computer readable medium, the second processor being local to the first non-transitory computer readable medium; wherein the first non-transitory computer readable medium and the second non-transitory computer readable medium collectively store machine readable instructions configured to perform a method comprising creating respective initial multinomial distributions from each of a plurality of data records; augmenting each initial multinomial distribution with a semantic parameter to form a plurality of augmented multinomial distributions from a family of multinomial distributions, the semantic parameter for each initial multinomial distribution representing a portion of a semantic content of the data record associated with the initial multinomial distribution as a value derived at least one of associated meanings, inter-symbol structure, and associated source properties of a data record; creating a semantic model representing the plurality of augmented distributions as points on an information manifold, the information manifold having an intrinsic geometry defined by the family of multinomial distributions such that a distance between two points on the information manifold represents a similarity between the multinomial distributions represented by the two points; defining a region on the information manifold associated with a subject of interest; and retrieving at least one data record within the defined region.

15

15. A method for providing data relevant to a subject of interest to a user comprising: identifying occurrences of each of a plurality of n-grams within the data record; defining a multinomial distribution from the respective numbers of occurrence of a subset of the plurality of n-grams, wherein defining the multinomial distribution comprises associating at least one of the n-grams with a semantic parameter, the semantic parameter being a value derived at least one of associated meanings, inter-symbol structure, and associated source properties of the data record, and defining the multinomial distribution from the respective numbers of occurrence of the subset of the plurality of n-grams and the semantic parameter; storing the multinomial distribution in a semantic model as a point on an information manifold, the semantic model being configured to represent a plurality of indexed distributions as points on the information manifold, and the information manifold being an N-simplex, where N is an integer greater than one; determining if the data record is relevant to the subject of interest according to the position of the point on the information manifold; and providing the data record to the user for review if the data record is relevant to the subject of interest.

16

16. The method of claim 15 , wherein associating at least one n-gram with a semantic parameter comprises: defining a grid over a geospatial region of interest, the grid comprising a plurality of subregions; and determining at least one subregion of the plurality of subregions associated with the at least one n-gram.

17

17. The method of claim 15 , wherein associating at least one n-gram with a semantic parameter comprises: defining a series of subintervals over a time frame of interest; and determining at least one subinterval of the series of subintervals associated with the at least one n-gram.

18

18. The method of claim 15 , wherein associating at least one n-gram with a semantic parameter comprises identifying a relationship between one of the plurality of n-grams and an additional n-gram that is not present in the data record, and representing the additional n-gram in the multinomial distribution.

19

19. A computer implemented method for retrieving data relevant to a subject of interest comprising: creating respective initial multinomial distributions from each of a plurality of data records; augmenting each initial multinomial distribution with a semantic parameter to form a plurality of augmented multinomial distributions from a family of multinomial distributions, the semantic parameter for each initial multinomial distribution representing a portion of a semantic content of the data record associated with the initial multinomial distribution as a value derived at least one of associated meanings, inter-symbol structure, and associated source properties of a data record, and the augmenting of each initial multinomial distribution with a semantic parameter comprising; defining a series of subintervals over a time frame of interest; and determining at least one subinterval associated with the portion of the semantic content of the data record; creating a semantic model representing the plurality of augmented distributions as points on an information manifold, the information manifold having an intrinsic geometry defined by the family of multinomial distributions such that a distance between two points on the information manifold represents a similarity between the multinomial distributions represented by the two points; and defining a region on the information manifold associated with the subject of interest; and retrieving at least one data record within the defined region.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

June 19, 2012

Publication Date

November 19, 2013

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Semantic data integration” (US-8589404). https://patentable.app/patents/US-8589404

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.