Semantic indexing methods and systems are disclosed. One such method is directed to training a semantic indexing model by employing an expanded query. The query can be expanded by merging the query with documents that are relevant to the query for purposes of compensating for a lack of training data. In accordance with another exemplary aspect, time difference features can be incorporated into a semantic indexing model to account for changes in query distributions over time.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for training a semantic indexing model comprising: providing a search engine with a first query; receiving a set of documents of a plurality of documents related to the first query from the search engine; generating, by at least one hardware processor, an expanded query by merging at least a portion of a subset of the set of the documents with the first query; and training the semantic indexing model based on the expanded query; wherein the training comprises presenting at least a portion of the plurality of documents to a user, receiving indications of which of the plurality of documents are relevant to the expanded query and which of the plurality of documents are irrelevant to the expanded query; wherein the training updates the model based on the expanded query, the documents that are relevant to the expanded query and the documents that are irrelevant to the expanded query; wherein the updating comprises modifying the model by computing the model such that ∑ ( q , d + , d - ) max ( 0 , 1 - f ( q ′ , d + ) + f ( q ′ , d - ) ) is minimized, where f is the model, q′ denotes the expanded query, d + denotes documents that are relevant to the query q′ and d − denotes documents that are irrelevant to the query q′.
2. The method of claim 1 , wherein the method further comprises: re-ranking the set of documents based on the expanded query using the semantic indexing model.
3. The method of claim 1 , wherein the receiving further comprises selecting the subset by applying a cosine distance between a vector denoting the first query and vectors denoting the documents in the set.
4. The method of claim 1 , wherein the merging comprises merging words of said subset with words of the first query that have a particular similarity to the words of the subset.
5. The method of claim 4 , wherein the particular similarity is a co-occurrence based measure.
6. A system for training a semantic indexing model comprising: search engine, which is configured to receive a first query and generate a set of documents of a plurality of documents related to the first query; a query generator unit, implemented by at least one hardware processor, configured to generate an expanded query by merging at least a portion of a subset of the set of documents with the first query; and a controller configured to train the semantic indexing model based on the expanded query; wherein the training by the controller comprises presenting at least a portion of the plurality of documents to a user, receiving indications of which of the plurality of documents are relevant to the expanded query and which of the plurality of documents are irrelevant to the expanded query; wherein the training updates the model based on the expanded query, the documents that are relevant to the expanded query and the documents that are irrelevant to the expanded query; wherein the updating comprises modifying the model by computing the model such that ∑ ( q , d + , d - ) max ( 0 , 1 - f ( q ′ , d + ) + f ( q ′ , d - ) ) is minimized, where f is the model, q′ denotes the expanded query, d + denotes documents that are relevant to the query q′ and d − denotes documents that are irrelevant to the query q′.
7. The system of claim 6 , wherein the merging comprises merging words of said subset with words of the first query that have a particular similarity to the words of the subset.
8. The system of claim 6 , further comprising: a ranker configured to re-rank the set of documents based on the expanded query using the semantic indexing model.
9. The system of claim 6 , further comprising: a time difference module configured to determine at least one time difference parameter denoting a time difference between a receipt of the query and a generation of at least one document of the plurality of documents and to modify a similarity measure of the model based on the at least one time difference parameter.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 28, 2013
May 10, 2016
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.