US-10942954

Dataset adaptation for high-performance in specific natural language processing tasks

PublishedMarch 9, 2021

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Systems, methods, and computer program products to perform an operation comprising identifying a first available dataset having a degree of similarity to a received input dataset that exceeds a similarity threshold, determining, based on a plurality of features of the first available dataset and a plurality of features of the input dataset, a set of recommendations for transforming the input dataset, and transforming a text of the input dataset based on the set of recommendations and to optimize the input dataset for processing by a natural language processing (NLP) algorithm.

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: receiving an input dataset comprising textual data from a plurality of sources; determining a degree of similarity between a first available dataset, from a plurality of available datasets, and the input dataset; and upon determining that the degree of similarity exceeds a similarity threshold: generating, based on performance of a natural language processing (NLP) algorithm with respect to the first available data set and based further on a plurality of features of the first available dataset and a plurality of features of the input dataset, a set of recommendations for transforming the input dataset, wherein each recommendation of the set of recommendations, when applied to the input dataset, will increase the degree of similarity between the input dataset and the first available dataset; and transforming the textual data of the input dataset based on the set of recommendations, such that the input dataset becomes more similar to the first available dataset and is optimized for processing by the NLP algorithm.

2. The method of claim 1 , wherein the first available dataset is of a plurality of available datasets, wherein identifying the first available dataset comprises: computing a similarity score reflecting the degree of similarity between the first available dataset and the input dataset; and determining that the similarity score exceeds a similarity threshold.

3. The method of claim 1 , further comprising prior to identifying the first available dataset: receiving a plurality of available datasets including the first available dataset; transforming each of the plurality of available datasets into a plurality of transformed datasets based on at least one transformation rule; extracting a plurality of features for each available dataset and each transformed dataset; applying a plurality of NLP algorithms to each available dataset and each transformed dataset; monitoring performance metrics when applying the plurality of NLP algorithms to each available dataset and each transformed dataset; and storing the monitored performance metrics.

4. The method of claim 1 , wherein generating the set of recommendations comprises: computing, for a first feature of the plurality of features of the first available dataset and the input dataset, a difference between a first feature value for the first feature of the first available dataset and the input dataset; determining that the computed difference exceeds a threshold; and generating a first recommendation of the set of recommendations based on the first feature and the computed difference.

5. The method of claim 1 , wherein transforming the input dataset comprises: identifying, in the text of the input dataset, a first element of text that does not comply with a first recommendation of the set of recommendations; regenerating the first element of text as a first transformed element of text that complies with the first recommendation; and storing the first transformed element of text in a transformed dataset corresponding to the first input dataset.

6. The method of claim 5 , wherein transforming the input dataset further comprises: identifying, based on a second recommendation of the set of recommendations, an additional category of text required for the input dataset; identifying, from at least one data source, data corresponding to the additional category of text required for the input dataset; generating, based on the identified data, additional text that satisfies the second recommendation; and storing the additional text in the transformed dataset.

7. The method of claim 1 , further comprising: applying the NLP algorithm to the transformed text of the input dataset.

8. A computer program product, comprising: a non-transitory computer-readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by a processor to perform an operation comprising: receiving an input dataset comprising textual data from a plurality of sources; determining a degree of similarity between a first available dataset, from a plurality of available datasets, and the input dataset; and upon determining that the degree of similarity exceeds a similarity threshold: generating, based on performance of a natural language processing (NLP) algorithm with respect to the first available data set and based further on a plurality of features of the first available dataset and a plurality of features of the input dataset, a set of recommendations for transforming the input dataset, wherein each recommendation of the set of recommendations, when applied to the input dataset, will increase the degree of similarity between the input dataset and the first available dataset; and transforming the textual data of the input dataset based on the set of recommendations, such that the input dataset becomes more similar to the first available dataset and is optimized for processing by the NLP algorithm.

9. The computer program product of claim 8 , wherein the first available dataset is of a plurality of available datasets, wherein identifying the first available dataset comprises: computing a similarity score reflecting the degree of similarity between the first available dataset and the input dataset; and determining that the similarity score exceeds a similarity threshold.

10. The computer program product of claim 8 , the operation further comprising prior to identifying the first available dataset: receiving a plurality of available datasets including the first available dataset; transforming each of the plurality of available datasets into a plurality of transformed datasets based on at least one transformation rule; extracting a plurality of features for each available dataset and each transformed dataset; applying a plurality of NLP algorithms to each available dataset and each transformed dataset; monitoring performance metrics when applying the plurality of NLP algorithms to each available dataset and each transformed dataset; and storing the monitored performance metrics.

11. The computer program product of claim 8 , wherein generating the set of recommendations comprises: computing, for a first feature of the plurality of features of the first available dataset and the input dataset, a difference between a first feature value for the first feature of the first available dataset and the input dataset; determining that the computed difference exceeds a threshold; and generating a first recommendation of the set of recommendations based on the first feature and the computed difference.

12. The computer program product of claim 8 , wherein transforming the input dataset comprises: identifying, in the text of the input dataset, a first element of text that does not comply with a first recommendation of the set of recommendations; regenerating the first element of text as a first transformed element of text that complies with the first recommendation; and storing the first transformed element of text in a transformed dataset corresponding to the first input dataset.

13. The computer program product of claim 12 , wherein transforming the input dataset further comprises: identifying, based on a second recommendation of the set of recommendations, an additional category of text required for the input dataset; identifying, from at least one data source, data corresponding to the additional category of text required for the input dataset; generating, based on the identified data, additional text that satisfies the second recommendation; and storing the additional text in the transformed dataset.

14. The computer program product of claim 8 , the operation further comprising: applying the NLP algorithm to the transformed text of the input dataset.

15. A system, comprising: a processor; and a memory storing one or more instructions which, when executed by the processor, performs an operation comprising: receiving an input dataset comprising textual data from a plurality of sources; determining a degree of similarity between a first available dataset, from a plurality of available datasets, and the input dataset; and upon determining that the degree of similarity exceeds a similarity threshold: generating, based on performance of a natural language processing (NLP) algorithm with respect to the first available data set and based further on a plurality of features of the first available dataset and a plurality of features of the input dataset, a set of recommendations for transforming the input dataset, wherein each recommendation of the set of recommendations, when applied to the input dataset, will increase the degree of similarity between the input dataset and the first available dataset; and transforming the textual data of the input dataset based on the set of recommendations, such that the input dataset becomes more similar to the first available dataset and is optimized for processing by the NLP algorithm.

16. The system of claim 15 , wherein the first available dataset is of a plurality of available datasets, wherein identifying the first available dataset comprises: computing a similarity score reflecting the degree of similarity between the first available dataset and the input dataset; and determining that the similarity score exceeds a similarity threshold.

17. The system of claim 15 , the operation further comprising prior to identifying the first available dataset: receiving a plurality of available datasets including the first available dataset; transforming each of the plurality of available datasets into a plurality of transformed datasets based on at least one transformation rule; extracting a plurality of features for each available dataset and each transformed dataset; applying a plurality of NLP algorithms to each available dataset and each transformed dataset; monitoring performance metrics when applying the plurality of NLP algorithms to each available dataset and each transformed dataset; and storing the monitored performance metrics.

18. The system of claim 15 , wherein generating the set of recommendations comprises: computing, for a first feature of the plurality of features of the first available dataset and the input dataset, a difference between a first feature value for the first feature of the first available dataset and the input dataset; determining that the computed difference exceeds a threshold; and generating a first recommendation of the set of recommendations based on the first feature and the computed difference.

19. The system of claim 15 , wherein transforming the input dataset comprises: identifying, in the text of the input dataset, a first element of text that does not comply with a first recommendation of the set of recommendations; regenerating the first element of text as a first transformed element of text that complies with the first recommendation; and storing the first transformed element of text in a transformed dataset corresponding to the first input dataset.

20. The system of claim 19 , wherein transforming the input dataset further comprises: identifying, based on a second recommendation of the set of recommendations, an additional category of text required for the input dataset; identifying, from at least one data source, data corresponding to the additional category of text required for the input dataset; generating, based on the identified data, additional text that satisfies the second recommendation; and storing the additional text in the transformed dataset.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F G06N

Patent Metadata

Filing Date

December 22, 2017

Publication Date

March 9, 2021

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search