Categorical data transformation and clustering for machine learning using natural language processing

PublishedDecember 20, 2022

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Categorical data transformation and clustering techniques and systems are described for machine learning using natural language processing. These techniques and systems are configured to improve operation of a computing device to support efficient and accurate use of categorical data, which is not possible using conventional techniques. In an example, categorical data is received by a computing device that includes a categorical variable having a non-numerical data type for a number of classes. The categorical data is then converted into numerical data using natural language processing. Data is then generated by the computing device that includes a plurality of latent classes. This is performed by clustering the numerical data into a number of clusters that is smaller than the number of classes in the categorical data.

Patent Claims

15 claims

Legal claims defining the scope of protection, as filed with the USPTO.

2. The method as described in claim 1, wherein the converting includes converting the second said categorical variable into the numerical data into n-gram vector representations.

3. The method as described in claim 1, wherein the vector representations represents multiple words.

4. The method as described in claim 1, wherein the clustering is based on features included in a set of strings of alphabetical text.

5. The method as described in claim 4, wherein the features include bi-gram words and tri-gram characters.

6. The method as described in claim 1, wherein the clustering uses a K-means clustering technique.

8. The method as described in claim 1, wherein the number is ten or more.

9. The method as described in claim 1, wherein the second said categorical variable includes URLs.

10. The method as described in claim 1, further comprising parsing the second said categorical variable and the converting is based on the parsed categorical variable.

11. The method as described in claim 10, wherein the parsing includes removing characters from that include punctuation and stop words.

13. The computing device as described in claim 12, the operations further comprising parsing the second said categorical variable to remove characters that do not contribute to the clustering.

14. The computing device as described in claim 12, wherein the numerical data is configured as vector representations.

15. The computing device as described in claim 14, wherein characters in the plurality of categorical variables are converted into the vector representations.

17. The computing device as described in claim 12, wherein the number has been found to produce results having limited accuracy.

19. The one or more computer readable storage media as described in claim 18, wherein the converting includes converting the second said categorical variable into the numerical data as vector representations of the number of classes.

20. The one or more computer readable storage media as described in claim 18, further comprising parsing the second said categorical variable to remove characters that do not contribute to clustering.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06N G06F

Patent Metadata

Filing Date

November 28, 2017

Publication Date

December 20, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search