Patentable/Patents/US-11531927
US-11531927

Categorical data transformation and clustering for machine learning using natural language processing

PublishedDecember 20, 2022
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

Categorical data transformation and clustering techniques and systems are described for machine learning using natural language processing. These techniques and systems are configured to improve operation of a computing device to support efficient and accurate use of categorical data, which is not possible using conventional techniques. In an example, categorical data is received by a computing device that includes a categorical variable having a non-numerical data type for a number of classes. The categorical data is then converted into numerical data using natural language processing. Data is then generated by the computing device that includes a plurality of latent classes. This is performed by clustering the numerical data into a number of clusters that is smaller than the number of classes in the categorical data.

Patent Claims
15 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 2

Original Legal Text

2. The method as described in claim 1, wherein the converting includes converting the second said categorical variable into the numerical data into n-gram vector representations.

Plain English Translation

This invention relates to data processing techniques for handling categorical variables in machine learning or statistical analysis. The problem addressed is the difficulty of incorporating categorical data into algorithms that require numerical inputs, as categorical variables (e.g., text labels, categories) cannot be directly processed without transformation. The method involves converting categorical variables into numerical data representations. Specifically, it focuses on transforming a categorical variable into n-gram vector representations. N-grams are contiguous sequences of items (e.g., words, characters) from a given sample of text or data, and their vector representations capture local patterns within the categorical data. This conversion allows the categorical variable to be used in numerical analysis, improving compatibility with machine learning models and statistical techniques that rely on numerical inputs. The method ensures that the categorical data retains meaningful structure by preserving the relationships between n-grams, which can be critical for tasks like text classification, natural language processing, or any domain where categorical variables need to be numerically encoded. The approach avoids lossy transformations, such as one-hot encoding, by leveraging n-gram vectors to capture more nuanced relationships within the data. This enhances the accuracy and interpretability of models that depend on numerical representations of categorical variables.

Claim 3

Original Legal Text

3. The method as described in claim 1, wherein the vector representations represents multiple words.

Plain English Translation

The invention relates to natural language processing (NLP) and machine learning, specifically to methods for generating and using vector representations of text data. The core problem addressed is the need for efficient and meaningful representations of text that capture semantic relationships between words or sequences of words. Traditional methods often struggle with contextual nuances or fail to represent multi-word phrases effectively. The method involves generating vector representations that encode information about multiple words rather than single words. These representations are derived from a neural network model trained on a large corpus of text data. The model processes input text sequences and produces dense, low-dimensional vectors that capture semantic relationships. By representing multiple words together, the method improves the ability to understand and generate coherent text, particularly in tasks like machine translation, text summarization, and question answering. The vector representations are used to compare or transform text data, enabling applications such as semantic search, where queries and documents are matched based on meaning rather than exact keyword matches. The method also supports tasks like text classification, where the vector representations help categorize documents into predefined classes. The approach enhances the accuracy and efficiency of NLP systems by providing richer, context-aware embeddings that better reflect the meaning of multi-word expressions.

Claim 4

Original Legal Text

4. The method as described in claim 1, wherein the clustering is based on features included in a set of strings of alphabetical text.

Plain English Translation

This invention relates to a method for clustering data based on features extracted from strings of alphabetical text. The method addresses the challenge of organizing and categorizing text data by identifying meaningful patterns or groupings within the text. The clustering process involves analyzing the features present in the text strings, such as word frequency, syntax, or semantic relationships, to determine similarities between different text entries. By grouping similar text strings together, the method enables more efficient data organization, retrieval, and analysis. The clustering is performed using a predefined set of features derived from the text, ensuring that the groupings are based on relevant and consistent criteria. This approach is particularly useful in applications such as document classification, natural language processing, and information retrieval, where organizing large volumes of text data is essential. The method improves upon existing techniques by focusing on specific text features, leading to more accurate and meaningful clusters. The clustering process may involve machine learning algorithms or statistical techniques to identify patterns and relationships within the text data. The resulting clusters can be used for various purposes, including data mining, trend analysis, and automated content categorization.

Claim 5

Original Legal Text

5. The method as described in claim 4, wherein the features include bi-gram words and tri-gram characters.

Plain English Translation

This invention relates to natural language processing (NLP) and text analysis, specifically improving the accuracy of text classification or information extraction by incorporating bi-gram words and tri-gram characters as features. The method addresses the challenge of capturing meaningful patterns in text data, where traditional single-word or character-based approaches may fail to account for contextual relationships. The process involves extracting bi-gram words (pairs of adjacent words) and tri-gram characters (sequences of three consecutive characters) from input text. These features are then used to enhance machine learning models, such as classifiers or extractors, by providing additional contextual information. For example, bi-grams can help distinguish between words with similar meanings but different contexts, while tri-grams can capture subtle linguistic patterns, such as prefixes, suffixes, or character-level sequences that are important for tasks like language identification or named entity recognition. The method may be applied in various NLP applications, including sentiment analysis, topic modeling, and document categorization, where understanding word and character relationships improves model performance. By leveraging both word-level and character-level n-grams, the approach provides a more robust representation of text, reducing ambiguity and increasing accuracy in automated text processing tasks.

Claim 6

Original Legal Text

6. The method as described in claim 1, wherein the clustering uses a K-means clustering technique.

Plain English Translation

A method for data analysis involves clustering data points using a K-means clustering technique. The method begins by selecting an initial set of data points from a larger dataset. These initial points are then grouped into clusters based on their similarity, where each cluster is represented by a centroid. The K-means algorithm iteratively refines these centroids by recalculating them as the mean of all data points assigned to each cluster. This process continues until the centroids stabilize or a predefined convergence criterion is met. The method ensures that data points are assigned to the nearest centroid, minimizing the total distance between points and their assigned centroids. This approach is particularly useful for organizing large datasets into meaningful groups, improving data interpretation and pattern recognition. The K-means technique is efficient for unsupervised learning tasks, where the goal is to identify natural groupings within the data without prior labeling. The method can be applied in various fields, including machine learning, bioinformatics, and market segmentation, to extract insights from complex datasets.

Claim 8

Original Legal Text

8. The method as described in claim 1, wherein the number is ten or more.

Plain English Translation

A system and method for processing data involves generating a plurality of data points, where each data point is associated with a unique identifier. The method includes selecting a subset of these data points based on a predefined criterion, such as a threshold value or a specific attribute. The selected subset is then processed to extract relevant information, which is stored in a database for further analysis. The method ensures that the subset contains at least ten or more data points to maintain statistical significance and reliability in the extracted information. This approach is particularly useful in fields requiring large-scale data analysis, such as machine learning, big data processing, or quality control in manufacturing. The method improves efficiency by focusing on a meaningful subset of data while ensuring sufficient sample size for accurate results. The system may include a data processing unit, a storage module, and an analysis engine to perform these operations. The method can be applied in various industries where large datasets need to be processed efficiently while maintaining data integrity and accuracy.

Claim 9

Original Legal Text

9. The method as described in claim 1, wherein the second said categorical variable includes URLs.

Plain English Translation

The invention relates to data processing systems that analyze categorical variables, particularly in the context of machine learning or statistical modeling. A key challenge in such systems is efficiently handling categorical data, which often includes non-numeric values like URLs, text labels, or identifiers. Traditional methods may struggle with high-cardinality categorical variables, leading to inefficiencies in model training or analysis. The invention addresses this by providing a method for processing categorical variables, where at least one of the variables includes URLs. The method involves encoding these categorical variables into a numerical format suitable for machine learning algorithms. This encoding may involve techniques like one-hot encoding, embedding, or hashing, ensuring that the categorical data, including URLs, is transformed into a structured numerical representation. The method may also include preprocessing steps to clean or normalize the URLs before encoding, such as removing protocol prefixes or query parameters. Additionally, the system may apply dimensionality reduction techniques to handle high-cardinality variables efficiently. The encoded data is then used in downstream tasks like classification, regression, or clustering, improving model performance and computational efficiency. The invention ensures that categorical variables, including URLs, are processed in a way that maintains their informational value while being compatible with numerical analysis.

Claim 10

Original Legal Text

10. The method as described in claim 1, further comprising parsing the second said categorical variable and the converting is based on the parsed categorical variable.

Plain English Translation

This invention relates to data processing systems that handle categorical variables in datasets. The problem addressed is the difficulty of incorporating categorical variables into machine learning models or statistical analyses, as these variables often require conversion into numerical formats for compatibility. The invention provides a method to improve the conversion process by parsing categorical variables and using the parsed data to guide the conversion. The method involves receiving a dataset containing at least one categorical variable, which is a variable that takes on a limited number of distinct values (e.g., colors, labels, or categories). The categorical variable is parsed to extract meaningful information, such as the number of unique categories, the frequency of each category, or hierarchical relationships between categories. This parsed information is then used to determine an appropriate conversion method, such as one-hot encoding, label encoding, or ordinal encoding. The conversion process ensures that the numerical representation retains the original categorical structure, improving the accuracy and interpretability of subsequent analyses or machine learning tasks. The method may also include preprocessing steps, such as handling missing values or standardizing the categorical data, to further enhance the conversion process. The parsed categorical variable can be used to dynamically adjust the conversion parameters, ensuring optimal performance for different types of categorical data.

Claim 11

Original Legal Text

11. The method as described in claim 10, wherein the parsing includes removing characters from that include punctuation and stop words.

Plain English Translation

A method for processing text data involves parsing input text to extract relevant information. The parsing step includes filtering out unwanted characters, specifically punctuation marks and stop words, to refine the text for further analysis. Stop words are common words such as articles, prepositions, and conjunctions that typically do not contribute significant meaning to the text. By removing these elements, the method focuses on the core content, improving efficiency in subsequent processing steps like natural language understanding or machine learning tasks. This approach enhances the accuracy and performance of text-based applications by reducing noise and irrelevant data. The method is particularly useful in fields such as document analysis, search engines, and automated content summarization, where precise and meaningful text extraction is essential. The filtering process ensures that only the most relevant words remain, allowing for more effective analysis and interpretation of the text.

Claim 13

Original Legal Text

13. The computing device as described in claim 12, the operations further comprising parsing the second said categorical variable to remove characters that do not contribute to the clustering.

Plain English Translation

This invention relates to data processing systems that handle categorical variables in clustering algorithms. The problem addressed is the inefficiency and inaccuracy of clustering when categorical variables contain irrelevant or noisy characters that do not contribute meaningful information to the grouping process. For example, categorical data may include extraneous symbols, prefixes, or suffixes that do not reflect the true categorical distinctions needed for clustering. The system processes categorical variables by first identifying and then removing non-contributing characters from the data. This preprocessing step ensures that only relevant characters are used in the clustering algorithm, improving the accuracy and efficiency of the resulting groupings. The system may apply rules or statistical methods to determine which characters are irrelevant, such as removing punctuation, whitespace, or standardized prefixes. The cleaned categorical variables are then used in a clustering algorithm to group similar data points based on the refined categorical distinctions. This approach is particularly useful in applications where categorical data is noisy or inconsistently formatted, such as text data in natural language processing, log files, or database records. By filtering out irrelevant characters, the system enhances the quality of the clustering results, making the groupings more meaningful and reliable. The method can be integrated into existing data processing pipelines to improve the performance of machine learning models that rely on categorical clustering.

Claim 14

Original Legal Text

14. The computing device as described in claim 12, wherein the numerical data is configured as vector representations.

Plain English Translation

This invention relates to computing devices that process numerical data, particularly for tasks like machine learning or data analysis. The problem addressed is the efficient representation and manipulation of numerical data in computing systems, especially when dealing with high-dimensional datasets or complex computations. The computing device includes a processor and memory storing instructions for processing numerical data. The data is configured as vector representations, which are mathematical constructs used to encode information in a structured format. These vector representations allow for efficient storage, retrieval, and computation, particularly in applications like neural networks, where data is often processed in batches or matrices. The device may also include a data preprocessing module to transform raw data into the vector format, ensuring compatibility with downstream processing tasks. Additionally, the system may support operations like vector arithmetic, normalization, or dimensionality reduction to optimize performance. The vector representations can be used in various applications, such as natural language processing, image recognition, or recommendation systems, where numerical data must be processed in a structured and efficient manner. By using vector representations, the computing device improves computational efficiency, reduces memory usage, and enhances the accuracy of data-driven models. This approach is particularly useful in environments where large-scale data processing is required, such as cloud computing or edge devices.

Claim 15

Original Legal Text

15. The computing device as described in claim 14, wherein characters in the plurality of categorical variables are converted into the vector representations.

Plain English Translation

This invention relates to data processing systems that handle categorical variables, which are variables with discrete, non-numeric values (e.g., colors, labels, or categories). A common challenge in machine learning and data analysis is that categorical variables cannot be directly processed by algorithms that require numerical inputs. The invention addresses this by converting categorical variables into vector representations, enabling their use in computational models. The system includes a computing device configured to process data containing categorical variables. The device identifies a plurality of categorical variables within a dataset and applies a transformation to convert each category into a numerical vector. This conversion may involve techniques such as one-hot encoding, embedding layers, or other dimensionality reduction methods. The resulting vectors preserve the relationships between categories while making the data compatible with numerical processing algorithms. The invention ensures that the vector representations maintain meaningful distinctions between categories, allowing machine learning models to effectively utilize the information. This approach is particularly useful in applications like classification, clustering, and predictive modeling where categorical data is prevalent. By automating the conversion process, the system improves efficiency and reduces errors compared to manual encoding methods. The invention may be integrated into data preprocessing pipelines or deployed as a standalone tool for data scientists and analysts.

Claim 17

Original Legal Text

17. The computing device as described in claim 12, wherein the number has been found to produce results having limited accuracy.

Plain English Translation

A computing device is configured to process data using a numerical value that has been determined to yield results with limited accuracy. The device includes a processor and a memory storing instructions that, when executed, cause the processor to perform operations involving the numerical value. These operations may include calculations, data transformations, or other computational tasks where the numerical value is applied. The limited accuracy of the results may arise from inherent properties of the numerical value, such as its precision, range, or representation in the computing system. The device may further include input/output interfaces to receive or transmit data, as well as additional processing components to support the operations. The numerical value may be predefined or dynamically adjusted based on system conditions, but its use is known to produce results that are less accurate than desired for certain applications. The device may be part of a larger system where the numerical value is applied in specific computational contexts, such as numerical simulations, data analysis, or control systems, where accuracy is a critical factor. The limited accuracy may necessitate additional error correction, validation steps, or alternative methods to compensate for the shortcomings of the numerical value.

Claim 19

Original Legal Text

19. The one or more computer readable storage media as described in claim 18, wherein the converting includes converting the second said categorical variable into the numerical data as vector representations of the number of classes.

Plain English Translation

This invention relates to data processing techniques for handling categorical variables in machine learning or statistical analysis. Categorical variables, which represent discrete groups or classes, are often challenging to incorporate into numerical models that require continuous or ordinal data. The invention addresses this problem by converting categorical variables into numerical data representations that preserve the distinct class information while enabling compatibility with numerical processing algorithms. The method involves transforming a categorical variable into a numerical format by generating vector representations based on the number of classes present. Each class in the categorical variable is assigned a unique vector, where the vector's dimensionality corresponds to the total number of classes. For example, if a categorical variable has three classes (A, B, and C), each class is represented by a binary vector of length three, with a single "1" indicating the presence of the class and "0"s elsewhere. This approach ensures that the numerical representation retains the categorical distinctions, allowing models to differentiate between classes effectively. The conversion process may also include additional steps such as normalization or scaling to optimize the numerical data for downstream analysis. The resulting vector representations can then be used in machine learning models, statistical analyses, or other computational tasks that require numerical inputs. This technique is particularly useful in scenarios where categorical data must be integrated into algorithms that do not natively support non-numerical inputs, such as neural networks or regression models. The invention enhances data compatibility and improves the accuracy of models that rely on ca

Claim 20

Original Legal Text

20. The one or more computer readable storage media as described in claim 18, further comprising parsing the second said categorical variable to remove characters that do not contribute to clustering.

Plain English Translation

This invention relates to data preprocessing for clustering algorithms, specifically addressing the challenge of handling categorical variables with irrelevant or noisy characters that degrade clustering performance. The system processes a dataset containing categorical variables, where at least one variable includes non-informative characters that do not contribute to meaningful clustering. The method involves parsing the categorical variable to identify and remove these irrelevant characters, ensuring only relevant data is used for clustering. This preprocessing step enhances the accuracy and efficiency of subsequent clustering operations by reducing noise and improving feature relevance. The system may also include additional preprocessing steps, such as encoding categorical variables into numerical representations, to prepare the data for clustering algorithms. The invention is particularly useful in applications where categorical data contains extraneous characters, such as special symbols, formatting artifacts, or irrelevant prefixes/suffixes, which could otherwise distort clustering results. By filtering out these non-contributing elements, the method ensures that the clustering process relies on the most relevant aspects of the categorical data, leading to more accurate and interpretable clusters.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

November 28, 2017

Publication Date

December 20, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Categorical data transformation and clustering for machine learning using natural language processing” (US-11531927). https://patentable.app/patents/US-11531927

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/US-11531927. See llms.txt for full attribution policy.