Provided is a computer-implemented method for generating automatically annotations for tabular cell data of a table having column and rows, wherein the method includes: supplying raw cell data of cells of a row of the table as input to an embedding layer of a semantic type annotation neural network which transforms the received raw cell data of the cells of the supplied row into cell embedding vectors; processing the cell embedding vectors to calculate attentions among the cells of the respective row of the table encoding a context within the row output as cell context vectors; and processing the cell context vectors generated by the self-attention layer by a classification layer of the semantic type annotation neural network to predict semantic column type annotations and/or to predict relations between semantic column type annotations for the columns of the table.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
2. The computer-implemented method according to claim 1 wherein a bidirectional recurrent neural network, RNN, trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model, BPE, is used as an encoder of the embedding layer of the semantic type annotation neural network.
This invention relates to natural language processing and semantic type annotation, specifically improving the accuracy of semantic type annotation in text data. The problem addressed is the challenge of accurately identifying and classifying semantic types in text, which is crucial for tasks like information extraction, question answering, and machine translation. Existing methods often struggle with capturing long-range dependencies and contextual nuances in text data. The solution involves a neural network architecture that leverages a bidirectional recurrent neural network (RNN) as an encoder within a semantic type annotation system. The RNN is pre-trained as part of an autoencoder, specifically on cell embeddings generated by a byte-pair encoding (BPE) model. The BPE model tokenizes text into subword units, which are then converted into embeddings. The autoencoder is trained to reconstruct these embeddings, allowing the RNN to learn meaningful representations of the input text. This pre-trained RNN is then used as the encoder in the embedding layer of the semantic type annotation neural network, enhancing its ability to capture contextual information and improve annotation accuracy. The approach combines the strengths of subword tokenization and deep learning to improve semantic type annotation performance.
3. The computer-implemented method according to claim 1 wherein the generated annotations and the tabular cell data of the table, T, are supplied to an extract, transform, load (ETL) process used to generate a knowledge graph instance stored in a memory.
The invention relates to a computer-implemented method for processing tabular data to generate a knowledge graph. The method addresses the challenge of extracting structured information from tables and integrating it into a knowledge graph, which is a graph-based representation of data that enables efficient querying and analysis. The method involves generating annotations for tabular cell data, where these annotations describe the semantic meaning or relationships of the data within the table. These annotations, along with the original tabular cell data, are then supplied to an extract, transform, load (ETL) process. The ETL process transforms the annotated data into a format suitable for storage in a knowledge graph, which is then stored in memory. The knowledge graph instance allows for the representation of entities, relationships, and attributes derived from the tabular data, facilitating advanced data analysis and querying capabilities. The method ensures that the structured information from tables is accurately and efficiently integrated into a knowledge graph, enabling better data utilization and insights.
4. The computer-implemented method according to claim 1 wherein the classification layer calculates column type vectors, y, comprising for the cell data of each cell, C, of the respective supplied row, R, predicted semantic column type probabilities.
The invention relates to a computer-implemented method for classifying data in a tabular format, addressing the challenge of accurately determining the semantic types of columns in structured data. The method processes a supplied row of tabular data to predict the most likely semantic types for each column in the table. A classification layer generates column type vectors for each cell in the row, where each vector contains predicted probabilities for various semantic column types. These probabilities indicate the likelihood that the cell's data belongs to a specific semantic category, such as dates, names, or numerical values. The method leverages these probabilities to infer the overall semantic structure of the table, improving data interpretation and processing in applications like data analysis, machine learning, and database management. By analyzing individual cells and their context within a row, the system enhances accuracy in column type classification, particularly in datasets with mixed or ambiguous data formats. The approach supports automated data cleaning, schema inference, and integration tasks, reducing manual effort and improving consistency in data handling.
5. The computer-implemented method according to claim 4 wherein a mean pooling of the column type vectors, y, of all rows, R, of the table, T, is performed to predict a semantic column type for each column, C, of the table, T.
The invention relates to a computer-implemented method for predicting semantic column types in tabular data. The method addresses the challenge of automatically determining the semantic meaning of columns in a table, such as distinguishing between dates, names, or numerical values, which is crucial for data integration, analysis, and machine learning applications. The method processes a table, T, containing multiple rows, R, and columns, C. For each column, a vector representation, y, is generated based on the data in that column. These column type vectors are then aggregated using mean pooling, which computes the average of the vectors across all rows. This aggregated representation is used to predict the semantic type of each column, such as identifying whether a column contains dates, names, or other categorical data. The method leverages the relationships between rows and columns to improve the accuracy of type prediction. By analyzing the entire table structure, it can infer semantic types more effectively than approaches that examine columns in isolation. This technique is particularly useful in scenarios where metadata is missing or unreliable, ensuring that data is correctly interpreted for downstream tasks. The approach is scalable and can be applied to large datasets, making it suitable for automated data processing pipelines.
6. The computer-implemented method according to claim 1 wherein the self-attention layer of the semantic type annotation neural network comprises a stack of transformers to calculate attentions among the cells, C, of the respective row, R, of the table, T.
The invention relates to a computer-implemented method for processing tabular data using a neural network, specifically focusing on semantic type annotation. The method addresses the challenge of accurately interpreting and categorizing data within tables, where relationships between cells are often context-dependent. The core innovation involves a self-attention layer within a semantic type annotation neural network, which employs a stack of transformers to compute attention scores among cells in a given row of a table. This allows the network to dynamically weigh the relevance of each cell when determining the semantic type of another cell in the same row. The transformers in the stack process the cells iteratively, capturing complex dependencies and interactions between them. By leveraging self-attention mechanisms, the method improves the accuracy of semantic type predictions, particularly in structured data where cell values may have contextual meanings that are not immediately apparent. The approach enhances the ability of machine learning models to understand and classify tabular data, making it more interpretable and useful for downstream applications.
7. The computer-implemented method according to claim 1 wherein the semantic type annotation neural network is trained in a supervised learning process using labeled rows, R, as samples.
This invention relates to a computer-implemented method for training a semantic type annotation neural network using supervised learning. The method addresses the challenge of accurately classifying data into predefined semantic types, which is crucial for tasks like natural language processing, data integration, and knowledge graph construction. The neural network is trained using labeled rows of data as training samples, where each row contains input features and a corresponding semantic type label. The training process involves feeding these labeled rows into the neural network, allowing it to learn patterns and relationships between the input features and the correct semantic type annotations. The method ensures that the neural network can generalize from the labeled training data to accurately annotate new, unseen data with the appropriate semantic types. This approach improves the reliability and precision of semantic type annotation systems, making them more effective for applications requiring structured data interpretation. The invention focuses on enhancing the training efficiency and accuracy of neural networks for semantic type annotation tasks, leveraging supervised learning to optimize performance.
8. A computer program product comprising a computer readable storage device have computer readable program code stored therein, the program code executable by a processor of a computer system to perform the computer-implemented method according to claim 1 to generate automatically annotations for tabular cell data received from a data source.
This invention relates to automated annotation of tabular cell data in computer systems. The problem addressed is the manual effort required to label or annotate data within tables, which is time-consuming and prone to human error. The solution involves a computer program that automatically generates annotations for tabular cell data received from a data source. The program operates by analyzing the structure and content of the tabular data to identify patterns, relationships, and contextual clues within the cells. It then applies machine learning or rule-based techniques to assign relevant annotations, such as labels, categories, or metadata, to the cells. The annotations may include semantic information, data types, or other descriptive attributes that enhance the usability and interpretability of the tabular data. The system may also incorporate preprocessing steps to clean or normalize the data before annotation, ensuring consistency and accuracy. Additionally, the program can adapt to different data sources and formats, making it versatile for various applications, including data analysis, reporting, and database management. The automated approach reduces manual labor, improves efficiency, and ensures uniformity in data annotation.
10. The apparatus according to claim 9 wherein a bidirectional recurrent neural network, RNN, trained as an encoder of an autoencoder on cell embeddings provided by a byte-pair encoding model, BPE, is implemented as an encoder of the embedding layer of the semantic type annotation neural network of the apparatus.
The invention relates to natural language processing systems, specifically improving semantic type annotation in text processing. The problem addressed is the need for more accurate and efficient semantic type annotation, which involves classifying words or phrases in text according to their semantic roles or categories. Traditional methods often struggle with contextual understanding and computational efficiency. The apparatus includes a neural network-based system for semantic type annotation. A key component is a bidirectional recurrent neural network (RNN) that serves as an encoder within the system. This RNN is pre-trained as part of an autoencoder, which is a neural network that learns to compress and reconstruct data. The autoencoder is trained using cell embeddings generated by a byte-pair encoding (BPE) model, a technique for tokenizing text by merging frequent byte pairs. The pre-trained RNN encoder is then integrated into the embedding layer of the semantic type annotation neural network, enhancing its ability to capture contextual information and improve annotation accuracy. The system leverages the pre-training on BPE embeddings to better understand the semantic relationships in text, leading to more precise and efficient type annotations. This approach combines the strengths of autoencoders for representation learning and BPE for efficient tokenization, resulting in a robust solution for semantic analysis in natural language processing tasks.
11. The apparatus according to claim 9 wherein the generated annotations and the tabular cell data of the table, T, are supplied to an extract, transform, load (ETL) process used to generate a knowledge graph instance of the knowledge base.
The invention relates to a system for processing tabular data to generate a knowledge graph. The problem addressed is the difficulty of extracting structured information from tables and integrating it into a knowledge base in a machine-readable format. The apparatus includes a component that generates annotations for tabular cell data, where these annotations describe relationships between the cells. These annotations and the tabular data are then supplied to an extract, transform, load (ETL) process. The ETL process transforms the annotated data into a knowledge graph instance, which is stored in a knowledge base. The knowledge graph represents the relationships and entities from the table in a structured, queryable format. This allows for efficient integration of tabular data into a broader knowledge base, enabling advanced querying and analysis. The system automates the conversion of tabular data into a knowledge graph, reducing manual effort and improving data consistency. The annotations ensure that the relationships and context of the tabular data are preserved in the knowledge graph.
12. The apparatus according to claim 9 wherein the classification layer of the semantic type annotation neural network is adapted to calculate column type vectors, y, comprising for the cell data of each cell, C, of the respective supplied row, R, predicted semantic column type probabilities.
This invention relates to a neural network-based apparatus for semantic type annotation in tabular data, addressing the challenge of automatically identifying and classifying the semantic types of columns in structured datasets. The apparatus includes a semantic type annotation neural network that processes input data to predict the most likely semantic types for each column in a table. The network generates column type vectors for each cell in a row, where each vector contains predicted semantic type probabilities for the corresponding cell data. These probabilities indicate the likelihood that the cell belongs to a particular semantic type, such as dates, names, or numerical values. The apparatus is designed to enhance data processing and analysis by accurately classifying column types, which is crucial for tasks like data integration, validation, and querying. The neural network's classification layer computes these probabilities based on the input data, enabling automated and scalable semantic type annotation in tabular datasets. This solution improves upon traditional rule-based or statistical methods by leveraging machine learning to handle diverse and complex data structures. The apparatus is particularly useful in applications requiring high accuracy in semantic type detection, such as database management, data mining, and artificial intelligence-driven analytics.
13. The apparatus according to claim 9, wherein a mean pooling of column type vectors, y, of all rows, R, of the table, T, is performed to predict the semantic type annotation of each column, C, of the table, T.
This invention relates to a system for predicting semantic type annotations in tabular data. The problem addressed is the automatic classification of columns in a table into semantic categories, such as dates, names, or numerical values, to improve data understanding and processing. The apparatus processes a table, T, containing multiple rows, R, and columns, C. Each column is represented as a column type vector, y, which encodes features of the column's data. The system performs mean pooling on these column type vectors across all rows to generate a pooled representation for each column. This pooled representation is then used to predict the semantic type annotation for each column. The mean pooling operation aggregates the column type vectors, reducing dimensionality while preserving relevant semantic information. The predicted semantic type annotations help in tasks like data cleaning, integration, and analysis by providing structured metadata about the table's content. The apparatus may also include additional components, such as a feature extractor to generate the column type vectors and a classifier to assign semantic types based on the pooled representations. The system is designed to handle diverse tabular data, improving accuracy in semantic type prediction compared to traditional methods.
14. The apparatus according to claim 9 wherein the self-attention layer of the semantic type annotation neural network comprises a stack of transformers adapted to calculate attentions among the cells, C, of the respective row, R, of the table, T.
The apparatus relates to a semantic type annotation system for tabular data, addressing the challenge of accurately identifying and labeling the semantic types of cells within a table. The system employs a neural network with a self-attention layer designed to process and analyze the relationships between cells in a table row. The self-attention layer consists of a stack of transformer modules, which compute attention scores to capture dependencies and contextual information among the cells in each row. This approach enhances the model's ability to understand the semantic meaning of each cell by considering its interactions with other cells in the same row. The transformer stack allows the model to dynamically weigh the importance of different cells, improving the accuracy of type annotations. The system is particularly useful in applications requiring automated data interpretation, such as database management, data cleaning, and natural language processing tasks involving tabular data. By leveraging self-attention mechanisms, the apparatus provides a robust solution for semantic type annotation, ensuring that the relationships between cells are properly accounted for in the annotation process.
15. The apparatus according to claim 9 wherein the semantic type annotation neural network of the apparatus is trained in a supervised learning process using labeled rows, R, as samples.
The apparatus is designed for semantic type annotation in data processing systems, addressing the challenge of accurately classifying and labeling data elements within structured or unstructured datasets. The system employs a neural network specifically trained to recognize and assign semantic types to data elements, improving data interpretation and usability in applications such as natural language processing, database management, and knowledge graph construction. The neural network is trained using a supervised learning approach, where labeled rows of data serve as training samples. Each labeled row, denoted as R, contains data elements paired with their corresponding semantic type annotations. During training, the neural network learns to map input data elements to their correct semantic types by analyzing patterns and relationships within the labeled samples. This training process enhances the network's ability to generalize and accurately annotate new, unseen data. The apparatus may also include preprocessing modules to prepare raw data for annotation, such as tokenization, normalization, or feature extraction. Post-processing modules may refine the annotations, ensuring consistency and accuracy. The system can be integrated into larger data pipelines or deployed as a standalone tool for semantic enrichment of datasets. The supervised learning method ensures high precision in type annotation, making it suitable for applications requiring reliable data classification.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 17, 2021
May 7, 2024
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.