Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A computer-implemented method comprising: compressing a first data stream at least in part by: deriving both a first set of literal tokens and a master set of references from the first data stream, wherein each reference of the master set of references refers to a literal token of the first set of literal tokens, and storing a compressed version of the first data stream, wherein the compressed version of the first data stream comprises the first set of literal tokens and the master set of references; compressing a second data stream at least in part by: deriving a second set of literal tokens from the second data stream, wherein each literal token of the second set of literal tokens uniquely corresponds to a corresponding literal token of the first set of literal tokens, storing a compressed version of the second data stream, wherein the compressed version of the second data stream comprises the second set of literal tokens, and storing metadata, for the compressed version of the second data stream, that refers to at least a portion of the master set of references; wherein the first data stream is distinct from the second data stream; wherein at least a portion of the first data stream encodes the same substantive content as at least a portion of the second data stream; and wherein the method is performed by one or more computing devices.
The invention relates to a computer-implemented method for compressing data streams, particularly where multiple data streams contain overlapping substantive content. The method addresses the inefficiency of independently compressing similar data streams, which can lead to redundant storage and processing. The solution involves deriving a shared set of literal tokens and references from a first data stream, then using this shared information to compress a second data stream that contains overlapping content. The first data stream is processed by extracting literal tokens and generating a master set of references, where each reference points to a literal token. The compressed version of the first data stream includes these literal tokens and references. For the second data stream, a second set of literal tokens is derived, where each token corresponds to a token in the first set. The compressed version of the second data stream includes only these literal tokens, while metadata links to the master set of references from the first stream. This approach reduces redundancy by reusing references from the first stream, improving compression efficiency for similar data streams. The method is executed by one or more computing devices.
2. The method of claim 1 , wherein compressing the first data stream and compressing the second data stream is performed concurrently.
This invention relates to data compression systems, specifically methods for concurrently compressing multiple data streams to improve processing efficiency. The problem addressed is the inefficiency of sequential compression, which can lead to bottlenecks in data processing pipelines, particularly in real-time or high-throughput applications. The method involves receiving a first data stream and a second data stream, where each stream is independently compressed. The key innovation is that the compression of the first and second data streams is performed concurrently, rather than sequentially. This parallel processing reduces latency and increases throughput by utilizing multiple processing resources simultaneously. The compression may be lossless or lossy, depending on the application requirements. The method may also include decompressing the compressed data streams to reconstruct the original data, ensuring data integrity and usability. Concurrent compression is particularly useful in systems where multiple data sources must be processed in real time, such as video streaming, sensor networks, or distributed computing environments. By eliminating the delays associated with sequential compression, the method enhances overall system performance and scalability. The invention may be implemented in hardware, software, or a combination of both, depending on the specific use case.
3. The method of claim 1 , further comprising: identifying a particular token from the second data stream that corresponds to a corresponding token from the first data stream; identifying, for the second data stream, first one or more references that refer to the particular token; determining that the master set of references includes corresponding second one or more references that refer to the corresponding token from the first data stream; wherein the first one or more references have the same respective values as the corresponding second one or more references; in response to determining that the master set of references includes the corresponding second one or more references that refer to the corresponding token from the first data stream, omitting, from an output buffer for the second data stream, the first one or more references.
This invention relates to data processing systems that handle multiple data streams, particularly in scenarios where redundant references to tokens (e.g., variables, identifiers, or data elements) exist across streams. The problem addressed is the inefficiency of storing or transmitting duplicate references, which consumes unnecessary memory or bandwidth. The method involves comparing two data streams to identify redundant references. A first data stream is processed to generate a master set of references, which includes all references to tokens within that stream. A second data stream is then analyzed to identify tokens that match those in the first stream. For each matching token in the second stream, the method checks if the same references exist in the master set. If they do, and the values of the references are identical, the redundant references in the second stream are omitted from the output buffer. This reduces storage or transmission overhead by eliminating duplicate references while preserving data integrity. The approach ensures that only unique references are retained, optimizing resource usage in systems handling multiple data streams with overlapping token references. This is particularly useful in distributed computing, database systems, or real-time data processing where efficiency is critical.
4. The method of claim 1 , wherein each reference of the master set of references refers to a position of a literal token of the first set of literal tokens.
A system and method for processing and analyzing text data involves generating a master set of references, where each reference in the master set corresponds to a specific position of a literal token within a first set of literal tokens. The literal tokens are extracted from a text input, and the master set of references is used to map or index these tokens for further processing. This approach enables efficient retrieval, comparison, or manipulation of the literal tokens based on their positions within the text. The method may include preprocessing the text to identify and extract the literal tokens, which can be words, phrases, or other textual elements. The master set of references allows for precise tracking of token positions, facilitating tasks such as text indexing, search, or natural language processing. The system may also include additional steps to refine or filter the literal tokens before generating the master set of references, ensuring accuracy and relevance in the mapping process. This method is particularly useful in applications requiring precise text analysis, such as document retrieval, semantic parsing, or machine learning-based text processing.
5. The method of claim 1 , wherein the first data stream and the second data stream are related data streams.
This invention relates to systems and methods for processing related data streams in a computing environment. The problem addressed is the efficient handling and correlation of multiple data streams that share a relationship, such as time-series data, sensor inputs, or network traffic logs, to improve data analysis, synchronization, or decision-making. The method involves receiving a first data stream and a second data stream, where the two streams are related by a common attribute, such as time, source, or content. The system processes these streams to identify and leverage their relationship, enabling operations like synchronization, cross-stream analysis, or anomaly detection. For example, if the streams represent sensor readings from different devices monitoring the same environment, the method may align timestamps or correlate values to derive insights that a single stream could not provide. The method may include preprocessing steps like filtering, normalization, or feature extraction to prepare the data for analysis. It may also involve applying machine learning models or statistical techniques to detect patterns, dependencies, or discrepancies between the streams. The output can be used for real-time monitoring, predictive maintenance, or automated decision-making in applications like industrial IoT, cybersecurity, or financial trading. The invention improves upon prior art by providing a structured approach to handling related data streams, reducing computational overhead and improving accuracy in multi-stream analysis.
6. The method of claim 1 , wherein: deriving the first set of literal tokens comprises tokenizing the first data stream; deriving the second set of literal tokens comprises tokenizing the second data stream; tokenizing the first data stream and tokenizing the second data stream results in similar data, from the first data stream and the second data stream, being tokenized similarly.
This invention relates to data processing systems that compare or analyze data streams by tokenizing them into literal tokens. The problem addressed is ensuring consistent tokenization of similar data across different data streams, which is critical for accurate comparison, analysis, or integration of the data. The method involves tokenizing a first data stream to derive a first set of literal tokens and tokenizing a second data stream to derive a second set of literal tokens. The tokenization process ensures that similar data in both streams is tokenized in the same way, maintaining consistency. This consistency is essential for applications such as data matching, deduplication, or semantic analysis, where identical or similar data must be processed uniformly. The method may be used in systems that compare structured or unstructured data, such as text, logs, or database records, to improve accuracy and reliability in data processing tasks. By standardizing tokenization, the invention helps prevent discrepancies that could arise from different tokenization rules or algorithms applied to similar data in different streams.
7. The method of claim 6 , wherein compressing the first data stream further comprises: searching, within a first history buffer for the first data stream, for one or more tokens from the first data stream; wherein the first history buffer stores history data that is at least partially tokenized; identifying a particular reference, of the master set of references, based on finding particular content of the one or more tokens in the first history buffer represented as one or more whole tokens within the history data stored in the first history buffer.
This invention relates to data compression techniques, specifically improving compression efficiency by leveraging tokenized history data. The problem addressed is the inefficiency of traditional compression methods that fail to fully utilize previously processed data, leading to suboptimal compression ratios. The solution involves a method for compressing a first data stream by searching within a first history buffer associated with that stream. The history buffer contains at least partially tokenized historical data, allowing for more efficient pattern matching. The method searches for tokens within the incoming data stream and identifies a particular reference from a master set of references when matching content is found in the history buffer. This reference is represented as whole tokens within the stored history data, enabling faster and more accurate compression by reusing previously tokenized patterns. The approach enhances compression performance by reducing redundant processing and improving the reuse of historical data structures. This technique is particularly useful in systems where data streams exhibit repetitive patterns, such as in network communications, file storage, or real-time data processing.
8. The method of claim 1 , wherein storing the compressed version of the first data stream further comprises: storing a first token of the first set of literal tokens as a plurality of sub-tokens; wherein a particular sub-token of the plurality of sub-tokens is represented by a reference to at least a portion of a second literal token of the first set of literal tokens.
The invention relates to data compression techniques, specifically methods for efficiently storing compressed versions of data streams by leveraging tokenization and sub-token referencing. The problem addressed is the inefficiency in storing literal tokens, particularly when certain tokens or portions of tokens repeat within the same data stream. The solution involves decomposing a literal token into sub-tokens and representing at least one sub-token by referencing another literal token or a portion of it, thereby reducing redundancy and storage requirements. The method involves processing a data stream to generate a set of literal tokens, which are then compressed. During compression, a first token from the set is split into multiple sub-tokens. At least one of these sub-tokens is represented by a reference to another literal token or a portion of it within the same set. This referencing mechanism avoids redundant storage of identical or similar sub-tokens, improving compression efficiency. The approach is particularly useful in scenarios where data streams contain repetitive patterns or shared sub-sequences, such as in text, code, or structured data compression. The technique may be applied in various compression algorithms, including those used in data storage, transmission, or archival systems, where minimizing storage footprint and bandwidth usage are critical. By dynamically referencing existing tokens, the method reduces the need to store duplicate sub-tokens, leading to more compact representations of the original data stream.
9. The method of claim 1 , wherein deriving the master set of references from the first data stream comprises: determining that particular content of a particular token, from the first data stream, occurs within a history buffer maintained for the first data stream; wherein the particular content of the particular token includes all content of the particular token; determining whether the particular content, within the history buffer maintained for the first data stream, is represented as one or more whole tokens; in response to determining that the particular content is represented as one or more whole tokens within the history buffer, outputting a particular reference to represent the particular token in an output buffer for the first data stream; wherein the particular reference refers to the one or more whole tokens.
This invention relates to data compression techniques, specifically methods for efficiently encoding data streams by referencing previously occurring content. The problem addressed is the need to reduce redundancy in data transmission or storage by leveraging repeated sequences within a data stream. The method involves deriving a master set of references from a data stream by analyzing tokens within the stream. A token is a segment of data, such as a byte, word, or larger block. The process begins by examining a token from the data stream and checking if its content exists in a history buffer, which stores previously processed tokens. If the token's content is found in the history buffer, the method then determines whether that content is represented as one or more whole tokens within the buffer. If so, the method outputs a reference to these whole tokens instead of the original token, effectively replacing the token with a pointer to its previous occurrence. This reference is stored in an output buffer, reducing the amount of data to be transmitted or stored. The reference mechanism allows for efficient compression by avoiding the repetition of identical data segments. The technique is particularly useful in applications where data streams contain significant redundancy, such as video encoding, file compression, or network communication protocols.
10. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause: compressing a first data stream at least in part by: deriving both a first set of literal tokens and a master set of references from the first data stream, wherein each reference of the master set of references refers to a literal token of the first set of literal tokens, and storing a compressed version of the first data stream, wherein the compressed version of the first data stream comprises the first set of literal tokens and the master set of references; compressing a second data stream at least in part by: deriving a second set of literal tokens from the second data stream, wherein each literal token of the second set of literal tokens uniquely corresponds to a corresponding literal token of the first set of literal tokens, and storing a compressed version of the second data stream, wherein the compressed version of the second data stream comprises the second set of literal tokens, and storing metadata, for the compressed version of the second data stream, that refers to at least a portion of the master set of references wherein the first data stream is distinct from the second data stream; and wherein at least a portion of the first data stream encodes the same substantive content as at least a portion of the second data stream.
The invention relates to data compression techniques for handling multiple data streams that share substantive content. The problem addressed is efficient compression of related data streams where some content is redundant across streams. The solution involves compressing a first data stream by deriving literal tokens and a master set of references, where each reference points to a literal token. The compressed version of the first stream includes these literal tokens and references. A second, distinct data stream is compressed by deriving literal tokens that correspond uniquely to those in the first stream. The compressed version of the second stream includes these literal tokens, and metadata is stored to reference at least part of the master set of references from the first stream. This approach leverages shared content between streams to reduce redundancy, improving compression efficiency. The method ensures that while the data streams are distinct, portions of them encode the same substantive content, allowing references from the first stream to be reused in the second stream's compression. This technique is particularly useful in systems where multiple data streams contain overlapping or similar data, such as in versioned datasets or related documents.
11. The one or more non-transitory computer-readable media of claim 10 , wherein compressing the first data stream and compressing the second data stream is performed concurrently.
The invention relates to data compression systems, specifically methods for concurrently compressing multiple data streams to improve processing efficiency. The problem addressed is the inefficiency of sequential compression, which can lead to bottlenecks in data processing pipelines, particularly in high-throughput applications. The solution involves a system that compresses two or more data streams simultaneously, reducing latency and increasing throughput. The system includes a data processing module that receives a first data stream and a second data stream, each containing data to be compressed. A compression module applies a compression algorithm to both data streams at the same time, ensuring that the compression processes do not interfere with each other. The compressed outputs are then stored or transmitted for further use. This concurrent compression approach is particularly useful in environments where multiple data streams must be processed in real-time, such as video streaming, network traffic management, or large-scale data storage systems. The invention may also include additional features, such as error detection and correction mechanisms, to ensure data integrity during compression. By compressing multiple streams in parallel, the system optimizes resource utilization and minimizes delays, making it suitable for high-performance computing and real-time data applications.
12. The one or more non-transitory computer-readable media of claim 10 , wherein the instructions further comprise instructions that, when executed by one or more processors, cause: identifying a particular token from the second data stream that corresponds to a corresponding token from the first data stream; identifying, for the second data stream, first one or more references that refer to the particular token; determining that the master set of references includes corresponding second one or more references that refer to the corresponding token from the first data stream; wherein the first one or more references have the same respective values as the corresponding second one or more references; in response to determining that the master set of references includes the corresponding second one or more references that refer to the corresponding token from the first data stream, omitting, from an output buffer for the second data stream, the first one or more references.
This invention relates to data stream processing, specifically optimizing reference handling in data streams to reduce redundancy. The problem addressed is the inefficiency of storing or transmitting duplicate references in data streams, which can increase storage requirements and bandwidth usage. The solution involves comparing tokens and their associated references between two data streams to eliminate redundant references. The system processes a first data stream and a second data stream, each containing tokens and references that point to those tokens. A master set of references is maintained, which includes references from the first data stream. When processing the second data stream, the system identifies a token in the second data stream that matches a corresponding token in the first data stream. For this token, the system identifies references in the second data stream and checks if the master set already contains corresponding references from the first data stream. If the references have the same values, the system omits the redundant references from the output buffer for the second data stream, thereby reducing redundancy. This approach ensures that only unique references are stored or transmitted, improving efficiency.
13. The one or more non-transitory computer-readable media of claim 10 , wherein each reference of the master set of references refers to a position of a literal token of the first set of literal tokens.
The invention relates to a system for managing references in a computer-readable medium, particularly for tracking positions of literal tokens within a dataset. The problem addressed is the need for an efficient way to reference and locate specific literal tokens in a structured or unstructured dataset, ensuring accurate and quick retrieval of token positions. The system involves a master set of references, where each reference in this set corresponds to a specific position of a literal token within a predefined set of literal tokens. This allows for precise mapping and retrieval of token positions, improving data processing and analysis tasks that rely on token-based operations. The master set of references is stored in a non-transitory computer-readable medium, ensuring persistence and reliability. The system may be used in applications such as natural language processing, data indexing, or code analysis, where tracking token positions is critical for accurate parsing, searching, or transformation of data. The invention enhances the efficiency and accuracy of token-based operations by providing a structured and direct way to reference token positions, reducing the computational overhead associated with searching or indexing large datasets.
14. The one or more non-transitory computer-readable media of claim 10 , wherein the first data stream and the second data stream are related data streams.
This invention relates to data processing systems that handle multiple data streams, particularly where the streams are related. The problem addressed is efficiently managing and processing related data streams to improve system performance, accuracy, or resource utilization. The invention involves a computer-implemented method that processes a first data stream and a second data stream, where these streams are related. The system may analyze the relationship between the streams to optimize processing, such as by correlating data points, synchronizing timestamps, or merging information. The method may include steps like receiving the data streams, identifying their relationship, and applying processing rules based on that relationship. The system may also handle metadata associated with the streams to ensure proper alignment or integration. The invention may be used in applications like real-time analytics, sensor data processing, or financial transaction systems where multiple related data sources must be coordinated. The goal is to enhance data consistency, reduce processing latency, or improve decision-making by leveraging the relationship between the streams. The invention may include additional features like error correction, data validation, or adaptive processing based on the nature of the relationship between the streams.
15. The one or more non-transitory computer-readable media of claim 10 , wherein: deriving the first set of literal tokens comprises tokenizing the first data stream; deriving the second set of literal tokens comprises tokenizing the second data stream; tokenizing the first data stream and tokenizing the second data stream results in similar data, from the first data stream and the second data stream, being tokenized similarly.
This invention relates to data processing systems that compare or analyze data streams by tokenizing them into literal tokens. The problem addressed is ensuring consistent tokenization of similar data across different data streams, which is critical for accurate comparison, analysis, or matching of information. The invention involves a method for tokenizing two or more data streams such that similar data within those streams is tokenized in the same way. This ensures that subsequent processing, such as comparison or analysis, produces reliable results. The tokenization process breaks down the data streams into literal tokens, which are discrete units of data (e.g., words, numbers, or symbols) that can be processed individually. By applying the same tokenization rules to both data streams, the system ensures that identical or similar data segments are represented by identical or similar tokens. This consistency is essential for applications like data deduplication, text analysis, or database matching, where differences in tokenization could lead to incorrect conclusions. The invention may be implemented in software, hardware, or a combination thereof, and is particularly useful in systems where data streams must be compared or analyzed in real-time or at scale.
16. The one or more non-transitory computer-readable media of claim 15 , wherein compressing the first data stream further comprises: searching, within a first history buffer for the first data stream, for one or more tokens from the first data stream; wherein the first history buffer stores history data that is at least partially tokenized; identifying a particular reference, of the master set of references, based on finding particular content of the one or more tokens in the first history buffer represented as one or more whole tokens within the history data stored in the first history buffer.
This invention relates to data compression techniques, specifically improving compression efficiency by leveraging tokenized history data. The problem addressed is the inefficiency of traditional compression methods that fail to fully utilize previously processed data for better compression ratios. The solution involves a system that compresses a first data stream by searching a first history buffer, which stores at least partially tokenized history data, for matching tokens from the current data stream. The system identifies a particular reference from a master set of references when the content of the tokens in the current data stream is found in the history buffer as whole tokens. This allows for more accurate and efficient compression by reusing previously tokenized data patterns. The method enhances compression performance by reducing redundancy and improving the accuracy of reference matching. The history buffer's tokenized structure enables faster searches and more precise matches, leading to better compression outcomes. This approach is particularly useful in systems where data streams contain repetitive patterns or where historical data can be effectively reused for compression. The invention optimizes storage and transmission efficiency by leveraging tokenized history data to improve compression ratios.
17. The one or more non-transitory computer-readable media of claim 10 , wherein storing the compressed version of the first data stream further comprises: storing a first token of the first set of literal tokens as a plurality of sub-tokens; wherein a particular sub-token of the plurality of sub-tokens is represented by a reference to at least a portion of a second literal token of the first set of literal tokens.
The invention relates to data compression techniques, specifically improving compression efficiency by tokenizing and referencing literal tokens within a data stream. The problem addressed is the inefficiency of traditional compression methods when dealing with repetitive or similar literal tokens in a data stream, leading to suboptimal storage and transmission performance. The invention involves storing a compressed version of a data stream by breaking down literal tokens into sub-tokens, where at least one sub-token is represented by a reference to another literal token in the same set. This approach leverages redundancy within the data stream itself, reducing storage overhead by avoiding redundant storage of similar or identical sub-tokens. The method includes identifying literal tokens in the data stream, decomposing them into sub-tokens, and replacing certain sub-tokens with references to other literal tokens already stored. This technique enhances compression efficiency by minimizing repetitive storage of similar data segments while maintaining data integrity. The invention is particularly useful in systems where data streams contain repetitive or similar literal tokens, such as text processing, log file compression, or database storage, where traditional compression methods may not fully exploit internal redundancies. By dynamically referencing existing literal tokens, the method reduces the overall size of the compressed data stream without sacrificing decompressibility.
18. The one or more non-transitory computer-readable media of claim 10 , wherein deriving the master set of references from the first data stream comprises: determining that particular content of a particular token, from the first data stream, occurs within a history buffer maintained for the first data stream; wherein the particular content of the particular token includes all content of the particular token; determining whether the particular content, within the history buffer maintained for the first data stream, is represented as one or more whole tokens; in response to determining that the particular content is represented as one or more whole tokens within the history buffer, outputting a particular reference to represent the particular token in an output buffer for the first data stream; wherein the particular reference refers to the one or more whole tokens.
This invention relates to data compression techniques, specifically methods for efficiently encoding and decoding data streams by leveraging previously transmitted content. The problem addressed is the redundancy in data streams, where identical or similar content is repeatedly transmitted, leading to inefficiencies in storage and transmission. The invention involves a system that processes a first data stream to derive a master set of references. This process includes analyzing tokens within the data stream to determine if their content has already been transmitted and stored in a history buffer. When a token's content is found in the history buffer, the system checks if it is represented as one or more whole tokens. If confirmed, the system outputs a reference to these whole tokens in an output buffer, rather than retransmitting the full content. This reference-based approach reduces redundancy by replacing repeated content with pointers to previously stored data, improving compression efficiency. The history buffer maintains a record of previously transmitted tokens, allowing the system to identify and reference recurring content. The output buffer stores the compressed data, where references replace redundant tokens. This method ensures that only unique or newly introduced content is transmitted, while previously seen content is referenced, optimizing data transmission and storage. The technique is particularly useful in applications requiring efficient data compression, such as streaming, file transfer, or real-time communication systems.
Unknown
June 9, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.