A serialization data format for the transfer, analysis, and modification of schemaless mass data is proposed. The data format supports stream oriented, pipeline-based processing, and it enables read access of contained data without deserialization, and modification access that only requires the deserialization of portions of structure and meta data. Incoming, semi-structured data records may be transformed into records of the proposed serialization data format, compressed and stored in processing buffers containing multiple of those records. During processing, only individual records are decompressed, and processed records are then compressed and stored in output processing buffers for efficient memory usage. Manipulations of data records are performed by appending new values to data records, invalidating old ones and updating access data structures to refer to new values instead of old ones, to enable various modification activities by only requiring append or not-size-changing operations of serialized data records.
Legal claims defining the scope of protection, as filed with the USPTO.
. A computer-implemented method for ingesting data records, comprising:
. The method ofwherein formatting the serialized record container includes updating a validity indicator indicating whether data in the serialized record container is valid, and updating a type indicator field indicating the type of data stored in the serialized record container.
. The method ofwherein the serialized record container includes a size field indicating size of the serialized record container and further comprises determining size of the serialized record container after the step of appending the record index and updating the size field with the determined size of the serialized record container.
. The method ofwherein formatting the serialized record container includes allocating a second memory space for the record index, where the second memory space differs from the first memory space.
. The method ofwherein each of the serialized data records in the plurality of serialized data records includes a validity indicator indicating whether data in a particular serialized data record is valid, a type indicator field indicating the type of data stored in the particular serialized data record, a size field indicating size of the particular serialized data record, and payload area.
. The method ofwherein each of the serialized data records in the plurality of serialized data records includes an encoding field, where the encoding field indicates how data in size field is encoded in the particular serialized data record.
. The method offurther comprises compressing the serialized record container prior to storing the serialized record container
. A computer-implemented method for manipulating a serialized data record, comprising:
. The method ofwherein comparing size of the replacement data record to size of the serialized data record includes determining size of the replacement data record from a size field in the replacement data record and determining size of the serialized data record from a size field in the serialized data record.
. The method ofwherein determining size of the serialized data record further includes reading an encoding field in the serialized data record and determining size of the serialized data record in part based on the encoding field, where the encoding field indicates how data in size field is encoded in the serialized data record.
. The method offurther comprises
. The method offurther comprises
. The method offurther comprises
. The method offurther comprises
. The method offurther comprises
. The method offurther comprises compressing the serialized record container prior to storing the serialized record container.
. The method ofwherein the serialized record container includes a validity indicator indicating whether data in the serialized data container is valid, a type indicator field indicating the type of data stored in the serialized data container, and a size field indicating size of the serialized record container.
. The method ofwherein the at least one serialized data record includes a validity indicator indicating whether data in a particular serialized data record is valid, a type indicator field indicating the type of data stored in the particular serialized data record, a size field indicating size of the particular serialized data record, and payload area.
. The method ofwherein the at least one serialized data record is defined as a key-value tuple and includes a key that uniquely identifies the at least one serialized data record in the serialized record container.
. A computer-implemented method for serializing data records in a computer system, comprising:
. The method ofwherein formatting the serialized record container includes serializing the input data record to from a serialized data record, appending the serialized data record to an end of the serialized record container, updating a validity indicator indicating whether data in the serialized data container is valid, and updating a type indicator field indicating the type of data stored in the serialized data container.
. The method ofwherein the serialized data record includes a validity indicator indicating whether data in a particular serialized data record is valid, a type indicator field indicating the type of data stored in the particular serialized data record, a size field indicating size of the particular serialized data record, and payload area.
. The method ofwherein the serialized data record further includes an encoding field, where the encoding field indicates how data in size field is encoded in the particular serialized data record.
. The method ofwherein the serialized data record is defined as a key-value tuple and includes a key that uniquely identifies the serialized data record in the serialized record container.
. The method offurther comprises compressing the serialized data record prior to storing the serialized data record.
. A computer-implemented method for ingesting data records, comprising:
. The method ofwherein appending the hash structure further comprises compressing the hash structure by only storing updated slots of the hash structure at the end of the serialize tuple container.
. The method ofwherein formatting the serialized tuple container includes updating a validity indicator indicating whether data in the serialized tuple container is valid, and updating a type indicator field indicating the type of data stored in the serialized tuple container.
. The method ofwherein the serialized tuple container includes a size field indicating size of the serialized tuple container and further comprises determining size of the serialized tuple container after the step of appending the hash structure and updating the size field with the determined size of the serialized tuple container.
. The method ofwherein formatting the serialized tuple container includes allocating a second memory space for the hash structure, where the second memory space differs from the first memory space.
. The method ofwherein each of the serialized data records in the plurality of serialized data records includes a validity indicator indicating whether data in a particular serialized data record is valid, a type indicator field indicating the type of data stored in the particular serialized data record, a size field indicating size of the particular serialized data record, and payload area.
. The method ofwherein each of the serialized data records in the plurality of serialized data records includes an encoding field, where the encoding field indicates how data in size field is encoded in the particular serialized data record.
Complete technical specification and implementation details from the patent document.
This application claims the benefit and priority of U.S. Provisional Application No. 63/659,432 filed on Jun. 13, 2024. The entire disclosure of the above application is incorporated herein by reference.
The present disclosure relates to the technical field of data processing and transmission. In particular, the disclosure relates to a unified, binary data format that is suitable for transfer, read-only and modifying processing of data records. Although the data format is applicable to various types and amounts of data and forms of processing, it is best suited for the distributed, stream-oriented processing of large amounts of schemaless data records.
Application performance and functionality monitoring and management systems rely on constantly provided monitoring data ingested from monitored processing infrastructure like host computing systems, processes, and services. While early monitoring solutions relied solely on agents or other monitoring date generation functionality that is injected into monitored environments, and that create limited, controllable amounts of generated structured monitoring data records that are tailored to the requirements of the monitoring systems, this approach is no longer feasible in modern application scenarios.
First, modern application provisioning and operating environments, like cloud computing, operating orchestration systems, or even serverless application operation systems make it difficult or even impossible for application monitoring systems to place sensors or agents inside or near to monitored entities, to create and provide high-quality but relatively low-volume monitoring data, like transaction traces, that can be ingested and processed by a monitoring system with relatively low effort.
To overcome this issue, monitoring systems had to widen their monitoring data ingestion capabilities to also ingest lower-quality, semi-formatted, or schemaless monitoring data, like events, log data and the like, or other evidence data that is created by monitored components out-of-the-box, without significant interference by the monitoring system.
The increased variety of ingested evidence data variants, and the lack of knowledge about structure and semantic of to be ingested monitoring evidence data causes new requirements that monitoring systems need to cope with. First, schemaless or semi-structured data needs to be analyzed, preferably at, or near to ingestion time, to identify the structure of the ingested data. Second, this data needs to be analyzed to identify the semantic of the ingested data, which may also include changing or updating identifiers, keys or names contained in the ingested data to align them with identifiers, keys or names that are used by the monitoring system and that are linked to a certain semantic. In addition, pre-processing steps may be performed on ingested schemaless data to extract implicitly available evidence data and make it available in an explicit way, like e.g., the extraction of numeric metric values for evidence data that was ingested in textual form.
The preferred system architecture to perform those tasks is a processing pipeline, which receives the ingested evidence data via a stream like interface, where a theoretically infinite stream of consecutive evidence data records is received by the pipeline and where a sequence of operations is executed on each data record.
In addition to extending the variety of record types that monitoring systems ingest and process, also the number of ingested records drastically increased. This leads to distributed, scalable architectures of those pipeline systems, where the processing tasks are distributed to various, loosely coupled processing nodes. Evidence data records may be transmitted between processing or storage nodes via interconnecting computer networks.
Various solutions exist in the art, which support the automated sender side serialization, network transfer and receiver side deserialization of data records. The Protocol Buffers, or Protobuf framework, initially developed by Google provides serialization/deserialization functionality for various programming languages. Pipeline systems using such automated serialization frameworks for the network transfer of evidence records work in principal, and development of such systems causes relatively low development effort, because complex network communication logic is encapsulated by the used serialization frameworks. However, such pipeline systems quickly show performance and processing time issues if the number of processed record increases.
Reason for those problems are high CPU and memory resources that are required for serialization and deserialization of data records.
Those problems are intensified in modern processing environments that provide automate heap memory management, like Java or .NET. In such systems, objects are implicitly allocated in heap memory via instructions in code, but deallocation of those objects, and reclaiming of memory is automatically performed by garbage collection subsystems, which detect when objects are no longer referred and then delete them.
Serialization/deserialization frameworks typically create short lived objects of specific type during receipt and deserialization of received data and during serialization and send of to be delivered data. Processing steps performed on received data records typically use processing specific objects representing those data records. All those short-living objects increase the burden of the garbage collection system, as they need to be identified as unreachable and the memory allocated by those objects needs to be reclaimed.
Various approaches are known in the art, which aim to mitigate these serialization/deserialization issues by providing data layouts that can be used for both transfer via a computing network and for access of the data during its processing.
Apache Arrow is one of those approaches, which uses a columnar data format for transferring and processing mass data. Basically, received data elements are split into key-value tuples, and all values of the same key are stored in one column, where the head of the column is specified by the key of the key value tuple and the values for the key are then stored in a sequence, where the sequence number identifies the data record to which the value belongs. Those “column oriented” approaches show very good performance for batch-oriented processing, where all or a large subset of values of a specific key are accessed. One reason for this performance improvement is that those values are stored next to each other in memory, which is very well exploited by various caching mechanisms available in modern CPUs and computing architectures.
However, such column-oriented storage schemes are not well suitable for processing pipelines, where individual data records are received and processed in a sequential, stream-oriented way, due to the effort caused by rearranging sequences of individual, independent records into sequences of columns, where each column represents values of multiple records.
Other approaches exist that apply a record-oriented approach, where serialized sequences of data records are stored in memory, and structure or meta data describing the inner format of the data records are created and used for selective read access of portions of individual portions of records. The Flatbuffers library, also initially developed by Google is a prominent example of those approaches. However this library and other similar approaches provide means to reuse one and the same binary format for network transfer and for read access of data records, they all lack support for modification. In case a data record needs to be modified, this record must first be transformed into a deserialized format to then be serialized again after desired modifications were performed.
Although not all processing steps performed by pipeline systems designed for the ingest and processing of semi-formatted and schemaless data perform modifications of received data records, the portion of processing steps that also perform modification of received data records is considerable. Therefore, it is desirable to also avoid serialization overhead for modification processing steps.
Consequently, there is need in the art for a serialized data format that is suitable for the transfer of data over a computing network, for selective read access of specific portions of the data, and that also supports modification of the data, without requiring a deserialization of to be modified data records.
This section provides background information related to the present disclosure which is not necessarily prior art.
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
The disclosed technologies and methods are directed to a unified serialization and processing format, for data records of various complexity that are received and processed in a sequential, stream-oriented form. The proposed format supports interleaved network transport and processing of encoded data records, without the need to deserialize or otherwise reformat serialized data records. Input data records may be received and converted into serialized data records by ingestion nodes at the perimeter of processing pipeline systems, and then forwarded to sequences of processing nodes. The serialized records may also be persisted on a file system or in a data base, as an intermediate or terminating step of the processing pipeline.
Some embodiments may distinguish between read-only and modifying processing and create and use variants of serialized data records that are optimized for the respective processing type. A serialized record supporting its modification may contain a validity indicator, which may be used to perform deletion or modification operation of the serialized record. This validity indicator may be omitted if no modifications of the record are required.
The layout of serialized data records may, for modifiable records start with validity indicator, followed by a byte size encoding section. For read-only records, it may start with the byte size encoding section. The byte size encoding section is independent of the type of the serialized record and contains data to determine how data for the byte size of the payload of the serialized record is stored. This enables processing nodes that are not aware of specific record types to gracefully ignore the unknown record type, by reading and interpreting the byte size encoding data to determine the byte size of the unknown record and to skip over to the next serialized record.
A type information section follows the byte size encoding, which specifies the type of the serialized record. Exemplary types include Strings, integer numbers (Short, Integer, Long, etc.), floating point numbers, complex types of fixed byte size (IP address, location coordinates, etc.) or complex types with variable byte size (arrays or key-value sets). An optional payload size data section may follow the type information section. In some cases, e.g., where the type information already determines the byte size of the payload, like numeric types, or complex types with fixed size, the payload size data section is omitted.
Depending on the type of the serialized record and on data access requirements, either a random access meta data offset section, or the payload section containing the actual data of the serialized record follow. The random access meta data offset may be available for complex types with variable size, like arrays or key-value sets, when random access to the elements of those complex types is required. Type info may be used to encode and determine whether a specific serialized record containing a complex type with variable length contains a random access meta data offset. For some types of records, where the number of different values for the type is limited, those values may be encoded into the type information section. In this case, also the payload section may be omitted. An example is the type Boolean, where the possible values “true” or “false” may already be encoded in the type information section. For serialized records that contain complex types with variable size that support random access of contained elements, a random-access meta data section, containing offsets for those data elements terminates the serialized record.
Variant embodiments may, for complex serialized records with variable size, use a same length indicator, which indicates if all elements contained in the serialized record have the same byte size. In this case, random access meta data may be omitted, even if random access for contained data elements is required, because offsets for individual data elements can be calculated by determining the size of the first data element and multiplying it with the index number of the to be accessed data element. Consequently, the same length indicator needs to be stored and interpreted before the random-access meta data offset.
Variant embodiments of complex types may provide mapping functionality, where contained elements form key-value tuples, and where random-access data is available in form of hash tables, where entries of such hash tables map hash values generated from keys to information for the location of the key within the complex serialized record. Those hash tables may be stored in the serialized record in a compressed form, consisting of a sequence of indicators or flags for occupied hash slots, followed by a list of offsets for those hash slots.
Other embodiments may provide means to select between serialization forms of serialized records that support modifications and forms that do not support modifications. The ability to perform modifications on serialized records requires the storage of additional management data describing those manipulations. This additional management data should only be created when it is required due to desired manipulations.
Various forms of manipulations of serialized data are provided by some embodiments, including in-place modifications, where a portion of the serialized record is replaced by data that at most requires the serialized storage size of the original data, and the storage space of the original data can be reused, and size-changing modifications, where storage size of updated date exceeds the size of original data. For size-changing manipulations, the original data may need to be invalidated, the new data be appended to the serialized record, and eventually available random-access data needs to be updated to refer to the new data portion instead of the invalided old one.
Different forms of finalization procedures, which process serialized data records after sequences of manipulations may applied, e.g., to store the modified serialized data records in an output stream. A first finalization variant may store modified serialized data records in the form they have after manipulation, thereby also storing portions of the data record that were invalidated due to manipulations in the output stream. This finalization strategy has the lowest CPU impact because the record is transferred to the output stream as it is, without any manipulations, but it has the drawback that storage space in the output stream is wasted due to the already invalidated portions of data that are written to the output stream.
A second finalization variant may identify invalidated sections of modified serialized records and overwrite those invalidated sections with specific, homogeneous values, like sequences of null values (i.e., those sections are “zeroed out”). Typically, serialized records are compressed before they are stored in an output stream, and the homogeneous values that the finalization process creates for invalidated sections show a much better compression ratio than arbitrary data. This variant causes limited CPU effort, because only some sections of a to be finalized serialized data record need to be identified and overwritten, and it mitigates the issue of wasted storage space for invalidated data sections due to the more efficient compression.
A third finalization variant may, during iterating over the elements of a to be finalized serialized record and write only still valid portions of the serialized record that are still valid to the output stream. If the serialized record contains random-access data, like a hash table, also the hash table needs to be updated, because offsets stored in the random-access data may need to be corrected after such a “rewriting” finalization. This finalization variant has the highest CPU impact, because the whole serialized record needs to be analyzed and rewritten, but it also has the best memory efficiency, because invalidated portions of the serialized data record are not stored in the output stream. In addition, subsequent processing of the serialized data record may benefit from such a “clean” rewrite, because in this case, it is not required to read and interpret data describing invalidated sections of the serialized record to interpret the stored data correctly.
Yet other variant embodiments may provide means to represent hierarchical structures of serialized data records and to also manipulate those hierarchical records. As an example, serialized records that represent containers for other serialized records, like arrays or maps, may contain elements that are in turn containers of other serialized records. This provides for arbitrary nesting depths of serialized records. For size-changing modifications of nested serialized records, data for updates may be appended to the serialized record that represents the top-level of the nesting hierarchy.
Still other embodiments may use serialized records in combination with dictionary data structures, where those dictionary data may map textual identifiers to corresponding numerical identifiers, like hash values.
Some variants of those embodiments may maintain global, semantic dictionaries, that map textual identifiers, like key values, to numeric (hash) values. During creation of serialized records that contain those keys, not the original textual value of the key may be stored in the serialized record, but the corresponding numeric value.
Other variants of those embodiments may create groups of serialized records, create a local key dictionary for each created group of record, and store the dictionary together with the group of records for which it was created. Also in this variant, the dictionary maps textual values of keys to corresponding numeric values, and the serialized records use those numeric values instead of the textual values as keys.
Both above variants reduce the storage memory footprint, as the numeric representations of textual keys typically require less memory than the textual representation of those keys. Those variants also support fast and efficient renaming of keys, as in case the change of the textual value of a key is desired, only the textual representation of the key in the semantic of local dictionary is required, instead of an update of all key values in all affected serialized records.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The disclosed technologies, data formats and processing methods are directed to the seamless network transfer, interpretation and modification of record-oriented mass data that is received in a stream like fashion. Processing and conversion efforts required for the transfer of such data records are eliminated or at least minimized, because interpretation and read-only processing are using the serialized format of received data records. Conversion efforts are also minimized for processing steps that perform manipulations of serialized records, only access-meta data contained in those records is deserialized and updated in this deserialized form. Portions of manipulated records that are not affected by manipulations remain in the serialized form, and for parts that are affected by those manipulations, the manipulation results are directly stored in serialized form.
Each serialized record contains, next to its payload data, also meta data describing type and size of the payload data. This meta data is generated on the fly, at the ingest of input data and the conversion of this input data into its serialized representation. This supports schemaless processing and storage, where semantic of data and processing steps performed on data are not dictated by a fixed schema to which all ingested data records are converted. Instead, only structure and type data of each individual define the processing steps that are performed on it.
Although the proposed data formats and processing methods are defined and shaped according to the specific requirements of the application performance domain, where large amounts of individual data records, like log lines, transaction traces, event records or metric records, need to be processed in a fast and efficient way, the proposes technologies are also applicable to other technological fields.
Coming now to, which depict two different variants of distributed pipeline systems, where the variant ofperforms persistent storage of ingested serialized records only at the end of the processing chain, while the pipeline shown inalso performs intermediate persistent storage of serialized records, followed by the possibly delayed, reading of the persisted serialized records for additional processing steps.
The distributed pipeline system depicted inreceives semi-structured input recordscontaining application monitoring data in form of log lines/log files, transaction trace data, metric, or event data, in arbitrary, customer specific form. The input data records may be received from a monitored environmentand may be created by monitoring components, like agents or dedicated monitoring interfaces located in the monitored environment. The input records may be transferred to one or more ingestion nodesof the distributed pipeline system via a connecting computer network. The ingestion nodemay convert received input records into serialized records, where the created serialized records contain meta data describing type, semantic and structure of the contained payload data, which enables the processing of this payload data in serialized form. Generated serialized records are compressed and stored in a buffer, which may be forwarded to subsequent processing nodes, if specific forwarding conditions for the buffer are met. Exemplary forwarding conditions include but are not limited to time elapsed since creation of the buffer, number of serialized records contained in the buffer, or memory size of the buffer. After one or more of those conditions are met, such buffers may be forwardedto a processing node, either via a connecting computer network, or via in-memory transfer, if ingest node and processing node reside in the same process or otherwise share the same memory address space.
The read only processing nodemay analyze the serialized records contained in the compressed buffer and afterwards forwardthe compressed bufferto another processing node. In the example depicted in, this is a modifying processing node. The modifying processing node may analyze and update serialized records contained in the received compressed bufferand store the updated serialized records in another compressed buffer. Various other read only or modifying processing nodes may receive, analyze and update serialized records, until a final persisting nodeis reached, which persistently storesthe compressed bufferin a database, in a plain file on a file system, or by using other means for the persistent storage of data. It is noteworthy that the same format of serialized records is used for their transfer over a computer network, read only analysis, modification, and persistent storage.
Referring now to, which shows a pipeline variant in which persistent storage of serialized records is performed as an intermediate step. Also in this variant, semi-structured input recordsare received by an ingestion node, transformed into serialized records, which are stored in compressed buffers, and the compressed buffers are then forwardedto subsequent processing nodes. However, in this case, an intermediate processing/persisting nodeperforms persistent storageof received compressed buffersin a persistent medium. A reading/processing nodemay, potentially at a later time, read the persisted compressed buffers, perform additional analysis and modification activities on contained serialized records and then forwardcompressed buffers containing serialized records to subsequent processing nodes. Intermediate persistent storage may be used to decouple different stages of processing pipeline. It is noteworthy also here, that network transfer, read-only analysis, modification and persistent storage of serialized records use the same format and structure, no transformation or format change of serialized records is required for those activities.
Referring now towhich provides a more detailed block diagram of an ingestionand a read only processing node.
An ingestion nodereceives semi-structured input recordsfrom a monitored environmentand processes those input records using a serializer module, which analyzes those input records and converts them into serialized recordsaccording to one or more conversion rules.
Created serialized recordsare forwardedto an ingest buffer manager module, which compresses received serialized records and appends them to a compressed ingest buffer. The ingest buffer manager may select or create an appropriate compressed ingest buffer for each received serialized record. The ingest buffer to which a serialized record is appended may be selected based on the origin of the serialized record, to create compressed ingest buffers that contain serialized records with homogeneous origins, or it may be selected based on the type of content of received serialized records, like log data, transaction trace data, metric or event data, to create compressed ingest buffers containing serialized records with homogeneous content types.
Next to appending serialized records to compressed ingest buffers, the ingest buffer manager also monitors buffer provisioning parameters for each compressed ingest buffer to determine when a compressed ingest buffer is forwarded to a next pipeline step. Those provisioning parameters may include but are not limited to the time since a compressed ingest buffer was created, the number of serialized records contained in the buffer, or the memory size occupied by the buffer. If one or more of those parameters of a compressed ingest buffer exceed a certain threshold, the ingest buffer managermay providethe buffer to a next processing node of the pipeline.
In the depicted scenario, this is a processing nodethat only performs read access on received serialized records. A processing buffer managerreceives the compressed ingest buffer from the ingest node and stores it as compressed input buffer.
Unknown
December 18, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.