Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. An apparatus comprising a processor and a storage to store instructions that, when executed by the processor, cause the processor to perform operations comprising: receive, at the processor, a first request to store a flow input data set in a federated area, wherein: at least one federated area is defined within storage space provided by at least one of a set of storage devices to store objects to perform a job flow; the objects to perform the job flow comprise a job flow definition that defines the job flow as a set of tasks to be performed, and a corresponding set of task routines to perform the set of tasks; processors associated with the set of storage devices cooperate to maintain a distributed file system as spanning storage spaces provided by each storage device of the set of storage devices; and as part of maintaining the distributed file system, at least one processor associated with of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; compare a size of the flow input data set to a threshold size that is based on the distribution block size to determine whether the size of the flow input data set is larger than the threshold size; and in response to a determination that the size of the flow input data set is larger than the threshold size, the processor is caused to perform operations comprising: analyze the flow input data set to determine whether the flow input data set is of a distributable form in which data items of the flow input data set are organized into a single homogeneous data structure such that, after the flow input data set is divided into a set of data object blocks, the data items remain accessible from each data object block of the flow input data set independently of the other data object blocks of the flow input data set; in response to a determination that the flow input data set is not of the distributable form of the flow input data set, the processor is caused to perform operations comprising: convert the flow input data set from an original form and into the distributable form of the flow input data set; and following conversion of the original form of the flow input data set into the distributable form, provide the distributable form of the flow input data set to the set of storage devices to be divided by the set of storage devices into the set of data object blocks of the flow input data set that are to be stored in a distributed manner within a first federated area of the at least one federated area, wherein the first federated area is defined within the distributed file system.
This invention relates to distributed storage systems for job flow processing. The system addresses the challenge of efficiently storing and managing large input data sets in a distributed file system where storage is provided by multiple storage devices. The system defines federated areas within the distributed storage space to organize objects used in job flows, which consist of task definitions and corresponding routines. The distributed file system spans storage spaces across all storage devices, with processors associated with these devices cooperating to maintain the system. When storing data, the system determines whether to store objects as undivided units or as divided blocks based on their size relative to a distribution block size. For large input data sets exceeding a threshold size, the system analyzes whether the data is in a distributable form, where data items are organized into a single homogeneous structure that remains accessible after division into blocks. If the data is not in this form, the system converts it into a distributable format before distributing it across the storage devices. This ensures efficient storage and retrieval of large data sets while maintaining data integrity and accessibility. The converted data is then stored in a federated area within the distributed file system.
2. The apparatus of claim 1 , wherein the processor is caused, in response to a determination that the original form of the flow input data set is the distributable form of the flow input data set, to provide the original form of the flow input data set to the set of storage devices to be divided by the set of storage devices into the set of data object blocks of the flow input data set that are to be stored in a distributed manner within the first federated area.
This invention relates to distributed data storage systems, specifically methods for managing and storing flow input data sets in a federated storage environment. The problem addressed is the efficient distribution and storage of data across multiple storage devices in a federated area, ensuring data is divided into manageable blocks while maintaining its original form when appropriate. The apparatus includes a processor and a set of storage devices. The processor determines whether the original form of a flow input data set is suitable for distribution. If the original form is already in a distributable format, the processor provides the data set directly to the storage devices. The storage devices then divide the data set into a set of data object blocks, which are stored in a distributed manner within a first federated area. This ensures efficient storage and retrieval while preserving the data's integrity and structure. The system optimizes storage by avoiding unnecessary transformations when the data is already in a suitable format, reducing processing overhead and improving performance. The invention is particularly useful in environments where data must be distributed across multiple nodes while maintaining consistency and accessibility.
3. The apparatus of claim 1 , wherein the processor is caused, in response to a determination that the size of the flow input data set is smaller than the predetermined threshold size, to provide the flow input data set to the set of storage devices to be stored as an undivided object within storage space provided by a single storage device of the set of storage devices, and within a federated area comprising at least one of: the first federated area defined within the distributed file system; and a second federated area defined within storage space of a local file system maintained by one storage device of the set of storage devices.
This invention relates to data storage systems, specifically optimizing storage of small data sets in distributed and local file systems. The problem addressed is inefficient storage of small data objects, which can lead to wasted storage space and reduced performance due to excessive fragmentation or metadata overhead. The apparatus includes a processor and a set of storage devices. The processor determines the size of a flow input data set and compares it to a predetermined threshold. If the data set is smaller than the threshold, the processor stores it as an undivided object within a single storage device, rather than distributing it across multiple devices. The storage occurs in a federated area, which can be either a first federated area within a distributed file system or a second federated area within the local file system of a single storage device. This approach minimizes fragmentation and metadata overhead for small data objects while maintaining compatibility with both distributed and local storage architectures. The system dynamically selects the appropriate storage location based on data size, improving storage efficiency and performance.
4. The apparatus of claim 1 , wherein the processor is caused to perform operations comprising: retrieve, from the set of storage devices, an indication of the distribution block size; and define the threshold size to be less than or equal to the distribution block size.
This invention relates to data storage systems, specifically optimizing data distribution across multiple storage devices. The problem addressed is inefficient data distribution, which can lead to performance bottlenecks and uneven wear on storage devices. The invention improves this by dynamically adjusting data distribution based on block size parameters. The apparatus includes a processor and a set of storage devices. The processor retrieves a distribution block size from the storage devices, which defines the granularity of data distribution. The processor then sets a threshold size that is less than or equal to this distribution block size. This ensures that data is distributed in manageable chunks, preventing overloading any single storage device and balancing workloads. The threshold size acts as a limit for data allocation, ensuring that data blocks do not exceed the optimal distribution block size. This prevents fragmentation and improves read/write efficiency. The system dynamically adapts to varying storage conditions, maintaining performance and longevity across the storage devices. The invention is particularly useful in distributed storage systems where data must be evenly spread to avoid hotspots and ensure reliability.
5. The apparatus of claim 1 , wherein the processor is caused to perform operations comprising: receive, at the processor and from a remote device, a second request to retrieve the flow input data set; retrieve a stored indication of at least the size of the flow input data set; compare the size of the flow input data set to the threshold size to determine whether the size of the flow input data set is larger than the threshold size; analyze the retrieved indication to determine whether the flow input data set was converted into the distributable form from the original form prior to being provided to the set of storage devices; and in response to a determination that the size of the flow input data set is smaller than the threshold size or a determination that the flow input data set was not converted from the original form and into the distributable form, the processor is caused to perform operations comprising: retrieve the original form of the flow input data set from the one or more storage devices; and transmit the original form to the remote device.
This invention relates to data retrieval systems for managing and transmitting large datasets efficiently. The problem addressed is the inefficient handling of large datasets when requested by remote devices, particularly when the data may exist in different forms (original or distributable) and when storage constraints or performance considerations apply. The system includes a processor and storage devices that store a flow input data set in either an original form or a distributable form. The processor receives a request from a remote device to retrieve the data set. Before retrieving the data, the processor checks the stored size of the data set against a predefined threshold size. If the data set is smaller than the threshold or was not previously converted into the distributable form, the processor retrieves the original form of the data set from storage and transmits it directly to the remote device. This approach optimizes data retrieval by avoiding unnecessary conversions or processing steps when the data is small or already in the desired form, improving efficiency and reducing latency. The system ensures that data is transmitted in the most appropriate format based on size and prior conversion status, balancing storage and performance considerations.
6. The apparatus of claim 5 , wherein, in response to a determination that the size of the flow input data set is larger than the threshold size and a determination that the flow input data set was converted from the original form and into the distributable form, the processor is caused to perform operations comprising: retrieve a stored indication of one or more characteristics of the original form of the flow input data set; retrieve the distributable form of the flow input data set from the one or more storage devices; employ the indication of the one or more characteristics to reverse the conversion to of the flow input data set to re-generate the original form; and transmit the original form to the remote device.
This invention relates to data processing systems that handle large datasets, particularly those that convert data between original and distributable forms for storage or transmission. The problem addressed is efficiently managing and reconstructing large datasets that have been transformed into a distributable format, ensuring accurate recovery of the original data when needed. The system includes a processor that processes flow input data sets, which may be converted from an original form into a distributable form for storage or transmission. When the size of the flow input data set exceeds a predefined threshold and the data has been converted into a distributable form, the processor performs specific operations. First, it retrieves stored characteristics of the original form of the data set, which describe how the data was transformed. Next, it retrieves the distributable form of the data from storage. Using the stored characteristics, the processor reverses the conversion process to regenerate the original form of the data. Finally, the original form is transmitted to a remote device, ensuring the data is restored accurately for further use. This approach ensures that large datasets can be efficiently stored or transmitted in a distributable format while maintaining the ability to reconstruct the original data when required, addressing challenges in data integrity and retrieval.
7. The apparatus of claim 6 , wherein the stored indication of one or more characteristics comprises at least one of: a copy of metadata incorporated into the original form; an indication of a characteristic of at least one data structure by which data values are organized within the original form; and an indication of a characteristic of an indexing scheme by which data values are accessed within the original form.
This invention relates to data processing systems that manage and retrieve information from structured or semi-structured data forms, such as databases, documents, or files. The problem addressed is the difficulty in preserving and efficiently accessing key characteristics of original data forms when they are transformed, migrated, or otherwise modified, leading to potential loss of metadata, structural information, or indexing schemes that are critical for accurate data retrieval and analysis. The apparatus includes a storage system that retains an indication of one or more characteristics of the original data form. These characteristics may include a copy of metadata embedded in the original form, details about the data structure used to organize data values (such as tables, hierarchies, or linked lists), or information about the indexing scheme that enables efficient data access (such as primary keys, secondary indexes, or search algorithms). By storing these characteristics separately, the system ensures that even if the original data form is altered or migrated, the essential properties that define its organization and accessibility are preserved. This allows for consistent data retrieval, validation, and analysis over time, regardless of changes to the underlying data storage or format. The stored indications can be used to reconstruct or reference the original structure, metadata, or indexing logic, maintaining data integrity and usability.
8. The apparatus of claim 1 , wherein: the distributed file system is Hadoop distributed file system (HDFS); and the distributable form of the flow input data set comprises at least one of: a text file comprising data items separated by delimiters; and a optimized row columnar (ORC) file comprising compressed data items.
This invention relates to a distributed file system apparatus designed to process and store large-scale data efficiently. The apparatus specifically utilizes the Hadoop Distributed File System (HDFS) to manage and distribute data across multiple nodes in a cluster. The system is configured to handle flow input data sets in distributable forms, including text files with delimiter-separated data items and Optimized Row Columnar (ORC) files containing compressed data items. The use of HDFS ensures high availability, fault tolerance, and scalability, while the support for different file formats allows for flexible data ingestion and processing. The apparatus optimizes storage and retrieval operations by leveraging HDFS's distributed architecture and the efficiency of ORC files, which reduce storage footprint and improve query performance. The system is particularly suited for big data applications requiring robust, scalable, and efficient data management solutions. The invention addresses the challenge of handling large volumes of structured and semi-structured data by providing a distributed storage solution that supports multiple data formats, ensuring compatibility and performance across diverse data processing workflows.
9. The apparatus of claim 1 , wherein the processor is caused to perform operations comprising: receive, at the processor and from a remote device, a second request to perform the job flow using the flow input data set as an input to the job flow performance, wherein: at least one result report is to be generated as an output of the job flow performance; and the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices; retrieve the job flow definition and each task routine of the set of task routines from one or more storage devices of the set of storage devices; retrieve a stored indication of at least the size of the flow input data set; compare the size of the flow input data set to the threshold size to determine whether the size of the flow input data set is larger than the threshold size; and in response to the determination that the flow input data set is larger than the threshold size, the processor is caused to perform operations comprising: generate a container that contains the job flow definition and the set of task routines to enable the processor associated with each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein each performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report; and provide a copy of the container to each storage device of the set of storage devices to enable the processors associated with at least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel.
This invention relates to distributed data processing systems, specifically optimizing job flow execution for large input datasets. The system addresses the challenge of efficiently processing large datasets by distributing the workload across multiple storage devices, each with local processing capabilities. A job flow, defined by a set of task routines, is executed in parallel on different storage devices, each handling a portion of the input data. The system first receives a request to perform a job flow using a flow input dataset, where the job flow definition and task routines are stored as an undivided object in one storage device. The system retrieves the job flow definition, task routines, and the size of the input dataset. If the dataset exceeds a predefined threshold, the system generates a container holding the job flow definition and task routines. This container is distributed to multiple storage devices, allowing each to independently execute an instance of the job flow using its locally stored portion of the input data. Each storage device processes its data block, generating a corresponding block of the final result report. The parallel execution of job flow instances across storage devices improves processing efficiency for large datasets. The system ensures that the job flow definition and task routines remain intact as an undivided object, simplifying distribution and execution.
10. The apparatus of claim 9 , wherein the processor is caused to perform operations comprising: retrieve, from each storage device of the multiple storage devices, at least one data object block of the set of data object blocks of the result report; assemble the result report from the set of data object blocks of the result report; transmit the result report to the remote device; compare a size of the result report to the threshold size to determine whether the size of the result report is larger than the threshold size; and in response to a determination that the size of the result report is larger than the threshold size, the processor is caused to perform operations comprising: analyze the result report to determine whether the result report is of a distributable form of the result report; in response to a determination that the result report is not of the distributable form of the result report, convert the result report into the distributable form of the result report; and provide the distributable form of the result report to the set of storage devices to be divided into the set of data object blocks of the result report that are to be distributed among the set of storage devices.
This invention relates to distributed data processing systems, specifically for managing and transmitting large result reports across multiple storage devices. The problem addressed is the efficient handling of large data reports that exceed a predefined threshold size, ensuring they can be distributed, stored, and transmitted in a manageable form. The system includes a processor and multiple storage devices. The processor retrieves data object blocks from each storage device to assemble a result report. If the report exceeds a threshold size, the processor checks whether it is in a distributable form. If not, the report is converted into a distributable form, then divided into smaller data object blocks. These blocks are distributed among the storage devices for storage and later retrieval. The system ensures that large reports are broken down into smaller, manageable parts, optimizing storage and transmission efficiency. The invention also includes mechanisms to reassemble the report from the distributed blocks when needed, ensuring data integrity and accessibility.
11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor to perform operations comprising: receive, at the processor, a first request to store a flow input data set in a federated area, wherein: at least one federated area is defined within storage space provided by at least one of a set of storage devices to store objects to perform a job flow; the objects to perform the job flow comprise a job flow definition that defines the job flow as a set of tasks to be performed, and a corresponding set of task routines to perform the set of tasks; processors associated with the set of storage devices cooperate to maintain a distributed file system as spanning storage spaces provided by each storage device of the set of storage devices; and as part of maintaining the distributed file system, at least one processor associated with of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; compare a size of the flow input data set to a threshold size that is based on the distribution block size to determine whether the size of the flow input data set is larger than the threshold size; and in response to a determination that the size of the flow input data set is larger than the threshold size, the processor is caused to perform operations comprising: analyze the flow input data set to determine whether the flow input data set is of a distributable form in which data items of the flow input data set are organized into a single homogeneous data structure such that, after the flow input data set is divided into a set of data object blocks, the data items remain accessible from each data object block of the flow input data set independently of the other data object blocks of the flow input data set; in response to a determination that the flow input data set is not of the distributable form of the flow input data set, the processor is caused to perform operations comprising: convert the flow input data set from an original form and into the distributable form of the flow input data set; and following conversion of the original form of the flow input data set into the distributable form, provide the distributable form of the flow input data set to the set of storage devices to be divided by the set of storage devices into the set of data object blocks of the flow input data set that are to be stored in a distributed manner within a first federated area of the at least one federated area, wherein the first federated area is defined within the distributed file system.
This invention relates to distributed storage systems and data processing in federated storage environments. The problem addressed is efficient storage and management of large data sets in a distributed file system where data objects may be divided into blocks for storage across multiple storage devices. The system defines federated areas within a distributed file system to store objects for job flows, which consist of task definitions and corresponding routines. The distributed file system spans storage spaces across multiple storage devices, with processors cooperating to manage storage and determine whether incoming data objects should be stored as undivided objects or divided into blocks based on their size relative to a distribution block size. When a request is received to store a flow input data set in a federated area, the system compares its size to a threshold based on the distribution block size. If the data set exceeds the threshold, the system analyzes whether it is in a distributable form, where data items are organized into a single homogeneous structure that remains accessible after division into blocks. If not, the system converts the data set into a distributable form before distributing it across storage devices as blocks within a federated area. This ensures efficient storage and accessibility of large data sets in a distributed environment.
12. The computer-program product of claim 11 , wherein the processor is caused, in response to a determination that the original form of the flow input data set is the distributable form of the flow input data set, to provide the original form of the flow input data set to the set of storage devices to be divided by the set of storage devices into the set of data object blocks of the flow input data set that are to be stored in a distributed manner within the first federated area.
This invention relates to distributed data storage systems, specifically methods for efficiently storing and managing data across multiple storage devices in a federated network. The problem addressed is the need to optimize data distribution when the input data is already in a suitable format for direct storage without requiring preprocessing. The system includes a processor that evaluates the form of incoming flow input data sets. If the data is determined to be in a distributable form, the processor bypasses additional processing steps and directly provides the data to a set of storage devices. These storage devices then divide the data into a set of data object blocks, which are stored in a distributed manner within a first federated area. This approach reduces computational overhead by avoiding unnecessary preprocessing when the data is already optimized for distribution, improving storage efficiency and performance in federated storage environments. The system ensures that data is properly segmented and distributed across multiple storage nodes, maintaining consistency and accessibility while minimizing latency. This method is particularly useful in scenarios where data is frequently updated or accessed across a network, ensuring seamless integration and scalability.
13. The computer-program product of claim 11 , wherein the processor is caused, in response to a determination that the size of the flow input data set is smaller than the predetermined threshold size, to provide the flow input data set to the set of storage devices to be stored as an undivided object within storage space provided by a single storage device of the set of storage devices, and within a federated area comprising at least one of: the first federated area defined within the distributed file system; and a second federated area defined within storage space of a local file system maintained by one storage device of the set of storage devices.
This invention relates to data storage systems, specifically optimizing storage of input data sets in distributed and local file systems. The problem addressed is inefficient storage allocation when handling small data sets, which can lead to wasted storage space or excessive fragmentation. The system processes input data sets for storage across a set of storage devices. If the input data set is smaller than a predetermined threshold size, it is stored as an undivided object within a single storage device rather than being divided and distributed. The storage occurs in a federated area, which can be either a first federated area within a distributed file system or a second federated area within the storage space of a local file system managed by one of the storage devices. This approach ensures efficient use of storage space for small data sets while maintaining compatibility with both distributed and local storage architectures. The system dynamically determines the appropriate storage location based on the size of the input data, optimizing performance and reducing overhead for small files. The federated areas allow seamless integration between distributed and local storage systems, enabling flexible and scalable data management. This method improves storage efficiency, particularly for small data sets, by avoiding unnecessary fragmentation and distribution.
14. The computer-program product of claim 11 , wherein the processor is caused to perform operations comprising: retrieve, from the set of storage devices, an indication of the distribution block size; and define the threshold size to be less than or equal to the distribution block size.
This invention relates to data storage systems, specifically optimizing data distribution across multiple storage devices. The problem addressed is inefficient data distribution, which can lead to performance bottlenecks and uneven resource utilization. The invention improves this by dynamically adjusting data distribution parameters based on storage device characteristics. The system includes a processor that retrieves a distribution block size from a set of storage devices. This block size represents the optimal chunk size for distributing data across the devices. The processor then defines a threshold size that is less than or equal to the retrieved distribution block size. This threshold size is used to determine when data should be redistributed or consolidated to maintain balanced performance. The invention ensures that data distribution aligns with the storage devices' capabilities, preventing fragmentation and improving access efficiency. By dynamically setting the threshold size relative to the block size, the system adapts to varying storage configurations, whether using solid-state drives, hard disk drives, or hybrid storage systems. This approach enhances data throughput and reduces latency by minimizing unnecessary data movement while maintaining balanced load distribution. The solution is particularly useful in large-scale storage environments where performance consistency is critical.
15. The computer-program product of claim 11 , wherein the processor is caused to perform operations comprising: receive, at the processor and from a remote device, a second request to retrieve the flow input data set; retrieve a stored indication of at least the size of the flow input data set; compare the size of the flow input data set to the threshold size to determine whether the size of the flow input data set is larger than the threshold size; analyze the retrieved indication to determine whether the flow input data set was converted into the distributable form from the original form prior to being provided to the set of storage devices; and in response to a determination that the size of the flow input data set is smaller than the threshold size or a determination that the flow input data set was not converted from the original form and into the distributable form, the processor is caused to perform operations comprising: retrieve the original form of the flow input data set from the one or more storage devices; and transmit the original form to the remote device.
This invention relates to data retrieval systems in distributed computing environments, specifically addressing the efficient handling of large datasets. The problem solved involves optimizing the retrieval of data stored in different forms, such as original or distributable formats, to minimize processing overhead and bandwidth usage. The system includes a processor that receives a request from a remote device to retrieve a dataset, referred to as a flow input data set. The processor checks the stored size of the dataset against a predefined threshold size to determine if the dataset is large. Additionally, the system analyzes whether the dataset has been converted from its original form into a distributable form before storage. If the dataset is smaller than the threshold size or has not been converted into the distributable form, the processor retrieves the original form of the dataset from storage and transmits it directly to the remote device. This approach ensures that only necessary data transformations occur, reducing computational and network resource consumption during retrieval operations. The invention improves efficiency in distributed data management by dynamically selecting the optimal data format for transmission based on size and prior conversion status.
16. The computer-program product of claim 15 , wherein, in response to a determination that the size of the flow input data set is larger than the threshold size and a determination that the flow input data set was converted from the original form and into the distributable form, the processor is caused to perform operations comprising: retrieve a stored indication of one or more characteristics of the original form of the flow input data set; retrieve the distributable form of the flow input data set from the one or more storage devices; employ the indication of the one or more characteristics to reverse the conversion to of the flow input data set to re-generate the original form; and transmit the original form to the remote device.
This invention relates to data processing systems that handle large datasets, particularly those that convert data between original and distributable forms for storage or transmission. The problem addressed is efficiently managing and reconstructing large datasets that have been transformed into a distributable form, ensuring accurate recovery of the original data when needed. The system processes a flow input data set, which is initially in an original form. If the data set exceeds a threshold size, it is converted into a distributable form for storage or transmission. When the data set is later retrieved, the system checks if it was converted and if its size exceeds the threshold. If both conditions are met, the system retrieves stored metadata indicating the original form's characteristics and the distributable version of the data. Using this metadata, the system reverses the conversion to reconstruct the original form and transmits it to a remote device. This ensures that large datasets can be efficiently stored or transmitted in a distributable format while allowing accurate reconstruction of the original data when required. The approach optimizes storage and transmission efficiency while maintaining data integrity.
17. The computer-program product of claim 16 , wherein the stored indication of one or more characteristics comprises at least one of: a copy of metadata incorporated into the original form; an indication of a characteristic of at least one data structure by which data values are organized within the original form; and an indication of a characteristic of an indexing scheme by which data values are accessed within the original form.
This invention relates to computer-program products for managing and analyzing data stored in original forms, such as documents, databases, or other structured data formats. The problem addressed is the difficulty of efficiently extracting, processing, and analyzing data from these original forms due to variations in their structure, metadata, and indexing schemes. The solution involves storing an indication of one or more characteristics of the original form to facilitate data processing. The stored indication includes at least one of the following: a copy of metadata embedded in the original form, an indication of the data structure used to organize data values within the original form, or an indication of the indexing scheme used to access data values within the original form. The metadata may include information such as file properties, document attributes, or database schema details. The data structure characteristic may describe how data is organized, such as tables, hierarchies, or key-value pairs. The indexing scheme characteristic may describe how data is indexed for retrieval, such as primary keys, secondary indexes, or searchable fields. By storing these characteristics, the system enables more efficient data extraction, transformation, and analysis, allowing for better compatibility with different data processing tools and improving the accuracy of data interpretation. This approach ensures that the original form's structure and metadata are preserved, even when the data is processed or migrated to other systems.
18. The computer-program product of claim 11 , wherein: the distributed file system is Hadoop distributed file system (HDFS); and the distributable form of the flow input data set comprises at least one of: a text file comprising data items separated by delimiters; and a optimized row columnar (ORC) file comprising compressed data items.
This invention relates to distributed file systems, specifically optimizing data processing in Hadoop Distributed File System (HDFS). The problem addressed is inefficient handling of large-scale data sets in distributed computing environments, particularly when dealing with unstructured or semi-structured data formats. The solution involves a computer-program product that processes flow input data sets stored in HDFS, where the data is formatted either as text files with delimiter-separated values or as Optimized Row Columnar (ORC) files containing compressed data items. The system is designed to distribute and process these data sets efficiently across a cluster, leveraging HDFS's distributed storage capabilities. The text file format allows for flexible parsing of structured data, while the ORC format provides compression and optimized columnar storage for faster querying and analysis. The invention ensures compatibility with both formats, enabling seamless integration into existing HDFS-based workflows. The distributed nature of the system enhances scalability and performance, making it suitable for big data applications requiring high-throughput processing.
19. The computer-program product of claim 11 , wherein the processor is caused to perform operations comprising: receive, at the processor and from a remote device, a second request to perform the job flow using the flow input data set as an input to the job flow performance, wherein: at least one result report is to be generated as an output of the job flow performance; and the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices; retrieve the job flow definition and each task routine of the set of task routines from one or more storage devices of the set of storage devices; retrieve a stored indication of at least the size of the flow input data set; compare the size of the flow input data set to the threshold size to determine whether the size of the flow input data set is larger than the threshold size; and in response to the determination that the flow input data set is larger than the threshold size, the processor is caused to perform operations comprising: generate a container that contains the job flow definition and the set of task routines to enable the processor associated with each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein each performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report; and provide a copy of the container to each storage device of the set of storage devices to enable the processors associated with at least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel.
This invention relates to distributed job flow processing systems, specifically optimizing the handling of large input datasets in a parallelized manner. The system addresses the challenge of efficiently processing large datasets by distributing the workload across multiple storage devices, each with its own processor, to improve performance and resource utilization. The system receives a request to execute a job flow, which consists of a job flow definition and a set of task routines, all stored as an undivided object in a storage device. The job flow processes an input dataset, generating at least one result report as output. The system retrieves the job flow definition, task routines, and the size of the input dataset. If the dataset exceeds a predefined threshold size, the system generates a container that includes the job flow definition and task routines. This container is distributed to multiple storage devices, allowing each device to independently execute an instance of the job flow using a portion of the input dataset stored locally. Each storage device processes its portion in parallel, producing a segment of the final result report. This parallel processing approach enhances scalability and efficiency for large datasets.
20. The computer-program product of claim 19 , wherein the processor is caused to perform operations comprising: retrieve, from each storage device of the multiple storage devices, at least one data object block of the set of data object blocks of the result report; assemble the result report from the set of data object blocks of the result report; transmit the result report to the remote device; compare a size of the result report to the threshold size to determine whether the size of the result report is larger than the threshold size; and in response to a determination that the size of the result report is larger than the threshold size, the processor is caused to perform operations comprising: analyze the result report to determine whether the result report is of a distributable form of the result report; in response to a determination that the result report is not of the distributable form of the result report, convert the result report into the distributable form of the result report; and provide the distributable form of the result report to the set of storage devices to be divided into the set of data object blocks of the result report that are to be distributed among the set of storage devices.
This invention relates to distributed data storage and retrieval systems, specifically addressing the challenge of efficiently managing and transmitting large result reports across multiple storage devices. The system retrieves data object blocks from distributed storage devices to assemble a result report, then evaluates the report's size against a predefined threshold. If the report exceeds the threshold, the system checks whether it is in a distributable form. If not, the report is converted into a distributable format and redistributed across the storage devices as smaller data object blocks. This ensures that large reports can be efficiently stored, transmitted, and reconstructed without exceeding system limitations. The process involves dynamic analysis and conversion of report formats to optimize storage and retrieval performance in distributed environments. The invention enhances scalability and reliability in systems handling large datasets by automating the fragmentation and reassembly of reports based on size constraints.
21. A computer-implemented method comprising: receiving, by a processor, a first request to store a flow input data set in a federated area, wherein: at least one federated area is defined within storage space provided by at least one of a set of storage devices to store objects to perform a job flow; the objects to perform the job flow comprise a job flow definition that defines the job flow as a set of tasks to be performed, and a corresponding set of task routines to perform the set of tasks; processors associated with the set of storage devices cooperate to maintain a distributed file system as spanning storage spaces provided by each storage device of the set of storage devices; and as part of maintaining the distributed file system, at least one processor associated with of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; comparing, by the processor, a size of the flow input data set to a threshold size that is based on the distribution block size to determine whether the size of the flow input data set is larger than the threshold size; and in response to a determination that the size of the flow input data set is larger than the threshold size, performing operations comprising: analyzing, by the processor, the flow input data set to determine whether the flow input data set is of a distributable form in which data items of the flow input data set are organized into a single homogeneous data structure such that, after the flow input data set is divided into a set of data object blocks, the data items remain accessible from each data object block of the flow input data set independently of the other data object blocks of the flow input data set; in response to a determination that the flow input data set is not of the distributable form of the flow input data set, performing operations comprising: converting, by the processor, the flow input data set from an original form and into the distributable form of the flow input data set; and following conversion of the original form of the flow input data set into the distributable form, providing the distributable form of the flow input data set to the set of storage devices to be divided by the set of storage devices into the set of data object blocks of the flow input data set that are to be stored in a distributed manner within a first federated area of the at least one federated area, wherein the first federated area is defined within the distributed file system.
This invention relates to distributed storage systems for job flow processing. The system addresses the challenge of efficiently storing and managing large input datasets in a federated storage environment where data is distributed across multiple storage devices. The system defines federated areas within a distributed file system, where each area stores objects needed to perform job flows. These objects include job flow definitions and corresponding task routines. The distributed file system spans storage spaces across multiple storage devices, with processors cooperating to manage data distribution. When a data object is received, the system determines whether to store it as an undivided object or split it into blocks based on its size relative to a predefined distribution block size. For large input datasets exceeding a threshold size, the system analyzes whether the data is in a distributable form, meaning it can be split into blocks while maintaining accessibility of individual data items. If the data is not in a distributable form, the system converts it into such a form before distributing it across storage devices. The converted data is then divided into blocks and stored in a federated area within the distributed file system. This approach ensures efficient storage and retrieval of large datasets in a distributed environment, optimizing job flow performance.
22. The computer-implemented method of claim 21 , comprising, in response to a determination that the original form of the flow input data set is the distributable form of the flow input data set, providing the original form of the flow input data set to the set of storage devices to be divided by the set of storage devices into the set of data object blocks of the flow input data set that are to be stored in a distributed manner within the first federated area.
This invention relates to distributed data storage systems, specifically methods for managing and storing flow input data sets in a federated storage environment. The problem addressed is efficiently distributing and storing data across multiple storage devices in a way that maintains data integrity and accessibility while optimizing storage and retrieval processes. The method involves determining whether the original form of a flow input data set is already in a distributable form. If it is, the original form is provided to a set of storage devices, which then divide the data into a set of data object blocks. These blocks are stored in a distributed manner within a first federated area, ensuring that the data is spread across multiple storage devices for redundancy and performance. The federated area refers to a network of interconnected storage devices that operate as a unified system, allowing for coordinated data distribution and management. The method ensures that data is stored in an optimized format, reducing the need for preprocessing or reformatting before distribution. This approach enhances storage efficiency and speeds up data retrieval by leveraging the inherent structure of the data. The distributed storage mechanism also improves fault tolerance, as data loss in one storage device does not compromise the entire data set. The invention is particularly useful in large-scale data storage systems where data integrity, accessibility, and performance are critical.
23. The computer-implemented method of claim 21 , comprising, in response to a determination that the size of the flow input data set is smaller than the predetermined threshold size, providing the flow input data set to the set of storage devices to be stored as an undivided object within storage space provided by a single storage device of the set of storage devices, and within a federated area comprising at least one of: the first federated area defined within the distributed file system; and a second federated area defined within storage space of a local file system maintained by one storage device of the set of storage devices.
This invention relates to data storage management in distributed and local file systems, specifically addressing the efficient handling of small data sets. The problem solved is the inefficiency of storing small data objects in distributed storage systems, where overhead from fragmentation and metadata management can outweigh the benefits of distribution. The method involves determining the size of a flow input data set and comparing it to a predetermined threshold. If the data set is smaller than the threshold, it is stored as an undivided object within a single storage device, rather than being distributed across multiple devices. The storage occurs in a federated area, which can be either a first federated area within a distributed file system or a second federated area within the storage space of a local file system managed by one of the storage devices. This approach reduces overhead by avoiding unnecessary fragmentation and metadata operations for small data sets while maintaining compatibility with both distributed and local storage architectures. The method ensures efficient storage allocation by dynamically adapting to the size of the input data.
24. The computer-implemented method of claim 21 , comprising: retrieving, from the set of storage devices, an indication of the distribution block size; and defining, by the processor, the threshold size to be less than or equal to the distribution block size.
This invention relates to data storage systems, specifically optimizing data distribution across multiple storage devices. The problem addressed is inefficient data distribution, which can lead to performance bottlenecks and uneven storage utilization. The invention provides a method to dynamically adjust data distribution based on storage device characteristics. The method involves retrieving a distribution block size from a set of storage devices, which represents the optimal size for distributing data across the devices. A threshold size is then defined to be less than or equal to this distribution block size. This ensures that data is distributed in manageable chunks, preventing overloading any single storage device and improving overall system performance. The method may also include steps to monitor storage device performance and adjust the distribution block size accordingly, ensuring continuous optimization. The invention may further include determining whether a data block exceeds the threshold size and, if so, splitting the data block into smaller segments for distribution. This prevents large data blocks from overwhelming any single storage device. The method may also involve balancing data distribution across the storage devices to maintain even utilization and prevent bottlenecks. By dynamically adjusting the distribution block size and threshold size, the invention ensures efficient data distribution, improving storage system performance and reliability. The method is particularly useful in large-scale storage systems where data distribution must be carefully managed to maintain optimal performance.
25. The computer-implemented method of claim 21 , comprising: receiving, by the processor and from a remote device, a second request to retrieve the flow input data set; retrieving a stored indication of at least the size of the flow input data set; comparing, by the processor, the size of the flow input data set to the threshold size to determine whether the size of the flow input data set is larger than the threshold size; analyzing, by the processor, the retrieved indication to determine whether the flow input data set was converted into the distributable form from the original form prior to being provided to the set of storage devices; and in response to a determination that the size of the flow input data set is smaller than the threshold size or a determination that the flow input data set was not converted from the original form and into the distributable form, performing operations comprising: retrieving the original form of the flow input data set from the one or more storage devices; and transmitting, from the processor, the original form to the remote device.
This invention relates to a computer-implemented method for efficiently retrieving and transmitting data sets, particularly addressing the challenge of optimizing storage and retrieval of large data sets in distributed systems. The method involves receiving a request from a remote device to retrieve a flow input data set, which may exist in either an original form or a distributable form. The system first checks the stored size of the data set against a predefined threshold size. If the data set is smaller than the threshold or was not previously converted into the distributable form, the system retrieves the original form of the data set from storage and transmits it directly to the remote device. This approach ensures efficient data handling by avoiding unnecessary conversions for smaller data sets or those already in the desired form, thereby reducing processing overhead and improving retrieval performance. The method leverages stored metadata about the data set's size and conversion status to make these determinations, ensuring optimal resource utilization during data retrieval operations.
26. The computer-implemented method of claim 25 , wherein, in response to a determination that the size of the flow input data set is larger than the threshold size and a determination that the flow input data set was converted from the original form and into the distributable form, performing operations comprising: retrieving a stored indication of one or more characteristics of the original form of the flow input data set; retrieving the distributable form of the flow input data set from the one or more storage devices; employing, by the processor, the indication of the one or more characteristics to reverse the conversion to of the flow input data set to re-generate the original form; and transmitting, from the processor, the original form to the remote device.
This invention relates to data processing systems that handle large datasets, particularly those that convert data between different forms for storage or transmission. The problem addressed is efficiently managing and reconstructing original data forms when needed, especially when dealing with large datasets that have been converted into a distributable form for storage or transmission. The method involves determining whether a flow input data set exceeds a predefined threshold size and whether it has been converted from its original form into a distributable form. If both conditions are met, the system retrieves stored characteristics of the original form and the distributable form of the data set. Using these characteristics, the system reverses the conversion process to regenerate the original form of the data. Finally, the original form is transmitted to a remote device. The stored characteristics of the original form may include metadata, structural information, or other attributes that define how the data was originally structured. The distributable form is a modified version of the original data, optimized for storage or transmission, but may lack some of the original formatting or structure. The method ensures that when the original form is required, it can be accurately reconstructed and sent to a remote device without requiring the original data to be stored redundantly. This approach improves storage efficiency and reduces the need for redundant data storage while maintaining data integrity.
27. The computer-implemented method of claim 26 , wherein the stored indication of one or more characteristics comprises at least one of: a copy of metadata incorporated into the original form; an indication of a characteristic of at least one data structure by which data values are organized within the original form; and an indication of a characteristic of an indexing scheme by which data values are accessed within the original form.
This invention relates to a computer-implemented method for managing and retrieving data from original forms, addressing challenges in efficiently storing and accessing structured data within documents. The method involves storing an indication of one or more characteristics of the original form to facilitate accurate data extraction and retrieval. These characteristics may include a copy of metadata embedded in the original form, details about the data structure used to organize data values within the form, or information about the indexing scheme that enables data access within the form. By preserving these characteristics, the method ensures that data can be accurately located and retrieved even when the original form undergoes modifications or transformations. The stored indications allow for consistent data extraction, maintaining the integrity and accessibility of the structured information. This approach is particularly useful in systems where forms are frequently updated or processed, ensuring that data remains retrievable regardless of changes to the form's structure or content. The method enhances data management by providing a reliable way to track and access structured data within dynamic documents.
28. The computer-implemented method of claim 21 , wherein: the distributed file system is Hadoop distributed file system (HDFS); and the distributable form of the flow input data set comprises at least one of: a text file comprising data items separated by delimiters; and a optimized row columnar (ORC) file comprising compressed data items.
This invention relates to distributed file systems, specifically optimizing data processing in Hadoop Distributed File System (HDFS). The problem addressed is the inefficient handling of large-scale data sets in distributed computing environments, particularly when dealing with structured or semi-structured data formats. The solution involves transforming input data into a distributable form compatible with HDFS, enabling parallel processing across multiple nodes. The method processes a flow input data set by converting it into a format suitable for distribution across HDFS. The distributable form can be either a text file with delimiter-separated data items or an Optimized Row Columnar (ORC) file containing compressed data items. The text file format allows for straightforward parsing of structured data, while the ORC format provides compression and efficient columnar storage, reducing storage overhead and improving query performance. The distributed file system, HDFS, manages the storage and retrieval of these data sets across a cluster of nodes, ensuring scalability and fault tolerance. By supporting multiple data formats, the invention accommodates different data processing requirements, whether for batch processing or real-time analytics. The use of HDFS ensures compatibility with existing Hadoop ecosystems, while the choice of ORC or text formats optimizes storage and processing efficiency based on the specific use case. This approach enhances data accessibility and processing speed in large-scale distributed environments.
29. The computer-implemented method of claim 21 , wherein the processor is caused to perform operations comprising: receiving, by the processor and from a remote device, a second request to perform the job flow using the flow input data set as an input to the job flow performance, wherein: at least one result report is to be generated as an output of the job flow performance; and the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices; retrieving the job flow definition and each task routine of the set of task routines from one or more storage devices of the set of storage devices; retrieving a stored indication of at least the size of the flow input data set; comparing the size of the flow input data set to the threshold size to determine whether the size of the flow input data set is larger than the threshold size; and in response to the determination that the flow input data set is larger than the threshold size, the performing operations comprising: generating, by the processor, a container that contains the job flow definition and the set of task routines to enable the processor associated with each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein each performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report; and providing a copy of the container to each storage device of the set of storage devices to enable the processors associated with at least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel.
This invention relates to distributed job flow processing systems, specifically optimizing the handling of large input data sets by parallelizing task execution across multiple storage devices. The system addresses inefficiencies in processing large data sets sequentially, which can lead to bottlenecks and delays. The method involves receiving a request to execute a job flow, where the job flow definition and associated task routines are stored as an undivided object in a storage device. The system retrieves the job flow definition, task routines, and the size of the input data set. If the input data set exceeds a predefined threshold size, the system generates a container that includes the job flow definition and task routines. This container is distributed to multiple storage devices, each of which independently executes an instance of the job flow using a portion of the input data stored locally. Each storage device processes its assigned data block, generating a corresponding segment of the final result report. By distributing the workload, the system enables parallel execution across storage devices, improving processing efficiency for large data sets. The method ensures that the job flow and task routines remain intact as a single object, simplifying management and reducing fragmentation.
30. The computer-implemented method of claim 29 , comprising: retrieving, from each storage device of the multiple storage devices, at least one data object block of the set of data object blocks of the result report; assembling, by the processor, the result report from the set of data object blocks of the result report; transmitting, from the processor, the result report to the remote device; comparing, by the processor, a size of the result report to the threshold size to determine whether the size of the result report is larger than the threshold size; and in response to a determination that the size of the result report is larger than the threshold size, performing operations comprising: analyzing, by the processor, the result report to determine whether the result report is of a distributable form of the result report; in response to a determination that the result report is not of the distributable form of the result report, converting, by the processor, the result report into the distributable form of the result report; and providing the distributable form of the result report to the set of storage devices to be divided into the set of data object blocks of the result report that are to be distributed among the set of storage devices.
This invention relates to a distributed data processing system for managing and transmitting large result reports across multiple storage devices. The system addresses the challenge of efficiently handling and distributing result reports that exceed a predefined threshold size, ensuring they can be transmitted and stored in a scalable and manageable manner. The method involves retrieving data object blocks of a result report from multiple storage devices and assembling the report. The report is then transmitted to a remote device. The system compares the report's size to a threshold to determine if it exceeds the limit. If it does, the report is analyzed to check if it is in a distributable form. If not, the report is converted into a distributable form, which is then divided into smaller data object blocks and distributed across the storage devices. This ensures that large reports are broken down into manageable parts, facilitating efficient storage and retrieval. The system optimizes data handling by dynamically adjusting the report format and distribution based on size constraints.
Unknown
May 19, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.