US-12591565-B2

Predicting purge effects in hierarchical data environments

PublishedMarch 31, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Systems and methods described herein relate to the prediction of effects of data purging on data sources that are related through hierarchical data relationships. A purge request comprises a set of purge parameters that identify a data source and define one or more purge criteria for purging of data items of the data source. A plurality of impacted data sources is identified based on one or more hierarchical data relationships held by the data items of the data source. The impacted data sources include the data source and one or more additional data sources. The purge parameters are provided to a machine learning model to obtain output indicative of a predicted effect of execution of the purge request on the impacted data sources. The predicted effect is caused to be presented at the user device prior to the execution of the purge request.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A system comprising:

. The system of, wherein the predicted effect is a first predicted effect, the operations further comprising:

. The system of, wherein the plurality of impacted data sources comprise a first set of impacted data sources, the adjustment of the set of purge parameters causes identification of a second set of impacted data sources, and the adjusted output is indicative of the second predicted effect of the execution of the purge request on the second set of impacted data sources.

. The system of, wherein the predicted effect comprises a data source-specific effect for each of the plurality of impacted data sources.

. The system of, wherein the causing of the presentation of the predicted effect at the user device comprises causing presentation of the data source-specific effect for each of the plurality of impacted data sources separately within a graphical user interface.

. The system of, wherein the predicted effect comprises an overall effect covering the plurality of impacted data sources.

. The system of, wherein the set of purge parameters comprises at least one of: an identifier of the data source, the one or more purge criteria, user data, a purge scope, a data retention policy, a purge date, or a purge time.

. The system of, wherein the operations comprise providing, to the machine learning model, the set of purge parameters from the purge request together with one or more additional purge parameters that identify at least a subset of the plurality of impacted data sources.

. The system of, wherein the predicted effect comprises at least one of: a predicted purge volume, a predicted number of data items purged, or a predicted data purge execution duration.

. The system of, the operations further comprising:

. The system of, wherein each impacted data source corresponds to a respective functional module of a cloud-based service, and the user device is associated with a user account held with the cloud-based service.

. The system of, wherein the machine learning model comprises at least one feedforward neural network (FNN).

. A method comprising:

. The method of, wherein the predicted effect is a first predicted effect, the method further comprising:

. The method of, comprising providing, to the machine learning model, the set of purge parameters from the purge request together with one or more additional purge parameters that identify at least a subset of the plurality of impacted data sources.

. A non-transitory computer-readable medium that stores instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising:

. The non-transitory computer-readable medium of, wherein the predicted effect is a first predicted effect, the operations further comprising:

. The non-transitory computer-readable medium of, wherein the operations comprise providing, to the machine learning model, the set of purge parameters from the purge request together with one or more additional purge parameters that identify at least a subset of the plurality of impacted data sources.

Detailed Description

Complete technical specification and implementation details from the patent document.

The subject matter disclosed herein generally relates to data purging. More specifically, but not exclusively, the subject matter relates to the prediction of effects of data purging on data sources that are related through hierarchical data relationships.

Data purging is an important task in many business environments. For example, data purging operations can free up storage space, improve system speeds, reduce costs, better protect sensitive information, or ensure compliance with data regulations, such as the European Union's General Data Protection Regulation (GDPR). A purging engine of a data purging system (e.g., a data purging system operating in a cloud-based environment) may, in response to a purge request, automatically perform a data purging process according to specified purge criteria.

Hierarchical data relationships between data items can impact a data purging process. For example, a purge request may indicate that certain “parent” data should be purged from a particular data source (e.g., records containing employee numbers are to be deleted from a human resources module). Other data sources may contain “child” data relying on the parent data (e.g., records of an information technology module that utilize the employee numbers to track computer equipment allocations). In some cases, the parent data and child data are then purged as part of the same purging operation. The effect of execution of a purge request can therefore be unpredictable. For example, when a user generates a new purge request that targets one module, the user has limited visibility into the storage space that will be freed up by the purge request, or the time it will take to execute the purge request, due to possible cascading purging in other modules caused by hierarchical data relationships.

With modern computing systems hosting various types of data in cloud storage, it may be important for entities to manage constantly increasing data volumes. Executing regular data purges to delete certain data can be vital for controlling cloud-related costs (e.g., to ensure that storage needs, and thus costs, do not balloon out of control).

The term “data purging” (or simply “purging”), as used herein, generally refers to a process of deleting or removing data from a system or device. A data purge may be a soft purge or a hard purge. The term “soft purge,” as used herein, refers to purging one or more data items in a reversible manner or in a manner that otherwise allows for recovery of the purged data. For example, a soft purge may involve marking a record in a table as “deleted,” or transferring a file to a recycle bin from where it can be recovered. The term “hard purge,” as used herein, refers to purging one or more data items in a permanent or irreversible manner. For example, a hard purge may involve removing a data item from all systems or devices, including backup systems or devices, in such a manner that the data item is not recoverable after the purging process (e.g., the hard purge cannot subsequently be reversed).

A purge request may specify or identify at least one data source. The term “data source,” as used herein, refers to any collection or repository of data items, or a component or module associated with such a collection or repository, that can be acted upon as a logical unit. This may include, for example, functional modules within enterprise systems (e.g., human resources, information technology, payroll, or time management modules of a cloud-based enterprise service), databases, tables, data warehouses, files, or other logical groupings of data items that have some shared significance or purpose within a computing environment. A data source may comprise a subset of data from a larger data collection (e.g., a subset that is unified by common attributes, dependencies, business functions, or intended usage).

The term “data item,” as used herein, refers to any unit, collection, or aggregation of data that can be stored digitally. A data item may comprise any type, structure, or format of data. Non-limiting examples of data items include a database record, a table, a file (e.g., a document, image, video, or audio file), a folder (e.g., a folder containing multiple files), an email message, a packet of raw data, a data object instance, a cell or row in a spreadsheet, a log entry or log file, or metadata associated with a data object.

Data sources, or data items associated with different data sources, may be arranged in hierarchical data relationships, where one data source or data item relies on or interacts with another. Data systems often utilize hierarchical data structures to organize and relate information. For example, in a business context, employee data may comprise foundational “parent” or “master” data that resides at a top level of a data hierarchy, not dependent on any other data. Lower levels of related data rely on the parent or master data. For example, data items used in time sheets or payroll typically rely on employee data, such as employee identifiers. Through a tree-like hierarchy (e.g., using foreign key relationships or dependencies), additional layers of connected data can branch from the parent data.

Hierarchical data relationships may include parent-child relationships or more complex structures with multiple connections or dependencies. Hierarchical data relationships may, for example, be found in data sources relating to organizational structures, enterprise resource planning systems, file systems, manufacturing systems, social networks, source code, or product catalogs.

When purging data items in a hierarchical data environment, it is often necessary to purge not only parent data items, but also their associated child or dependent data items. For example, this may be done to free up additional storage, ensure privacy, maintain data integrity (e.g., to avoid child records remaining in data systems in an “orphaned” state), or avoid anomalies, such as confusing outputs that reference records or identifiers that no longer exist. A purging engine may be configured to follow natural hierarchical data relationships, deleting downstream data before top-level data.

It may be desirable or even necessary to predict the impacts of data purges. For example, a company may have a 10 terabyte (TB) cloud storage quota contracted for a 3-year period. Advance purge planning may be required to ensure that the storage quota is not exceeded, or to plan for contract changes to the extent required. However, as a result of hierarchical data relationships between data items in different data sources, there may be a lack of visibility into a purge request's potential effects (e.g., the volume to be purged or expected completion time). For example, a user may generate a purge request to purge obsolete data from one data source that contains parent data items, without understanding the knock-on effect that the data purging process will have on child data items in other data sources. This may make it challenging to design an efficient purging routine.

Examples described herein leverage historical purge data to enable more accurate impact forecasting. A machine learning model may be trained to take the hierarchical nature of data items and the effects thereof on data purges (e.g., on the number of data items purged, the purge volume, or the data purge execution duration) into account. In some examples, the machine learning model comprises one or more feedforward neural networks (FNNs).

In some examples, the input parameters of a purge request and the outcome of the data purging process are parameterized, with purge requests being executed multiple times to obtain training data, thereby enabling the use of machine learning algorithms to build a connection between the input parameters and the outcome of the data purging process. A system collects input parameters (also referred to herein as “purge parameters”) and output results from previously executed data purge jobs. The parameters and results may be formatted into multi-dimensional arrays, and split into training and testing sets. In some examples, the parameters and results are grouped and the groups are utilized in the training process.

The machine learning model may train on the training set, learning relationships between input parameters and output results. When a new data purge job arises, the system may take the input parameters and generate output using the machine learning model. The output may be indicative of a predicted effect of execution of the relevant purge request.

A method may include receiving, from a user device, a purge request that comprises a set of purge parameters. The user device may be associated with a user account held with a cloud-based service. The purge parameters identify a data source and define one or more purge criteria for purging of data items of the data source. Examples of purge parameters include: an identifier of the data source, the one or more purge criteria, user data, a purge scope, a data retention policy, a purge date, or a purge time.

The method may include identifying, based on one or more hierarchical data relationships held by the data items of the data source, a plurality of impacted data sources. The impacted data sources may include the data source and one or more additional data sources. In some examples, impacted data sources are used as purge parameters.

The term “primary data source,” as used herein, refers to a data source that is specified in the purge request, or which the purge request is specifically targeted at. A purge request may have multiple primary data sources. A primary data source may also be referred to as a “target data source.” The term “secondary data source,” as used herein, refers to a data source that is not specified in the purge request, or which the purge request is not specifically targeted at, but which will be affected by execution of the data purge request due to existing hierarchical data relationships (e.g., a data structure that will cause a purging engine to purge child data from the secondary data sources together with parent data from the primary data source). A secondary data source may thus be identified as an additional data source impacted by a purge. Each impacted data source may correspond to a respective functional module of the cloud-based service.

The method may include identifying the one or more hierarchical data relationships based on relationships or dependencies, such as one or more parent-child relationships, between first data items of the primary data source and second data items of the one or more secondary data sources. In some examples, the machine learning model generates the output based on learned connections without having to identify the hierarchical data relationships.

The purge parameters of the purge request may be provided to a machine learning model to obtain output indicative of the predicted effect of execution of the purge request on the impacted data sources. In some examples, the purge parameters from the purge request may be provided together with one or more additional purge parameters that identify the impacted data sources (or at least a subset of the impact data sources, such as the secondary data sources). In other words, in some examples, parameters specified by the user together with identifiers of the impacted data sources may be provided as input to the machine learning model.

The predicted effect may, for example, be a predicted purge volume, a predicted number of data items purged, or a predicted data purge execution duration. In some examples, the predicted effect is presented at the user device prior to the execution of the purge request. For example, a user interface, such as a purge management interface, may present the predicted effect at the user device in relation to the purge request. The user may finalize the purge request (e.g., cause it to be submitted for execution) or adjust the purge request.

The method may include receiving, from the user device, user input to adjust the set of purge parameters of the purge request (e.g., after the user has reviewed a first predicted effect). The purge parameters may be adjusted, and the adjusted purge parameters may be provided to the machine learning model to obtain adjusted output indicative of a second predicted effect of the execution of the purge request. The second predicted effect may then be presented at the user device.

Adjustment of the purge parameters may change the impacted data sources. For example, as a result of the change in the purge parameters, the primary data source may change, one or more of the secondary data sources may no longer be impacted, or one or more additional secondary data sources may become impacted data sources. The adjustment of the set of purge parameters may thus cause identification of a second set of impacted data sources, with the adjusted output being indicative of the predicted effect of the execution of the purge request on the second set of impacted data sources.

In some examples, the predicted effect generated by the machine learning model comprises a data source-specific effect for each of the impacted data sources. In other words, the machine learning model may generate a separate result or impact prediction for each impacted data source. The data source-specific effects may be separately presented at the user device. In some cases, the predicted effect generated by the machine learning model comprises an overall effect that covers all of the impacted data sources (as opposed to generating a separate result or impact prediction for each impacted data source).

The output of the machine learning model may be used in various downstream operations. In some examples, a data purging system automatically schedules the execution of the data purge request based on the predicted effect. A processor-implemented purging component (e.g., a purging engine) then executes the purge request in accordance with the scheduling.

As mentioned, the machine learning model may be trained on historical purge data. The historical purge data may include a plurality of input-output pairs, with each input-output pair including a set of purge parameters and at least one corresponding purge effect. For example, the purge parameters of a particular input-output pair may identify a target or primary data source for a purge request associated with the input-output pair, together with one or more purge criteria used in that specific purge request.

In some examples, the purge effect of each input-output pair is specific to an impacted data source. In other words, the purge effect in the output of the input-output pair may relate only to one data source, e.g., one impacted functional module. Execution of one purge request may thus result in creation of multiple input-output pairs, each corresponding to a different data source. The input-output pairs may be grouped by impacted data source, and the machine learning model may be trained using such grouped input-output pairs.

In other cases, the purge effect of each input-output pair covers all impacted data sources. In other words, the purge effect in the output of the input-output pair may relate to multiple data sources, e.g., multiple impacted functional modules. Execution of one purge request may thus result in creation of a single input-output pair, in which case the machine learning model may be trained without grouping of input-output pairs.

Examples described herein may address or alleviate technical problems associated with data purging systems. For example, accurate predictions of data storage needs and growth rates may be enabled by analyzing historical purge data that include data with hierarchical data relationships, thus improving the management of storage capacity and reducing storage costs. Furthermore, issues such as abnormal data increases (e.g., abnormal spikes in data growth due to bugs or misconfigurations) or unexpected, cascading purging of downstream data, may be detected and addressed.

In some examples, purge job scheduling can be optimized by providing a purge job scheduling component of a data purging system with an accurate impact prediction that is useful in allocating appropriate computing resources to a purge job. This may improve the functioning of the data purging system, including its overall efficiency in executing purge jobs.

The machine learning-driven techniques described herein may further improve the functioning of a data purging system by enabling users to preview purge results for specified purge parameters. As mentioned, the purge results may be indicative of the impact that a proposed purge may have on one or more impacted data sources. This can empower users to make data-driven decisions when configuring and scheduling purge routines. By accounting for hierarchical data relationships, users may be provided with better visibility into the impacts of a purge request prior to execution thereof.

When the effects in this disclosure are considered in aggregate, one or more of the methodologies described herein may obviate a need for certain efforts or resources that otherwise would be involved in data storage systems, data purging processes or data purging management, such as reactive adjustments of purging schedules or purge job settings resulting from storage capacity issues. Examples of such computing resources may include processor cycles, network traffic, memory usage, graphics processing unit (GPU) resources, data storage capacity, power consumption, and cooling capacity.

is a diagrammatic representation of a networked computing environmentin which some examples of the present disclosure may be implemented or deployed. One or more servers in a server systemprovide server-side functionality via a networkto a networked device, in the example form of a user devicethat is accessed by a user. The usermay, for example, be a customer accessing one or more products or services provided by a service provider via the server system. Examples of the products or services are provided below. An administrator, such as an administrator associated with the service provider, may also access the server systemvia the network(e.g., by using an administrator device).

A web client(e.g., a browser) or a programmatic client(e.g., an “app”) may be hosted and executed on the user device. Although not shown in, the administrator deviceof the administratormay be similar to the user deviceand also host and execute a similar web client or programmatic client.

An Application Program Interface (API) serverand a web serverprovide respective programmatic and web interfaces to components of the server system. A specific application serverhosts a data purging systemwhich includes components, modules, or applications. Storage system servershost or provide access to a storage system. For example, the storage system may be a distributed cloud-based storage system, such as a Hadoop Distributed File System (HDFS).

The user devicecan communicate with the application server. For example, communication can occur via the web interface supported by the web serveror via the programmatic interface provided by the API server. It will be appreciated that, although only a single user deviceis shown in, a plurality of user devices may be communicatively coupled to the server systemin some examples. Further, while certain functions may be described herein as being performed at either the user deviceor administrator device(e.g., web clientor programmatic client) or the server system, the location of certain functionality either within the user deviceor administrator device, or the server system, may be a design choice.

The application serveris communicatively coupled to the storage system servers, facilitating access to one or more information storage repository, such as storageor storage. The storageor storagemay, for example, include one or more databases or file systems. In some examples, the storage system serversprovide access to storage devices that store data items to be purged by the data purging system(e.g., files, records, or logs). In some examples, the storage system serversmay also be accessed by the userusing the user device(e.g., to add new files or modify files), or by the administratorusing the administrator device. The storage system serversmay be accessed directly, or via the API serveror web server, depending on the implementation.

The application serveraccesses application data (e.g., application data stored by the storage system servers) to provide one or more applications or software tools to the user deviceor the administrator device(e.g., via a web interfaceor an app interface). As described further below according to examples, the application server, using the data purging system, may provide one or more tools or functions for performing data purges and predicting the effect of data purges on one or more data sources (e.g., predicting volume changes, number numbers of data items removed, or purge duration).

In some examples, the server systemis part of a cloud-based platform or cloud-based service provided by a software provider that allows the userto utilize features of one or more of the storage system serversand the data purging system. The usermay utilize one or more software offerings of the software provider, such as a data storage solution, an accounting module, a human resources module, a planning module, or an enterprise resource planning module. These modules may be regarded as functional modules of the cloud-based platform or cloud-based service. Such functional modules may represent respective data sources that can be targeted or selected for purging, as described further below. For each data source or module, the server systemmay cause data items to be stored in the storageor storage.

For example, the usermay store data items via the storage system serversand make use of the data purging systemto perform purges according to one or more purge policies (e.g., to ensure compliance with data retention regulations). Different purge policies may be applied to different offerings or modules, or even within the same offering or module. In some examples, the usercan transmit a purge request to initiate or cause scheduling of a data purge. The usermay upload one or more purge policies to the data purging systemvia the user device. The data purging systemthen uses each purge policy to schedule purge jobs.

In some examples, the data purging systemis a centralized system configured to execute automated data purging operations on one or more storage systems associated with an enterprise based on defined purge policies. The data purging systemprovides a platform to apply retention rules for deleting obsolete, redundant, or unnecessary data. The data purging systemmay also delete specific or custom data items on request.

The data purging systemmay generate predictions to provide the useror administratorwith visibility into an expected or predicted effect of a data purge (or a series of data purges). The data purging systemmay also provide one or more dashboards via a graphical user interface on the user deviceor the administrator device, such as a dashboard that allows the useror administratorto create, adjust, track, monitor, or manage data purges. The graphical user interface may also present the predictions referred to herein.

As mentioned, the storage system serversmay provide access to a distributed storage system that is accessed by the data purging systemto purge data. Purging may be performed to free up storage space or reduce costs, and may be driven by user instructions, purge policies, or regulatory compliance (e.g., a law requiring data to be completely removed from a system after a certain period). In some examples, the distributed storage system comprises a HDFS or other distributed file system (DFS). A DFS is a file system that enables clients to access file storage from multiple hosts through a computer network. Files may be spread across multiple storage servers in multiple locations, and hierarchical data relationships may exist between files. In some examples, a DFS can be designed so that geographically distributed users, such as remote workers and distributed teams, can access and share files remotely as if they were stored locally.

A DFS may cluster together multiple storage nodes that each have their own computing power and storage and distribute data sets across multiple nodes. In some examples, data items are replicated onto multiple servers, which enables redundancy to keep data highly available. The data on a DFS can reside on various types of storage devices, such as solid-state drives and hard disk drives, and examples described herein are not restricted to a particular type of storage device.

One or more of the application server, the data purging system, the storage system servers, the API server, the web server, or parts thereof, may each be implemented in a computer system, in whole or in part, as described below with respect to. In some examples, external applications, such as an external applicationexecuting on an external server, can communicate with the server systemvia the programmatic interface provided by the API server. For example, a third-party application may support one or more features or functions on a website or platform hosted by a third party, or may perform certain methodologies and provide input or output information to the server systemfor further processing or publication. The external applicationmay, for example, access the storage system serversto view or modify files, or access the data purging systemto view the status of data purge jobs or data purge effect predictions.

The networkmay be any network that enables communication between or among machines, databases, and devices. Accordingly, the networkmay be a wired network, a wireless network (e.g., a mobile or cellular network), or any suitable combination thereof. The networkmay include one or more portions that constitute a private network, a public network (e.g., the Internet), or any suitable combination thereof.

One or more of the components inmay be implemented using hardware (e.g., one or more processors of one or more machines) or a combination of hardware and software. For example, a component may be implemented by a processor configured to perform the operations described herein for that component. Moreover, two or more of these components may be combined into a single component, or the functions described herein for a single component may be subdivided among multiple components. Furthermore, according to various examples, components described herein may be implemented using a single machine, database, or device, or be distributed across multiple machines, databases, or devices.

is a diagramthat illustrates components of the data purging systemof, according to some examples.also shows the user deviceand the administrator deviceof, which are communicatively coupled with the data purging system. The data purging systemis shown to include a purge request handling component, a purge effect prediction component, a purging engine, and a retention management component.

The purge request handling componentmay provide an interface for users or administrators to create or submit purge requests (e.g., by uploading purge policies or creating once-off purge jobs) and view purge effect predictions. The purge request handling componentmay include a purge job previewerthat provides at least some of these functions.

In some examples, the purge job previewerworks with the purge effect prediction componentto provide, at the user deviceor administrator device, a preview of the predicted effects for a given purge request. The purge job previewermay retrieve purge parameters entered by a user and transmit them to the purge effect prediction component. The purge effect prediction componentthen applies a machine learning model to generate predicted effects or metrics, such as purged volume and duration. These predicted effects are returned to the purge job previewer.

The purge job previewermay format predictions into graphical and textual analytics displayed at a user interface, such as a purge management interfaceshown in. The purge job previewermay provide data-backed insights into the possible impacts of specified purge configurations prior to execution. This may enable informed decision-making when planning and scheduling data purge routines. The purge management interfacecan be accessed by the user deviceor the administrator device, in some examples. More detailed, non-limiting examples of the purge management interfaceare described with reference toandbelow.

Patent Metadata

Filing Date

Unknown

Publication Date

March 31, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search