Providing content based data protection for data stored in a large-scale data storage system by scanning data stored in one or more databases for discovery of metadata, and extracting the discovered metadata, for storage in a data catalog, the data catalog having a scanning function performing the scanning step, and comprising a database storing the metadata in one or more tables. A protection policy is defined to commonly protect content data referenced by metadata in the data catalog, and applied to the referenced content data to perform a data protection operation the content data. Datasets stored in the catalog are generated by running queries on the catalog, where a query comprises metadata selectors as tags applied to the catalog, where the tags define at least one of a file type, name, location, creation time, or file characteristic.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method of providing content based data protection, comprising: scanning data stored in one or more databases for discovery of metadata; extracting the discovered metadata; storing the metadata in a data catalog, the data catalog having a scanning function performing the scanning step, and comprising a database storing the metadata in one or more tables; defining protection policies to commonly protect content data referenced by metadata in the data catalog, wherein the content data comprises data objects having disparate file format and protected by different protection policies; iteratively processing the dataset to tag the data objects according to a native file format; attaching multiple tags to the dataset to indicate that the data objects of the dataset are of different file types according to the disparate file formats; merging the protection policies to protect the dataset under a merged protection policy utilizing a most restrictive policy of the different protection policies; producing, from the data catalog, a change file list storing names of files changed from a first scan period to a next scan period for use by the protection policy; and applying the merged protection policy to the referenced content data to perform a data protection operation on the content data to provide content-based data protection rather than location-based data protection.
2. The method of claim 1 further comprising compiling the metadata into a single dataset prior to storing in the data catalog, wherein the dataset automatically tracks data added, removed or relocated to content data protected by the defined protection policy.
3. The method of claim 2 wherein the dataset is organized into collection information and per file and object information, and further wherein collection information comprises a dataset creation time, a query, role-based access control (RBAC) for the dataset, and first free-form metadata, and wherein the per file and object information comprises location of data of the dataset, unstructured metadata information, and second free-form metadata.
4. The method of claim 3 wherein the dataset is one of a static dataset or a dynamic dataset, wherein the static dataset comprises a fixed amount of data set at a time of creation, and the dynamic dataset comprises an amount of data that changes over time.
5. The method of claim 4 further comprising interfacing both the static dataset and dynamic dataset to the content data through a catalog interface to form a static database catalog and a dynamic database catalog.
6. The method of claim 5 wherein the static database catalog is used to create and store persistent datasets that contain data that is not modifiable during its lifecycle.
7. The method of claim 6 wherein the catalog comprises a user interface displaying to the user data usage trends, storage device usage, or storage device health, and further providing a mechanism through which a user can perform searches for files of the content data.
8. The method of claim 2 wherein the dataset comprises a logical collection of metadata for unstructured files and objects that are grouped together by one or more filters from a data query performed on the data catalog.
9. The method of claim 8 wherein the dataset represents a subset of data that a user categorizes for specific needs, wherein actions performed on the dataset will affect only the corresponding content data referenced by the metadata.
10. The method of claim 9 wherein the dataset spans multiple storage device types and multiple operating environments including edge networks, core networks and public or cloud networks.
11. The method of claim 1 wherein the data protection operation comprises at least one of: backing up data from operating memory to storage memory, restoring data from the storage to the operating memory, moving data among storage devices, and tiering data between different storage devices, and wherein the dataset automatically tracks data added, removed or relocated to content data protected by the defined protection policy.
12. A computer-implemented method of providing content-based data protection for data stored in a large-scale data storage system, comprising: accessing content data stored in the data storage system; deploying a data catalog that comprising a scanning function configured to discover metadata associated with the content data, and a database storing the discovered metadata; defining protection policies to commonly protect selected data referenced by metadata in the data catalog; iteratively processing the dataset to tag the data objects according to a native file format; attaching multiple tags to the dataset to indicate that the data objects of the dataset are of different file types according to the disparate file formats; merging the protection policies to protect the dataset under a merged protection policy utilizing a most restrictive policy of the different protection policies; running a query received from a user against the catalog to generate the dataset based on the multiple tags; producing, from the data catalog, a change file list storing names of files changed from a first scan period to a next scan period for use by the protection policy; and applying the merged protection policy to the dataset to perform a data protection application on the selected data so as to provide content-based data protection rather than location-based data protection.
13. The method of claim 12 further comprising: creating the dataset by grouping metadata for unstructured data objects that are grouped together by one or more filters, wherein the dataset spans multiple storage devices of different storage types; initiating the query that generates one or more filters; and defining the protection policy to protect the dataset as the single unit based on data content rather than data location, wherein the query comprises metadata selectors applied to the catalog.
14. The method of claim 13 wherein the metadata selectors comprise tags consisting of alphanumeric strings applied to respective data objects based on user-defined rules, and wherein the tags define at least one of a file type, name, location, creation time, or characteristic.
15. The method of claim 12 wherein the dataset is one of a static dataset or a dynamic dataset, wherein the static dataset comprises a fixed amount of data set at a time of creation, and the dynamic dataset comprises an amount of data that changes over time, and wherein the dataset is organized into collection information and per file and object information.
16. The method of claim 15 wherein collection information comprises a dataset creation time, the query, role-based access control (RBAC) for the dataset, and first free-form metadata, and wherein the per file and object information comprises location of data of the dataset, unstructured metadata information, and second free-form metadata.
17. The method of claim 12 wherein the defined protection policy comprises at least one of: backing up data from operating memory to storage memory, restoring data from the storage to the operating memory, moving data among memory, and tiering data between different storage memory.
18. The method of claim 12 wherein the dataset spans multiple storage device types and multiple operating environments including edge networks, core networks and public or cloud networks.
19. A hardware-embodied computer program product having stored thereon program code that when executed by a processor, cause the processor to perform a method of providing content based data protection, comprising: scanning data stored in one or more databases for discovery of metadata; extracting the discovered metadata; storing the metadata in a data catalog, the data catalog having a scanning function performing the scanning step, and comprising a database storing the metadata in one or more tables; defining protection policies to commonly protect content data referenced by metadata in the data catalog, wherein the content data comprises data objects having disparate file format and protected by different protection policies; iteratively processing the dataset to tag the data objects according to a native file format; attaching multiple tags to the dataset to indicate that the data objects of the dataset are of different file types according to the disparate file formats; merging the protection policies to protect the dataset under a merged protection policy utilizing a most restrictive policy of the different protection policies; producing, from the data catalog, a change file list storing names of files changed from a first scan period to a next scan period for use by the protection policy; and applying the merged protection policy to the referenced content data to perform a data protection operation on the content data to provide content-based data protection rather than location-based data protection.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 27, 2022
June 3, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.