A data storage node includes a plurality of compute nodes that allocate portions of local memory to a shared cache. The shared cache is configured with mirrored and non-mirrored segments. The mirrored and non-mirrored segments are separately configured with pools of data slots. Within each segment, each pool is associated with same-size data slots that differ in size relative to the data slots of other pools. The sizes of the pools in each segment are adjusted to minimize cache loss, where cache loss is the difference between the amount of shared cache used to service IOs and the minimum amount required for servicing the IOs.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method comprising:
2. The method offurther comprising changing the sizes of at least some of the partitions contingent upon the cache efficiency loss of at least one of the partitions exceeding a predetermined limit.
3. The method offurther comprising calculating updated percentages of the shared cache for each of the partitions to minimize aggregate cache efficiency loss of all the partitions.
4. The method offurther comprising calculating target bank counts for each of the partitions based on the updated percentages.
5. The method offurther comprising updating a record in compute node memory of cache bank donors and cache bank acceptors based on differences between enabled cache bank counts and target cache bank counts.
6. The method offurther comprising reallocating cache banks from donor partitions to acceptor partitions.
7. The method offurther comprising reallocating cache banks from mirrored donor partitions to mirrored acceptor partitions and reallocating cache banks from non-mirrored donor partitions to non-mirrored acceptor partitions.
8. A non-transitory computer-readable storage medium storing instructions that when executed by a data storage system with a plurality of compute nodes that allocate portions of local cache to a shared cache that is organized into a plurality of partitions, each of the partitions containing data slots of only one size in terms of storage capacity, different ones of the partitions being characterized by different slot sizes, each external IO received by one of the compute nodes being serviced by selecting an unallocated data slot that is equal to or larger in size than the IO and allocating that data slot for servicing the IO and afterwards deallocating that data slot, cause the data storage system to perform a method comprising:
9. The non-transitory computer-readable storage medium ofin which the method further comprises changing sizes of at least some of the partitions to reduce aggregate cache efficiency loss of all the partitions contingent upon the cache efficiency loss of at least one of the partitions exceeding a predetermined limit.
10. The non-transitory computer-readable storage medium ofin which the method further comprises calculating updated percentages of the shared cache for each of the partitions to minimize aggregate cache efficiency loss of all the partitions.
11. The non-transitory computer-readable storage medium ofin which the method further comprises calculating target bank counts for each of the partitions based on the updated percentages.
12. The non-transitory computer-readable storage medium ofin which the method further comprises updating a record of cache bank donors and cache bank acceptors based on differences between enabled cache bank counts and target cache bank counts.
13. The non-transitory computer-readable storage medium ofin which the method further comprises reallocating cache banks from donor partitions to acceptor partitions.
14. The non-transitory computer-readable storage medium ofin which the method further comprises reallocating cache banks from mirrored donor partitions to mirrored acceptor partitions, and reallocating cache banks from non-mirrored donor partitions to non-mirrored acceptor partitions.
15. An apparatus comprising:
16. The apparatus offurther comprising at least one cache partition balancer configured to calculate updated percentages of the shared cache for each of the partitions to minimize aggregate cache efficiency loss of all the partitions.
17. The apparatus offurther comprising at least one cache partition balancer configured to calculate target bank counts for each of the partitions based on the updated percentages.
18. The apparatus offurther comprising at least one cache partition balancer configured to update a record of cache bank donors and cache bank acceptors based on differences between enabled cache bank counts and target cache bank counts.
19. The apparatus offurther comprising at least one cache partition balancer configured to reallocate cache banks from donor partitions to acceptor partitions.
20. The apparatus offurther comprising at least one cache partition balancer configured to reallocate cache banks from mirrored donor partitions to mirrored acceptor partitions, and reallocating cache banks from non-mirrored donor partitions to non-mirrored acceptor partitions.
Complete technical specification and implementation details from the patent document.
The subject matter of this disclosure is generally related to cache partition allocations in data storage systems.
High-capacity data storage systems such as storage area networks (SANs) and storage arrays manage access to host application data stored on arrays of non-volatile drives. The storage systems respond to input-output (IO) commands from instances of host applications that run on host servers. Examples of host applications may include, but are not limited to, software for email, accounting, manufacturing, inventory control, and a wide variety of other business processes. It has long been standard practice in the art to use a single, fixed size data allocation unit for data access so that storage system metadata is practical to manage. The data allocation units are sometimes referred to as tracks (TRKs). The single, fixed TRK size can be selected as a design choice, where TRK size is generally proportional to the manageability of the metadata, but inversely proportional to resource utilization efficiency. Using a larger TRK size can reduce the resource burden on memory and processing resources for metadata management but decreases the efficiency of managed drive utilization by increasing unused space. TRKs are distinct from hard disk drive (HDD) tracks that characterize spinning disk storage architecture. An HDD track is a physical characteristic that corresponds to a concentric band on a platter. TRKs larger in size than HDD tracks and are not limited by the physical architecture of a spinning platter. It has also long been standard practice in the art to mirror the volatile memory of pairs of interconnected storage system compute nodes for failover. Mirroring causes all TRKs in volatile memory of a primary compute node to also be in volatile memory of a secondary compute node so that the secondary compute node can quickly take responsibility for IO processing in the event of failure of the primary compute node.
It has recently been proposed to implement selective mirroring based on whether data in volatile memory is stored on non-volatile drives. It has also been proposed to simultaneously support multiple data allocation unit sizes. In order to implement selective mirroring, the volatile memory may be divided into mirrored and non-mirrored segments. In order to simultaneously support multiple data allocation unit sizes, pools (partitions) of different sized data slots may be created. Some aspects of the presently disclosed invention are predicated in part on recognition that supporting multiple TRK sizes and implementing selective mirroring creates new problems. Different organizations and different storage nodes tend to generate and service a variety of IO workloads that vary in both size and type. Depending on a variety of factors, the read-to-write ratio of an IO workload and distribution of IO sizes may vary widely. Thus, a default segmentation configuration can lead to inefficient operation and resource starvation. For example, an organization that generates an IO workload that is dominated by large size read IOs will inefficiently utilize the resources of a storage array that is configured with a relatively large, mirrored cache segment and relatively large size non-mirrored cache segment allocations to small data slot size pools.
In accordance with some implementations, a method comprises, in a data storage system comprising a plurality of compute nodes that allocate portions of local cache to a shared cache that is organized into partitions characterized by different data slot sizes: calculating cache loss of each of the partitions as a difference between amount of shared cache used to service IOs and minimum amount required to service the IOs; and changing sizes of at least some of the partitions to reduce aggregate cache loss of all the partitions.
In accordance with some implementations, an apparatus comprises a data storage system comprising a plurality of compute nodes configured to allocate portions of local cache to a shared cache that is organized into partitions characterized by different slot sizes; and at least one cache partition balancer configured to: calculate cache loss of each of the partitions as a difference between amount of shared cache used to service IOs and minimum amount required to service the IOs; and change sizes of at least some of the partitions to reduce aggregate cache loss of all the partitions.
In accordance with some implementations, a non-transitory computer-readable storage medium stores instructions that when executed by a data storage system with a plurality of compute nodes that allocate portions of local cache to a shared cache that is organized into partitions characterized by different slot sizes cause the data storage system to perform a method comprising: calculating cache loss of each of the partitions as a difference between amount of shared cache used to service IOs and minimum amount required to service the IOs; and changing sizes of at least some of the partitions to reduce aggregate cache loss of all the partitions.
This summary is not intended to limit the scope of the claims or the disclosure. Other aspects, features, and implementations will become apparent in view of the detailed description and figures. Moreover, all the examples, aspects, implementations, and features can be combined in any technically possible way.
The terminology used in this disclosure is intended to be interpreted broadly within the limits of subject matter eligibility. The terms “disk,” “drive,” and “disk drive” are used interchangeably to refer to non-volatile storage media and are not intended to refer to any specific type of non-volatile storage media. The terms “logical” and “virtual” are used to refer to features that are abstractions of other features, for example, and without limitation, abstractions of tangible features. The term “physical” is used to refer to tangible features that possibly include, but are not limited to, electronic hardware. For example, multiple virtual computers could operate simultaneously on one physical computer. The term “logic” is used to refer to special purpose physical circuit elements, firmware, software, computer instructions that are stored on a non-transitory computer-readable medium and implemented by multi-purpose tangible processors, and any combinations thereof. Aspects of the inventive concepts are described as being implemented in a data storage system that includes host servers and a storage array. Such implementations should not be viewed as limiting. Those of ordinary skill in the art will recognize that there are a wide variety of implementations of the inventive concepts in view of the teachings of the present disclosure.
Some aspects, features, and implementations described herein may include machines such as computers, electronic components, optical components, and processes such as computer-implemented procedures and steps. It will be apparent to those of ordinary skill in the art that the computer-implemented procedures and steps may be stored as computer-executable instructions on a non-transitory computer-readable medium. Furthermore, it will be understood by those of ordinary skill in the art that the computer-executable instructions may be executed on a variety of tangible processor devices, i.e., physical hardware. For practical reasons, not every step, device, and component that may be part of a computer or data storage system is described herein. Those of ordinary skill in the art will recognize such steps, devices, and components in view of the teachings of the present disclosure and the knowledge generally available to those of ordinary skill in the art. The corresponding machines and processes are therefore enabled and within the scope of the disclosure.
illustrates a storage arraywith shared memory/cache segment balancers. The storage arrayincludes at least one brick. Each brickincludes an engineand one or more disk array enclosures (DAEs),. Each engineincludes two interconnected compute nodes,that are arranged as a pair for failover and may be referred to as “storage directors.” Although it is known in the art to refer to the compute nodes of a SAN as “hosts,” that naming convention is avoided in this disclosure to help distinguish the network server hosts from the compute nodes,. Nevertheless, the host applications could run on the compute nodes, e.g., on virtual machines or in containers. Each compute node is implemented as a separate blade and includes resources such as at least one multi-core processorand local memory. The processormay include central processing units (CPUs), graphics processing units (GPUs), or both. The local memorymay include volatile media such as dynamic random-access memory (DRAM), non-volatile memory (NVM) such as storage class memory (SCM), or both. Each compute node includes one or more host adapters (HAs)for communicating with host servers. Each host adapter has resources for servicing input-output commands (IOs) from the host servers. The host adapter resources may include processors, volatile memory, and ports via which the hosts may access the storage array. Each compute node also includes a remote adapter (RA)for communicating with other storage systems. Each compute node also includes one or more disk adapters (DAS)for communicating with managed drivesin the DAEs,. Each disk adapter has processors, volatile memory, and ports via which the compute node may access the DAEs for servicing IOs. Each compute node may also include one or more channel adapters (CAs)for communicating with other compute nodes via an interconnecting fabric. The managed drivesinclude non-volatile storage media that may be of any type, e.g., solid-state drives (SSDs) based on EEPROM technology such as NAND and NOR flash memory and hard disk drives (HDDs) with spinning disk magnetic storage media. Disk controllers may be associated with the managed drives as is known in the art. An interconnecting fabricenables implementation of an N-way active-active backend. A backend connection group includes all disk adapters that can access the same drive or drives. In some implementations, every disk adapterin the storage array can reach every DAE via the fabric. Further, in some implementations every disk adapter in the storage array can access every managed disk.
Referring to, host application data is persistently stored on the managed drivesand, because the managed drives are not discoverable by the host servers, logically stored on a storage objectthat can be discovered by the host servers. Without limitation, a storage object may be referred to as a volume, device, or LUN, where a logical unit number (LUN) is a number used to identify logical storage volumes in accordance with the small computer system interface (SCSI) protocol. From the perspective of the host servers, the storage object is a single disk having a set of contiguous logical block addresses (LBAs) on which data used by the instances of a host application resides. However, the host application data is stored at non-contiguous addresses on various managed drives. There may be a large number of host servers and the storage array may maintain a large number of storage objects.
Each compute node,allocates a fixed amount of its local memoryto a shared cache (aka shared memory)that can be accessed by all compute nodes of the storage array using direct memory access (DMA). The shared cacheincludes metadata slotsand data slots, each of which is a fixed allocation of the shared cache. The basic allocation units of storage capacity that are used by the compute nodes to access the managed drives are back-end tracks (BE-TRKs). The host application data is logically stored in front-end tracks (FE-TRKs) on the production storage objectand actually stored on BE-TRKs on the managed drives. The FE-TRKs are mapped to the BE-TRKs and vice versa by FE-TRK IDs and BE-TRK IDs, which are pointers that are maintained in the metadata slots. More specifically, the BE-TRK IDs are pointers to BE-TRKs of host application data in the data slots. The data slots, which function to hold data for processing IOs, are divided into a mirrored segmentand a non-mirrored segment. The mirrored segment is mirrored by both compute nodes,of an engine, whereas the non-mirrored segment is not mirrored. Each segment is divided into a plurality of pools (e.g., pool, pool, pool). The sizes of the data slots correspond to the sizes of the BE-TRKs and the terms data slot and BR-TRK maybe used interchangeably when referring to partition and segment allocations. Each pool (partition) contains same-size data slots for holding BE-TRK data, and the sizes of the data slots/BE-TRKs differs between pools. For example, and without limitation, poolmay contain only 16 KB data slots, poolmay contain only 64 data slots, and poolmay contain only 128 KB data slots.
The shared cacheis used to service IOs from the host server, with the pools being used selectively to reduce wasted space. In the illustrated example, compute nodereceives an IOfrom hostwith storage objectas the target. IOcould be a Read or Write to a FE-TRKthat is logically stored on the storage object. A response to a Write IO is an Ack, whereas a response to a Read IO is data. The response is collectively represented as Ack/Data. The compute nodeuses information in the IO to identify a metadata page corresponding to FE-TRK, e.g., by inputting information such as the storage object ID and LBAs into a hash table. The hash tableindicates the location of the corresponding metadata page in the metadata slots. The location of the metadata page in the shared cache may be local or remote relative to compute node. A BE-TRK ID pointer from that metadata page is obtained and used by the compute nodeto find the corresponding data slot that contains BE-TRKwhich is associated with FE-TRK. The BE-TRKis not necessarily present in the data slots when the IOis received because the managed driveshave much greater storage capacity than the data slots so data slots are routinely recycled to create free data slots. If the IOis a Read and the corresponding BE-TRKis not present in the data slots, then the compute nodelocates and retrieves a copy of BE-TRKfrom the managed drives. More specifically, the BE-TRKis copied into an empty data slot in the pool with the closest sized data slots that are ≥BE-TRKsize in the non-mirrored segment. That copy is then used to respond to the host server and the data is eventually flushed from the data slots. If the IOis a Write and the corresponding BE-TRKis not present in the data slots, then the compute nodeplaces the Write data into an empty data slot in the pool with the closest sized data slots that are ≥BE-TRKsize in the mirrored segment. In accordance with mirroring, the data is copied to the corresponding mirrored segment and pool of compute node. Worker threadsrunning the background eventually destage the data to BE-TRKon the managed drives, e.g., overwriting the stale data on the managed drives and flushing the data from the data slots.
A race condition exists between the recycling of data slots by worker threads running in the background and use of free data slots in the foreground to service IOs, e.g., to receive data for a pending write. When an appropriately sized data slot is unavailable, a larger size data slot is used. For example, a write IO sized at 8 KB would normally be written to a free data slot in the 16 KB pool of the mirrored segment. However, if there are no free data slots in the 16 KB pool of the mirrored segment then a free data slot in the nearest larger size pool of the mirrored segment is used, e.g., a free data slot from the 64 KB pool or the 128 KB pool. Although this procedure enables the write to be processed, it increases the amount of temporarily wasted space in the shared cache by using a 64 KB or 128 KB data slot rather than a 16 KB data slot for an 8 KB write. Moreover, the unpredictability of the IO workload creates a likelihood that inefficiently large data slot pools will be used in some implementations if a single set of pools size allocations is used in all situations.
illustrates an example configuration of data slotsegments and pools with area representing relative percentage of shared cache. The mirrored segment, which is used for Write data, may differ in size relative to the non-mirrored segment, which is used for Read data. For example, a configuration with a relatively larger mirrored segmentmay be created for organizations that historically generate Write-heavy IO workloads. In contrast, a configuration with a relatively larger non-mirrored segmentmay be created for organizations that historically generate Read-heavy IO workloads. Similarly sized mirrored and non-mirrored segments may be created for organizations that historically generate balanced (read: write) IO workloads. In the illustrated example, the non-mirrored segmentincludes, in ascending order according to size, a pool of 32 KB BE-TRKs, a larger pool of 64 KB BE-TRKs, and a larger pool of 128 KB BE-TRKs. The mirrored segmentincludes, in ascending order according to size, a pool of 16 KB BE-TRKs, a larger pool of 64 KB BE-TRKs, and a larger pool of 128 KB BE-TRKs. The data slot sizes of the pools can be configured based on cache loss signals as will be described below.
illustrate operation of the shared memory/cache partition balancers. The storage array is initially configured with default percentages of the shared cache allocated to each partition. Stepis calculating the cache loss of each partition. Cache loss is the difference between the amount of actual shared cache used and the minimum shared cache needed for the IOs serviced. Continuing with the example from the preceding paragraph, cache loss for the IO would have been 16 KB-8 KB-8 KB if a data slot in the 16 KB pool had been free, but instead is either 64 KB-8 KB-56 KB or 128 KB-8 KB-120 KB depending on whether a data slot in the 64 KB pool is free. The cache loss may be calculated for all TRKs present in the data slots or all IOs serviced over some period of time. The cache loss calculations may be represented by a cache loss table as shown in. The table shows the cache losses for various IO types on the mirrored and non-mirrored 128 KB partitions W128, R128. For example, if IO path uses a W128 cache slot for a 16 KB write IO then there is a Cache Loss=112 KB for that IO. Using data slots in the mirrored segment for reads exacerbates the cache loss. For example, if IO path uses a W128 cache slot for a 16 KB read IO then there is a Cache Loss=240 KB for that IO. Referring again to, stepis comparing the cache loss of each partition with a predetermined cache loss limit. Different limits may be assigned to different partitions, e.g., based on data slot size and whether the segment is mirrored or non-mirrored. If none of the partitions exceed their limits as determine in step, then flow returns to stepand monitoring continues. If any of the partitions exceed their limits as determine in step, then new percentages of shared cache are calculated for each partition to minimize aggregate cache loss. More specifically, updated percentages are calculated based on the cache losses calculated in step. Stepis calculating target cache bank counts for each partition corresponding to the updated percentages of shared cache. A bank is a fixed-size chunk of cache with a size that is less than the sizes of partitions. For example, a bank size of 0.25 GB might be used.
Stepis calculating a donor/acceptor list for the mirrored segment. Each partition in the mirrored segment is analyzed independently. If the count of enabled banks is greater than the (new) target bank count, then the mirrored segment donor list entry is set equal to the difference between the enabled bank count and the target bank count. The number of enabled cache banks in a partition is the number of cache banks that the partition contains, and which are currently serving cache slot allocations. The modifier “enabled” is used because cache banks can be logically and physically enabled/disabled. If the count of enabled banks is less than the target bank count, then the mirrored segment acceptor list entry is set equal to the difference between the enabled bank count and the target bank count.
Stepis calculating a donor/acceptor list for the non-mirrored segment. Each partition in the non-mirrored segment is analyzed independently. If the count of enabled banks is greater than the (new) target bank count, then the non-mirrored segment donor list entry is set equal to the difference between the enabled bank count and the target bank count. If the count of enabled banks is less than the target bank count, then the non-mirrored segment acceptor list entry is set equal to the difference between the enabled bank count and the target bank count.
A mirrored segment balancerreallocates mirrored banks from donors to acceptors. The balancer must first logically disable a relevant cache bank. When a cache bank is logically disabled it cannot be used for a new incoming IOs. Once a relevant bank is disabled, the contents of that bank are freed by running a disabled banks scan on that bank so as to allow any write data to be saved to disk and to ensure the bank is completely free before being formatted. A single donor bank is selected, disabled, and drained of data as indicated in step. In stepthe balancer updates the cache bank map table for the system to reflect that the cache bank is associated with a different partition. In stepthe entire bank is formatted, including its metadata, because the size of each slot in the bank is changed to the size associated with the new partition. The non-mirrored segment balanceroperates in the same manner, with step,, andcorresponding to steps,, and, respectively, although for the non-mirrored partitions. The processes are iterated, one bank at a time by each balancer, until the enabled bank count for each partition equals the target bank count. Flow then returns to step.
Specific examples have been presented to provide context and convey inventive concepts. The specific examples are not to be considered as limiting. A wide variety of modifications may be made without departing from the scope of the inventive concepts described herein. Moreover, the features, aspects, and implementations described herein may be combined in any technically possible way. Accordingly, modifications and combinations are within the scope of the following claims.
Unknown
October 14, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.