Systems and methods for extended memory are disclosed. An apparatus for extended memory may include a first processing device; a second processing device; a first memory device; and a second memory device. A first logical memory space and a second logical memory space are configured to be allocated for respectively the first processing device and the second processing device. The first logical memory space and the second logical memory space are further configured to be mapped to a first physical memory space of one of the first memory device or the second memory device.
Legal claims defining the scope of protection, as filed with the USPTO.
a second processing device; a first memory device; and a second memory device; wherein, a first logical memory space and a second logical memory space are configured to be allocated for respectively the first processing device and the second processing device, wherein the first logical memory space and the second logical memory space are further configured to be mapped to a first physical memory space of one of the first memory device or the second memory device. . An apparatus comprising: a first processing device;
claim 1 . The apparatus of, wherein the first processing device includes a graphical processing device, and the second processing device includes a processing engine embedded in the first memory device.
claim 1 . The apparatus of, wherein the first memory device includes two or more memory chips configured to be vertically stacked on top of each other.
claim 1 . The apparatus of, wherein the first memory device includes a type of random access memory.
claim 1 . The apparatus of, wherein the first processing device and the second processing device are configured to share data stored in the first physical memory space.
claim 1 . The apparatus of, wherein the first processing device is configured to provide an address of the first physical memory space, and the second processing device is configured to map the second logical memory space to the first physical memory space based on the address.
claim 1 . The apparatus of, wherein the first processing device is configured to execute a first computation and the second processing device is configured to execute a second computation, wherein execution of the second computation is based on the execution of the first computation.
claim 7 . The apparatus of, wherein the first computation is associated with a first stream, and the second computation is associated with a second stream.
claim 1 . The apparatus of, wherein the first logical memory space and the second logical memory space are assigned based on a memory allocation request, wherein the memory allocation request identifies the one of the first memory device or the second memory device, and further identifies a type of processing device for accessing the one of the first memory device or the second memory device.
receiving a memory allocation request; and mapping the first physical memory space to the first logical memory space and the second logical memory space. based on receiving the memory allocation request: allocating a first logical memory space for a first processing device; allocating a first physical memory space in a memory device based on the first logical memory space; allocating a second logical memory space for a second processing device; . A method comprising:
claim 10 . The method of, wherein the first processing device includes a graphical processing device, and the second processing device includes a processing engine embedded in the first memory device.
claim 10 . The method of, wherein the first memory device includes two or more memory chips configured to be vertically stacked on top of each other.
claim 10 . The method of, wherein the first memory device includes a type of random access memory.
claim 10 . The method of, wherein the first processing device and the second processing device are configured to share data stored in the first physical memory space.
claim 10 . The method of, wherein the first processing device is configured to provide an address of the first physical memory space, and the second processing device is configured to map the second logical memory space to the first physical memory space based on the address.
claim 10 . The method ofwherein the first processing device is configured to execute a first computation and the second processing device is configured to execute a second computation, wherein execution of the second computation is based on the execution of the first computation.
claim 16 . The method of, wherein the first computation is associated with a first stream, and the second computation is associated with a second stream.
claim 10 . The method of, wherein the first logical memory space and the second logical memory space are assigned based on a memory allocation request, wherein the memory allocation request identifies the one of the first memory device or the second memory device, and further identifies a type of processing device for accessing the one of the first memory device or the second memory device.
Complete technical specification and implementation details from the patent document.
The present application claims priority to and the benefit of U.S. Provisional Application No. 63/666,105, filed June 28, 2024, entitled “ADVANCED HIGH BANDWIDTH MEMORY (A-HBM),” and claims priority to and the benefit of U.S. Provisional Application No. 63/704,975, filed October 8, 2024, entitled “ADVANCED HIGH BANDWIDTH MEMORY (A-HBM) COMPANION + LOW-POWER DOUBLE DATA RATE (LPDDR): USAGE MODES AND PROGRAMMING METHODS,” and claims priority to and the benefit of U.S. Provisional Application No. 63/760,905, filed February 20, 2025, entitled “ADVANCED HIGH BANDWIDTH MEMORY (A-HBM),” the entire content of each of which is incorporated herein by reference.
One or more aspects of embodiments according to the present disclosure relate to memory devices, and more particularly, to extending memory capacity and bandwidth for memory devices with compute capability.
The use of artificial intelligence (AI) has increased dramatically over the last few years. AI has become commonly used in domains such as image classification, speech recognition, media analytics, heath care, autonomous machines, smart assistants, and the like. Using AI often necessitates the use of large datasets and advanced algorithms and that similarly necessitate efficient and cost-effective data processing solutions.
The above information disclosed in this Background section is only for enhancement of understanding of the background of the present disclosure, and therefore, it may contain information that does not form prior art.
One or more embodiments of the present disclosure are directed to an apparatus comprising: a first processing device; a second processing device; a first memory device; and a second memory device; wherein, a first logical memory space and a second logical memory space are configured to be allocated for respectively the first processing device and the second processing device, wherein the first logical memory space and the second logical memory space are further configured to be mapped to a first physical memory space of one of the first memory device or the second memory device.
According to some embodiments, the first processing device includes a graphical processing device, and the second processing device includes a processing engine embedded in the first memory device.
According to some embodiments, the first memory device includes two or more memory chips configured to be vertically stacked on top of each other.
According to some embodiments, the first memory device includes a type of random access memory.
According to some embodiments, the first processing device and the second processing device are configured to share data stored in the first physical memory space.
According to some embodiments, the first processing device is configured to provide an address of the first physical memory space, and the second processing device is configured to map the second logical memory space to the first physical memory space based on the address.
According to some embodiments, the first processing device is configured to execute a first computation and the second processing device is configured to execute a second computation, wherein execution of the second computation is based on the execution of the first computation.
According to some embodiments, the first computation is associated with a first stream, and the second computation is associated with a second stream.
According to some embodiments, the first logical memory space and the second logical memory space are assigned based on a memory allocation request, wherein the memory allocation request identifies the one of the first memory device or the second memory device, and further identifies a type of processing device for accessing the one of the first memory device or the second memory device.
One or more embodiments of the present disclosure are directed to a method that includes: receiving a memory allocation request; based on receiving the memory allocation request: allocating a first logical memory space for a first processing device; allocating a first physical memory space in a memory device based on the first logical memory space; allocating a second logical memory space for a second processing device; and mapping the first physical memory space to the first logical memory space and the second logical memory space.
These and other features, aspects and advantages of the embodiments of the present disclosure will be more fully understood when considered with respect to the following detailed description, appended claims, and accompanying drawings. Of course, the actual scope of the invention is defined by the appended claims.
Hereinafter, example embodiments will be described in more detail with reference to the accompanying drawings, in which like reference numbers refer to like elements throughout. The present disclosure, however, may be embodied in various different forms, and should not be construed as being limited to only the illustrated embodiments herein. Rather, these embodiments are provided as examples so that this disclosure will be thorough and complete, and will fully convey the aspects and features of the present disclosure to those skilled in the art. Accordingly, processes, elements, and techniques that are not necessary to those having ordinary skill in the art for a complete understanding of the aspects and features of the present disclosure may not be described. Unless otherwise noted, like reference numerals denote like elements throughout the attached drawings and the written description, and thus, descriptions thereof may not be repeated. Further, in the drawings, the relative sizes of elements, layers, and regions may be exaggerated and/or simplified for clarity.
Embodiments of the present disclosure are described below with reference to block diagrams and flow diagrams. Thus, it should be understood that each block of the block diagrams and flow diagrams may be implemented in the form of a computer program product, an entirely hardware embodiment, a combination of hardware and computer program products, and/or apparatus, systems, computing devices, computing entities, and/or the like carrying out instructions, operations, steps, and similar words used interchangeably (for example the executable instructions, instructions for execution, program code, and/or the like) on a computer-readable storage medium for execution. For example, retrieval, loading, and execution of code may be performed sequentially such that one instruction is retrieved, loaded, and executed at a time. In some example embodiments, retrieval, loading, and/or execution may be performed in parallel such that multiple instructions are retrieved, loaded, and/or executed together. Thus, such embodiments can produce specifically-configured machines performing the steps or operations specified in the block diagrams and flow diagrams. Accordingly, the block diagrams and flow diagrams support various combinations of embodiments for performing the specified instructions, operations, or steps.
In addition, a feature of embodiments of the present disclosure may be combined or combined with one or more other features, partially or entirely, and may be operated in various ways, and an embodiment may be implemented independently of one or more other embodiments, or in conjunction with the one or more other embodiments.
Applications such as AI and machine learning applications may use computing devices such as graphics processing units (GPUs) to accelerate the computing of data. Processing engines integrated into memory, such as processing-in-memory (PIM) devices, may also be used to increase processing speeds of computations. When the PIM architecture is incorporated with high bandwidth memory (HBM) dies, an HBM-PIM stack or cube may result that provides increased memory speed and capacity in addition to processing speed.
In some cases, one or more GPUs may be provided in a same package as one or more HBM-PIMs. A computing device that includes both GPUs and HBM-PIMs may be referred to as an A-HBM companion. Both the GPUs and the processing engines of the HBM-PIMs may access the HBM stack for writing/storing and reading/loading data. The HBM stack may include several memory dies (may also be referred to as core dies) stacked vertically over a logic die (may also be referred to as a base die or buffer). The memory die may be relatively fast memory such as, for example, a dynamic random access memory (DRAM). Although DRAM may be fast compared to other memory solutions, it may also be costly in terms of power and money, and have lower capacity. It may be desirable to increase the memory capacity and bandwidth of an A-HBM companion device while minimizing cost.
In general terms, embodiments of the present disclosure are directed to a single physical computing device with two or more independent (e.g., separate) compute units and two or more independent (e.g., separate) memory devices that allow mixing and matching of the compute units with the memory devices. In some embodiments, the two or more independent compute units include a GPU and a PIM, although embodiments are not limited thereto. In some embodiments, the two or more memory devices include an HBM stack and a low-power double data rate (LPDDR) memory, although embodiments are not limited thereto.
In some embodiments, the mixing and matching of the compute units with the memory devices allow for four usage modes in a single physical computing device. For example, the usage modes may include: 1) GPU + HBM; 2) GPU + LPDDR; 3) PIM + HBM; and 4) PIM + LPDDR. A first one of the usage modes may be activated concurrently with a second one of the usage modes (or a subset of other usage modes).
In some embodiments, the operating system of a host computing device may identify the compute units as separate and independent compute units, and further identify four independent memory regions in a same logical address space that may be allocated to respectively the four usage modes. Two of the logical memory regions allocated respectively to the GPU+HBM and PIM+HBM usage modes may be mapped to the physical address space in the HBM. Two of the logical memory regions allocated respectively to the GPU+LPDDR and PIM+LPDDR may be mapped to the physical address space of the LPDDR.
In some embodiments, a user determines the usage mode to invoke for a particular application, or portions of the application, and allocates memory in the appropriate logical address space based on the determined usage mode. For example, for computations where speed may not be a primary concern, the LPDDR may be selected as the memory medium to be used by the GPU and/or PIM. In another example, for computations where speed may be a primary concern, the HBM stack may be selected as the memory medium to be used by the GPU and/or PIM. In a more specific example, the GPU + LPDDR may be used for prefill operations of an LLM, and the PIM + HBM may be used for decode operations of the LLM, all within a single physical device containing the GPU, PIM, HBM, and LPDDR.
In some embodiments, programming methods are provided for allocating memory in the HBM or LPDDR for use by GPU, PIM, or both. In some embodiments, the GPU and PIM share data stored in a particular memory device. In this regard, programming methods may be provided to provide data coherency and operation synchronization between the GPU and the PIM.
1 FIG. 100 102 104 100 102 104 depicts a block diagram of a computing system according to an embodiment of the present disclosure. The computing system includes a computational memory devicecoupled to a processing deviceand an extended memory device. In some embodiments, one or more of the computational memory devices, processing devices, and extended memory devices, may be packaged together in a single physical device, such as, for example, in an A-HBM companion device.
100 106 108 106 108 106 The computational memory devicemay be, for example, a 3D-stacked memory device (e.g., an HBM) with one or more embedded processing devices (e.g., PIMs). In this regard, the 3D-stacked memory device may include a base die (also referred to as a buffer or logic die), and two or more memory dies (also referred to as core dies or memory stack)stacked over the base die. One or more of the processing devices (e.g., PIMs) may be embedded in the one or more memory dies (also referred to as memory stack)and/or in the base die.
108 108 The memory diesmay be implemented as DRAMs. However, the present invention is not limited thereto, and the memory diesmay be implemented as any suitable memory that may be implemented in a 3D-stacked structure.
102 102 102 112 In some embodiments, the processing deviceincludes a GPU, although embodiments are not limited thereto. For example, the processing devicemay, in addition or in lieu of the GPU, include neural processing unit (NPU), a tensor processing unit (TPU), co-processor unit, and/or the like. The processing devicemay be configured to perform computations or operations (used interchangeably herein) of an application running in a host computing device (“host”). The operations may include, for example, prefill and/or decode operations of a large language model (LLM), although embodiments are not limited thereto, and may include other computations of the application.
104 104 104 100 In some embodiments, the extended memory deviceincludes an LPDDR, although embodiments are not limited thereto. For example, the extended memory devicemay, in lieu or in addition to an LPDDR, include a static DRAM (SDRAM), random access memory (RAM), flash memory, and/or the like. The use of an extended memory devicesuch as an LPDDR may allow the capacity and bandwidth of the computational memory deviceto be increased in a cost efficient manner.
112 100 102 104 110 102 104 106 112 100 102 In some embodiments, the hostis coupled to the computational memory device, processing device, and extended memory deviceover one or more data communication links. In some embodiments, communication of the processing deviceand the extended memory deviceis through the base die. The hostmay include an external processor such as, for example, a central processing unit (CPU). In some embodiments, the CPU is configured to run one or more applications including AI and/or machine learning applications. The machine learning application may include, for example, an LLM. Computations may be performed while running the application. The computations may be offloaded to the computational memory deviceand/or processing devicefor a faster and more efficient computation.
100 102 112 110 100 104 100 102 112 In some embodiments, the computational memory deviceand/or processing devicereceive control instructions from the hostthrough the data communication links. The control instructions may be for allocating memory in the computational memory deviceand/or the extended memory device. The control instructions may also be for instructing the computational memory deviceand/or processing deviceto perform operations for an application running in the host.
2 FIG. 106 106 200 202 204 206 206 102 106 104 102 108 104 206 depicts a block diagram of the base dieaccording to one or more embodiments of the present disclosure. The base dieincludes one or more processing elements (PE) or devices (also referred to as PIMs), one or more memory controllers, and an extended memory controller, communicating with each other over a data communications network. The data communications networkmay also enable communication between the processing deviceand the base dieor the extended memory device. The communication may be, for example, to allow the processing deviceto write or read data to or from the memory stackand/or extended memory device. The data communications networkmay be implemented, for example, as a network-on-chip (NoC) interconnect.
200 108 104 108 104 One or more of the processing elementsmay include a computing circuit including, for example, arithmetic logic units (ALUs). The computing circuit may be configured to perform computations requested by the application. For example, the ALUs may retrieve data from the memory stackor the extended memory deviceto perform a computation using the data, and may transmit the results of the computation back to the memory stackor extended memory devicefor storing therein.
102 200 108 104 102 200 102 200 104 108 104 108 102 108 102 104 200 108 200 104 112 In some embodiments, the processing deviceand one or more of the PEsmay be independently invoked to access the memory stackand/or extended memory deviceto perform computations of an application. In this regard, the processing deviceand PE(collectively referenced as processing devices,) may be mixed and matched with the extended memory deviceand memory stack(collectively referenced as memory devices,) to effectively provide four computing devices in a single physical device. The four computing devices (also referred to as usage modes) may include: 1) processing device+ memory stack; 2) processing device+ extended memory device; 3) PE+ memory stack; and 4) PE+ extended memory device. The computing devices may be independently invoked for performing a computation based on instructions from the host.
102 200 In some embodiments, the application provides instructions that identify the processing device(s) that are to use a memory device. Physical memory may be allocated in the identified memory device and mapped to the logical address in a logical address space. One or more embodiments of the present invention allow for data stored in the allocated physical memory to be shared by the processing devices,.
3 FIG. 102 200 112 300 300 302 304 306 308 302 308 302 308 depicts a conceptual layout diagram of logical and physical memory spaces that may be allocated for use by the processing devices,according to one or more embodiments of the present disclosure. In some embodiments, an operating system (OS) of the hostallocates memory in a logical address spacein response to a memory allocation request from the application. The logical address spacemay be divided into four independent memory regions,,, and(referred to as-). One or more of the memory regions-may be allocated based on the usage mode identified in the request.
302-308 300 302 102 108 304 200 108 306 102 104 200 104 The four memory regionsof the logical address spacemay include: 1) a first memory regionfor use by the processing deviceand memory stack; 2) a second memory regionfor use by the PEsand memory stack; 3) a third memory regionfor use by the processing deviceand extended memory device; and 4) a fourth memory region for use by the PEsand extended memory device.
310 108 302 304 312 104 306 308 In some embodiments, memory is allocated in the physical address space based on the allocations in the logical address space. For example, a physical address spacein the memory stackmay be allocated for one or more allocations in the corresponding logical address space,. A physical address spacein the extended memory devicemay be allocated for one or more allocations in the corresponding logical address space,.
102 200 104 108 102 200 108 302 304 302 304 310 310 302 304 310 a a a a a a In some embodiments, the processing devices,may share data in one or more of the memory devices,. For example, both the processing deviceand the PEsmay share data stored in the memory stack. In this case, logical memory addresses,are allocated in the appropriate logical address spaces,. A single physical memory allocationmay occur in the corresponding physical address space. The logical memory address,may be mapped to the same physical address(e.g., the same physical page).
102 200 104 306 308 306 308 312 312 306 308 312 302 304 306 308 310 312 a a a a a a a a a a a a In another example, both the processing deviceand the PEsmay share data stored in the extended memory device. In this case, logical memory addresses,are allocated in the appropriate logical address spaces,. A single physical memory allocationmay occur in the corresponding physical address space. The logical memory address,may be mapped to the same physical address(e.g., the same physical page). In some embodiments, the logical memory addresses,,,are each contiguous memory addresses that are mapped to non-contiguous physical address pages,.
4 FIG. 112 400 112 104 108 400 104 108 102 200 depicts components of the hostthat are invoked for a shared memory allocation according to one or more embodiments of the present disclosure. In some embodiments, an application includes a memory allocation functionthat is executed by the OS of the hostfor allocating memory in the appropriate memory device,. The memory allocation functionmay include parameters that identify a memory type (extended memoryor memory stack), device type (processing device, PE, or all), pointer to the allocated memory in the logical address space based on the device type, and size of the memory to allocate.
4 FIG. 4 FIG. 400 400 400 104 400 102 200 400 302 102 400 304 200 400 a b c d e In the example of, the memory allocation functionis a shared memory allocation request. In this regard, the memory allocation functionidentifies the memory typeas the extended memory(e.g., HBM), and the device typeas both processing devices,(e.g., all). The function returns a first address pointerto the logical memory in the logical address spaceallocated to the processing device(e.g., GPU), and a second address pointerto the logical memory in the logical address spaceallocated to the PEs(e.g., PIM). A requested sizeof the allocation in the example ofis 100bytes.
400 402 404 112 402 102 402 200 a b In some embodiments, a memory allocation request is transmitted based on execution of the memory allocation function. The request may be received by a runtime memory management layerof a memory management engineexecuted by the OS of the host. In some embodiments, a first (e.g., GPU) runtime layermay process memory allocation requests for use by the processing device(e.g., GPU), and a second (e.g., PIM) runtime layermay process memory allocation requests for use by the PE/PIM.
404 402 406 102 406 200 406 a b In some embodiments, the memory management enginereceives the memory allocation request from the runtime layerand forwards the request to an appropriate driver. In some embodiments, a request for allocating memory for the processing device(e.g., the GPU) is provided to a first (e.g., GPU) driver, and a request for allocating memory for the PEis provided to a second (e.g., PIM) driver.
406 104 108 302 306 310 312 a The first drivermay be configured to maintain track of available logical and physical memory chunks or pages of the memory devices,and allocate a logical address in the first memory regionor the third memory region, and one or more physical addresses in the physical address spaceorbased on the parameters of the request.
406 404 404 200 406 406 a b a The first drivermay return the allocated logical memory address and the one or more allocated physical memory addresses to the memory management engine. The memory management enginemay send the memory allocation request for the PEto the second driveralong with the list of physical addresses returned by the first driver.
406 104 108 304 308 300 406 310 312 104 108 406 406 102 200 102 b b b a The second drivermay be configured to maintain track of available logical and physical memory of the memory devices,and allocate a logical address in the second memory regionor the fourth memory regionof the logical address space. For a shared memory allocation request, the second drivermay refrain from allocating a physical address in the physical address spaceorof the requested memory device (e.g., the extended memory deviceor the memory stack). The second drivermay map the allocated logical address to the list of physical addresses returned by the first driver. In this manner, the processing deviceand the PEmay share the physical memory allocation and write and read to and from the same physical address(es). The logical addresses, however, may remain separate and independent, allowing the processing deviceand PE to execute concurrently with each other.
102 200 104 108 102 108 102 104 200 108 200 104 108 104 One or more different types of memory allocation functions may be generated throughout an application for different combinations of the processing devices,and memory devices,depending on the operations to be performed. For example, memory may be allocated for one or more of the four usage modes: 1) processing device(e.g., GPU) + memory stack(e.g., HBM); 2) processing device+ extended memory device(e.g., LPDDR); 3) PE+ memory stack; and 4) PE+ extended memory device. For example, shared (or non-shared) memory may be allocated in the memory stackfor certain computations where speed may be a factor. For other operations where speed may not be a factor, shared (or non-shared) memory may be allocated in the extended memory device.
102 200 102 200 In some embodiments, when the shared data can be modified (e.g., written) by either processing device,, a data coherency measure may be implemented to ensure data coherency. The data coherency measure may allow the processing devices,to have the same, up-to-date view of the shared data when the data is stored in their respective caches (e.g., hardware caches).
5 FIG. 500 104 108 500 502 102 504 200 depicts an example programming codefor sharing data stored in the memory deviceoraccording to one or more embodiments of the present disclosure. In some embodiments, the programming codegenerates a first data structure (also referred to as a stream)for execution by the processing device(e.g., GPU), and a second data structure or streamfor execution by the PEs(e.g., PIM).
5 FIG. 500 500 500 502 504 500 500 a b c d In the example of, the programming codeinitializes a GPU processing device and a PIM processing device via initialization instructions,. The first stream(e.g., GPU stream) and the second stream(e.g., PIM stream) are generated via stream creation instructions,.
502 504 500 500 502 e a 5 FIG. The streams,may queue one or more operations to be executed in order by the corresponding processing device. For example, the programming codemay define an operation with one or more parameters as an event(e.g., event 0). The operation may include kernel launch operation, memory copy operation, or the like. The parameters of a kernel launch operation may include, for example, identification of a prefill layer and identification of the stream where the operation is to be queued. In the example of, a GPU kernel prefill layer 0 operation defined as event 0 may be queued as a first queued operationof the GPU stream.
502 504 502 502 502 500 500 a b b h In some embodiments, the operations queued in the streams,are executed in order. For example, the first queued operationin the GPU stream is executed before a second queued operationin the same stream. The second queued operationmay be defined by the programming codeas another event(e.g., event 1). The operations queued in one stream may be synchronized with operations in another stream via a synchronization operation.
500 500 500 f f In some embodiments, the programming codeincludes synchronization instructionsfor executing a synchronization operation. The synchronization instructionsmay identify the event (e.g., event 0) with which a current stream is to be synchronized, and identification of the current stream where the synchronization operation is to be queued (e.g., PIM stream).
5 FIG. 5 FIG. 504 500 504 504 500 500 a f a b g In the example of, a synchronization operationis queued in the PIM stream based on the synchronization instructions. The synchronization operationin the example ofidentifies event 0 as the event that is to finish executing before other operations of the stream may be executed. For example, the PIM stream waits for event 0 of the GPU stream to finish executing before the PIM stream may move to execute a next operationof the stream as defined by instructionsof the programming code.
102 200 104 108 504 504 504 504 a a b b In some embodiments, when a processing device,finishes executing an operation, data in the corresponding hardware cache is flushed, and the data is stored in the memory device (e.g., memory deviceor). For example, when a GPU kernel prefill layer 0 operation identified as event 0 finishes executing, the data is flushed from the hardware cache and stored in the memory device. A notification may be received by the synchronization operationthat event 0 has finished executing, allowing the synchronization operationto finish execution, and allow the next queued operation(e.g., PIM kernel decode layer 0 operation) to be executed. In executing the PIM kernel decoder layer 0 operation, results of the GPU kernel prefill layer 0 operation may need to be accessed. The synchronization event may help ensure that the data that is accessed by the next queued operationis the up-to-date data returned by the GPU kernel prefill layer 0 operation.
6 FIG. 600 404 104 108 102 200 600 102 200 depicts a flow diagram of a process for extending memory capacity and bandwidth according to one or more embodiments of the present disclosure. The process starts, and in step, the memory management modulereceives a memory allocation request from an application. The memory allocation request may identify a memory type (extended memoryor memory stack) where memory is to be allocated, device type (processing device, PE, or both) to use the allocated memory, pointer to the allocated memory in the logical address space based on the device type, and size of the memory to allocate. For purposes of illustration, it is assumed that the memory allocation request in stepis for shared memory for use by the processing deviceand the PE.
602 406 302 306 a In step, the first driverallocates a first logical memory space in the first memory regionor the third memory regiondepending on the memory device identified in the memory allocation request.
604 406 104 108 404 a In step, the first driverallocates a first physical memory space in the memory deviceorfor the allocated logical memory. The allocated logical and physical memory addresses may be returned to the memory management engine.
606 406 304 308 102 200 406 406 604 b b a In step, the second driverallocates a second logical memory space in the second memory regionor the fourth memory regiondepending on the memory device identified in the memory allocation request. In the example where data in the physical memory region is to be shared by the processing devices,, the second drivermay refrain from allocating a physical memory region for the allocated second logical memory addresses. In some embodiments, the first physical memory space allocated by the first deriverin stepis used for the allocated second logical memory addresses.
608 400 400 102 200 c d In step, the allocated first physical memory space is mapped to the allocated first logical memory addresses and the second logical memory addresses. A pointer to the allocated logical memory addresses may be returned to the requesting application (e.g., via the appropriate address pointers,). The application may use the returned addresses for writing and reading data to and from the addresses when operations are performed by the processing devices,.
7 FIG. 5 FIG. 502 504 102 200 depicts a flow diagram of a process for executing operations of the streams,of, where data is shared by the processing deviceand PEaccording to one or more embodiments of the present disclosure. In some embodiments, the streams are executed concurrently with each other.
700 102 200 The process starts, and in step, the processing deviceand PEidentify the stream generated for the respective devices.
702 102 200 502 504 In step, the processing deviceand PErespectively determine whether there is an operation event queued in the corresponding stream,.
102 200 804 If the answer is YES, the corresponding processing deviceand/or PEexecutes the operation in act.
706 102 200 In step, the corresponding processing deviceand/or PEflushes its respective hardware cache, and stores the flushed data in the shared physical memory location.
708 102 200 In step, the corresponding processing deviceand/or PEsignals that execution of the operation has finished.
702 102 200 710 Referring again to step, if the event to be executed is not an operation, the processing deviceand/or PEdetermines, in step, whether the event is a synchronization event.
If the answer is YES, the event in the other stream identified by the synchronization event is monitored for determining whether the event has finished execution. For example, the synchronization event may wait for a completion signal by the monitored event.
712 In step, a determination is made as to whether the monitored event has finished executing. In some embodiments, the synchronization event continues monitoring for completion of the execution of the event until a determination is made that the event has finished executing. If the event has finished executing, the synchronization event may stop, and a next event of the stream may be executed.
As a person of skill in the art should appreciate, embodiments of the present disclosure allow for increased transactions to be processed via, for example, the concurrent usage modes, when compared to a solution that uses a single processing device (e.g., a GPU server). The usage modes may be invoked to handle different operations (e.g., prefill operations and decode operations) concurrently using a shared physical memory space. Although two GPU servers may achieve similar increased transactions, the use of two GPU servers may be more expensive in terms of power and money than embodiments of the present disclosure. Embodiments of the present disclosure also provide flexibility in the mix and matching of processing devices and memory devices based on the operations to be performed.
One or more embodiments of the present disclosure may be implemented in one or more processors or processing devices. The term processor or processing device may refer to one or more processors and/or one or more processing cores. The one or more processors may be hosted in a single device or distributed over multiple devices (e.g. over a cloud system). A processor may include, for example, application specific integrated circuits (ASICs), general purpose or special purpose central processing units (CPUs), digital signal processors (DSPs), graphics processing units (GPUs), and programmable logic devices such as field programmable gate arrays (FPGAs). In a processor, as used herein, each function is performed either by hardware configured, i.e., hard-wired, to perform that function, or by more general-purpose hardware, such as a CPU, configured to execute instructions stored in a non-transitory storage medium (e.g. memory). A processor may be fabricated on a single printed circuit board (PCB) or distributed over several interconnected PCBs. A processor may contain other processing circuits; for example, a processing circuit may include two processing circuits, an FPGA and a CPU, interconnected on a PCB.
It will be understood that, although the terms “first”, “second”, “third”, etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the inventive concept.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. Also, unless explicitly stated, the embodiments described herein are not mutually exclusive. Aspects of the embodiments described herein may be combined in some implementations.
As used herein, the terms “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent deviations in measured or calculated values that would be recognized by those of ordinary skill in the art.
As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Further, the use of “may” when describing embodiments of the inventive concept refers to “one or more embodiments of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration. As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.
Although exemplary embodiments of systems and methods for extended memory capacity and bandwidth have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Accordingly, it is to be understood that systems and methods for extended memory capacity and bandwidth constructed according to principles of this disclosure may be embodied other than as specifically described herein. The disclosure is also defined in the following claims, and equivalents thereof.
The systems and methods for extending memory capacity and bandwidth may contain one or more combination of features set forth in the below statements.
Statement 1: An apparatus comprising: a first processing device; a second processing device; a first memory device; and a second memory device; wherein, a first logical memory space and a second logical memory space are configured to be allocated for respectively the first processing device and the second processing device, wherein the first logical memory space and the second logical memory space are further configured to be mapped to a first physical memory space of one of the first memory device or the second memory device.
Statement 2. The apparatus of Statement 1, wherein the first processing device includes a graphical processing device, and the second processing device includes a processing engine embedded in the first memory device.
Statement 3. The apparatus of Statement 1, wherein the first memory device includes two or more memory chips configured to be vertically stacked on top of each other.
Statement 4. The apparatus of Statement 1, wherein the first memory device includes a type of random access memory.
Statement 5. The apparatus of Statement 1, wherein the first processing device and the second processing device are configured to share data stored in the first physical memory space.
Statement 6. The apparatus of Statement 1, wherein the first processing device is configured to provide an address of the first physical memory space, and the second processing device is configured to map the second logical memory space to the first physical memory space based on the address.
Statement 7. The apparatus of Statement 1, wherein the first processing device is configured to execute a first computation and the second processing device is configured to execute a second computation, wherein execution of the second computation is based on the execution of the first computation.
Statement 8. The apparatus of Statement 7, wherein the first computation is associated with a first stream, and the second computation is associated with a second stream.
Statement 9. The apparatus of Statement 1, wherein the first logical memory space and the second logical memory space are assigned based on a memory allocation request, wherein the memory allocation request identifies the one of the first memory device or the second memory device, and further identifies a type of processing device for accessing the one of the first memory device or the second memory device.
Statement 10. A method comprising: receiving a memory allocation request; based on receiving the memory allocation request: allocating a first logical memory space for a first processing device; allocating a first physical memory space in a memory device based on the first logical memory space; allocating a second logical memory space for a second processing device; and mapping the first physical memory space to the first logical memory space and the second logical memory space.
Statement 11. The method of Statement 10, wherein the first processing device includes a graphical processing device, and the second processing device includes a processing engine embedded in the first memory device.
Statement 12. The method of Statement 10, wherein the first memory device includes two or more memory chips configured to be vertically stacked on top of each other.
Statement 13. The method of Statement 10, wherein the first memory device includes a type of random access memory.
Statement 14. The method of Statement 10, wherein the first processing device and the second processing device are configured to share data stored in the first physical memory space.
Statement 15. The method of Statement 10, wherein the first processing device is configured to provide an address of the first physical memory space, and the second processing device is configured to map the second logical memory space to the first physical memory space based on the address.
Statement 16. The method of Statement 10wherein the first processing device is configured to execute a first computation and the second processing device is configured to execute a second computation, wherein execution of the second computation is based on the execution of the first computation.
, Statement 17. The method of Statement 16wherein the first computation is associated with a first stream, and the second computation is associated with a second stream.
Statement 18. The method of Statement 10, wherein the first logical memory space and the second logical memory space are assigned based on a memory allocation request, wherein the memory allocation request identifies the one of the first memory device or the second memory device, and further identifies a type of processing device for accessing the one of the first memory device or the second memory device.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
May 30, 2025
January 1, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.