US-12627502-B2

Achieving high SSL/TLS throughput in embedded devices

PublishedMay 12, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

An embedded system includes hash message authentication code (HMAC) hardware. The HMAC hardware receives data in separate data transfers to compute a hash. The HMAC hardware receives data of unaligned lengths in at least one of the separate data transfers. The data of unaligned lengths includes fewer valid bytes than the transfer size. The HMAC hardware responds to a residue indication indicating valid bytes associated with the data transfer to fill in the residue from a subsequent data transfer. For each data transfer the HMAC hardware receives an indication of whether the data is final data or if more data will be transferred for computation of the hash. The embedded system loads a linear buffer directly from scatter buffers, which contain encrypted data from a network. Decrypted data in the linear buffer is sent to a host using a direct memory access (DMA) operation responsive to a host request.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method comprising:

. The method as recited in, further comprising:

. The method as recited in, wherein the first data is prepended to decrypted secure sockets layer (SSL)/transport layer security (TLS) data, the second data is the decrypted SSL/TLS data, and the third data is padding data added to an end of the decrypted SSL/TLS data.

. The method as recited in, further comprising:

. The method as recited in, further comprising: transferring the decrypted SSL/TLS data responsive to a read request using a direct memory access (DMA) operation that supports at least up to a 16 KB DMA operation.

. The apparatus as recited in,

. The apparatus as recited in, wherein the first data is prepended to decrypted secure sockets layer (SSL)/transport layer security (TLS) data, the second data is the decrypted SSL/TLS data, and the third data is padding data added to an end of the decrypted SSL/TLS data.

. The apparatus as recited in, further comprising: a plurality of scatter buffers; and a linear buffer communicatively coupled to the scatter buffers to receive data directly from the scatter buffers.

. The apparatus as recited in, wherein the apparatus is responsive to a read request by a host to send decrypted data stored in the linear buffer to the host using a direct memory access (DMA) operation.

. The system as recited in, further comprising:

. The system as recited in, wherein the system is responsive to a read request by a host to send decrypted data stored in the linear buffer to the host using a direct memory access (DMA) operation.

Detailed Description

Complete technical specification and implementation details from the patent document.

This disclosure relates to throughput on secure sockets layer/transport layer security (SSL/TLS) in embedded devices.

There are increasing demands for secure communication in today's world. To address those demands, it is desirable to ensure high (SSL/TLS) throughput. SSL/TLS provides for secure network communications by encrypting the communications for transport over the network. The data is then decrypted for use by the receiving device.

illustrates a prior art approach for the receive flow in a portion of an embedded device receiving SSL/TLS data. The SSL/TLS layer requests data from the transmission control protocol/internet protocol (TCP/IP) layer and receives the data from the TCP/IP layer responsive to the request. The TCP/IP layer includes the TCP scatter buffers. A common TCP payload size is 1460 bytes. A larger SSL/TLS record, e.g., 16 kilobytes (KB), can be broken up into transfers of smaller portions of 1460 bytes each. The scatter buffersare sized to contain 512 bytes of data. A 1460 byte payload along with the required packet and header information requires four scatter buffers. Assume that the SSL layer requests a record sized at 16 KB. The TCP scatter buffersreceive multiple 1460 byte payloads and transfer the 1460 bytes to an intermediate linear hold bufferresponsive to the request. Once the complete TCP payload of 1460 bytes is copied from the four scatter buffersto the linear intermediate hold buffer, the four scatter buffers are freed at the same time. Once the intermediate linear hold buffer is filled, e.g., with 1460 bytes of the requested data, a memory copy operation transfers the contents of the intermediate linear hold buffer to the 16 KB linear buffer, which is sized to accommodate a 16 KB payload. The SSL/TLS layer read request length may not be equivalent to the length of the TCP packet received. For example, the read request may be for less data than contained in the TCP packet received. So, a portion of the data that is in the intermediate linear hold buffer that is not part of the SSL/TLS request is stored in the intermediate linear bufferin the SSL layer to satisfy a future SSL/TLS layer request.

Once the 16 KB linear buffer has received the encrypted data transmitted over the network, decryption logicreceives encrypted data stored in the 16 KB linear buffer, decrypts the data, and the decrypted data is then stored back into the 16 KB linear buffer. In addition, a message authentication process occurs as described further herein. Once the record is decrypted, the record is available for the host.illustrates a prior art approach to transferring the data from the 16 KB linear bufferto the host. In the approach illustrated in, transferring a 16 KB size SSL/TLS record requires 12 requests-responses (request from the host, response from the SSL/TLS layer in the Networking Processor (NWP) as the maximum packet size that can be transferred to the host is limited to 1460 bytes in the embodiment illustrated in. Those 1460 bytes are stored in the host scatter buffersshown in. In order to transfer one 16 KB record, 1460 bytes of data is copied into the scatter buffersfrom the 16 KB linear bufferresponsive to each host request and then transferred to the host from the scatter buffers. That approach not only requires multiple (12) request-responses but also requires additional memory for creating the host scatter buffers.shows the 16 KB linear bufferstarts with 16 KB of decrypted data and after 12 requests-responses the entire 16 KB record has been transferred through the scatter buffersto the host. The hatched portion inillustrates the cumulative amount of data that has been transferred.

Referring back to, the HMAC (hash-based message authentication code) SHA (Secure Hash Algorithm) hardwareshown inrequires data be sent on aligned four byte boundaries and supports only a one-shot input meaning that data cannot be transferred in multiple transfers. Thus, all data sent for a particular HMAC SHA hardware operation must be aligned on four-byte boundaries and must be sent in one shot. Considering an SSL record size of 16 KB, before computing the hash on the given data, HMAC computations can require data be prepended and also require padding data to properly account for the required block size. Thus, 13 bytes of HMAC inner data is prepended to the record and the data must be padded if the given data is not a multiple of the HMAC block size. With HMAC hardware supporting only single shot mode, that requires a second linear buffer to be allocated sized at (13 bytes+record size+padding data) and the (13 bytes+record size+padding data) must be memory copied into the second linear buffer.

In order to provide enhanced SSL/TLS throughput an embodiment provides a method that includes receiving first data at a hash message authentication code (HMAC) hardware at a first time in a first data transfer. The method further includes receiving second data at the HMAC hardware at a second time in a second data transfer that is separate from the first data transfer and receiving third data at the HMAC hardware accelerator at a third time in a third data transfer that is separate from the first data transfer and the second data transfer. The HMAC hardware performs a hash operation using the first data, the second data, and the third data.

In another embodiment an apparatus includes a hash message authentication code (HMAC) hardware and the HMAC hardware is configured to receive data in separate data transfers to perform a hash operation.

In another embodiment a system comprises a hash message authentication code (HMAC) hardware. The HMAC hardware is configured to receive data in separate data transfers to compute a hash for message authentication. The HMAC hardware is configured to receive data of unaligned lengths, the data of unaligned lengths being in at least one of the separate data transfers. The data of unaligned lengths includes a first number of bytes of valid data in a transfer having a second number of bytes, the second number of bytes being greater than the first number of bytes. The HMAC hardware is responsive to an indication of a residue associated with the data of unaligned lengths to fill in the residue from a subsequent data transfer; and wherein for each of the separate data transfers the HMAC hardware receives an indication of whether the data is a final data transfer for computation of the hash or if more data is to be transferred after the current data for computation of the hash.

The use of the same reference symbols in different drawings indicates similar or identical items.

Embodiments herein improve on the system illustrated into provide higher SSL/TLS throughput. High SSL/TLS throughput provides a number of benefits for embedded devices. One benefit is improved user experience. Users expect fast and responsive applications. If an application has slow SSL/TLS performance, that can lead to a poor user experience and frustration for end-users. Another benefit of high SSL throughput is improved security. Since SSL/TLS is used to secure data transmissions over the network, SSL/TLS is critical to the security of many applications. However, SSL/TLS processing can be CPU-intensive and slow down the application, which can lead to lower security if the SSL/TLS processing is skipped or reduced. Therefore, high SSL/TLS throughput helps maintain a high level of security. Achieving high SSL/TLS throughput allows devices to perform SSL/TLS operations more quickly and reduces the amount of time that the central processing unit (CPU) and related circuitry needs to be active, which can enhance energy efficiency by reducing power consumption and thereby prolonging battery life.

Embodiments described herein utilize multiple enhancements to speed up SSL/TLS throughput. A first enhancement avoids the intermediate linear hold buffershown in. A second enhancement provides more flexible HMAC SHA hardware. A third enhancement eliminates the need for multiple transfers from the 16 KB linear bufferto the hostsaving time and memory resources, particularly for large transfers.

illustrates a high-level system diagram showing various system layers. In a receive operation the data flows from the bottom layerin the Networking Processor (NWP)to the top layerin the host. For a transmit operation, the flow is the opposite. The bottom layer includes the hardware for the Physical (PHY) layer, which implements the physical and electrical requirements of the network. It is responsible for managing the hardware that modulates and demodulates the radio frequency (RF) transmissions. The Media Access Control (MAC) layer is responsible for sending and receiving RF frames over the air. The layer above is the Lower Media Access Control (LMAC) layer, which provides software for interacting with hardware. Next layer is the Supplicant, which gets access and maintains the wireless connection with an Access point. The layer above includes TCP/IP Stack and SSL/TLS layer. The TCP/IP stack is a suite of communication protocols used to interconnect network devices on the internet or in a private computer network (an intranet or extranet). The SSL/TLS layer is a protocol or communication rule that allows devices to communicate on the internet safely. The NWP-Host interfacing layerin the NWPacts as a connecting interface to the Host from the NWP. The Real Time Operating System (RTOS)runs in the Host. Host-NWP interfacing layeracts as a connecting interface from the Host to the NWP. The topmost Application layerprovides the functionality to be performed in the embedded system. The SSL/TLS layeris the focus of the improvements described herein.

illustrates an embodiment that avoids the use of the intermediate linear hold buffershown in. Data from the scatter buffersis written directly into the linear bufferwithout utilizing the intermediate linear hold buffer. The control logictransfers data in accordance with the request from the SSL layer. The control logicmaintains the proper pointers into the 16 KB linear buffer and the TCP scatter buffersand ensures that the transfers continue from the TCP scatter buffers into the 16 KB linear buffers until the request has been fulfilled. For example, the SSL/TLS layer request may be for a 16 KB record or less than a 16 KB record. As four scatter buffers (4×512 bytes) are required to hold a 1460 byte payload, not all the bytes in the scatter buffers are used and the control logicensures the appropriate scatter buffer locations are transferred to the linear bufferto fulfill the request. Once the contents of the scatter buffers for a particular TCP packet have been consumed by the 16 KB linear buffer, the control logic frees the scatter buffers. The scatter buffers may not be freed if fulfilling the SSL request leaves valid data in the scatter buffers that are not part of the request. Thus, the scatter buffer may hold that data until it is needed to fulfill another request by the SSL layer. Hold scatter bufferrepresents that situation. Writing directly from the TCP scatter buffersinto the 16 KB linear buffersaves the time required for memory copy operations due to the elimination of the intermediate linear hold bufferand saves memory by eliminating the need to allocate memory for the intermediate linear hold buffershown in. The control logicwhile shown as a separate entity incan be implemented as code and/or microcode running on processor, which can be implemented as a microcontroller unit (MCU), as digital logic including state machines, as a separate memory management unit, or as any software and hardware combination that tracks the SSL/TLS request, effects the transfer of data into the 16 KB linear bufferfrom the TCP scatter buffers, frees the scatter buffers when possible, and ensures the data is written sequentially into the 16 KB linear bufferresponsive to the SSL/TLS data request.

A second enhancement involves more efficient use of security hardware, specifically HMAC (hash-based message authentication code) SHA (Secure Hash Algorithm) hardware(referred to herein as HMAC hardware for ease of reference), shown in. HMAC hardware provides message authentication using a hash algorithm. The hash algorithm is a cryptographic operation that can consume a lot of time when implemented in software. To avoid that time consuming operation, the HMAC hardwareis implemented as a hardware accelerator. After decryption, the message authentication assures the recipient of the authenticity of the message by obtaining the correct hash. Thus, HMAC hardwarecan be used to check SSL/TLS data for data integrity and to authenticate the parties involved in a transaction. That prevents, e.g., a man in the middle attack, where data is changed in the SSL/TLS record before it arrives at the destination.

As pointed out above, a prior implementation of HMAC hardwareindescribed above requires that data be sent on aligned four-byte boundaries and supports only one-shot input meaning that data cannot be transferred in multiple transfers. That is, the HMAC expects the data for the hash operation to be provided without interruption. To improve on that implementation, an embodiment of HMAC hardwareis configured to support multi-input. That is, all inputs do not have to come in one data transfer operation. An embodiment of multi-input identifies different types of inputs to the HMAC hardware. One type of input is HMAC Update, which indicates to the HMAC hardware that more data will be transferred to the HMAC hardware following the current data. Another kind of input is HMAC final, which indicates to the HMAC hardware that the current data transferred is the last data. In an embodiment a hardware register in the HMAC is programmed to indicate that the data is final data, which triggers computing the hash. That identification can be, e.g., in a command field associated with the data transfer sent to the HMAC hardware from a processor, in a control signal line supplied to the HMAC hardware, or using any other another suitable mechanism to inform the HMAC hardware either that more data will follow that is to be used to generate the hash or that the data is final and the current data transfer to the HMAC hardware is the last data transfer required for hash generation. To still support single shot mode in multi-input capable HMAC hardware, the initial data transfer is indicated as an HMAC Final transfer and all the data is transferred in one shot.

In addition to supporting multi-input instead of single shot mode, embodiments of the HMAC hardwaresupport unaligned lengths of data while using direct memory access (DMA). If the input is in unaligned bytes, then the HMAC hardware calculates the residue bytes that need to be filled and fills the data appropriately from the next data input. For example, in embodiments in which data lengths that can be transferred via DMA to the HMAC hardware is limited to 4 bytes of aligned data, when 13 bytes of data is programmed to be transferred for a required prepend to the 16 KB data record, the residue will be 3 bytes as the aligned length for the DMA will be 16 bytes (4×4 bytes to accommodate the 13 bytes). The HMAC hardware fills the residue bytes from the next DMA data input. The HMAC hardware receives a notification of the transfer size and the number of valid bytes. That notification can be in a command field or other notification sent from the processorthat utilizes the HMAC hardware as an accelerator. In an embodiment HMAC hardware registers are programmed for each transaction indicating the actual length of valid data for the bytes that are received in the DMA.

illustrates a flow diagram showing the operation of the HMAC hardwarefor an exemplary hash operation in which the HMAC hardware supports both multi-input and unaligned lengths. Inthe HMAC hardwarereceives a command from the associated processor requesting a hash operation. In addition, the command indicates 13 bytes of HMAC inner data forming the prepend and required by the SHA is going to be supplied as HMAC Update data (indicating more data is coming). The HMAC hardware computes and stores an indication of the residue (here three bytes) in. Since a minimum of 16 bytes is transferred, assuming DMA is limited to four-byte increments, transferring only 13 bytes of valid data results in a residue of 3 bytes. Then inthe HMAC hardware receives a command indicating a DMA transfer of a 16 KB record from the 16 KB linear bufferdirectly to the HMAC hardware. The HMAC starts processing based on the block size. If sufficient data has not been received, e.g., just 13 valid bytes, the HMAC retains that data waiting for additional data to be received to form a complete block. The HMAC hardware also receives an HMAC Update indication associated with the 16 KB record again indicating that more data is to come in order to complete computation of the hash. In, the HMAC uses the stored residue information to fill in the residue data from the 16 KB transfer. Since the previous update had a residue of 3 bytes, although the current 16 KB transfer is aligned to 4 bytes, since the residue gets filled first, at the end of the 16 KB transfer there remains a residue of 3 bytes. Padding is done in such a way that the final data includes enough bytes to account for any remaining residue to form blocks of the appropriate block size for the HMAC. Inrequired padding data is supplied to the HMAC hardware via a DMA transfer along with an HMAC Final indication (indicating this is the last data). The HMAC hardware completes computation of the hash inand supplies the hash to the processoror stores the result in an appropriate memory location such as a temporary buffer for comparison with the expected hash. Support for multi-input and data of unaligned lengths avoids the need for memory copy into a duplicative memory prior to transferring data to the HMAC hardware. Note that implementations of HMAC hardware to compute a desired hash, e.g., SHA-256 or SHA-512, are well known in the art and accordingly are not being described herein other than the capability to handle non-aligned byte transfers and multi-input as described, e.g., in.

A third enhancement improves transfer speed of a 16 KB size SSL/TLS record from the 16 KB linear bufferto the host. In an embodiment illustrated in, a complete record up to a maximum size of 16 KB can be transferred to the hostwith a single read request from the host application. Assuming the host application has enough memory, as shown inthe host sends the read requestof maximum length 16 KB and firmware running in the NWP uses DMA to send the decrypted contentsof the linear bufferdirectly to 16 KB application bufferof the host. The linear bufferis freed upon completion of the 16 KB data transfer. Thus, transferring the complete 16 KB recordfrom the 16 KB linear buffer to the 16 KB application buffer with a single request-response improves the throughput by reducing the number of host interactions as compared to the multiple request-responses shown inand also eliminates the need for allocating memory for host scatter buffersshown in.

is a flow diagram illustrating the differences between the old architecture illustrated inand the new architecture shown in. The new architecture flow is indicated by solid lines and the old architecture flow is indicated by dotted lines. In, the host requests data from the SSL/TLS layer. If a decrypted buffer is available inin the old architecture the pathis taken, scatter buffers are created in, and data is sent to the host inusing the scatter buffers with any single transfer being limited to 1460 bytes. In the new architecture, if a decrypted data is already available, the pathis taken and data is transferred into the host in a single request-response with the DMA transfer size being up to a maximum size of 16 KB although different maximum sizes can be used in other embodiments. If decrypted data is not available in, in the old architecture inthe TCP packet is received from the TCP/IP stack into the scatter buffers. Inthe contents of the scatter buffers are copied into the intermediate linear hold buffer in. In, if the requested record size has been received, the data in the buffer is decrypted in. If the complete record has not been received the flow returns toand the process is repeated until the requested record is available for decryption. In the old architecture, once decryption is finished the flow proceeds toas described above. In the new architecture, if the decrypted data is not available in, the pathis taken and the scatter buffers are copied directly into the 16 KB linear buffer in. The flow then proceeds toto determine if the data corresponding to the record size has been received. If not, the flow returns toand the transfer into the linear buffer is repeated until all data has been received. If all the data has been received the flow proceeds to decrypt the buffer in. In the new architecture once decryption is completed the flow proceeds to, which is described above.

Utilizing embodiment with enhancements described herein saves both memory and time thus improving the SSL/TLS throughput, which helps provide an improved user experience and power savings. The table shown inshows the improvements for reception of both asynchronous and synchronous 16 KB SSL/TLS records. For synchronous reception, data is sent to the host responsive to a read request from the host. In a synchronous transmission the NWP waits for a request from the host and then sends the data to host. An asynchronous receive does not need a read request from host. Instead, whenever a decrypted buffer is available data is sent immediately to the host.

illustrates a high level block diagram of an embodiment of an embedded devicethat incorporates the high throughput enhancements described above. The deviceincludes wireless modem circuits, providing wirelesss communication capabilities for one or more wireless standards including IEEE 802.11 and Bluetooth® Low Energy (BLE) at various frequencies such as 2.4 GHz and 5.4 GHz. A networking processor(corresponding to NWP) and associated memorystore code providing support and control for transmit and receive functions for one or more of the various wireless protocols supported by the wireless modem circuits. The deviceincludes a security blockthat includes, e.g., the HMAC hardware, encryption and decryption logic, and other security related functions not illustrated. The devicefurther includes a processorand associated memoryto provide programming capability for application programs and other functionality of the embedded device. The processor includes a central processing unit (CPU), a floating point unit (FPU)and other functionality, e.g., memory management, not specifically illustrated. Note that the memoryandincludes, e.g., cache memory integrated onto the processor integrated circuit, read only memory (ROM), non-volatile memory (NVM) such as flash memory, and static random access memory (SRAM). In addition to the blocks shown, the embedded device further includes peripheral blockto provide such functions as timing, memory control, analog to digital (A/D) converters, digital to analog converts (DACs), and various communication interfaces such as serial input/output (I/O). The device further includes a power management unitto reduce power consumption where possible and extend battery life. In embodiments, the embedded devicefurther includes a host microcontroller unit (MCU)that includes the CPUand memorythat provides host functionality described herein. Of course, various embodiments can include additional or fewer capabilities. In addition, the functionality of the individual blocks in device, while shown separately, may be incorporated into one or more of the other illustrated blocks or integrated with other functionality not illustrated. The various blocks communicate on an interconnect. Of course, other interconnects may provide for direct communication between certain blocks or functionality within blocks. The various blocks illustrated inimplement the layers shown inand the functionality described for high SSL/TLS throughput.

Thus, embodiments with various enhancements to achieve high SSL/TLS throughput have been described. The description of the invention set forth herein is illustrative and is not intended to limit the scope of the invention as set forth in the following claims. The terms such as “first” and “second”, as used in the claims, unless otherwise clear by context, are used to distinguish between different items in the claims and do not otherwise indicate or imply any order in time, location, or quality. Other variations and modifications of the embodiments disclosed herein, may be made based on the description set forth herein, without departing from the scope of the invention as set forth in the following claims.

Patent Metadata

Filing Date

Unknown

Publication Date

May 12, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search