Patentable/Patents/US-20260140725-A1
US-20260140725-A1

Apparatus, System, and Method of Compiling Code for a Processor

PublishedMay 21, 2026
Assigneenot available in USPTO data we have
Technical Abstract

For example, a compiler may be configured to identify a loop nest based on a source code to be compiled into a target code to be executed by a target processor, the loop nest including a plurality of loops including at least a first loop and a second loop nested in the first loop, the first loop including at least one first-loop instruction outside the second loop; and to generate Address Generation Unit (AGU) configuration code to configure an AGU of the target processor based on the first-loop instruction, wherein the AGU configuration code is to configure a first dimension of the AGU based on the first loop and a second dimension of the AGU based on the second loop to configure a memory-access operation based on the first-loop instruction, to be performed at a start of the second loop or at an end of the second loop.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

33 .-. (canceled)

2

identify a loop nest based on a source code to be compiled into a target code to be executed by a target processor, the loop nest comprising a plurality of loops, the plurality of loops comprising at least a first loop and a second loop nested in the first loop, wherein the first loop comprises at least one first-loop instruction, which is outside the second loop; generate Address Generation Unit (AGU) configuration code to configure an AGU of the target processor based on the first-loop instruction, wherein the AGU configuration code is to configure a first dimension of the AGU based on the first loop, and to configure a second dimension of the AGU based on the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure a memory-access operation to be performed at a start of the second loop or at an end of the second loop, wherein the memory-access operation is based on the first-loop instruction; and generate the target code based on compilation of the source code, wherein the target code is based on the AGU configuration code. . A product comprising one or more tangible computer-readable non-transitory storage media comprising computer-executable instructions operable to, when executed by at least one processor, enable the at least one processor to cause a compiler to:

3

claim 34 . The product of, wherein the plurality of loops comprises a third loop nested in the first loop, the second loop is nested in the third loop, the first-loop instruction is outside the third loop, wherein the AGU configuration code is to configure a third dimension of the AGU based on the third loop, wherein the AGU configuration code is to configure the third dimension to configure the memory-access operation to be performed at the start of the second loop or at the end of the second loop.

4

claim 35 . The product of, wherein the third loop comprises a third-loop instruction, which is outside the second loop, wherein the AGU configuration code is to configure an other AGU of the target processor based on the third-loop instruction, wherein the AGU configuration code is to configure a first dimension of the other AGU based on the third loop, and to configure a second dimension of the other AGU based on the second loop, wherein the AGU configuration code is to configure the second dimension of the other AGU to configure an other memory-access operation to be performed at the start of the second loop or at the end of the second loop, wherein the other memory-access operation is based on the third-loop instruction.

5

claim 36 . The product of, wherein the instructions, when executed, cause the compiler to transform the loop nest into a transformed loop comprising the memory-access operation and the other memory access operation, wherein the target code is based on the transformed loop.

6

claim 35 . The product of, wherein the AGU configuration code is to set a Maximum (Max) parameter of the second dimension of the AGU and a Max parameter of the third dimension of the AGU based on an entry size corresponding to the first-loop instruction.

7

claim 34 . The product of, wherein the AGU configuration code is to set a base parameter of the AGU based on a memory pointer of the first-loop instruction, and to set a Maximum (Max) parameter of the second dimension of the AGU based on an entry size corresponding to the first-loop instruction.

8

claim 34 . The product of, wherein the at least one first-loop instruction comprises a pre-header instruction to be performed before a first iteration of the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure the memory-access operation to be performed only at the start of the second loop.

9

claim 40 . The product of, wherein the AGU configuration code is to set a Minimum (Min) parameter of the second dimension of the AGU to zero.

10

claim 40 . The product of, wherein the instructions, when executed, cause the compiler to, based on a determination that the pre-header instruction comprises a load operation, configure the AGU configuration code to set a step parameter of the second dimension of the AGU to zero.

11

claim 40 . The product of, wherein the instructions, when executed, cause the compiler to, based on a determination that the pre-header instruction comprises a store operation, configure the AGU configuration code to set a step parameter of the second dimension of the AGU based on an entry size corresponding to the pre-header instruction.

12

claim 40 . The product of, wherein the AGU configuration code is to set a base parameter of the AGU to a memory pointer of the pre-header instruction.

13

claim 34 . The product of, wherein the at least one first-loop instruction comprises a latch instruction to be performed after a last iteration of the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure the memory-access operation to be performed only at the end of the second loop.

14

claim 45 . The product of, wherein the latch instruction comprises a load operation.

15

claim 46 . The product of, wherein the AGU configuration code is to set a base parameter of the AGU to a memory pointer of the latch instruction, to set a Minimum (Min) parameter of the second dimension of the AGU to zero, to set a Maximum (Max) parameter of the second dimension of the AGU to an entry size corresponding to the latch instruction, and to set a step parameter of the second dimension of the AGU to zero.

16

claim 45 . The product of, wherein the latch instruction comprises a store operation.

17

claim 48 . The product of, wherein the AGU configuration code is to set a base parameter of the AGU based on a first parameter value, a second parameter value and a third parameter value, wherein the first parameter value comprises an entry size corresponding to the latch instruction, the second parameter value comprises a total count of iterations over one or more loops, which are in the first loop and include the second loop, the third parameter value comprising a count of dimensions of the AGU corresponding to the one or more loops.

18

claim 49 . The product of, wherein the AGU configuration code is to set the base parameter, denoted Base, of the AGU as follows: wherein OrigBase denotes a memory pointer of the latch instruction, EntrySize denotes the entry size, [Σ TripCount(L)] denotes the total count of iterations over the one or more loops, and #InnerDims denotes the count of dimensions of the AGU corresponding to the one or more loops.

19

claim 48 . The product of, wherein the AGU configuration code is to set a step parameter of the second dimension of the AGU based on an entry size corresponding to the latch instruction; to set a Minimum (Min) parameter of the second dimension of the AGU based on the entry size and a count of iterations in the second loop; and to set a Maximum (Max) parameter of the second dimension of the AGU based on the Min parameter of the second dimension of the AGU and the entry size.

20

claim 51 . The product of, wherein the AGU configuration code is to set the step parameter of the second dimension of the AGU based on an additive inverse of the entry size.

21

claim 51 . The product of, wherein the AGU configuration code is to set the Min parameter of the second dimension of the AGU based on a product of an additive inverse of the entry size and a subtraction result of subtracting one from the count of iterations in the second loop.

22

claim 51 . The product of, wherein the AGU configuration code is to set the Max parameter of the second dimension of the AGU based on a sum of the Min parameter of the second dimension of the AGU and the entry size.

23

at least one memory to store instructions; and identify a loop nest based on a source code to be compiled into a target code to be executed by a target processor, the loop nest comprising a plurality of loops, the plurality of loops comprising at least a first loop and a second loop nested in the first loop, wherein the first loop comprises at least one first-loop instruction, which is outside the second loop; generate Address Generation Unit (AGU) configuration code to configure an AGU of the target processor based on the first-loop instruction, wherein the AGU configuration code is to configure a first dimension of the AGU based on the first loop, and to configure a second dimension of the AGU based on the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure a memory-access operation to be performed at a start of the second loop or at an end of the second loop, wherein the memory-access operation is based on the first-loop instruction; and generate the target code based on compilation of the source code, wherein the target code is based on the AGU configuration code. at least one processor to retrieve the instructions from the memory and to execute the instructions to cause the computing system to: . A computing system comprising:

24

claim 55 . The computing system ofcomprising the target processor to execute the target code, the target processor comprising a vector processor.

25

identifying a loop nest based on a source code to be compiled into a target code to be executed by a target processor, the loop nest comprising a plurality of loops, the plurality of loops comprising at least a first loop and a second loop nested in the first loop, wherein the first loop comprises at least one first-loop instruction, which is outside the second loop; generating Address Generation Unit (AGU) configuration code to configure an AGU of the target processor based on the first-loop instruction, wherein the AGU configuration code is to configure a first dimension of the AGU based on the first loop, and to configure a second dimension of the AGU based on the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure a memory-access operation to be performed at a start of the second loop or at an end of the second loop, wherein the memory-access operation is based on the first-loop instruction; and generating the target code based on compilation of the source code, wherein the target code is based on the AGU configuration code. . A method comprising:

26

claim 57 . The method of, wherein the plurality of loops comprises a third loop nested in the first loop, the second loop is nested in the third loop, the first-loop instruction is outside the third loop, wherein the AGU configuration code is to configure a third dimension of the AGU based on the third loop, wherein the AGU configuration code is to configure the third dimension to configure the memory-access operation to be performed at the start of the second loop or at the end of the second loop.

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims the benefit of and priority from U.S. Provisional Patent Application No. 63/415,308 entitled “APPARATUS, SYSTEM, AND METHOD OF VECTOR PROCESSING”, filed Oct. 12, 2022, the entire disclosure of which is incorporated herein by reference.

A compiler may be configured to compile source code into target code configured for execution by a processor.

There is a need to provide a technical solution to support efficient processing functionalities.

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of some aspects. However, it will be understood by persons of ordinary skill in the art that some aspects may be practiced without these specific details. In other instances, well-known methods, procedures, components, units and/or circuits have not been described in detail so as not to obscure the discussion.

Some portions of the following detailed description are presented in terms of algorithms and symbolic representations of operations on data bits or binary digital signals within a computer memory. These algorithmic descriptions and representations may be the techniques used by those skilled in the data processing arts to convey the substance of their work to others skilled in the art.

An algorithm is here, and generally, considered to be a self-consistent sequence of acts or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities capture the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers or the like. It should be understood, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.

Discussions herein utilizing terms such as, for example, “processing”, “computing”, “calculating”, “determining”, “establishing”, “analyzing”, “checking”, or the like, may refer to operation(s) and/or process(es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulate and/or transform data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information storage medium that may store instructions to perform operations and/or processes.

The terms “plurality” and “a plurality”, as used herein, include, for example, “multiple” or “two or more”. For example, “a plurality of items” includes two or more items.

References to “one aspect”, “an aspect”, “demonstrative aspect”, “various aspects” etc., indicate that the aspect(s) so described may include a particular feature, structure, or characteristic, but not every aspect necessarily includes the particular feature, structure, or characteristic. Further, repeated use of the phrase “in one aspect” does not necessarily refer to the same aspect, although it may.

As used herein, unless otherwise specified the use of the ordinal adjectives “first”, “second”, “third” etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

Some aspects, for example, may capture the form of an entirely hardware aspect, an entirely software aspect, or an aspect including both hardware and software elements. Some aspects may be implemented in software, which includes but is not limited to firmware, resident software, microcode, or the like.

Furthermore, some aspects may capture the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For example, a computer-usable or computer-readable medium may be or may include any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

In some demonstrative aspects, the medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.

In some demonstrative aspects, a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements, for example, through a system bus. The memory elements may include, for example, local memory employed during actual execution of the program code, bulk storage, and cache memories which may provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

In some demonstrative aspects, input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers. In some demonstrative aspects, network adapters may be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices, for example, through intervening private or public networks. In some demonstrative aspects, modems, cable modems and Ethernet cards are demonstrative examples of types of network adapters. Other suitable components may be used.

Some aspects may be used in conjunction with various devices and systems, for example, a computing device, a computer, a mobile computer, a non-mobile computer, a server computer, or the like.

As used herein, the term “circuitry” may refer to, be part of, or include, an Application Specific Integrated Circuit (ASIC), an integrated circuit, an electronic circuit, a processor (shared, dedicated or group), and/or memory (shared. Dedicated, or group), that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality. In some aspects, some functions associated with the circuitry may be implemented by, one or more software or firmware modules. In some aspects, circuitry may include logic, at least partially operable in hardware.

The term “logic” may refer, for example, to computing logic embedded in circuitry of a computing apparatus and/or computing logic stored in a memory of a computing apparatus. For example, the logic may be accessible by a processor of the computing apparatus to execute the computing logic to perform computing functions and/or operations. In one example, logic may be embedded in various types of memory and/or firmware, e.g., silicon blocks of various chips and/or processors. Logic may be included in, and/or implemented as part of, various circuitry, e.g., processor circuitry, control circuitry, and/or the like. In one example, logic may be embedded in volatile memory and/or non-volatile memory, including random access memory, read only memory, programmable memory, magnetic memory, flash memory, persistent memory, and the like. Logic may be executed by one or more processors using memory, e.g., registers, stuck, buffers, and/or the like, coupled to the one or more processors, e.g., as necessary to execute the logic.

1 FIG. 100 Reference is now made to, which schematically illustrates a block diagram of a system, in accordance with some demonstrative aspects.

1 FIG. 100 102 As shown in, in some demonstrative aspects systemmay include a computing device.

102 In some demonstrative aspects, devicemay be implemented using suitable hardware components and/or software components, for example, processors, controllers, memory units, storage units, input units, output units, communication units, operating systems, applications, or the like.

102 In some demonstrative aspects, devicemay include, for example, a computer, a mobile computing device, a non-mobile computing device, a laptop computer, a notebook computer, a tablet computer, a handheld computer, a Personal Computer (PC), or the like.

102 191 192 193 194 195 102 102 102 In some demonstrative aspects, devicemay include, for example, one or more of a processor, an input unit, an output unit, a memory unit, and/or a storage unit. Devicemay optionally include other suitable hardware components and/or software components. In some demonstrative aspects, some or all of the components of one or more of devicemay be enclosed in a common housing or packaging, and may be interconnected or operably associated using one or more wired or wireless links. In other aspects, components of one or more of devicemay be distributed among multiple or separate devices.

191 191 102 In some demonstrative aspects, processormay include, for example, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), one or more processor cores, a single-core processor, a dual-core processor, a multiple-core processor, a microprocessor, a host processor, a controller, a plurality of processors or controllers, a chip, a microchip, one or more circuits, circuitry, a logic unit, an Integrated Circuit (IC), an Application-Specific IC (ASIC), or any other suitable multi-purpose or specific processor or controller. Processormay execute instructions, for example, of an Operating System (OS) of deviceand/or of one or more suitable applications.

192 193 In some demonstrative aspects, input unitmay include, for example, a keyboard, a keypad, a mouse, a touch-screen, a touch-pad, a track-ball, a stylus, a microphone, or other suitable pointing device or input device. Output unitmay include, for example, a monitor, a screen, a touch-screen, a flat panel display, a Light Emitting Diode (LED) display unit, a Liquid Crystal Display (LCD) display unit, a plasma display unit, one or more audio speakers or earphones, or other suitable output devices.

194 195 194 195 102 In some demonstrative aspects, memory unitincludes, for example, a Random Access Memory (RAM), a Read Only Memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units. Storage unitmay include, for example, a hard disk drive, a Solid State Drive (SSD), or other suitable removable or non-removable storage units. Memory unitand/or storage unit, for example, may store data processed by device.

102 103 In some demonstrative aspects, devicemay be configured to communicate with one or more other devices via at least one network, e.g., a wireless and/or wired network.

103 In some demonstrative aspects, networkmay include a wired network, a local area network (LAN), a wireless network, a wireless LAN (WLAN) network, a radio network, a cellular network, a WiFi network, an IR network, a Bluetooth (BT) network, and the like.

102 In some demonstrative aspects, devicemay be configured to perform and/or to execute one or more operations, modules, processes, procedures and/or the like, e.g., as described herein.

102 160 115 112 In some demonstrative aspects, devicemay include a compiler, which may be configured to generate a target code, for example, based on a source code, e.g., as described below.

160 112 115 In some demonstrative aspects, compilermay be configured to translate the source codeinto the target code, e.g., as described below.

160 In some demonstrative aspects, compilermay include, or may be implemented as, software, a software module, an application, a program, a subroutine, instructions, an instruction set, computing code, words, values, symbols, and/or the like.

112 In some demonstrative aspects, the source codemay include computer code written in a source language.

In some demonstrative aspects, the source language may include a programing language. For example, the source language may include a high-level programming language, for example, such as, C language, C++ language, and/or the like.

115 In some demonstrative aspects, the target codemay include computer code written in a target language.

In some demonstrative aspects, the target language may include a low-level language, for example, such as, assembly language, object code, machine code, or the like.

115 In some demonstrative aspects, the target codemay include one or more object files, e.g., which may create and/or form an executable program.

In some demonstrative aspects, the executable program may be configured to be executed on a target computer. For example, the target computer may include a specific computer hardware, a specific machine, and/or a specific operating system.

180 In some demonstrative aspects, the executable program may be configured to be executed on a processor, e.g., as described below.

180 180 180 In some demonstrative aspects, processormay include a vector processor, e.g., as described below. In other aspects, processormay include any other type of processor.

160 112 115 180 160 112 115 180 Some demonstrative aspects are described herein with respect to a compiler, e.g., compiler, configured to compile source codeinto target codeconfigured to be executed by a vector processor, e.g., as described below. In other aspects, a compiler, e.g., compiler, configured to compile source codeinto target codeconfigured to be executed by any other type of processor.

180 102 In some demonstrative aspects, processormay be implemented as part of device.

180 102 In other aspects, processormay be implemented as part of any other device, e.g., separate from device.

180 In some demonstrative aspects, vector processor(also referred to as an “array processor”) may include a processor, which may be configured to process an entire vector in one instruction, e.g., as described below.

In other aspects, the executable program may be configured to be executed on any other additional or alternative type of processor.

180 180 In some demonstrative aspects, the vector processormay be designed to support high-performance image and/or vector processing. For example, the vector processormay be configured to processes 1/2/3/4D arrays of fixed point data and/or floating point arrays, e.g., very quickly and/or efficiently.

180 180 In some demonstrative aspects, the vector processormay be configured to process arbitrary data, e.g., structures with pointers to structures. For example, the vector processormay include a scalar processor to compute the non-vector data, for example, assuming the non-vector data is minimal.

160 102 194 195 160 191 160 160 In some demonstrative aspects, compilermay be implemented as a local application to be executed by device. For example, memory unitand/or storage unitmay store instructions resulting in compiler, and/or processormay be configured to execute the instructions resulting in compilerand/or to perform one or more calculations and/or processes of compiler, e.g., as described below.

160 170 In other aspects, compilermay include a remote application to be executed by any suitable computing system, e.g., a server.

170 In some demonstrative aspects, servermay include at least a remote server, a web-based server, a cloud server, and/or any other server.

170 174 160 171 In some demonstrative aspects, the servermay include a suitable memory and/or storage unithaving stored thereon instructions resulting in compiler, and a suitable processorto execute the instructions, e.g., as descried below.

160 In some demonstrative aspects, compilermay include a combination of a remote application and a local application.

160 102 170 160 102 102 191 102 In one example, compilermay be downloaded and/or received by the user of devicefrom another computing system, e.g., server, such that compilermay be executed locally by users of device. For example, the instructions may be received and stored, e.g., temporarily, in a memory or any suitable short-term memory or buffer of device, e.g., prior to being executed by processorof device.

160 102 170 In another example, compilermay include a client-module to be executed locally by device, and a server module to be executed by server. For example, the client-module may include and/or may be implemented as a local application, a web application, a web site, a web client, e.g., a Hypertext Markup Language (HTML) web application, or the like.

160 102 160 170 For example, one or more first operations of compilermay be performed locally, for example, by device, and/or one or more second operations of compilermay be performed remotely, for example, by server.

160 In other aspects, compilermay include, or may be implemented by, any other suitable computing arrangement and/or scheme.

100 110 102 100 160 In some demonstrative aspects, systemmay include an interface, e.g., a user interface, to interface between a user of deviceand one or more elements of system, e.g., compiler.

110 In some demonstrative aspects, interfacemay be implemented using any suitable hardware components and/or software components, for example, processors, controllers, memory units, storage units, input units, output units, communication units, operating systems, and/or applications.

110 100 In some aspects, interfacemay be implemented as part of any suitable module, system, device, or component of system.

110 100 In other aspects, interfacemay be implemented as a separate element of system.

110 102 110 102 In some demonstrative aspects, interfacemay be implemented as part of device. For example, interfacemay be associated with and/or included as part of device.

110 102 110 160 102 In one example, interfacemay be implemented, for example, as middleware, and/or as part of any suitable application of device. For example, interfacemay be implemented as part of compilerand/or as part of an OS of device.

110 170 110 170 In some demonstrative aspects, interfacemay be implemented as part of server. For example, interfacemay be associated with and/or included as part of server.

110 In one example, interfacemay include, or may be part of a Web-based application, a web-site, a web-page, a plug-in, an ActiveX control, a rich content component, e.g., a Flash or Shockwave component, or the like.

110 113 114 100 In some demonstrative aspects, interfacemay be associated with and/or may include, for example, a gateway (GW)and/or an Application Programming Interface (API), for example, to communicate information and/or communications between elements of systemand/or to one or more other, e.g., internal or external, parties, users, applications and/or systems.

110 116 In some aspects, interfacemay include any suitable Graphic-User-Interface (GUI)and/or any other suitable interface.

110 112 102 116 114 In some demonstrative aspects, interfacemay be configured to receive the source code, for example, from a user of device, e.g., via GUI, and/or API.

110 112 160 115 In some demonstrative aspects, interfacemay be configured to transfer the source code, for example, to compiler, for example, to generate the target code, e.g., as described below.

2 FIG. 1 FIG. 200 160 200 200 Reference is made to, which schematically illustrates a compiler, in accordance with some demonstrative aspects. For example, compiler() may be implement one or more elements of compiler, and/or may perform one or more operations and/or functionalities of compiler.

2 FIG. 200 233 212 In some demonstrative aspects, as shown in, compilermay be configured to generate a target code, for example, by compiling a source codein a source language.

2 FIG. 200 210 212 In some demonstrative aspects, as shown in, compilermay include a front-endconfigured to receive and analyze the source codein the source language.

210 213 212 In some demonstrative aspects, front-endmay be configured to generate an intermediate code, for example, based on the source code.

213 212 In some demonstrative aspects, intermediate codemay include a lower level representation of the source code.

210 212 In some demonstrative aspects, front-endmay be configured to perform, for example, lexical analysis, syntax analysis, semantic analysis, and/or any other additional or alternative type of analysis, of the source code.

210 212 210 212 In some demonstrative aspects, front-endmay be configured to identify errors and/or problems with an outcome of the analysis of the source code. For example, front-endmay be configured to generate error information, e.g., including error and/or warning messages, for example, which may identify a location in the source code, for example, where an error or a problem is detected.

2 FIG. 200 220 213 223 In some demonstrative aspects, as shown in, compilermay include a middle-endconfigured to receive and process the intermediate code, and to generate an adjusted, e.g., optimized, intermediate code.

220 213 223 In some demonstrative aspects, middle-endmay be configured to perform one or more adjustment, e.g., optimizations, to the intermediate code, for example, to generate the adjusted intermediate code.

220 213 233 In some demonstrative aspects, middle-endmay be configured to perform the one or more optimizations on the intermediate code, for example, independent of a type of the target computer to execute the target code.

220 223 In some demonstrative aspects, middle-endmay be implemented to support use of the optimized intermediate code, for example, for different machine types.

220 223 233 In some demonstrative aspects, middle-endmay be configured to optimize the intermediate representation of the intermediate code, for example, to improve performance and/or quality of the produced target code.

213 In some demonstrative aspects, the one or more optimizations of the intermediate code, may include, for example, inline expansion, dead-code elimination, constant propagation, loop transformation, parallelization, and/or the like.

2 FIG. 200 230 213 233 213 In some demonstrative aspects, as shown in, compilermay include a back-endconfigured to receive and process the adjusted intermediate code, and to generate the target codebased on the adjusted intermediate code.

230 233 230 213 213 233 In some demonstrative aspects, back-endmay be configured to perform one or more operations and/or processes, which may be specific for the target computer to execute the target code. For example, back-endmay be configured to process the optimized intermediate codeby applying to the adjusted intermediate codeanalysis, transformation, and/or optimization operations, which may be configured, for example, based on the target computer to execute the target code.

213 In some demonstrative aspects, the one or more analysis, transformation, and/or optimization operations applied to the adjusted intermediate codemay include, for example, resource and storage decisions, e.g., register allocation, instruction scheduling, and/or the like.

233 233 In some demonstrative aspects, the target codemay include target-dependent assembly code, which may be specific to the target computer and/or a target operating system of the target computer, which is to execute the target code.

233 180 1 FIG. In some demonstrative aspects, the target codemay include target-dependent assembly code for a processor, e.g., vector processor().

200 200 In some demonstrative aspects, compilermay include a Vector Micro-Code Processor (VMP) Open Computing Language (OpenCL) compiler, e.g., as described below. In other aspects, compilermay include, or may be implemented as part of, any other type of vector processor compiler.

180 1 FIG. In some demonstrative aspects, the VMP OpenCL compiler may include a Low Level Virtual Machine (LLVM) based (LLVM-based) compiler, which may be configured according to an LLVM-based compilation scheme, for example, to lower OpenCL C-code to VMP accelerator assembly code, e.g., suitable for execution by vector processor().

200 In some demonstrative aspects, compilermay include one or more technologies, which may be required to compile code to a format suitable for a VMP architecture, e.g., in addition to open-sourced LLVM compiler passes.

210 In some demonstrative aspects, FEmay be configured to parse the OpenCL C-code and to translate it, e.g., through an Abstract Syntax Tree (AST), for example, into an LLVM Intermediate Representation (IR).

200 In some demonstrative aspects, compilermay include a dedicated API, for example, to detect a correct pattern for compiler pattern matching, for example, suitable for the VMP. For example, the VMP may be configured as a Complex Instruction Set Computer (CISC) machine implementing a very complex Instruction Set Architecture (ISA), which may be hard to target from standard C code. Accordingly, compiler pattern matching may not be able to easily detect the correct pattern, and for this case the compiler may require a dedicated API.

210 In some demonstrative aspects, FEmay implement one or more vendor extension built-ins, which may target VMP-specific ISA, for example, in addition to standard OpenCL built-ins, which may be optimized to a VMP machine.

210 In some demonstrative aspects, FEmay be configured to implement OpenCL structures and/or work item functions.

220 In some demonstrative aspects, MEmay be configured to process LLVM IR code, which may be general and target-independent, for example, although it may include one or more hooks for specific target architectures.

220 In some demonstrative aspects, MEmay perform one or more custom passes, for example, to support the VMP architecture, e.g., as described below.

220 In some demonstrative aspects, MEmay be configured to perform one or more operations of a Control Flow Graph (CFG) Linearization analysis, e.g., as described below.

In some demonstrative aspects, the CFG Linearization analysis may be configured to linearize the code, for example, by converting if-statements to select patterns, for example, in case VMP vector code does not support standard control flow.

220 In one example, MEmay receive a given code, e.g., as follows:

If (x > 0) {  A = A + 5; } else {  B = B * 2; } 220 According to this example, MEmay be configured to apply the CFG Linearization analysis to the given code, e.g., as follows:

220 In some demonstrative aspects, MEmay be configured to perform one or more operations of an auto-vectorization analysis, e.g., as described below.

In some demonstrative aspects, the auto-vectorization analysis may be configured to vectorize, e.g., auto-vectorize, a given code, e.g., to utilize vector capabilities of the VMP.

220 In some demonstrative aspects, MEmay be configured to perform the auto-vectorization analysis, for example, to vectorize code in a scalar form. For example, some or all operations of the auto-vectorization analysis may not be performed, for example, in case the code is already provided in a vectorized form.

In some demonstrative aspects, for example, in some use cases and/or scenarios, a compiler may not always be able to auto-vectorize a code, for example, due to data dependencies between loop iterations.

220 In one example, MEmay receive a given code, e.g., as follows:

char* a,b,c; for (int i=0; i < 2048; i++) {  a[i]=b[i]+c[i]; } 220 According to this example, MEmay be configured to perform the CFG auto-vectorization analysis by applying a first conversion, e.g., as follows:

char* a,b,c; for (int i=0; i < 2048; i+=32) {  a[i.i+31]=b[i...i+31]+c[i...i+31]; }

220 For example, MEmay be configured to perform the CFG auto-vectorization analysis by applying a second conversion, for example, following the first conversion, e.g., as follows:

char32* a,b,c; for (int i=0; i < 64; i++) {  a[i]=b[i]+c[i]; }

220 In some demonstrative aspects, MEmay be configured to perform one or more operations of a Scratch Pad Memory Loop Access Analysis (SPMLAA), e.g., as described below.

In some demonstrative aspects, the SPMLAA may define Processing Blocks (PB), e.g., that should be outlined and compiled for VMP later.

In some demonstrative aspects, the processing blocks may include accelerated loops, which may be executed by the vector unit of the VMP.

In some demonstrative aspects, a PB, e.g., each PB, may include memory references. For example, some or all memory accesses may refer to local memory banks.

320 3 FIG. In some demonstrative aspects, the VMP may enable access to memory banks through AGUs, e.g., AGUsas described below with reference to, and Scatter Gather units (SG).

In some demonstrative aspects, the AGUs may be pre-configured, e.g., before loop execution. For example, a loop trip count may be calculated, e.g., ahead of running a processing block.

In some demonstrative aspects, image references, e.g., some or all image references, may be created at this stage, and may be followed by calculation of strides and offsets, e.g., per dimension for each reference.

220 In some demonstrative aspects, MEmay be configured to perform one or more operations of an AGU planner analysis, e.g., as described below.

In some demonstrative aspects, the AGU Planner analysis may include iterator assignment, which may cover image references, e.g., all image references, from the entire Processing Block.

In some demonstrative aspects, an iterator may cover a single reference or a group of references.

In some demonstrative aspects, one or more memory references may be coalesced and/or reuse a same access through shuffle instructions, and/or saving values read from previous iterations.

In some demonstrative aspects, other memory references, e.g., which have no linear access pattern, may be handled using a Scatter-Gather (SG) unit, which may have a performance penalty, e.g., as it may require maintaining indices and/or masks.

In some demonstrative aspects, a plan may be configured as an arrangement of iterators in a processing block. For example, a processing block may have multiple plans, e.g., theoretically.

In some demonstrative aspects, the AGU Planner analysis may be configured to build all possible plans for all PBs, and to select a combination, e.g., a best combination, e.g., from all valid combinations.

In some demonstrative aspects, a total number of iterators in a valid combination may be limited, e.g., not to exceed a number of available AGUs on a VMP.

In some demonstrative aspects, one or more parameters, e.g., including stride, width and/or base, may be defined for an iterator, e.g., for each iterator for example, as part of the AGU Planner analysis. For example, min-max ranges for the iterators may be defined in a dimension, e.g., in each dimension, for example, as part of the AGU Planner analysis.

In some demonstrative aspects, the AGU Planner analysis may be configured to track and evaluate a memory reference, e.g., each memory reference, to an image, e.g., to understand its access pattern.

In one example, according to Examples 2a/2b, the image ‘a’ which is the base address, may be accessed with steps of 32 bytes for 64 iterations.

In some demonstrative aspects, the LLVM may include a scalar evaluation analysis (SCEV), which may compute an access pattern, e.g., to understand every image reference.

220 In some demonstrative aspects, MEmay utilize masking capabilities of the AGUs, for example, to avoid maintaining an induction variable, which may have a performance penalty.

220 In some demonstrative aspects, MEmay be configured to perform one or more operations of a rewrite analysis, e.g., as described below.

In some demonstrative aspects, the rewrite analysis may be configured to transform the code of a processing block, for example, while setting iterators and/or modifying memory access instructions.

In some demonstrative aspects, setting of the iterators, e.g., of all iterators, may be implemented in IR in target-specific intrinsics. For example, the setting of the iterators may reside in a pre-header of an outermost loop.

In some demonstrative aspects, the rewrite analysis may include loop-perfectization analysis, e.g., as described below.

In some demonstrative aspects, the code may be compiled with a target that substantially all calculations should be executed inside the innermost loop.

For example, the loop-perfectization analysis may hoist instructions, e.g., to move into a loop an operation performed after a last iteration of the loop.

For example, the loop-perfectization analysis may sink instructions, e.g., to move into a loop an operation performed before a first iteration of the loop.

For example, the loop-perfectization analysis may hoist instructions and/or sink instructions, for example, such that substantially all instructions are moved from outer loops to the innermost loops.

For example, the loop-perfectization analysis may be configured to provide a technical solution to support VMP iterators, e.g., to work on perfectly nested loops only.

For example, the loop-perfectization analysis may result in a situation where there are no instructions between the “for” statements that compose the loop, e.g., to support VMP iterators, which cannot emulate such cases.

In some demonstrative aspects, the loop-perfectization analysis may be configured to collapse a nested loop into a single collapsed loop.

220 In one example, MEmay receive a given code, e.g., as follows:

for (int i = 0; i < N; i++) {  int sum = 0;  for (int j = 0; j < M; j++)  {   sum += a[j + stride * i];  }    res[i] = sum; } 220 According to this example, MEmay be configured to perform the loop-perfectization analysis to collapse the nested loop in the code to a single collapsed loop, e.g., as follows:

for (int k = 0; k < N * M; k++) {  sum = (k % M == 0 ? 0 : sum);  sum += a[k % M + stride * ( k / M )];   res[k/M] = sum; }

220 In some demonstrative aspects, MEmay be configured to perform one or more operations of a Vector Loop Outlining analysis, e.g., as described below.

310 330 3 FIG. 3 FIG. 3 FIG. In some demonstrative aspects, the Vector Loop Outlining analysis may be configured to divide a code between a scalar subsystem and a vector subsystem, e.g., vector processing block() and scalar processor() as described below with reference to.

In some demonstrative aspects, the VMP accelerator may include the scalar and/or vector subsystems, e.g., as described below. For example, each of the subsystems may have different compute units/processors. Accordingly, a scalar code may be compiled on a scalar compiler, e.g., an SSC compiler, and/or an accelerated vector code may run on the VMP vector processor.

In some demonstrative aspects, the Vector Loop Outlining analysis may be configured to create a separate function for a loop body of the accelerated vector code. For example, these functions may be marked for the VMP and/or may continue to the VMP backend, for example, while the rest of the code may be compiled by the SSC compiler.

In some demonstrative aspects, one or more parts of a vector loop, e.g., configuration of the vector unit and/or initialization of vector registers, may be performed by a scalar unit. However, these parts may be performed in a later stage, for example, by performing backpatching into the scalar code, e.g., as the scalar code may still be in LLVM IR before processing by the SSC compiler.

230 230 220 In some demonstrative aspects, BEmay be configured to translate the LLVM IR into machine instructions. For example, the BEmay not be target agnostic and may be familiar with target-specific architecture and optimizations, e.g., compared to ME, which may be agnostic to a target-specific architecture.

230 230 In some demonstrative aspects, BEmay be configured to perform one or more analyses, which may be specific to a target machine, e.g., a VMP machine, to which the code is lowered, e.g., although BEmay use common LLVM.

230 In some demonstrative aspects, BEmay be configured to perform one or more operations of an instruction lowering analysis, e.g., as described below.

In some demonstrative aspects, the instruction lowering analysis may be configured to translate LLVM IR into target-specific instructions Machine IR (MIR), for example, by translating the LLVM IR into a Directed Acyclic Graph (DAG).

In some demonstrative aspects, the DAG may go through a legalization process of instructions, for example, based on the data types and/or VMP instructions, which may be supported by a VMP HW.

In some demonstrative aspects, the instruction lowering analysis may be configured to perform a process of pattern-matching, e.g., after the legalization process of instructions, for example, to lower a node, e.g., each node, in the DAG, for example, into a VMP-specific machine instruction.

In some demonstrative aspects, the instruction lowering analysis may be configured to generate the MIR, for example, after the process of pattern-matching.

In some demonstrative aspects, the instruction lowering analysis may be configured to lower the instruction according to machine Application Binary Interface (ABI) and/or calling conventions.

230 In some demonstrative aspects, BEmay be configured to perform one or more operations of a unit balancing analysis, e.g., as described below.

316 3 FIG. 3 FIG. In some demonstrative aspects, the unit balancing analysis may be configured to balance instructions between VMP compute units, e.g., data processing units() as described below with reference to.

In some demonstrative aspects, the unit balancing analysis may be familiar with some or all available arithmetic transformations, and/or may perform transformations according to an optimal algorithm.

230 In some demonstrative aspects, BEmay be configured to perform one or more operations of a modulo scheduler (pipeliner) analysis, e.g., as described below.

In some demonstrative aspects, the pipeliner may be configured to schedule the instructions according to one or more constraints, e.g., data dependency, resource bottlenecks and/or any other constrains, for example, using Swing Modulo Scheduling (SMS) heuristics and/or any other additional and/or alternative heuristic.

In some demonstrative aspects, the pipeliner may be configured to schedule a set, e.g., an Initiation Interval (II), of Very Long Instruction Word (VLIW) instructions that the program will iterate on, e.g., during a steady state.

In some demonstrative aspects, a performance metric, which may be based on a number of cycles a typical loop may execute, may be measured, e.g., as follows:

In some demonstrative aspects, the pipeliner may try to minimize the II, e.g., as much as possible, for example, to improve performance.

In some demonstrative aspects, the pipeliner may be configured to calculate a minimum II, and to schedule accordingly. For example, if the pipeliner fails the scheduling, the pipeliner may try to increase the II and retry scheduling, e.g., until a predefined II threshold is violated.

230 In some demonstrative aspects, BEmay be configured to perform one or more operations of a register allocation analysis, e.g., as described below.

In some demonstrative aspects, the register allocation analysis may be configured to attempt to assign a register in an efficient, e.g., optimal, way.

In some demonstrative aspects, the register allocation analysis may assign values to bypass vector registers, general purpose vector registers, and/or scalar registers.

In some demonstrative aspects, the values may include private variables, constants, and/or values that are rotated across iterations.

In some demonstrative aspects, the register allocation analysis may implement an optimal heuristic that suites one or more VMP register file (regfile) constraints. For example, in some use cases, the register allocation analysis may not use a standard LLVM register allocation.

In some demonstrative aspects, in some cases, the register allocation analysis may fail, which may mean that the loop cannot be compiled. Accordingly, the register allocation analysis may implement a retry mechanism, which may go back to the modulo scheduler and may attempt to reschedule the loop, e.g., with an increased initiation interval. For example, increasing the initiation interval may reduce register pressure, and/or may support compilation of the vector loop, e.g., in many cases.

230 In some demonstrative aspects, BEmay be configured to perform one or more operations of an SSC configuration analysis, e.g., as described below.

In some demonstrative aspects, the SSC configuration analysis may be configured to set a configuration to execute the kernel, e.g., the AGU configuration.

In some demonstrative aspects, the SSC configuration analysis may be performed at a late stage, for example, due to configurations calculated after legalization, the register allocation analysis, and/or the modulo scheduling analysis.

In some demonstrative aspects, the SSC configuration analysis may include a Zero Overhead Loop (ZOL) mechanism in the vector loop. For example, the ZOL mechanism may configure a loop trip count based on an access pattern of the memory references in the loop, for example, to avoid running instructions that check the loop exit condition every iteration.

In some demonstrative aspects, a VMP Compilation Flow may include one or more, e.g., a few, steps, which may be invoked during the compilation flow in a test library (testlib), e.g., a wrapper script for compilation, execution, and/or program testing. For example, these steps may be performed outside of the LLVM Compiler.

In some demonstrative aspects, a PCB Hardware Description Language (PHDL) simulator may be implemented to perform one or more roles of an assembler, encoder, and/or linker.

200 200 In some demonstrative aspects, compilermay be configured to provide a technical solution to support robustness, which may enable compilation of a vast selection of loops, with HW limitations. For example, compilermay be configured to support a technical solution, which may not create verification errors.

200 In some demonstrative aspects, compilermay be configured to provide a technical solution to support programmability, which may provide a user an ability to express code in multiple ways, which may compile correctly to the VMP architecture.

200 In some demonstrative aspects, compilermay be configured to provide a technical solution to support an improved user-experience, which may allow the user capability to debug and/or profile code. For example, the improved user-experience may provide informative error messages, report tools, and/or a profiler.

200 some demonstrative aspects, compilermay be configured to provide a technical solution to support improved performance, for example, to optimize a VMP assembly code and/or iterator accesses, which may lead to a faster execution. For example, improved performance may be achieved through high utilization of the compute units and usage of its complex CISC.

3 FIG. 1 FIG. 300 180 300 300 Reference is made to, which schematically illustrates a vector processor, in accordance with some demonstrative aspects. For example, vector processor() may be implement one or more elements of vector processor, and/or may perform one or more operations and/or functionalities of vector processor.

300 In some demonstrative aspects, vector processormay include a Vector Microcode Processor (VMP).

300 In some demonstrative aspects, vector processormay include a Wide Vector machine, for example, supporting Very Long Instruction Word (VLIW) architectures, and/or Single Instruction/Multiple Data (SIMD) architectures.

300 In some demonstrative aspects, vector processormay be configured to provide a technical solution to support high performance for short integral types, which may be common, for example, in computer-vision and/or deep-learning algorithms.

300 In other aspects, vector processormay include any other type of vector processor, and/or may be configured to support any other additional or alternative functionalities.

3 FIG. 300 310 330 340 In some demonstrative aspects, as shown in, vector processormay include a vector processing block (vector processor), a scalar processor, and a Direct Memory Access (DMA), e.g., as described below.

3 FIG. 310 310 In some demonstrative aspects, as shown in, vector processing blockmay be configured to process, e.g., efficiently process, image data and/or vector data. For example, the vector processing blockmay be configured to use vector computation units, for example, to speed up computations.

330 330 310 330 In some demonstrative aspects, scalar processormay be configured to perform scalar computations. For example, the scalar processormay be used as a “glue logic” for programs including vector computations. For example, some, e.g., even most, of the computation of the programs may be performed by the vector processing block. However, several tasks, for example, some essential tasks, e.g., scalar computations, may be performed by the scalar processor.

340 300 In some demonstrative aspects, the DMAmay be configured to interface with one or more memory elements in a chip including vector processor.

340 In some demonstrative aspects, the DMAmay be configured to read inputs from a main memory, and/or write outputs to the main memory.

330 310 In some demonstrative aspects, the scalar processorand the vector processing blockmay use respective local memories to process data.

3 FIG. 300 350 330 310 In some demonstrative aspects, as shown in, vector processormay include a fetcher and decoder, which may be configured to control the scalar processorand/or the vector processing block.

330 310 352 In some demonstrative aspects, operations of the scalar processorand/or the vector processing blockmay be triggered by instructions stored in a program memory.

340 352 In some demonstrative aspects, the DMAmay be configured to transfer data, for example, in parallel with the execution of the program instructions in memory.

340 300 In some demonstrative aspects, DMAmay be controlled by software, e.g., via configuration registers, for example, rather than instructions, and, accordingly, may be considered as a second “thread” of execution in vector processor.

330 310 340 In some demonstrative aspects, the scalar processor, the vector processing block, and/or the DMAmay include one or more data processing units, for example, a set of data processing units, e.g., as described below.

In some demonstrative aspects, the data processing units may include hardware configured to preform computations, e.g., an Arithmetic Logic Unit (ALU).

In one example, a data processing unit may be configured to add numbers, and/or to store the numbers in a memory.

352 330 In some demonstrative aspects, the data processing units may be controlled by commands, e.g., encoded in the program memoryand/or in configuration registers. For example, the configuration registers may be memory mapped, and may be written by the memory store commands of the scalar processor.

330 310 340 In some demonstrative aspects, the scalar processor, the vector processing block, and/or the DMAmay include a state configuration including a set of registers and memories, e.g., as described below.

3 FIG. 310 312 310 In some demonstrative aspects, as shown in, vector processor blockmay include a set of vector memories, which may be configured, for example, to store data to be processed by vector processor block.

3 FIG. 310 314 310 In some demonstrative aspects, as shown in, vector processor blockmay include a set of vector registers, which may be configured, for example, to be used in data processing by vector processor block.

330 310 340 In some demonstrative aspects, the scalar processor, the vector processing block, and/or the DMAmay be associated with a set of memory maps.

In some demonstrative aspects, a memory map may include a set of addresses accessible by a data processing unit, which may load and/or store data from/to registers and memories.

3 FIG. 310 320 312 In some demonstrative aspects, as shown in, the vector processing blockmay include a plurality of Address Generation Units (AGUs), which may include addresses accessible to them, e.g., in one or more of memories.

3 FIG. 310 316 In some demonstrative aspects, as shown in, vector processor blockmay include a plurality of data processing units, e.g., as described below.

316 In some demonstrative aspects, data processing unitsmay be configured to process commands, e.g., including several numbers at a time. In one example, a command may include 8 numbers. In another example, a command may include 4 numbers, 16 numbers, or any other count of numbers.

316 316 In some demonstrative aspects, two or more data processing unitsmay be used simultaneously. In one example, data processing unitsmay process and execute a plurality of different command, e.g., 3 different commands, for example, including 8 numbers, at a throughout of a single cycle.

316 316 316 316 316 In some demonstrative aspects, data processing unitsmay be asymmetrical. For example, first and second data processing unitsmay support different commands. For example, addition may be performed by a first data processing unit, and/or multiplication may be performed by a second data processing unit. For example, both operations may be performed by one or more additional other data processing units.

316 In some demonstrative aspects, data processing unitsmay be configured to support arithmetic operations for many combinations of input & output data types.

316 316 300 In some demonstrative aspects, data processing unitsmay be configured to support one or more operations, which may be less common. For example, processing unitsmay support operations working with a Look Up Table (LUT) of vector processor, and/or any other operations.

316 In some demonstrative aspects, data processing unitsmay be configured to support efficient computation of non-linear functions, histograms, and/or random data access, e.g., which may be useful to implement algorithms like image scaling, Hough transforms, and/or any other algorithms.

312 In some demonstrative aspects, vector memoriesmay include, for example, memory banks having a size of 16K or any other size, which may be accessed at a same cycle.

316 In one example, a maximal memory access size may be 64 bits. According to this example, a peak throughput may be 256 bits, e.g., 64×4=256. For example, high memory bandwidth may be implemented to utilize computation capabilities of the data processing units.

316 316 In one example, two data processing unitsmay support 16 8-bit multiply & accumulate operations (MACs) per cycle. According to this example, the two data processing unitsmay not be useful, for example, in case the input numbers are not fetched at this speed, and/or there isn't exactly 256 bits of input, e.g., 16×8×2=256.

320 314 In some demonstrative aspects, AGUsmay be configured to perform memory access operations, e.g., loading and storing data from/to vector memories.

320 316 In some demonstrative aspects, AGUsmay be configured to compute addresses of input and output data items, for example, to handle I/O to utilize the data processing units, e.g., in case sheer bandwidth is not enough.

320 330 In some demonstrative aspects, AGUsmay be configured to compute the addresses of the input and/or output data items, for example, based on configuration registers written by the scalar processor, for example, before a block of vector commands, e.g., a loop, is entered.

320 For example, AGUsmay be configured to write an image base pointer, a width, a height and/or a stride to the configuration registers, for example, in order to iterate over an image.

320 316 In some demonstrative aspects, AGUsmay be configured to handle addressing, e.g., all addressing, for example, to provide a technical solution in which data processing unitsmay not have the burden of incrementing pointers or counters in a loop, and/or the burden to check for end-of-row conditions, e.g., to zero a counter in the loop.

3 FIG. 320 312 32 In some demonstrative aspects, as shown in, AGUsmay include 4 AGUs, and, accordingly, four memoriesmay be accessed at a same cycle. In other aspects, any other count of AGUsmay be implemented.

320 312 320 320 312 312 320 312 In some demonstrative aspects, AGUsmay not be “tied” to memory banks. For example, an AGU, e.g., each AGU, may access a memory bank, e.g., every memory bank, for example, as long as two or more AGUsdo not try to access the same memory bankat the same cycle.

314 316 320 In some demonstrative aspects, vector registersmay be configured to support communication between the data processing unitsand AGUs.

314 314 316 320 314 In one example, a total number of vector registersmay be 28, which may be divided into several subsets, e.g., based on their function. For example, a first subset of vector registersmay be used for inputs/outputs, e.g., of all data processing unitsand/or AGUs; and/or a second subset of vector registersmay not be used for outputs of some operations, e.g., most operations, and may be used for one or more other operations, e.g., to store loop-invariant inputs.

316 316 316 314 314 In some demonstrative aspects, a data processing unit, e.g., each data processing unit, may have one or more registers to host an output of a last executed operation, e.g., which may be fed as inputs to other data processing units. For example, these registers may “bypass” the vector registers, and may work faster than writing these outputs to first set of vector registers.

350 In some demonstrative aspects, fetcher and decodermay be configured to support low-overhead vector loops, e.g., very low overhead vector loops (also referred to as “zero-overhead vector loops”), for example, where there may be no need to check a termination (exit) condition of a vector loop during an execution of the vector loop.

320 320 For example, a termination (exit) condition may be signaled by an AGU, for example, when the AGUfinishes iterating over a configured memory region.

350 320 For example, fetcher and decodermay quit the loop, for example, when the AGUsignals the termination condition.

330 For example, the scalar processormay be utilized to configure the loop parameters, e.g., first & last instructions and/or the exit condition.

316 In one example, vector loops may be utilized, for example, together with high memory bandwidth and/or cheap addressing, for example, to solve a control and data flow problem, for example, to provide a technical solution to allow the data processing unitsto process data, e.g., without substantially additional overhead.

330 310 316 330 In some demonstrative aspects, scalar processormay be configured to provide one or more functionalities, which may be complementary to those of the vector processing block. For example, a large portion, e.g., most, of the work in a vector program may be performed by the data processing units. For example, the scalar processormay be utilized, for example, for “gluing” together the various blocks of vector code of the vector program.

330 310 330 310 In some demonstrative aspects, scalar processormay be implemented separately from vector processing block. In other aspects, scalar processormay be configured to share one or more components and/or functionalities with vector processing block.

330 310 In some demonstrative aspects, scalar processormay be configured to perform operations, which may not be suitable for execution on vector processing block.

330 330 For example, scalar processormay be utilized to execute 32 bit C programs. For example, scalar processormay be configured to support 1, 2, and/or 4 byte data types of C code, and/or some or all arithmetic operators of C code.

330 310 For example, scalar processormay be configured to provide a technical solution to perform operations that cannot be executed on vector processing block, for example, without using a full-blown CPU.

330 332 In some demonstrative aspects, scalar processormay include a scalar data memory, e.g., having a size of 16K or any other size, which may be configured to store data, e.g., variables used by the scalar parts of a program.

330 200 2 FIG. For example, scalar processormay store local and/or global variables declared by portable C code, which may be allocated to scalar data memory by a compiler, e.g., compiler().

3 FIG. 330 334 330 In some demonstrative aspects, as shown in, scalar processormay include, or may be associated with, a set of vector registers, which may be used in data processing performed by the scalar processor.

330 330 300 330 In some demonstrative aspects, scalar processormay be associated with a scalar memory map, which may support scalar processorin accessing substantially all states of vector processor. For example, the scalar processormay configure the vector units and/or the DMA channels via the scalar memory map.

330 In some demonstrative aspects, scalar processormay not be allowed to access one or more block control registers, which may be used by external processors to run and debug vector programs.

340 300 340 330 340 In some demonstrative aspects, DMAmay be configured to communicate with one or more other components of a chip implementing the vector processor, for example, via main memory. For example, DMAmay be configured to transfer blocks of data, e.g., large, contiguous, blocks of data, for example, to support the scalar processorand/or the vector processing block, which may manipulate data stored in the local memories. For example, a vector program may be able to read data from the main chip memory using DMA.

340 In some demonstrative aspects, DMAmay be configured to communicate with other elements of the chip, for example, via a plurality of DMA channels, e.g., 8 DMA channels or any other count of DMA channels. For example, a DMA channel, e.g., each DMA channel, may be capable of transferring a rectangular patch from the local memories to the main chip memory, or vice versa. In other aspects, the DMA channel may transfer any other type of data block between the local memories and the main chip memory.

In some demonstrative aspects, a rectangular patch may be defined by abase pointer, a width, a height, and astride.

For example, at peak throughput, 8 bytes per cycle may be transferred, however, there may be overheads for each patch and/or for each row in a patch.

340 In some demonstrative aspects, DMAmay be configured to transfer data, for example, in parallel with computations, e.g., via the plurality of DMA channels, for example, as long as executed commands do not access a local memory involved in the transfer.

In one example, as all channels may access the same memory bus, using several channels to implement a transfer may not save I/O cycles, e.g., compared to the case when a single channel is used. However, the plurality of DMA channels may be utilized to schedule several transfers and execute them in parallel with computations. This may be advantageous, for example, compared to a single channel, which may not allow scheduling a second transfer before completion of the first transfer.

340 330 In some demonstrative aspects, DMAmay be associated with a memory map, which may support the DMA channels in accessing vector memories and/or the scalar data. For example, access to the vector memories may be performed in parallel with computations. For example, access to the scalar data may usually not be allowed in parallel, e.g., as the scalar processormay be involved in almost any sensible program, and may likely access it's local variables while the transfer is performed, which may lead to a memory contention with the active DMA channel.

340 310 In some demonstrative aspects, DMAmay be configured to provide a technical solution to support parallelization of I/O and computations. For example, a program performing computations may not have to wait for I/O, for example, in case these computations may run fast by vector processing block.

300 300 In some demonstrative aspects, an external processor, e.g., a CPU, may be configured to initiate execution of a program on vector processor. For example, vector processormay remain idle, e.g., as long as program execution is not initiated.

In some demonstrative aspects, the external processor may be configured to debug the program, e.g., execute a single step at a time, halt when the program reaches breakpoints, and/or inspect contents of registers and memories storing the program variables.

300 300 In some demonstrative aspects, an external memory map may be implemented to support the external processor in controlling the vector processorand/or debugging the program, for example, by writing to control registers of the vector processor.

300 In some demonstrative aspects, the external memory map may be implemented by a superset of the scalar memory map. For example, this implementation may make all registers and memories defined by the architecture of the vector processoraccessible to a debugger back-end running on the external processor.

300 300 In some demonstrative aspects, the vector processormay raise an interrupt signal, for example, when the vector processorterminates a program.

300 In some demonstrative aspects, the interrupt signal may be used, for example to implement a driver to maintain a queue of programs scheduled for execution by the vector processor, and/or to launch a new program, e.g., by the external processor, for example, upon the completion of a previously executed program.

1 FIG. 160 115 112 Referring back to, in some demonstrative aspects, compilermay be configured to generate the target codebased on one or more loops, which may be based, for example, on source code, e.g., as described below.

160 115 In some demonstrative aspects, compilermay be configured to generate the target codebased on one or more loops, which may be configured, for example, according to a loop-execution scheme, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to support one or more vector processing architectures, for example, VLIW architectures and/or any other architectures, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to support improvement of one or more types of loop nests, for example, imperfect-loop-nests, e.g., as described below.

In some demonstrative aspects, a loop nest may include at least an outer loop and an inner loop, e.g., as described below.

In some demonstrative aspects, the loop nest may include an outer loop, e.g., a most outer loop, an inner loop, e.g., a most inner loop, and one or more nested loops (also referred to as“intermediate nested loops” or “intermediate loops”), which may be nested between the outer loop and the inner loop, e.g., as described below.

In some demonstrative aspects, the loop nest may include a plurality of loops nested in a plurality of nest levels, e.g., as described below.

In one example, the plurality of loops may include a first loop, e.g., an outer loop, for example, in a first nest level, and a second loop, e.g., an inner loop, for example, in a second nest level.

In one example, the plurality of loops may include one or more intermediate loops, for example, in one or more intermediate nest levels, e.g., between the first nest level and the second nest level.

In one example, the plurality of loops may include three loops in three nest levels. For example, the three loops may include a first loop, e.g., an outermost loop, at a first nest level; a second loop, e.g., an intermediate loop, at a second nest level; and a third loop, e.g., an innermost loop, at a third nest level. For example, the second loop may be nested in the first loop, and the third loop may be nested in the second loop.

In some demonstrative aspects, there may be a need to provide a technical solution to efficiently transform imperfect-loop-nests into perfect loop nests, for example, to improve performance of an executable program, for example, when executed by a processor, for example, a vector processor or any other target processor, e.g., as described below.

In some demonstrative aspects, a perfect loop-nest may be configured to include a loop nest, in which all compute operations of the loop-nest reside in an inner-most loop of the perfect loop-nest.

For example, outer-loops of the perfect loop-nest may not include any compute instructions.

For example, all the compute instructions of the perfect loop-nest may be in the most inner loop of the perfect loop-nest.

In one example, one or more processor architectures may require using prefect loop nests in a program and/or may benefit from using the prefect loop nests in the program.

In another example, the perfect loop-nests may be amenable to one or more, e.g., many more, loop-optimizations.

In another example, one or more scheduling schemes, e.g., modulo-scheduling, which may be critical optimization for VLIW targets, may not be able to optimize code across loop-levels/basic-blocks, for example, when perfect loop-nests are not used.

In another example, one or more processor architectures may support only the perfect loop nests. For example, these architectures may rely on being able to perfectize loop-nests into perfect loops.

160 115 In some demonstrative aspects, compilermay be configured to generate the target codebased on one or more loops, which may be configured, for example, according to a loop-execution scheme, which may be configured to provide a technical solution to improve performance of a program executed by a target processor, for example, a vector processor, for example, by transforming imperfect-loop-nests into perfect loop nests, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to improve performance of a program executed by a target processor, for example, a vector processor, for example, by efficiently transforming imperfect-loop-nests into collapsed loops, e.g., as described below.

In some demonstrative aspects, a collapsed loop of a perfect loop nest may be configured to include a single-basic-block loop including all nested loops of the perfect loop, e.g., as described below.

In some demonstrative aspects, execution of the single-basic-block loop may be configured in advance, for example, along different dimensions, e.g., which may correspond to the original loops in the original loop nest.

For example, execution of the collapsed loop may be configured in advance, e.g., by control hardware (“HW controlled”), e.g., as described below.

In one example, one or more processor architectures may support only collapsed loops. For example, these processor architectures may rely on an ability to collapse and/or perfectize loop-nests into single-basic-block and/or perfect loops.

160 115 In some demonstrative aspects, compilermay be configured to generate the target codebased on one or more loops, which may be configured, for example, according to a loop-execution scheme, which may be configured to provide a technical solution to support transformation of imperfect-loop-nests into perfect loop nests, for example, to improve performance of an executable program, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to efficiently transform imperfect-loop-nests into collapsed loops, for example, by transforming perfect loop nests into collapsed loops, e.g., as described below.

160 115 In some demonstrative aspects, compilermay be configured to generate the target codebased on a compilation scheme, which may be configured, for example, to provide a technical solution to support computing one or more predicates, which may be utilized for the transformation of imperfect-loop-nests into perfect loop nests and/or for the transformation of imperfect-loop-nests into collapsed loops, e.g., as described below.

In some demonstrative aspects, a predicate may be configured to indicate, identify, affirm, predict, and/or assert a start of a loop and/or an end of a loop, e.g., a first iteration or a last iteration of the loop.

In some demonstrative aspects, the predicate may be configured to identify the start of a loop and/or the end of a loop, for example, even without handling and/or maintaining an induction variable, e.g., as described below.

In one example, one or more predicates may be utilized to indicate a start and/or an end of execution of one or more inner-loops nested in an original loop nest, e.g., as described below.

In one example, it may be important to efficiently compute the predicates, for example, in cases where loop perfectization and/or loop collapsing rely on predication, e.g., as described below.

In another example, it may be important to efficiently compute the predicates, for example, to support processor architectures, which may not be able to efficiently compute induction-variables, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to support processor architectures, which may not support predicated instructions for non-memory-access operations.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to support computing predicates, for example, to support transformation of loop nests into perfect loop nests and/or collapsed loops, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to support executing programs more efficiently, for example, while avoiding a need to compute an induction-variable based predicate, e.g., as described below.

In some demonstrative aspects, the loop-execution scheme may be configured to provide a technical solution to support one or more architectures, which do not have predicated operations in hardware and/or have loop-nests controlled by hardware, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to identify one or more loop nests based on source code, e.g., as described below.

160 112 112 In some demonstrative aspects, compilermay be configured to identify the one or more of the loop nests in source code, e.g., in case the loop nests are included in source code.

160 112 In some demonstrative aspects, compilermay be configured to identify the one or more of the loop nests in code, e.g., middle-end code, which may be compiled from the source code.

160 In some demonstrative aspects, compilermay be configured to transform one or more identified loop nests into one or more perfect loop-nests, e.g., as described below.

160 115 115 In some demonstrative aspects, compilermay be configured to compile the source code into the target code, for example, such that target codemay be based on the one or more perfect loop-nests, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to transform the one or more identified loop nests into the one or more perfect loop-nests, for example, according to a loop-perfectization scheme, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to move one or more outer instructions from outer loop-levels of a loop nest into an inner loop, e.g., an innermost loop, of the loop nest, for example, while using one or more predicates to guard the execution of the outer instructions that were moved into the inner loop, e.g., as described below.

In some demonstrative aspects, the predicates may be configured to check and/or represent a state of an induction-variable, which counts a number of iterations of the inner-loop, e.g., as described below.

In some demonstrative aspects, a predicate (also referred to as a “loop-start predicate”) may be configured to identify when the induction-variable may be equal to the start of a respective inner-loop, e.g., for an instruction (also referred to as “sunk instruction”) moved from before the inner-loop into the inner-loop, e.g., as described below.

In some demonstrative aspects, a predicate (also referred to as a “loop-end predicate”) may be configured to identify when the induction-variable may be equal to the end of a respective inner-loop, e.g., for an instruction (also referred to as “hoist instruction”) moved from after the inner-loop into the inner-loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to move all instructions from outer loop-levels (nest levels) of the loop nest into the innermost loop of the loop nest, for example, to transform the loop nest into a perfect loop nest, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to identify one or more outer-loop instructions of an outer-loop of a loop nest, which is outer to an inner loop of the loop nest, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to move an outer-loop instruction into the inner loop of the loop nest, for example, based on a location-based criterion relating to a location of the outer-loop instruction with respect to the inner loop, e.g., as described below.

In some demonstrative aspects, the location-based criterion may be used to identify whether the outer-loop instruction is before the inner loop (“a pre-header instruction”) or after the inner loop (“a latch instruction”), e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to transform an outer-loop instruction into a conditional instruction in the inner loop, which may be within the inner loop of the loop nest, e.g., as described below.

For example, the conditional instruction may be configured based on the location-based criterion, e.g., as described below.

In some demonstrative aspects, the conditional instruction may be configured, for example, based on a predicate to indicate, identify, affirm, predict, and/or assert a count of iterations of the inner loop, e.g., as described below.

In some demonstrative aspects, the predicate may be utilized to indicate, identify, affirm, predict, and/or assert whether the inner loop is at a first iteration of the inner loop or at a last iteration of the inner loop, e.g., as described below.

In some demonstrative aspects, the conditional instruction may be configured as a memory access operation, which may be based, for example, on the predicate on the count of iterations of the inner loop, e.g., as described below.

In some demonstrative aspects, the outer-loop instruction may include a load operation, and the memory access operation may be configured to perform the load operation, for example, based on the predicate on the count of iterations of the inner loop, e.g., as described below.

In some demonstrative aspects, the outer-loop instruction may include a store operation, and the memory access operation may be configured to perform the store operation, for example, based on the predicate on the count of iterations of the inner loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to sink an instruction, for example, by moving into an inner loop an instruction to be performed before a first iteration of the inner loop, e.g., as described below.

In some demonstrative aspects, a pre-header instruction may be sunk, for example, by moving the pre-header instruction into the inner loop, and transforming the pre-header instruction into a pre-header conditional instruction, e.g., as described below.

In some demonstrative aspects, the pre-header conditional instruction may include a condition to configure a result of the pre-header conditional instruction based on a predicate, for example, on the inner loop, e.g., as described below.

160 115 In some demonstrative aspects, compilermay generate the target codebased on compiled code, which may be configured to configure a particular result of the pre-header conditional instruction, for example, when the predicate identifies that execution of the inner loop is before the first iteration of the inner loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to hoist an instruction, for example, by moving into an inner loop an instruction to be performed after a last iteration of the inner loop, e.g., as described below.

In some demonstrative aspects, a latch instruction may be hoisted, for example, by moving the latch instruction into the inner loop, and transforming the latch instruction into a latch conditional instruction (also referred to as “hoisted conditional instruction”), e.g., as described below.

In some demonstrative aspects, the latch conditional instruction may include a condition to configure a result of the latch conditional instruction based on a predicate, for example, on the inner loop, e.g., as described below.

160 115 In some demonstrative aspects, compilermay generate the target codebased on compiled code, which may be configured to configure a particular result of the latch conditional instruction, for example, when the predicate identifies that execution of the inner loop is after the last iteration of the inner loop.

160 In some demonstrative aspects, compilermay be configured to repeatedly and/or iteratively perform the hoist and/or sink operations, for example, to iterate over all instructions in the loop nest, for example, until substantially all instructions are moved from outer loops into the innermost loops, e.g., as described below.

For example, the loop-execution scheme may be configured to provide a technical solution to support VMP iterators to work on perfectly loop nests only. For example, the loop-execution scheme may result in a situation where there are no instructions between two subsequent “for” statements that compose the loop, e.g., as described below.

160 112 115 In some demonstrative aspects, compilermay be configured to compile source codeinto target code, for example, by transforming one or more loop nests into collapsed loops, e.g., as described below.

In some demonstrative aspects, the one or more loop nests may be transformed into collapsed loops, for example, to provide a technical solution to improve performance of execution of a program by a target processor, for example, a vector processor and/or any other processor, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to transform one or more identified loop nests into collapsed loops, for example, according to a loop-collapsing scheme, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to apply the loop-collapsing scheme, for example, based on a result of the loop-perfectization scheme, as described below.

160 In some demonstrative aspects, compilermay be configured to apply the loop-collapsing scheme, for example, even without performing the loop-perfectization scheme, for example, when the loop-perfectization scheme is unnecessary, e.g., when an input loop nest includes a perfect loop nest.

160 115 In one example, compilermay be configured to apply the loop-collapsing scheme, for example, to provide target codeconfigured for execution by one or more processor architectures, which support only single-basic-block loops.

160 In some demonstrative aspects, compilermay be configured to collapse a plurality of individual loops of a loop nest into a single loop, which may be configured, for example, to execute substantially all the iterations of the original loop nest, e.g., as described below.

In some demonstrative aspects, execution of the collapsed loop may be configured in advance, for example, to configure advancement along dimensions of the original loop.

115 In some demonstrative aspects, the loop-collapsing scheme may be configured to provide a technical solution to support execution of target codeby one or more processor architectures, including processor architectures in which computation of induction variables may be computationally expensive.

160 112 In some demonstrative aspects, compilermay be configured to identify a perfect loop nest including a plurality of nested loops, for example, based on the source code, e.g., as described below.

In some demonstrative aspects, the plurality of nested loops may correspond to a respective plurality of dimensions, e.g., as described below.

In some demonstrative aspects, a dimension of a nested loop may be executed during a number of iterations of the nested loops, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to configure a collapsed loop based on the loop nest, for example, by collapsing the plurality of loop nests into a single loop, for example, based on the plurality of dimensions, e.g., as described below.

160 112 180 In some demonstrative aspects, compilermay compile a source codeof a program to be executed by a target processor, e.g., processor, as described below.

160 112 For example, compliermay identify, for example, based on source code, a loop nest including an outer loop (y loop) along a dimension based on a variable height, a nested loop (x loop) along a dimension based on a variable width, and an inner loop (z loop) along a dimension based on a variable area, e.g., as follows:

for(int y = 0; y < height; y++) {  char a;  for(int x = 0; x < width; x++) {   out1[y * width + x] = inp1[y * width + x] + 7;   for(int z = 0; z < area; z++) {    a = inp2[y * width * area + x * area + z];   }  }  out2[y] = a; }

For example, as shown by Example 4, the nested loop may include a pre-header instruction, e.g., out1[y*width+x]=inp1[y*width+x]+7, which may reside before the header of the inner loop (z loop).

For example, as shown by Example 4, the pre-header instruction (out1[y*width+x]=inp1[y*width+x]+7) may be executed, for example, before execution of the inner loop begins.

For example, as shown by Example 4, the pre-header instruction (out1[y*width+x]=inp1[y*width+x]+7) may include an outer load operation, e.g., inp1[y*width+x], and an outer store operation, e.g., “out1[y*width+x]=”, which may be executed, for example, every time before execution of the inner loop begins.

160 In some demonstrative aspects, compilermay be configured to sink the pre-header instruction into the inner loop, for example, to generate a perfect loop nest, e.g., as described below.

160 For example, compilermay be configured to move the pre-header instruction (out1[y*width+x]=inp1[y*width+x]+7) into the inner loop, and to transform the pre-header instruction (out1[y*width+x]=inp1[y*width+x]+7) into a conditional pre-header instruction (also referred to as a “sunk conditional instruction”), e.g., as described below.

For example, as shown by Example 4, the outer loop may include a latch instruction, e.g., out2[y]=a, which may be after the nested loop and the inner loop.

For example, as shown by Example 4, the latch instruction may include an outer store operation, e.g., out2[y]=a.

For example, as shown by Example 4, the latch instruction (out2[y]=a) may be executed, for example, every time after the last iteration of the inner-loop and the nested-loop.

160 In some demonstrative aspects, compilermay be configured to hoist the latch instruction (out2[y]=a) into the inner loop, for example, to generate a perfect loop nest, e.g., as described below.

160 For example, compilermay be configured to move the latch instruction (out2[y]=a) into the inner loop, and to transform the latch instruction (out2[y]=a) into a conditional latch (also referred to as a “hoisted conditional instruction”) instruction, which may be based on the latch instruction, e.g., as described below.

For example, as shown by Example 4, the inner loop may include an inner load instruction, e.g., a=inp2[y*width*area+x*area+z], which may be within the inner loop.

160 In some demonstrative aspects, compilermay be configured to transform the perfect loop nest into a collapsed loop along a dimension, which may be based, for example, on the value height, on the value width, and on the value area, e.g., as follows:

for(int ind = 0; ind < height * width * area; ind++) {  char val = inp1[inp1_ind];  char result = val + 7;  if (first_iteration_of_z_loop)   out1[out1_ind] = result;  char a = inp2[inp2_ind];  if(last_iteration_of_x_and_z_loops)   out2[out2_ind] = a; }

In some demonstrative aspects, as shown by Example 5, the collapsed loop may include a single block including instructions based on all the instructions of Example 4.

In some demonstrative aspects, as shown by Example 5, load and store operations in the pre-header instruction out1[y*width+x]=inp1[y*width+x]+7, may be transformed into a conditional pre-header instruction “if (first_iteration_of_z_loop) out1[out1_ind]=result”.

In some demonstrative aspects, as shown by Example 5, the conditional pre-header instruction may include a condition based on a predicate, which may indicate, identify, affirm, predict, and/or assert a start of the inner-loop.

In some demonstrative aspects, as shown by Example 5, the conditional pre-header instruction may be executed, for example, only when the predicate (first_iteration_of_z_loop) is true, e.g., only when the inner loop starts to execute.

In some demonstrative aspects, as shown by Example 5, the latch instruction out2[y]=a may be transformed into a conditional latch instruction “if(last_iteration_of_x_and_z_loops) out2[out2_ind]=a”.

In some demonstrative aspects, as shown by Example 5, the conditional latch instruction may include a condition based on a predicate, which may indicate, identify, affirm, predict, and/or assert the last iteration of the inner-loop and the last iteration of the nested loop.

In some demonstrative aspects, as shown by Example 5, the conditional latch instruction may be executed, for example, only when the predicate (last_iteration_of_x_and_z_loops) is true, e.g., only when the inner loop and the nested loop are after the last iteration.

In some demonstrative aspects, as shown by Example 5, the inner load instruction, e.g., a=inp2[y*width*area+x*area+z], may be transformed into a load instruction, e.g., char a=inp2[inp2_ind]. For example, as shown by Example 5, the inner load instruction (a=inp2[y*width*area+x*area+z]) may not be transformed into a conditional instruction.

In some demonstrative aspects, as shown by Example 5, indices of the load and store instructions may be computed, for example, as AGU parameters.

In some demonstrative aspects, as shown by Example 5, indices of the conditional latch instruction and/or the conditional pre-header instruction may be computed, for example, as AGU parameters, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to provide a technical solution to support computing predicates, e.g., conditional instructions, which may be utilized for the transformation of outer-loop instructions into instructions of a collapsed loop, e.g., as described below.

In some demonstrative aspects, for example, in some use cases, implementations, and/or scenarios, transforming the outer-loop instructions using conditional store/load instructions may be inefficient.

In one example, the conditional store/load instructions may require an additional condition instruction.

In another example, the conditional store/load instructions may require maintaining one or more induction variables in the loop.

160 In some demonstrative aspects, compilermay be configured to configure a collapsed loop, e.g., the collapsed loop of Example 5, for example, according to a predicate-based memory-access mechanism, which may be configured to support configuration of memory access operations, for example, based on one or more predicates, e.g., as described below.

In some demonstrative aspects, the predicate-based memory-access mechanism may be configured to provide a technical solution to support configuration of memory access operations, e.g., load instructions and/or store instructions, based on a start-of-loop predicate and/or an end-of-loop predicate, e.g., as described below.

In some demonstrative aspects, the predicate-based memory-access mechanism may be configured to provide a technical solution to support memory access operations based on the start-of-loop predicate and/or the end-of-loop predicate, for example, even without computing induction variables of one or more of the loops in the loop nest, e.g., as described below.

In some demonstrative aspects, the predicate-based memory-access mechanism may be configured to provide a technical solution to support memory access operations based on the start-of-loop predicate and/or the end-of-loop predicate, for example, at processor architectures, which may not support computation of induction variables.

In some demonstrative aspects, the predicate-based memory-access mechanism may be configured to provide a technical solution to support memory access operations based on the start-of-loop predicate and/or the end-of-loop predicate, for example, at processor architectures, in which it may be computationally expensive to compute induction variables, e.g., when loops are entirely configured in advance.

160 In some demonstrative aspects, compilermay be configured to implement the conditional latch instruction (hoisted conditional instruction) and/or the conditional pre-header instruction (sunk conditional instruction) of Example 5, for example, by setting one or more AGU parameters corresponding to the conditional latch instruction and/or the conditional pre-header instruction, e.g., as described below.

160 In some demonstrative aspects, compliermay be configured to configure an AGU to perform a memory access operation, which may be performed at a start of an inner loop, or at an end of an inner loop, for example, based on an outer loop instruction, e.g., as described below.

160 112 150 180 In some demonstrative aspects, compliermay be configured to identify a loop nest based on a source codeto be compiled into a target codeto be executed by a target processor, e.g., as described below.

160 115 180 In some demonstrative aspects, compilermay be configured to generate the target codeconfigured, for example, for execution by a target vector processor, for example, a vector processor, e.g., as described below.

160 115 180 In some demonstrative aspects, compilermay be configured to generate the target codeconfigured, for example, for execution by a Very Long Instruction Word (VLIW) Single Instruction/Multiple Data (SIMD) target processor, e.g., processor.

160 115 In other aspects, compilermay be configured to generate the target codeconfigured, for example, for execution by any other suitable type of processor.

160 115 112 In some demonstrative aspects, compilermay be configured to generate the target code, for example, based on the source codeincluding Open Computing Language (OpenCL) code.

160 115 112 In other aspects, compilermay be configured to generate the target code, for example, based on the source codeincluding any other suitable type of code.

160 112 115 In some demonstrative aspects, compilermay be configured to compile the source codeinto the target code, for example, according to a Low Level Virtual Machine (LLVM) based (LLVM-based) compilation scheme.

160 112 115 In other aspects, compilermay be configured to compile the source codeinto the target codeaccording to any other suitable compilation scheme.

160 112 112 In some demonstrative aspects, compilermay be configured to identify the loop nest in source code, e.g., in case the loop nest is included in source code.

160 112 In some demonstrative aspects, compilermay be configured to identify the loop nest in code, e.g., middle-end code or any other code, which may be compiled from the source code.

In some demonstrative aspects, the loop nest may include a plurality of loops, for example, including at least an outer loop and an inner loop inside the outer loop, e.g., as described below.

In some demonstrative aspects, the plurality of loops may include at least a first loop, e.g., an outer loop, and a second loop, e.g., an inner loop, nested in the first loop, e.g., as described below.

In some demonstrative aspects, the first loop may include at least one first-loop instruction (“outer loop instruction”), which may be outside the second loop, e.g., as described below.

160 180 In some demonstrative aspects, compliermay be configured to generate AGU configuration code, for example, to configure an AGU of the target processor, for example, based on the first-loop instruction, e.g., as described below.

In some demonstrative aspects, the AGU configuration code may be configured to configure a first dimension of the AGU, for example, based on the first loop, e.g., as described below.

In some demonstrative aspects, the AGU configuration code may be configured to configure a second dimension of the AGU, for example, based on the second loop, e.g., as described below.

In some demonstrative aspects, the AGU configuration code may be configured to configure the second dimension of the AGU, for example, to configure a memory-access operation to be performed at a start of the second loop, or at an end of the second loop, e.g., as described below.

In some demonstrative aspects, the memory-access operation may be based, for example, on the first-loop instruction, e.g., as described below.

In some demonstrative aspects, the memory-access operation may include a load operation or a store operation, e.g., as described below.

160 115 112 In some demonstrative aspects, compliermay be configured to generate target code, for example, based on compilation of the source code, e.g., as described below.

115 In some demonstrative aspects, the target codemay be based, for example, on the AGU configuration code, e.g., as described below.

In some demonstrative aspects, the plurality of loops may include a third loop, which may be, for example, nested in the first loop, e.g., as described below.

In some demonstrative aspects, the second loop may be nested in the third loop, e.g., as described below.

In some demonstrative aspects, the first-loop instruction may be outside the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure a third dimension of the AGU, for example, based on the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure the third dimension, for example, to configure the memory-access operation to be performed at the start of the second loop or at the end of the second loop, e.g., as described below.

In some demonstrative aspects, the third loop may include a third-loop instruction, which may be, for example, outside the second loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure an other AGU of the target processor, for example, based on the third-loop instruction, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure a first dimension of the other AGU, for example, based on the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure a second dimension of the other AGU, for example, based on the second loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure the second dimension of the other AGU, for example, to configure an other memory-access operation to be performed at the start of the second loop or at the end of the second loop, e.g., as described below.

In some demonstrative aspects, the other memory-access operation may be based, for example, on the third-loop instruction, e.g., as described below.

160 In some demonstrative aspects, compliermay be configured to transform the loop nest into a transformed loop including the memory-access operation, e.g., as described below.

115 In some demonstrative aspects, the target codemay be based, for example, on the transformed loop, e.g., as described below.

In some demonstrative aspects, the transformed loop may include the memory-access operation and the other memory access operation, e.g., as described below.

In some demonstrative aspects, the transformed loop may include a perfect flat loop, in which, for example, all compute operations of the loop nest are implemented in the transformed loop, e.g., as described below.

In some demonstrative aspects, the transformed loop may include a fully collapsed loop including, for example, only a single-basic-block loop, for example, based on the plurality of loops, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a base parameter of the AGU, for example, based on a memory pointer of the first-loop instruction, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a Maximum (Max) parameter of the second dimension of the AGU, for example, based on an entry size corresponding to the first-loop instruction, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Max parameter of the second dimension of the AGU, and to set a Max parameter of the third dimension of the AGU, for example, based on the entry size corresponding to the first-loop instruction, for example, in case the second loop is nested in the third loop, e.g., as described below.

In some demonstrative aspects, the at least one first-loop instruction may include a pre-header instruction to be performed before a first iteration of the inner loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure the second dimension of the AGU, for example, to configure the memory-access operation to be performed only at the start of the second loop, for example, when the first-loop instruction includes the pre-header instruction, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the base parameter of the AGU to the memory pointer of the pre-header instruction, for example, when the first-loop instruction includes the pre-header instruction, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a Minimum (Min) parameter of the second dimension of the AGU to zero, for example, when the first-loop instruction includes the pre-header instruction, e.g., as described below.

In some demonstrative aspects, the pre-header instruction may include a load operation, e.g., as described below.

160 In some demonstrative aspects, compliermay be configured to, for example, based on a determination that the pre-header instruction includes a load operation, configure the AGU configuration code to set a step parameter of the second dimension of the AGU to zero, e.g., as described below.

In some demonstrative aspects, the pre-header instruction may include a store operation, e.g., as described below.

160 In some demonstrative aspects, compliermay be configured to, for example, based on a determination that the pre-header instruction includes a store operation, configure the AGU configuration code to set the step parameter of the second dimension of the AGU, for example, based on an entry size corresponding to the pre-header instruction, e.g., as described below.

In some demonstrative aspects, the at least one first-loop instruction may include a latch instruction to be performed after a last iteration of the second-loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure the second dimension of the AGU, for example, to configure the memory-access operation to be performed only at the end of the second loop, for example, when the first-loop instruction includes the latch instruction, e.g., as described below.

In some demonstrative aspects, the latch instruction may include a load operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the base parameter of the second dimension of the AGU, for example, to the memory pointer of the latch instruction, for example, when the latch instruction includes the load operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a Min parameter of the second dimension of the AGU to zero, for example, when the latch instruction includes the load operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Max parameter of the second dimension of the AGU, for example, to an entry size corresponding to the latch instruction, for example, when the latch instruction includes the load operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the step parameter of the second dimension of the AGU to zero, for example, when the latch instruction includes the load operation, e.g., as described below.

In some demonstrative aspects, the latch instruction may include a store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the base parameter of the AGU, for example, based on a first parameter value, a second parameter value, and a third parameter value, for example, when the latch instruction includes the store operation, e.g., as described below.

In some demonstrative aspects, the first parameter value may include the entry size corresponding to the latch instruction, e.g., as described below.

In some demonstrative aspects, the second parameter value may include a total count of iterations over one or more loops, which are in the first loop and include the second loop, e.g., as described below.

In some demonstrative aspects, the third parameter value may include a count of dimensions of the AGU corresponding to the one or more loops, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the base parameter, denoted Base, of the AGU, e.g., as follows:

wherein OrigBase denotes a memory pointer of the latch instruction, wherein EntrySize denotes the entry size corresponding to the latch instruction, wherein [Σ TripCount(L)] denotes the total count of iterations over the one or more loops, which are in the first loop and include the second loop, and wherein #InnerDims denotes the count of dimensions of the AGU corresponding to the one or more loops, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the step parameter of the second dimension of the AGU, for example, based on the entry size corresponding to the latch instruction, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Min parameter of the second dimension of the AGU, for example, based on the entry size corresponding to the latch instruction, and a count of iterations in the second loop, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Max parameter of the second dimension of the AGU, for example, based on the Min parameter of the second dimension of the AGU and the entry size corresponding to the latch instruction, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the step parameter of the second dimension of the AGU, for example, based on an additive inverse of the entry size corresponding to the latch instruction, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Min parameter of the second dimension of the AGU, for example, based on the entry size corresponding to the latch instruction, and the count of iterations in the second loop, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Min parameter of the second dimension of the AGU, for example, based on a product of the additive inverse of the entry size corresponding to the latch instruction, and a subtraction result of subtracting one from the count of iterations in the second loop, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Max parameter of the second dimension of the AGU, for example, based on the Min parameter of the second dimension of the AGU and the entry size corresponding to the latch instruction, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set the Max parameter of the second dimension of the AGU, for example, based on a sum of the Min parameter of the second dimension of the AGU and the entry size corresponding to the latch instruction, for example, when the latch instruction includes the store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, based on a determination that the plurality of loops includes a third loop nested in the first loop, that the second loop is nested in the third loop, and that the latch instruction including the store operation is outside the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure the third dimension of the AGU, for example, based on the third loop, for example, when the latch instruction including the store operation is outside the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a step parameter of the third dimension of the AGU, for example, based on the entry size corresponding to the latch instruction, for example, when the latch instruction including the store operation is outside the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a Min parameter of the third dimension of the AGU, for example, based on the entry size corresponding to the latch instruction, and a count of iterations in the third loop, for example, when the latch instruction including the store operation is outside the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to set a Max parameter of the third dimension of the AGU, for example, based on the Min parameter of the third dimension of the AGU, and the entry size corresponding to the latch instruction, for example, when the latch instruction including the store operation is outside the third loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to perform one or more operations, for example, according to a loop-compilation scheme, which may be configured to compile instructions for a loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to identify load and/or store operations, and to analyze one or more attributes of a load/store operation, e.g., for each load and/or store operation, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to analyze for a load/store operation, e.g., for each load and/or store operation, its innermost enclosing loop, its base parameter, its step parameter, e.g., stride, its offset, its bounds, e.g., Min/Max parameters, and/or its passthrough value.

160 In some demonstrative aspects, compilermay be configured to partition the identified load and/or store operations, for example, into one or more groups, for example, according to their attributes, e.g., according to their stride, bounds, and/or passthrough values.

160 In some demonstrative aspects, compilermay be configured to assign an AGU to a group of identified load/store operations, e.g., to each group of identified load/store operations.

In one example, each group of identified load/store operations may be implemented with a single AGU.

160 In some demonstrative aspects, compilermay be configured to configure AGU configuration code for AGUs implementing load and/or store operations, which may be located outside an innermost loop.

For example, the AGU configuration code may set one or more special parameters, e.g., including a special base parameter, a special stride parameter, and/or special Min/Max parameters, for a load and/or store operation in an outer loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to sink or hoist an outer load operation, which may be outside an inner loop.

160 In some demonstrative aspects, compilermay be configured to generate AGU configuration code, for example, to configure an AGU based on the outer load operation.

In some demonstrative aspects, the AGU configuration code may configure a load operation based on the outer load operation, which may be performed at a start or an end of the inner-loop.

160 In some demonstrative aspects, compilermay be configured to generate the AGU configuration code, for example, to configure the AGU for execution of the load operation in the first iteration of the inner-loop or in the last operation of the inner-loop.

160 In some demonstrative aspects, compilermay be configured to set a base parameter of the AGU, for example, to a memory pointer of the outer load operation, e.g., similar to a usual setting of the base parameter.

160 In some demonstrative aspects, compilermay be configured to set to zero the step parameter of the AGU for a dimension corresponding to an inner loop, denoted L, for example, for each inner loop, e.g., as follows:

L For example, the setting of Step=0 may be applied with respect to load operations.

160 In some demonstrative aspects, compilermay be configured to set the minimum parameter of the AGU for a dimension corresponding to the inner loop L to zero, and to set the maximum parameter of the AGU for the dimension corresponding to the inner loop L, for example, based on an entry size of the outer load operation, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured, for example, to sink the load operation inp1[y*width+x] of the pre-header instruction of Example 4, for example, into the inner loop of Example 4.

160 In some demonstrative aspects, compilermay be configured to identify an innermost enclosing loop of the outer load operation of Example 4.

160 For example, compilermay identify the loop over x (X-loop) as the innermost enclosing loop for the load operation inp1[y*width+x], for example, while the loop over z (Z-loop) may be more inner then the loop of the load operation inp1[y*width+x].

160 In some demonstrative aspects, compilermay be configured to identify an entry size of the memory pointer inp1. For example, the entry size of the memory pointer inp1 may be 1, for example, as the memory pointer inp1 may be configured to include a Character (Char).

160 In some demonstrative aspects, compilermay be configured to transform the outer load operation inp1[y*width+x], for example, into a load instruction, e.g., char val=inp1[inp1_ind], for example, by configuring AGU configuration code for a dimension corresponding to the inner loop, e.g., the dimension corresponding to the Z-loop, of an AGU implementing the memory pointer inp1, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to sink an outer store operation, which may be outside of an inner loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate AGU configuration code, for example, to configure an AGU based on the sinking of the outer store operation.

In some demonstrative aspects, the AGU configuration code may configure a store operation based on the sinking of the outer store operation, which may be executed at a start of the inner-loop. For example, the store operation may be executed at the first iteration of the inner-loop.

160 In some demonstrative aspects, compilermay be configured to set the base parameter of the AGU, for example, to a memory pointer of the outer store operation, e.g., similar to a usual setting of the base parameter.

160 In some demonstrative aspects, compilermay be configured to set the step parameter of the AGU for a dimension corresponding to an inner loop, denoted L, for example, for each inner loop, for example, based on an entry size of the outer store operation, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to set the minimum parameter of the AGU for the dimension corresponding to the inner loop L, e.g., to zero, and to set the maximum parameter of the AGU for the dimension corresponding to the inner loop L, for example, based on an entry size of the outer store operation, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to sink the outer store operation “out1[y*width+x]= . . . ” of the prereader instruction of Example 4, for example, into the inner loop of Example 4.

160 In some demonstrative aspects, compilermay be configured to identify an innermost enclosing loop of the outer store operation of Example 4.

160 For example, compilermay identify the X-loop as the innermost enclosing loop for the store operation “out1[y*width+x]= . . . ”, for example, while the Z-loop may be more inner then the loop of the store operation “out1[y*width+x]= . . . ”.

160 In some demonstrative aspects, compilermay be configured to identify an entry size of the memory pointer out1.

For example, the entry size of the memory pointer out1 may be 1, for example, as the memory pointer out1 may be configured to include a Char.

160 In some demonstrative aspects, compilermay be configured to transform the outer store operation “out1[y*width+x]= . . . ”, for example, into a store instruction, e.g., “if (first_iteration_of_z_loop) out1[out1_ind]=result”, for example, by configuring AGU configuration code for a dimension corresponding to the inner-loop, e.g., the dimension corresponding to the Z-loop, of an AGU implementing the memory pointer out1, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to hoist an outer store operation, which may be outside of an inner loop, e.g., as described below.

160 In some demonstrative aspects, compilermay be configured to generate AGU configuration code, for example, to configure an AGU based on the hoisting of the outer store operation.

In some demonstrative aspects, the AGU configuration code may configure, for example, a store operation based on hoisting of the outer store operation, which may be executed at an end of the inner-loop. For example, the store operation may be executed at the last iteration of the inner-loop.

160 In some demonstrative aspects, compilermay be configured to set a base parameter of the AGU, for example, based on an entry size corresponding to the outer store operation, e.g., as follows:

OrigBase denotes the base of the outer store operation, wherein TripCount(L) denotes a number of iterations for the loop L, and wherein #InnerDims denotes a number (total count) of dimensions of the AGU, which correspond to loops that are more inner than the loop of the outer store operation.For example, the summation over loop L may be over all loops L, which are more inner than the loop of the outer store operation being hoisted.

160 In some demonstrative aspects, compilermay be configured to set the step parameter of the AGU, e.g., for dimensions corresponding to the inner loops, for example, based on the entry size of the outer store operation, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to set the minimum parameter of the AGU, e.g., for a dimension corresponding to an inner loop L, for example, for each inner loop, and to set the maximum parameter of the AGU for the dimension corresponding to the inner loop L, for example, based on the entry size of the outer store operation, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to hoist the store operation “out2[y]=a” of the latch instruction of Example 4, for example, into the inner loop of Example 4.

160 In some demonstrative aspects, compilermay be configured to identify an innermost enclosing loop of the outer store operation of Example 4.

160 For example, compilermay identify the loop over y (Y-loop) as the innermost enclosing loop for the store operation “out2[y]=a”, for example, while the X-loop and the Z-loop may be more inner then the loop of the store operation “out2[y]=a”.

160 In some demonstrative aspects, compilermay be configured to identify an entry size of the memory pointer out2.

For example, the entry size of the memory pointer out2 may be 1, for example, as the memory pointer out2 may be configured to include a Char.

160 In some demonstrative aspects, compilermay be configured to identify a count of iterations (trip count) of the inner X-loop, and a count of iterations (trip-count) of the inner Z-loop, e.g., as follows:

160 In some demonstrative aspects, compilermay be configured to transform the outer store operation “out2[y]=a”, for example, into a conditional store instruction, e.g., “if(last_iteration_of_x_and_z_loops) out2[out2_ind]=a”, for example, by configuring AGU configuration code for dimensions of the inner loops, e.g., the dimension corresponding to the X-loop and the dimension corresponding to the Z-loop, of an AGU implementing the memory pointer out2, e.g., as follows:

Set Step, Min, Max of Y-Loop and TripCounts of all the loops, e.g., as usual:

160 In some demonstrative aspects, compilermay configure AGU configuration code, for example, to configure a plurality of AGUs based on code of Example 4, e.g., as follows:

160 In some demonstrative aspects, compilermay configure loop code to execute the loop of Example 4, for example, based on the AGU configuration code of Example 6a, e.g., as follows:

160 In some demonstrative aspects, as shown by Example 6a, compilermay assign a first AGU, e.g., agu1, to perform the load operation “val=agu1.load( )”, for example, based on the outer load operation “inp1[y*width+x]” of Example 4.

160 In some demonstrative aspects, as shown by Example 6a, compilermay assign a second AGU, e.g., agu2, to perform the load operation “a=agu2.load( )”, for example, based on the inner load operation “a=inp2[y*width*area+x*area+z]” of Example 4.

160 In some demonstrative aspects, as shown by Example 6a, compilermay assign a third AGU, e.g., agu3, to perform the store operation “agu3.store(val);”, for example, based on the outer store operation “out1[y*width+x]= . . . ;” of Example 4.

160 In some demonstrative aspects, as shown by Example 6a, compilermay assign a fourth AGU, e.g., agu4, to perform a store operation “agu4.store(a)”, for example, based on the outer store operation “out2[y]=a” of Example 4.

160 In some demonstrative aspects, as shown by Example 6a, compilermay configure AGU configuration code to configure agu1.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu1 may be configured to set a base parameter of the first AGU to the memory pointer agu1.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu1 may be configured to set a Min parameter for the dimension z of the first AGU to zero, and to set the Max parameter for the dimension z of the first AGU to 1.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu1 may be configured to set a count of iterations for the dimension z of the first AGU to the value area, and to set the stride (step) for the dimension z of the first AGU to zero.

For example, these settings for the dimension z of the first AGU may configure the load operation val=agu1.load( ) to be performed, for example, at the start of the inner loop z.

160 In some demonstrative aspects, as shown by Example 6a, compilermay configure AGU configuration code to configure agu3.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu3 may be configured to set a base parameter of the third AGU to the memory pointer out1.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu3 may be configured to set a Min parameter for the dimension z of the third AGU to zero, and to set the Max parameter for the dimension z of the third AGU to 1.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu1 may be configured to set a count of iterations for the dimension z of the third AGU to the value area, and to set the stride (step) for the dimension z of the third AGU to 1.

For example, these settings for the dimension z of the first AGU may configure the load operation agu3.store(val) to be performed, for example, at the start of the inner loop z.

160 In some demonstrative aspects, as shown by Example 6a, compilermay configure AGU configuration code to configure agu4.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a base parameter of the fourth AGU based, for example, on the memory pointer out2 and a count of iterations of the inner loops, e.g., the value area and the value width, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Min parameter for the dimension x of the fourth AGU, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Max parameter for the dimension x of the fourth AGU, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Stride (Step) parameter for the dimension x of the fourth AGU, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Count parameter for the dimension x of the fourth AGU to width, e.g., according to the count of iterations of the X-loop.

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Min parameter for the dimension z of the fourth AGU, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Max parameter for the dimension z of the fourth AGU, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set a Stride (Step) parameter for the dimension z of the fourth AGU, e.g., as follows:

In some demonstrative aspects, as shown by Example 6a, the AGU configuration code to configure agu4 may be configured to set g a Count parameter for the dimension z of the fourth AGU to area, e.g., according to the count of iterations of the Z-loop.

160 112 In one example, compilermay process source code, which may be based on Example 4, for example, with a setting of width=3, and a setting of area=5, e.g., as follows:

for(int y = 0; y < height y++) {  char c = 0;  for(int x = 0; x < 3; x++) {   for(int z = 0; z < 5; z++) {    c |= inp[y * width * area + x * area + z];  }   }  out[y] = c; }

In some demonstrative aspects, as shown by Example 7, an outer-most loop (Y-loop) may include an outer store instruction, e.g., out[y]=c, which may be after a nested loop (X-loop) and an inner loop (Z-loop).

In some demonstrative aspects, the outer store instruction may be executed after 3 iterations of the X-loop, for example, wherein each iteration of the X-loop includes 5 iterations of the Z-loop. For example, the outer store instruction may be executed after a total of 15 iterations, e.g., 3*5=15.

160 In some demonstrative aspects, compilermay be configured to generate AGU configuration code, for example, to configure an AGU based on the outer store instruction of Example 7.

In some demonstrative aspects, the AGU configuration code may set parameters, for example, for the x dimension and the z dimension of the AGU, e.g., including the base parameter, the total count of iterations, the step parameter, the Min parameter, and the Max parameter, for example, based on the outer store instruction out[y]=c, e.g., as described above.

4 FIG. 400 Reference is made to, which schematically illustrates an execution schemeto execute a latch store operation in a loop nest, in accordance with some demonstrative aspects.

400 In some demonstrative aspects, execution schememay demonstrate the setting of the AGU to configure execution of the outer store operation out[y]=c of Example 7.

4 FIG. 400 In some demonstrative aspects, as shown in, execution schememay include three execution steps, for example, corresponding to the three iterations of the X-loop of Example 7.

4 FIG. 400 410 In some demonstrative aspects, as shown in, execution schememay include a first execution stepcorresponding to a first iteration of the X-loop, e.g., x=0.

4 FIG. 400 420 In some demonstrative aspects, as shown in, execution schememay include a second execution stepcorresponding to a second iteration of the X-loop, e.g., x=1.

4 FIG. 400 430 In some demonstrative aspects, as shown in, execution schememay include a third execution stepcorresponding to a third iteration of the X-loop, e.g., x=2.

4 FIG. 400 In some demonstrative aspects, as shown inthe Z-loop may perform 5 iterations, for example, in each execution step of the execution scheme.

4 FIG. In some demonstrative aspects, as shown in, the AGU configuration code may be configured based on Example 7, for example, to set the Min parameter to zero and the Max parameter to one, e.g., for each of the x dimension and the z dimension of the AGU.

4 FIG. 6 In some demonstrative aspects, as shown in, the AGU configuration code may be configured based on Example 7, for example, to set the base parameter of the AGU to a memory pointer.

4 FIG. 410 7 2 In some demonstrative aspects, as shown in, during the first execution step, a first iteration of the Z-loop may begin at a memory pointerand a last iteration of the Z-loop may be at a memory pointer. Accordingly, the store operation, which may be bounded by the Min parameter of zero and the Max parameter of one, may not be executed.

4 FIG. 420 6 1 In some demonstrative aspects, as shown in, during the second execution step, a first iteration of the Z-loop may begin at memory pointerand the last iteration of the Z-loop may be at memory pointer. Accordingly, the store operation, which may be bounded by the Min parameter of zero and the Max parameter of one, may not be executed.

4 FIG. 430 5 0 In some demonstrative aspects, as shown in, during the third execution step, a first iteration of the Z-loop may begin at memory pointerand the last iteration of the Z-loop may be at memory pointer. Accordingly, the store operation, which may be bounded by the Min parameter of zero and the Max parameter of one, may be executed, for example, only at the last iteration of the Z-loop in the last iteration of the X-loop.

5 FIG. 500 Reference is made to, which schematically illustrates an execution schemeto execute a pre-header load or store operation in a loop nest, in accordance with some demonstrative aspects.

500 In some demonstrative aspects, execution schememay demonstrate the execution of the pre-header load or store instruction, for example, according to the loop execution scheme.

In some demonstrative aspects, the pre-header load or store instruction may be within an outer loop, which may be outer to an inner loop.

5 FIG. In some demonstrative aspects, as shown in, the AGU configuration code corresponding to an AGU to perform the pre-header load or store operation may set, for example the Min parameter to zero and the Max parameter to one, for example, for a dimension of the AGU corresponding to the pre-header load or store instruction.

5 FIG. In some demonstrative aspects, as shown in, the AGU configuration code corresponding to the AGU to perform the pre-header load or store operation may set the base parameter of the AGU, for example, to the memory pointer of the outer load or store pre-header instruction.

5 FIG. In some demonstrative aspects, as shown in, the AGU configuration code may be configured to provide a technical solution to ensure that pre-header load or store operation is to be executed only once at a beginning of the inner loop.

For example, the setting of the Min parameter and the Max parameter, which may bound the execution of the load or store operation only to the first iteration of the inner loop, may ensure that the pre-header load or store operation is to be executed only once at the beginning of the inner loop, e.g., as described above.

6 FIG. 6 FIG. 1 FIG. 1 FIG. 1 FIG. 1 FIG. 2 FIG. 100 102 170 160 200 Reference is made to, which schematically illustrates a method of compiling code for a processor. For example, one or more operations of the method ofmay be performed by a system, e.g., system(); a device, e.g., device(); a server, e.g., server(); and/or a compiler, e.g., compiler(), and/or compiler().

602 160 112 1 FIG. 1 FIG. In some demonstrative aspects, as indicated at block, the method may include identifying a loop nest based on a source code to be compiled into a target code to be executed by a target processor. For example, the loop nest may include a plurality of loops, the plurality of loops including at least a first loop and a second loop nested in the first loop. For example, the first loop may include at least one first-loop instruction, which is outside the second loop. For example, compiler() may be configured to identify the loop nest, for example, based on the source code(), e.g., as descried above.

604 160 180 1 FIG. 1 FIG. In some demonstrative aspects, as indicated at block, the method may include generating AGU configuration code to configure an AGU of the target processor based on the first-loop instruction. For example, the AGU configuration code may configure a first dimension of the AGU based on the first loop, and a second dimension of the AGU based on the second loop. For example, the AGU configuration code may configure the second dimension of the AGU to configure a memory-access operation to be performed at a start of the second loop or at an end of the second loop. For example, the memory-access operation may be based on the first-loop instruction. For example, compiler() may be configured to generate the AGU configuration code to configure the AGU of the target processor() based on the first-loop instruction, e.g., as descried above.

606 160 115 112 1 FIG. 1 FIG. 1 FIG. In some demonstrative aspects, as indicated at block, the method may include generating the target code based on compilation of the source code. For example, the target code may be based on the AGU configuration code. For example, compiler() may be configured to generate target code(), which is based on the AGU configuration code, for example, based on compilation of the source code(), e.g., as descried above.

7 FIG. 1 FIG. 1 FIG. 1 FIG. 1 FIG. 1 FIG. 1 FIG. 1 6 FIGS.- 700 700 702 704 102 170 160 102 170 160 Reference is made to, which schematically illustrates a product of manufacture, in accordance with some demonstrative aspects. Productmay include one or more tangible computer-readable (“machine-readable”) non-transitory storage media, which may include computer-executable instructions, e.g., implemented by logic, operable to, when executed by at least one computer processor, enable the at least one computer processor to implement one or more operations at device(), server(), and/or compiler(), to cause device(), server(), and/or compiler() to perform, trigger and/or implement one or more operations and/or functionalities, and/or to perform, trigger and/or implement one or more operations and/or functionalities described with reference to the, and/or one or more operations described herein. The phrases “non-transitory machine-readable medium” and “computer-readable non-transitory storage media” may be directed to include all computer-readable media, with the sole exception being a transitory propagating signal.

700 702 702 In some demonstrative aspects, productand/or machine-readable storage mediamay include one or more types of computer-readable storage media capable of storing data, including volatile memory, non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and the like. For example, machine-readable storage mediamay include, RAM, DRAM, Double-Data-Rate DRAM (DDR-DRAM), SDRAM, static RAM (SRAM), ROM, programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory (e.g., NOR or NAND flash memory), content addressable memory (CAM), polymer memory, phase-change memory, ferroelectric memory, silicon-oxide-nitride-oxide-silicon (SONOS) memory, a disk, a hard drive, and the like. The computer-readable storage media may include any suitable media involved with downloading or transferring a computer program from a remote computer to a requesting computer carried by data signals embodied in a carrier wave or other propagation medium through a communication link, e.g., a modem, radio or network connection.

704 In some demonstrative aspects, logicmay include instructions, data, and/or code, which, if executed by a machine, may cause the machine to perform a method, process and/or operations as described herein. The machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware, software, firmware, and the like.

704 In some demonstrative aspects, logicmay include, or may be implemented as, software, a software module, an application, a program, a subroutine, instructions, an instruction set, computing code, words, values, symbols, and the like. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a processor to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language, machine code, and the like.

The following examples pertain to further aspects.

Example 1 includes a product comprising one or more tangible computer-readable non-transitory storage media comprising computer-executable instructions operable to, when executed by at least one processor, enable the at least one processor to cause a compiler to identify a loop nest based on a source code to be compiled into a target code to be executed by a target processor, the loop nest comprising a plurality of loops, the plurality of loops comprising at least a first loop and a second loop nested in the first loop, wherein the first loop comprises at least one first-loop instruction, which is outside the second loop; generate Address Generation Unit (AGU) configuration code to configure an AGU of the target processor based on the first-loop instruction, wherein the AGU configuration code is to configure a first dimension of the AGU based on the first loop, and to configure a second dimension of the AGU based on the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure a memory-access operation to be performed at a start of the second loop or at an end of the second loop, wherein the memory-access operation is based on the first-loop instruction; and generate the target code based on compilation of the source code, wherein the target code is based on the AGU configuration code.

Example 2 includes the subject matter of Example 1, and optionally, wherein the plurality of loops comprises a third loop nested in the first loop, the second loop is nested in the third loop, the first-loop instruction is outside the third loop, wherein the AGU configuration code is to configure a third dimension of the AGU based on the third loop, wherein the AGU configuration code is to configure the third dimension to configure the memory-access operation to be performed at the start of the second loop or at the end of the second loop.

Example 3 includes the subject matter of Example 2, and optionally, wherein the third loop comprises a third-loop instruction, which is outside the second loop, wherein the AGU configuration code is to configure an other AGU of the target processor based on the third-loop instruction, wherein the AGU configuration code is to configure a first dimension of the other AGU based on the third loop, and to configure a second dimension of the other AGU based on the second loop, wherein the AGU configuration code is to configure the second dimension of the other AGU to configure an other memory-access operation to be performed at the start of the second loop or at the end of the second loop, wherein the other memory-access operation is based on the third-loop instruction.

Example 4 includes the subject matter of Example 3, and optionally, wherein the instructions, when executed, cause the compiler to transform the loop nest into a transformed loop comprising the memory-access operation and the other memory access operation, wherein the target code is based on the transformed loop.

Example 5 includes the subject matter of any one of Examples 2-4, and optionally, wherein the AGU configuration code is to set a Maximum (Max) parameter of the second dimension of the AGU and a Max parameter of the third dimension of the AGU based on an entry size corresponding to the first-loop instruction.

Example 6 includes the subject matter of any one of Examples 1-5, and optionally, wherein the AGU configuration code is to set a base parameter of the AGU based on a memory pointer of the first-loop instruction, and to set a Maximum (Max) parameter of the second dimension of the AGU based on an entry size corresponding to the first-loop instruction.

Example 7 includes the subject matter of any one of Examples 1-6, and optionally, wherein the at least one first-loop instruction comprises a pre-header instruction to be performed before a first iteration of the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure the memory-access operation to be performed only at the start of the second loop.

Example 8 includes the subject matter of Example 7, and optionally, wherein the AGU configuration code is to set a Minimum (Min) parameter of the second dimension of the AGU to zero.

Example 9 includes the subject matter of Example 7 or 8, and optionally, wherein the instructions, when executed, cause the compiler to, based on a determination that the pre-header instruction comprises a load operation, configure the AGU configuration code to set a step parameter of the second dimension of the AGU to zero.

Example 10 includes the subject matter of any one of Examples 7-9, and optionally, wherein the instructions, when executed, cause the compiler to, based on a determination that the pre-header instruction comprises a store operation, configure the AGU configuration code to set a step parameter of the second dimension of the AGU based on an entry size corresponding to the pre-header instruction.

Example 11 includes the subject matter of any one of Examples 7-10, and optionally, wherein the AGU configuration code is to set a base parameter of the AGU to a memory pointer of the pre-header instruction.

Example 12 includes the subject matter of any one of Examples 1-11, and optionally, wherein the at least one first-loop instruction comprises a latch instruction to be performed after a last iteration of the second loop, wherein the AGU configuration code is to configure the second dimension of the AGU to configure the memory-access operation to be performed only at the end of the second loop.

Example 13 includes the subject matter of Example 12, and optionally, wherein the latch instruction comprises a load operation.

Example 14 includes the subject matter of Example 13, and optionally, wherein the AGU configuration code is to set a base parameter of the AGU to a memory pointer of the latch instruction, to set a Minimum (Min) parameter of the second dimension of the AGU to zero, to set a Maximum (Max) parameter of the second dimension of the AGU to an entry size corresponding to the latch instruction, and to set a step parameter of the second dimension of the AGU to zero.

Example 15 includes the subject matter of Example 12, and optionally, wherein the latch instruction comprises a store operation.

Example 16 includes the subject matter of Example 15, and optionally, wherein the AGU configuration code is to set a base parameter of the AGU based on a first parameter value, a second parameter value and a third parameter value, wherein the first parameter value comprises an entry size corresponding to the latch instruction, the second parameter value comprises a total count of iterations over one or more loops, which are in the first loop and include the second loop, the third parameter value comprising a count of dimensions of the AGU corresponding to the one or more loops.

Example 17 includes the subject matter of Example 16, and optionally, wherein the AGU configuration code is to set the base parameter, denoted Base, of the AGU as follows:

L Base=OrigBase+EntrySize*([Σ TripCount()]−#InnerDims),

wherein OrigBase denotes a memory pointer of the latch instruction, EntrySize denotes the entry size, [Σ TripCount(L)] denotes the total count of iterations over the one or more loops, and #InnerDims denotes the count of dimensions of the AGU corresponding to the one or more loops.

Example 18 includes the subject matter of any one of Examples 15-17, and optionally, wherein the AGU configuration code is to set a step parameter of the second dimension of the AGU based on an entry size corresponding to the latch instruction; to set a Minimum (Min) parameter of the second dimension of the AGU based on the entry size and a count of iterations in the second loop; and to set a Maximum (Max) parameter of the second dimension of the AGU based on the Min parameter of the second dimension of the AGU and the entry size.

Example 19 includes the subject matter of Example 18, and optionally, wherein the AGU configuration code is to set the step parameter of the second dimension of the AGU based on an additive inverse of the entry size.

Example 20 includes the subject matter of Example 18 or 19, and optionally, wherein the AGU configuration code is to set the Min parameter of the second dimension of the AGU based on a product of an additive inverse of the entry size and a subtraction result of subtracting one from the count of iterations in the second loop.

Example 21 includes the subject matter of any one of Examples 18-20, and optionally, wherein the AGU configuration code is to set the Max parameter of the second dimension of the AGU based on a sum of the Min parameter of the second dimension of the AGU and the entry size.

Example 22 includes the subject matter of any one of Examples 16-21, and optionally, wherein the plurality of loops comprises a third loop nested in the first loop, the second loop is nested in the third loop, the latch instruction is outside the third loop, wherein the AGU configuration code is to configure a third dimension of the AGU based on the third loop.

Example 23 includes the subject matter of Example 22, and optionally, wherein the AGU configuration code is to set a step parameter of the third dimension of the AGU based on the entry size, to set a Minimum (Min) parameter of the third dimension of the AGU based on the entry size and a count of iterations in the third loop, and to set a Maximum (Max) parameter of the third dimension of the AGU based on the Min parameter of the third dimension of the AGU and the entry size.

Example 24 includes the subject matter of any one of Examples 1-23, and optionally, wherein the instructions, when executed, cause the compiler to transform the loop nest into a transformed loop comprising the memory-access operation, wherein the target code is based on the transformed loop.

Example 25 includes the subject matter of Example 24, and optionally, wherein the transformed loop comprises a perfect flat loop, in which all compute operations of the loop nest are implemented in the transformed loop.

Example 26 includes the subject matter of Example 24 or 25, and optionally, wherein the transformed loop comprises a fully collapsed loop comprising only a single-basic-block loop based on the plurality of loops.

Example 27 includes the subject matter of any one of Examples 1-26, and optionally, wherein the memory-access operation comprises a load operation or a store operation.

Example 28 includes the subject matter of any one of Examples 1-27, and optionally, wherein the source code comprises Open Computing Language (OpenCL) code.

Example 29 includes the subject matter of any one of Examples 1-28, and optionally, wherein the computer-executable instructions, when executed, cause the compiler to compile the source code into the target code according to a Low Level Virtual Machine (LLVM) based (LLVM-based) compilation scheme.

Example 30 includes the subject matter of any one of Examples 1-29, and optionally, wherein the target code is configured for execution by a Very Long Instruction Word (VLIW) Single Instruction/Multiple Data (SIMD) target processor.

Example 31 includes the subject matter of any one of Examples 1-30, and optionally, wherein the target code is configured for execution by a target vector processor.

Example 32 includes a compiler configured to perform any of the described operations of any of Examples 1-31.

Example 33 includes a computing device configured to perform any of the described operations of any of Examples 1-31.

Example 34 includes a computing system comprising at least one memory to store instructions; and at least one processor to retrieve instructions from the memory and execute the instructions to cause the computing system to perform any of the described operations of any of Examples 1-31.

Example 35 includes a computing system comprising a compiler to generate target code according to any of the described operations of any of Examples 1-31, and a processor to execute the target code.

Example 36 comprises an apparatus comprising means for executing any of the described operations of any of Examples 1-31.

Example 37 comprises an apparatus comprising: a memory interface; and processing circuitry configured to: perform any of the described operations of any of Examples 1-31.

Example 38 comprises a method comprising any of the described operations of any of Examples 1-31.

Functions, operations, components and/or features described herein with reference to one or more aspects, may be combined with, or may be utilized in combination with, one or more other functions, operations, components and/or features described herein with reference to one or more other aspects, or vice versa.

While certain features have been illustrated and described herein, many modifications, substitutions, changes, and equivalents may occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the disclosure.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

October 12, 2023

Publication Date

May 21, 2026

Inventors

Michael ZUCKERMAN
Ayal ZAKS

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “APPARATUS, SYSTEM, AND METHOD OF COMPILING CODE FOR A PROCESSOR” (US-20260140725-A1). https://patentable.app/patents/US-20260140725-A1

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.