Patentable/Patents/US-20260158647-A1
US-20260158647-A1

Techniques for Synergistic Planning, Imitation, and Reinforcement Learning for Robot Control

PublishedJune 11, 2026
Assigneenot available in USPTO data we have
Technical Abstract

The disclosed method for training one or more robot control models includes performing, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot; and performing one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

performing, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot; and performing one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot. . A computer-implemented method for training one or more robot control models, the method comprising:

2

claim 1 . The computer-implemented method of, wherein each first trained machine learning model included in the one or more first trained machine learning models is trained to control the robot to perform a different skill included in the one or more skills, and wherein each second trained machine learning model included in the one or more second trained machine learning models is trained to control the robot to perform a different skill included in the one or more skills.

3

claim 1 . The computer-implemented method of, wherein each first trained machine learning model included in the one or more first trained machine learning models is trained to generate a base action to control the robot, and each second trained machine learning model included in the one or more second trained machine learning models is trained to generate a delta action that modifies the base action generated by a corresponding first trained machine learning model included in the one or more first trained machine learning models.

4

claim 1 . The computer-implemented method of, further comprising generating the one or more demonstration trajectories based on one or more user inputs to control the robot via one or more input/output devices.

5

claim 1 generating, using an untrained machine learning model, one or more robot actions; generating, based on the one or more robot actions and using a simulator, one or more state-action pairs; calculating, based on the one or more state-action pairs and at least one trajectory included in the one or more demonstration trajectories, a loss; and updating, based on the loss, one or more parameters of the untrained machine learning model. . The computer-implemented method of, wherein performing one or more training operations to generate the one or more first trained machine learning models comprises:

6

claim 1 generating, using a first trained machine learning model included in the one or more first trained machine learning models and an untrained machine learning model, one or more actions; generating, based on the one or more actions and using a simulator, one or more state-action pairs; calculating, based on the one or more state-action pairs, a reward; and updating, based on the reward, the one or more parameters of the untrained machine learning model to generate a second trained machine learning model included in the one or more second trained machine learning models. . The computer-implemented method of, wherein performing one or more reinforcement learning operations to generate the one or more second trained machine learning models comprises:

7

claim 6 a sparse reward of one upon successful completion of a first skill included in the one or more skills and zero otherwise; a dense reward based on progress toward one or more goals associated with the first skill; or one or more penalty terms for movements greater than a threshold and collisions. . The computer-implemented method of, wherein the reward comprises at least one of:

8

claim 1 . The computer-implemented method of, wherein performing one or more reinforcement learning operations comprises updating one or more parameters of an untrained machine learning model based on a Kullback-Leibler (KL) divergence term that penalizes differences between one or more first actions generated using a first trained machine learning model included in the one or more first trained machine learning models and one or more second actions generated using the first trained machine learning model and the untrained machine learning model.

9

claim 1 scheduling a plurality of workers based on a sampling strategy for sampling workers to execute and a queue that stores indications of workers that require scheduling; and executing the plurality of workers based on the scheduling to generate the one or more second trained machine learning models. . The computer-implemented method of, wherein performing one or more reinforcement learning operations comprises:

10

claim 1 receiving sensor data from one or more sensors; generating, based on the sensor data and using the one or more first trained machine learning models and the one or more second trained machine learning models, one or more actions; and causing the robot to perform one or more first movements based on the one or more actions. . The computer-implemented method of, further comprising:

11

performing, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot; and performing one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot. . One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of:

12

claim 11 . The one or more non-transitory computer-readable media of, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of generating the one or more demonstration trajectories based on one or more user inputs to control the robot via one or more input/output devices.

13

claim 11 generating, using an untrained machine learning model, one or more robot actions; generating, based on the one or more robot actions and using a simulator, one or more state-action pairs; calculating, based on the one or more state-action pairs and at least one trajectory included in the one or more demonstration trajectories, a loss; and updating, based on the loss, one or more parameters of the untrained machine learning model. . The one or more non-transitory computer-readable media of, wherein performing one or more training operations to generate the one or more first trained machine learning models comprises:

14

claim 13 . The one or more non-transitory computer-readable media of, wherein the loss comprises a difference between one or more first robot actions generated using the untrained machine learning model and one or more second robot actions included in the one or more demonstration trajectories.

15

claim 11 generating, using a first trained machine learning model included in the one or more first trained machine learning models and an untrained machine learning model, one or more actions; generating, based on the one or more actions and using a simulator, one or more state-action pairs; calculating, based on the one or more state-action pairs, a reward; and updating, based on the reward, the one or more parameters of the untrained machine learning model to generate a second trained machine learning model included in the one or more second trained machine learning models. . The one or more non-transitory computer-readable media of, wherein performing one or more reinforcement learning operations to generate the one or more second trained machine learning models comprises:

16

claim 11 . The one or more non-transitory computer-readable media of, wherein performing one or more reinforcement learning operations comprises updating one or more parameters of an untrained machine learning model based on a Kullback-Leibler (KL) divergence term that penalizes differences between one or more first actions generated using a first trained machine learning model included in the one or more first trained machine learning models and one or more second actions generated using the first trained machine learning model and the untrained machine learning model.

17

claim 11 . The one or more non-transitory computer-readable media of, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of controlling the robot to perform one or more other skills associated with the task using a task and motion planner (TAMP).

18

claim 11 receiving sensor data from one or more sensors; generating, based on the sensor data and using the one or more first trained machine learning models and the one or more second trained machine learning models, one or more actions; and causing the robot to perform one or more first movements based on the one or more actions. . The one or more non-transitory computer-readable media of, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the steps of:

19

claim 18 . The one or more non-transitory computer-readable media of, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of causing the robot to perform one or more second movements based on one or more motions generated using a motion planning technique.

20

one or more memories storing instructions, and perform, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot, and perform one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot. one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to: . A system comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims priority benefit of the United States Provisional Patent Application titled, “SYNERGISTIC PLANNING, IMITATION, AND REINFORCEMENT FOR LONG-HORIZON MANIPULATION,” filed on Jul. 26, 2024, and having Ser. No. 63/676,223. The subject matter of this related application is hereby incorporated herein by reference.

Embodiments of the present disclosure relate generally to computer science, artificial intelligence and machine learning and, more specifically, to techniques for synergistic planning, imitation, and reinforcement learning for robot control.

Robot control generally refers to the use of automated systems, such as robotic arms, to execute movement and manipulation tasks in a variety of settings. In robot control, a control algorithm is employed to determine the commands that drive a robotic end-effector (e.g., a robot gripper) through a desired motion, often relying on sensor data, such as force feedback, camera images, joint encoders, and/or the like. Typical robot control tasks can include precisely positioning an end-effector, grasping and manipulating objects, tracking desired trajectories (e.g., a set of robot motions), and reacting to changes in the environment. In some cases, robot control is integrated into larger systems that handle multi-step procedures—such as assembling products, dispensing materials, or performing inspection—where each step can require distinct control strategies or tool configurations. Moreover, certain robotic tasks include one or more skills that have to be sequenced or combined, such as picking an item from a conveyor, inspecting the item under a camera, reorienting the item in the gripper, and then placing the item accurately onto a moving fixture.

Conventional approaches for robotic control oftentimes use reinforcement learning (RL). In an RL-based robot control system, the robot explores various robot actions in a given environment and a control policy, which is a machine learning model for controlling the robot, is refined based on a numerical reward that indicates successful outcomes. For example, an RL-based robot control system can assign a numerical reward for beneficial behaviors (e.g., accurately inserting a peg into a hole) and could assign a lower or zero reward for unproductive or failed behaviors. Through repeated trials and by tracking the rewards, the RL algorithm gradually refines the robot control policy. For example, if the robot presses down at an incorrect angle, the robot could receive a low reward, prompting a policy update to avoid that action in future attempts. On the other hand, other conventional approaches for robot control use imitation learning (e.g., behavior cloning (BC)), which is based on demonstrations. For example, a human operator could provide examples (e.g., demonstrations) of the correct way to manipulate an object, and the robot can then clone or replicate the demonstrated actions to learn a robot control policy. As a specific example, the human operator could teleoperate the robot end-effector to align and insert a component into a slot. The robot state-action pairs from the demonstrations (e.g., sensor readings at each step and the corresponding operator actions) can be recorded. A robot policy can then be trained to replicate the recorded actions when facing similar inputs (e.g., object positions or force readings). In some examples, demonstrations can be collected using virtual reality controllers or exoskeleton suits, giving the robot examples of human-like dexterous maneuvers.

One drawback of the RL-based approaches for robot control is that RL-based robot control systems often need carefully designed rewards. Those rewards can be challenging to design when the robot handles multiple subtasks or interacts carefully with objects the robot touches or pushes. In such scenarios, an RL agent may struggle to discover effective actions unless the rewards provide specific guidance for each stage of the task.

One drawback of BC-based approaches for robot control is training a robot control policy through BC typically requires access to extensive and high-quality demonstration data. Whenever the demonstrations fail to cover certain variations or edge cases, the learned robot control policy can become unreliable or unable to handle new situations, limiting the adaptability of robot when new or slightly altered subtasks are introduced.

In addition, both RL and BC can require carefully designing rewards or collecting many example demonstrations to teach the robot what to perform. Accordingly, these approaches are oftentimes unsuitable for training robot control policies that control robots to perform long horizon robotic tasks that can include interactions with various object properties or intricate sequences of actions.

As the foregoing illustrates, what is needed in the art are more effective techniques for robot control.

According to some embodiments, a computer implemented method for training one or more robot control models includes performing, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot. The method further includes performing one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot.

According to some embodiments, a computer-implemented method for training one or more robot control models includes scheduling a plurality of workers based on a sampling strategy for sampling workers to execute and a queue that stores indications of workers that require scheduling. The method further includes executing the plurality of workers based on the scheduling to generate a plurality of trained machine learning models for controlling a robot to perform a plurality of skills associated with a task.

Further embodiments provide, among other things, non-transitory computer-readable storage media storing instructions and systems configured to implement the method set forth above.

At least one technical advantage of the disclosed techniques relative to the prior art is that the disclosed techniques combine task-and-motion planning (TAMP) with behavior cloning and reinforcement learning (RL) in a synergistic framework that overcomes limitations of either approach alone. Unlike conventional RL techniques which require finely tuned, dense reward functions, the disclosed techniques restrict RL to predefined handoff sections determined by TAMP. The restriction simplifies reward design by allowing sparse, success-based rewards to be used. Another advantage of the disclosed techniques is that, rather than learning entire task behaviors end-to-end, the disclosed techniques use TAMP to handle routine skills included in the task, while reinforcement learning is used to fine-tune residual corrections for more challenging skills. The disclosed techniques also reduce the need for large, high-quality demonstration datasets by limiting the scope of behavior cloning to a subset of skills, where skills that are easier to model are delegated to TAMP. Yet another advantage of the disclosed techniques is that, by leveraging a scheduler that coordinates multiple TAMP workers and selectively allocates RL training opportunities to skills that are ready for training, the disclosed techniques permit scalable training for long-horizon tasks. These technical advantages provide one or more technological improvements over prior art approaches.

In the following description, numerous specific details are set forth to provide a more thorough understanding of the various embodiments. However, it will be apparent to one skilled in the art that the concepts can be practiced without one or more of these specific details.

Embodiments of the present disclosure provide techniques for controlling robots using a task and motion planner (TAMP) and robot control models that are trained using behavior cloning and reinforcement learning. The robot control models are machine learning models, such as neural networks, that process robot states and generate robot actions to perform at least part of a robotic skill. Each robot control model includes a policy model that generates robot actions and a residual model that generates modifications to the robot actions output by the policy model. In some embodiments, TAMP is used to generate demonstration data based on user inputs. Given a robotic task and a demonstration skillset, TAMP divides the task into various skills and checks whether the current skill is in the demonstration skillset, which includes the set of skills for which one or more robot control models need to be trained. Whenever the skill is not in the demonstration skillset, TAMP causes the robot to perform the skill until reaching a handoff section, which implies that the next skill is in the demonstration skillset. Then, the robot receives one or more user inputs, which cause the robot to perform the skill in the demonstration skillset. A trajectory recorder records the trajectory of the robot, which is stored in the demonstration data. The foregoing process continues until the task is complete. In some embodiments, a model trainer uses the demonstration data to train the policy models included in the robot control models. For each skill in demonstration skillset, the model trainer uses a trajectory for that skill included in the demonstration data to train a policy model. In various embodiments, the robot control model generates robot actions which are applied within a simulator. The simulator generates the next robot state and roll-out data based on the robot actions. A loss calculator calculates a behavior cloning loss based on the roll-out data and the demonstration data, which includes robot states and robot actions, and the trajectory. The model trainer then iteratively updates the parameters of the policy model based on the calculated losses until one or more stopping criteria are met. The model trainer then trains another policy model for another skill until training for all skills in the demonstrations skillset have completed. In various embodiments, the model trainer uses the robot control model with the trained policy models to train residual models using reinforcement learning. During the reinforcement learning, the robot control model generates robot actions which are applied to the simulator. The simulator generates the next robot states and the roll-out data based on the robot actions. The loss calculation module uses a reinforcement learning reward calculator to calculate a reinforcement learning reward based on robot actions, robot state, and robot actions generated using the trained policy model. The model trainer then uses a reinforcement learning module to iteratively update the parameters of the residual model based on the calculated rewards until one or more stopping criteria are met. The reinforcement learning module can use a reward that includes a Kullback-Leibler (KL) divergence term that limits deviations of robot actions generated by the robot control model from the robot actions generated by the previously trained policy model. Once the residual models for all skills in the demonstration skillset are trained, the robot control models, which each include a trained policy model that generates robot actions and a trained residual model that generates modifications to the robot actions, can be used along with TAMP to process sensor data and a task, and generate actions to cause a robot to perform at least part of the task that includes multiple skills.

In some embodiments, the model trainer uses a scheduler to schedule training of robot control models during the reinforcement learning. In various embodiments, the scheduler receives a sampling strategy, one or more workers, and a status queue. Each worker includes a TAMP environment that performs skills not in the demonstration skillset by default and reports a section request to the status queue whenever TAMP reaches a handoff section. Each status queue element includes a worder identifier (ID) and a section ID. To begin with, the scheduler pops the status queue. The scheduler then checks whether a section from the status queue is acceptable based on the sampling strategy. Whenever the section is determined not to be acceptable, the scheduler resets the worker, thereby skipping reinforcement learning for that particular section. Otherwise, the scheduler interacts with the model trainer, which performs reinforcement learning to train the residual model for the section. The reinforcement learning continues until the worker indicates to the scheduler the completion of the skill, at which point the worker uses TAMP to perform a next skill, if any, until a handoff section is reached and the worker reports another section request to the status queue. The scheduler also checks whether the status queue is empty. Whenever the scheduler determines that the status queue is not empty, the scheduler pops the status queue again and repeats the process. Whenever the scheduler determines that the status queue is empty, the model trainer stores the trained residual models.

The robot control techniques of the present disclosure have many real-world applications. For example, the robot control techniques could be used to control a physical robot in a real-world environment or a simulated robot in a virtual environment. As another example, the robot control techniques could be used to control a robot to perform a task which requires multiple skills.

The above examples are not in any way intended to be limiting. As persons skilled in the art will appreciate, as a general matter, the robot control techniques described herein can be implemented in any suitable application.

1 FIG. 100 100 110 120 140 130 110 112 114 114 115 116 117 118 119 120 121 122 123 140 142 144 144 146 i illustrates a block diagram of a computer-based systemconfigured to implement one or more aspects of at least one embodiment. As shown, systemincludes a machine learning server, a data store, and a computing devicein communication over a network, which can be a wide area network (WAN) such as the Internet, a local area network (LAN), a cellular network, and/or any other suitable network. Machine learning serverincludes, without limitation, processor(s)and a memory. Memoryincludes, without limitation, a trajectory recorder, a model trainer, a simulator, a loss calculator, and a scheduler. Data storestores, without limitation, one or more robot control models, a task and motion planner (TAMP), and demonstration data. Computing deviceincludes, without limitation, processor(s)and a memory. Memoryincludes, without limitation, a robot control application.

112 112 110 112 Processor(s)receive user input from input devices, such as a keyboard or a mouse. Processor(s)may include one or more primary processors of machine learning server, controlling and coordinating operations of other system components. In particular, processor(s)can issue commands that control the operation of one or more graphics processing units (GPUs) (not shown) and/or other parallel processing circuitry (e.g., parallel processing units, deep learning accelerators, etc.) that incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry. The GPU(s) can deliver pixels to a display device that can be any conventional cathode ray tube, liquid crystal display, light-emitting diode display, and/or the like.

114 110 112 114 114 112 System memoryof machine learning serverstores content, such as software applications and data, for use by processor(s)and the GPU(s) and/or other processing units. System memorycan be any type of memory capable of storing data and software applications, such as a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash ROM), or any suitable combination of the foregoing. In some embodiments, a storage (not shown) can supplement or replace the system memory. The storage can include any number and type of external memories that are accessible to processorand/or the GPU. For example, and without limitation, the storage can include a Secure Digital Card, an external Flash memory, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, and/or any suitable combination of the foregoing.

110 112 114 114 112 114 1 FIG. Machine learning servershown herein is for illustrative purposes only, and variations and modifications are possible without departing from the scope of the present disclosure. For example, the number of processors, the number of GPUs and/or other processing unit types, the number of system memories, and/or the number of applications included in system memorycan be modified as desired. Further, the connection topology between the various units incan be modified as desired. In some embodiments, any combination of processor(s), system memory, and/or GPU(s) can be included in and/or replaced with any type of virtual computing system, distributed computing system, and/or cloud computing environment, such as a public, private, or a hybrid cloud system.

115 112 110 114 110 115 160 123 123 120 114 115 3 7 FIGS.and As shown, trajectory recorderexecutes on one or more processorsof machine learning serverand is stored in system memoryof machine learning server. In various embodiments, trajectory recorderis an application that records one or more trajectories of robotbased on one or more user inputs received from one or more I/O devices (not shown) to generate demonstration data. Demonstration data, which can be stored in data storeor elsewhere (e.g., in memory), includes trajectories (e.g., time-ordered sequences of robot end-effector, positions, velocities, accelerations) and related information describing how a robot performs at least part of a task. Trajectory recorderis described in greater detail below in conjunction with.

117 112 110 114 110 117 121 As shown, simulatoris an application that executes on one or more processorsof machine learning serverand is stored in system memoryof machine learning server. In various embodiments, simulatoris an application that processes robot actions generated by robot control modelsand generates the next robot states and roll-out data.

118 112 110 114 110 118 123 117 118 117 As shown, loss calculatoris an application that executes on one or more processorsof machine learning serverand is stored in system memoryof machine learning server. In various embodiments, loss calculatoris an application that calculates a behavior cloning loss based on demonstration dataand the roll-out data from simulator. In some embodiments, loss calculatorgenerates a reinforcement learning reward based on roll-out data using simulator.

119 112 110 114 110 119 116 121 121 117 119 4 9 FIGS.B and As shown, scheduleris an application that executes on one or more processorsof machine learning serverand is stored in system memoryof machine learning server. In various embodiments, scheduleris an application that interacts with model trainerand robot control modelsto schedule reinforcement learning to train robot control modelsusing simulator. Scheduleris described in greater detail below in conjunction with.

122 120 120 122 114 121 144 122 121 122 122 122 121 1 FIG. As shown, TAMPis an application that is stored in data store. Although shown as being stored in data storein, TAMPcan be stored in memoryduring the training of robot control modelsor can be stored in memoryduring inference. In various embodiments, TAMPreceives a task and a demonstration skillset via one or more I/O devices. The demonstration skillset includes one or more skills that have to be performed to complete the task. One or more skills included in the demonstration skillset are performed using at least one of user inputs or one or more trained robot control models. TAMPgenerates robot actions to perform a skill that is not part of the demonstration skillset. Once TAMPdetermines a handoff section has been reached based on robot states corresponding to a skill in the demonstration skillset, TAMPdefers robot action generation to user inputs or to an appropriate trained robot control model.

116 112 110 114 110 118 118 116 As shown, model traineris an application that executes on one or more processorsof machine learning serverand is stored in a system memoryof machine learning server. Although shown as distinct from loss calculatorfor illustrative purposes, in some embodiments, functionality of the loss calculatorand the model trainercan be combined into a single application.

116 121 121 121 121 160 1801 180 180 180 121 123 121 120 120 121 114 144 110 140 120 130 110 120 5 11 FIGS.and 4 6 10 FIGS.A and- 1 FIG. In some embodiments, model traineris configured to train one or more machine learning models, including robot control models(referred to herein collectively as robot control modelsand individually as a robot control model). Robot control modelsare machine learning models, such as neural networks, which are trained to generate actions for a robot (e.g., robot) to perform at least part of a task based on one or more observations acquired via one or more sensors(referred to herein collectively as sensorsand individually as a sensor), as discussed in greater detail below in conjunction with. For example, in at least one embodiment, sensorscan include one or more cameras, one or more RGB-D cameras (e.g., cameras using time-of-flight sensors), such as a wrist-mounted RGB-D camera, one or more LiDAR sensors, any combination thereof, etc. Techniques for training robot control models, based on demonstration dataand using reinforcement learning are discussed in greater detail herein in conjunction with at least. Robot control modelscan be stored in data store. Although shown as being stored in data storein, robot control modelscan be stored in memoryduring training or can be stored in memoryduring inference. In some embodiments, the same computing device(s) can be used for training and inference after training, rather than the separate machine learning serverand computing device. In some embodiments, data storecan include any storage device or devices, such as fixed disc drive(s), flash drive(s), optical storage, network attached storage (NAS), and/or a storage area-network (SAN). Although shown as accessible over network, in at least one embodiment machine learning servercan include data store.

146 121 122 120 130 142 140 121 146 160 121 160 146 160 144 142 114 112 110 146 5 FIG. As shown, a robot control applicationthat uses robot control modelsand TAMPis stored in data storeaccessed over network, and executes on processor(s), of computer device. Once trained, trained robot control modelscan be deployed, such as via robot control application, to control a physical robot in a real-world environment, such as robotto perform one or more skills as a part of a task. In various embodiments, trained robot control modelsare deployed for use with virtual environments, such as in a simulator (not shown), where a virtual model of robotis simulated within a virtual environment, such as a digital twin or a simulation platform. In the virtual deployment, robot control applicationinterfaces with a virtual representation of robot, which can enable testing, validation, and refinement of robot plans. Memoryand the processor(s)can be similar to memoryand processor(s)of machine learning server, described above. Robot control applicationis discussed in greater detail below in conjunction with.

160 161 163 165 162 164 166 160 1681 168 168 160 160 As shown, robotincludes multiple links,, andthat are rigid members, as well as joints,, andthat are movable components that can be actuated to cause relative motion between adjacent links. In addition, robotincludes multiple fingers(referred to herein collectively as fingersand individually as a finger) that can be controlled to grasp an object. For example, in at least one embodiment, robotcan include a locked wrist and multiple (e.g., four) fingers. Although an example robotis shown for illustrative purposes, in at least one embodiment, techniques disclosed herein can be applied to control any suitable robot.

2 FIG.A 1 FIG. 110 110 110 is a block diagram illustrating machine learning serverofin greater detail, according to various embodiments. Machine learning servermay include any type of computing system, including, without limitation, a server machine, a server platform, a desktop machine, a laptop machine, a hand-held/mobile device, a digital kiosk, an in-vehicle infotainment system, and/or a wearable device. In some embodiments, machine learning serveris a server machine operating in a data center or a cloud computing environment that provides scalable computing resources as a service over a network.

110 112 114 212 205 213 205 207 206 207 216 In various embodiments, machine learning serverincludes, without limitation, processor(s)and memory(ies)coupled to a parallel processing subsystemvia a memory bridgeand a communication path. Memory bridgeis further coupled to an I/O (input/output) bridgevia a communication path, and I/O bridgeis, in turn, coupled to a switch.

207 208 112 110 110 208 218 216 207 110 218 220 221 In one embodiment, I/O bridgeis configured to receive user input information from optional input devices, such as a keyboard, mouse, touch screen, sensor data analysis (e.g., evaluating gestures, speech, or other information about one or more uses in a field of view or sensory field of one or more sensors), and/or the like, and forward the input information to processor(s)for processing. In some embodiments, machine learning servermay be a server machine in a cloud computing environment. In such embodiments, machine learning servermay not include input devices, but may receive equivalent input information by receiving commands (e.g., responsive to one or more inputs from a remote computing device) in the form of messages transmitted over a network and received via network adapter. In some embodiments, switchis configured to provide connections between I/O bridgeand other components of machine learning server, such as a network adapterand various add-in cardsand.

207 214 142 212 214 207 In some embodiments, I/O bridgeis coupled to a system diskthat may be configured to store content and applications and data for use by processor(s)and parallel processing subsystem. In one embodiment, system diskprovides non-volatile storage for applications and data and may include fixed or removable hard disk drives, flash memory devices, and CD-ROM (compact disc read-only-memory), DVD-ROM (digital versatile disc-ROM), Blu-ray, HD-DVD (high-definition DVD), or other magnetic, optical, or solid state storage devices. In various embodiments, other components, such as universal serial bus or other port connections, compact disc drives, digital versatile disc drives, film recording devices, and the like, may be connected to I/O bridgeas well.

205 207 206 213 110 In various embodiments, memory bridgemay be a Northbridge chip, and I/O bridgemay be a Southbridge chip. In addition, communication pathsand, as well as other communication paths within machine learning server, may be implemented using any technically suitable protocols, including, without limitation, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol known in the art.

212 210 212 212 In some embodiments, parallel processing subsystemcomprises a graphics subsystem that delivers pixels to an optional display devicethat may be any conventional cathode ray tube, liquid crystal display, light-emitting diode display, and/or the like. In such embodiments, parallel processing subsystemmay incorporate circuitry optimized for graphics and video processing, including, for example, video output circuitry. Such circuitry may be incorporated across one or more parallel processing units (PPUs), also referred to herein as parallel processors, included within parallel processing subsystem.

212 212 212 114 212 114 115 116 117 118 119 115 116 117 118 119 212 In some embodiments, parallel processing subsystemincorporates circuitry optimized (e.g., that undergoes optimization) for general purpose and/or compute processing. Again, such circuitry may be incorporated across one or more PPUs included within parallel processing subsystemthat are configured to perform such general purpose and/or compute operations. In yet other embodiments, the one or more PPUs included within parallel processing subsystemmay be configured to perform graphics processing, general purpose processing, and/or compute processing operations. System memoryincludes at least one device driver configured to manage the processing operations of the one or more PPUs within parallel processing subsystem. In addition, system memoryincludes, without limitation, trajectory recorder, model trainer, simulator, loss calculator, and scheduler. Although described herein primarily with respect to trajectory recorder, model trainer, simulator, loss calculator, and scheduler, techniques disclosed herein can also be implemented, either entirely or in part, in other software and/or hardware, such as in parallel processing subsystem.

212 212 142 2 FIG.A In various embodiments, parallel processing subsystemmay be integrated with one or more of the other elements ofto form a single system. For example, parallel processing subsystemmay be integrated with processorand other connection circuitry on a single chip to form a system on a chip (SoC).

112 110 112 213 In some embodiments, processor(s)includes the primary processor of machine learning server, controlling and coordinating operations of other system components. In some embodiments, processor(s)issues commands that control the operation of PPUs. In some embodiments, communication pathis a PCI Express link, in which dedicated lanes are allocated to each PPU. Other communication paths may also be used. The PPU advantageously implements a highly parallel processing architecture, and the PPU may be provided with any amount of local parallel processing memory (PP memory).

112 212 114 112 205 114 205 112 212 207 112 205 207 205 216 218 220 221 207 212 212 2 FIG.A 2 FIG.A It will be appreciated that the system shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of processor(s), and the number of parallel processing subsystems, may be modified as desired. For example, in some embodiments, system memorycould be connected to the processor(s)directly rather than through memory bridge, and other devices may communicate with system memoryvia memory bridgeand processor. In other embodiments, parallel processing subsystemmay be connected to I/O bridgeor directly to processor, rather than to memory bridge. In still other embodiments, I/O bridgeand memory bridgemay be integrated into a single chip instead of existing as one or more discrete devices. In certain embodiments, one or more components shown inmay not be present. For example, switchcould be eliminated, and network adapterand add-in cards,would connect directly to I/O bridge. Lastly, in certain embodiments, one or more components shown inmay be implemented as virtualized resources in a virtual computing environment, such as a cloud computing environment. In particular, the parallel processing subsystemmay be implemented as a virtualized parallel processing subsystem in at least one embodiment. For example, the parallel processing subsystemmay be implemented as a virtual graphics processing unit(s) (vGPU(s)) that renders graphics on a virtual machine(s) (VM(s)) executing on a server machine(s) whose GPU(s) and other physical resources are shared across one or more VMs.

2 FIG.B 1 FIG. 140 140 140 110 140 is a block diagram illustrating computing deviceofin greater detail, according to various embodiments. Computing devicemay include any type of computing system, including, without limitation, a server machine, a server platform, a desktop machine, a laptop machine, a hand-held/mobile device, a digital kiosk, an in-vehicle infotainment system, and/or a wearable device. In some embodiments, computing deviceis a server machine operating in a data center or a cloud computing environment that provides scalable computing resources as a service over a network. In some embodiments, machine learning servercan include one or more similar components as computing device.

140 142 144 262 255 263 255 257 256 257 266 In various embodiments, computing deviceincludes, without limitation, processor(s)and memory(ies)coupled to a parallel processing subsystemvia a memory bridgeand a communication path. Memory bridgeis further coupled to an I/O (input/output) bridgevia a communication path, and I/O bridgeis, in turn, coupled to a switch.

257 258 142 140 140 258 268 266 257 140 268 270 271 In one embodiment, I/O bridgeis configured to receive user input information from optional input devices, such as a keyboard, mouse, touch screen, sensor data analysis (e.g., evaluating gestures, speech, or other information about one or more uses in a field of view or sensory field of one or more sensors), and/or the like, and forward the input information to processor(s)for processing. In some embodiments, computing devicemay be a server machine in a cloud computing environment. In such embodiments, computing devicemay not include input devices, but may receive equivalent input information by receiving commands (e.g., responsive to one or more inputs from a remote computing device) in the form of messages transmitted over a network and received via network adapter. In some embodiments, switchis configured to provide connections between I/O bridgeand other components of computing device, such as a network adapterand various add-in cardsand.

257 264 142 262 264 257 In some embodiments, I/O bridgeis coupled to a system diskthat may be configured to store content and applications and data for use by processor(s)and parallel processing subsystem. In one embodiment, system diskprovides non-volatile storage for applications and data and may include fixed or removable hard disk drives, flash memory devices, and CD-ROM (compact disc read-only-memory), DVD-ROM (digital versatile disc-ROM), Blu-ray, HD-DVD (high-definition DVD), or other magnetic, optical, or solid state storage devices. In various embodiments, other components, such as universal serial bus or other port connections, compact disc drives, digital versatile disc drives, film recording devices, and the like, may be connected to I/O bridgeas well.

255 257 256 263 140 In various embodiments, memory bridgemay be a Northbridge chip, and I/O bridgemay be a Southbridge chip. In addition, communication pathsand, as well as other communication paths within computing device, may be implemented using any technically suitable protocols, including, without limitation, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol known in the art.

262 260 262 262 In some embodiments, parallel processing subsystemcomprises a graphics subsystem that delivers pixels to an optional display devicethat may be any conventional cathode ray tube, liquid crystal display, light-emitting diode display, and/or the like. In such embodiments, parallel processing subsystemmay incorporate circuitry optimized for graphics and video processing, including, for example, video output circuitry. Such circuitry may be incorporated across one or more parallel processing units (PPUs), also referred to herein as parallel processors, included within parallel processing subsystem.

262 262 262 144 262 144 146 146 262 In some embodiments, parallel processing subsystemincorporates circuitry optimized (e.g., that undergoes optimization) for general purpose and/or compute processing. Again, such circuitry may be incorporated across one or more PPUs included within parallel processing subsystemthat are configured to perform such general purpose and/or compute operations. In yet other embodiments, the one or more PPUs included within parallel processing subsystemmay be configured to perform graphics processing, general purpose processing, and/or compute processing operations. System memoryincludes at least one device driver configured to manage the processing operations of the one or more PPUs within parallel processing subsystem. In addition, system memoryincludes robot control application. Although described herein primarily with respect to robot control application, techniques disclosed herein can also be implemented, either entirely or in part, in other software and/or hardware, such as in parallel processing subsystem.

262 262 142 2 FIG.B In various embodiments, parallel processing subsystemmay be integrated with one or more of the other elements ofto form a single system. For example, parallel processing subsystemmay be integrated with processorand other connection circuitry on a single chip to form a system on a chip (SoC).

142 140 142 263 In some embodiments, processor(s)includes the primary processor of computing device, controlling and coordinating operations of other system components. In some embodiments, processor(s)issue commands that control the operation of PPUs. In some embodiments, communication pathis a PCI Express link, in which dedicated lanes are allocated to each PPU. Other communication paths may also be used. The PPU advantageously implements a highly parallel processing architecture, and the PPU may be provided with any amount of local parallel processing memory (PP memory).

142 262 144 142 255 144 255 142 262 257 142 255 257 255 266 268 270 271 257 262 262 2 FIG.B 2 FIG.B It will be appreciated that the system shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of processor(s), and the number of parallel processing subsystems, may be modified as desired. For example, in some embodiments, system memorycould be connected to processor(s)directly rather than through memory bridge, and other devices may communicate with system memoryvia memory bridgeand processor. In other embodiments, parallel processing subsystemmay be connected to I/O bridgeor directly to processor, rather than to memory bridge. In still other embodiments, I/O bridgeand memory bridgemay be integrated into a single chip instead of existing as one or more discrete devices. In certain embodiments, one or more components shown inmay not be present. For example, switchcould be eliminated, and network adapterand add-in cards,would connect directly to I/O bridge. Lastly, in certain embodiments, one or more components shown inmay be implemented as virtualized resources in a virtual computing environment, such as a cloud computing environment. In particular, parallel processing subsystemmay be implemented as a virtualized parallel processing subsystem in at least one embodiment. For example, parallel processing subsystemmay be implemented as a virtual graphics processing unit(s) (vGPU(s)) that renders graphics on a virtual machine(s) (VM(s)) executing on a server machine(s) whose GPU(s) and other physical resources are shared across one or more VMs.

3 FIG. 1 FIG. 115 122 302 304 305 306 160 304 122 306 304 160 301 160 304 115 122 160 303 123 is a more detailed illustration of the trajectory recorderof, according to various embodiments. As shown, TAMPreceives taskand demonstration skillsetand generates robot actionsbased on robot statesto cause robotto perform a skill which is not in demonstration skillset. Once TAMPdetermines reaching a handoff section based on robot statescorresponding to a skill in demonstration skillset, robotreceives user inputsthat cause robotto perform a skill in demonstration skillset. Trajectory recorderinteracts with TAMPand robotand records trajectory, which is stored in demonstration data.

122 306 305 122 302 304 122 306 302 305 160 304 122 305 160 122 306 304 121 122 305 121 301 304 122 305 306 122 302 122 122 305 122 305 306 122 305 302 122 305 122 TAMPis an application which processes robot statesand generates robot actions. In various embodiments, TAMPreceives a taskand demonstration skillset. TAMPprocesses robot states, task, and a demonstration skillset and generates robot actionsto cause robotto perform a skill which is not in demonstration skillsetuntil reaching a hand off section. For example, TAMPcould generate actionscausing robotto move an arm from the resting position to a coffee machine or to retrieve a coffee cup from a known location. TAMPdetermines reaching a handoff section based on robot statescorresponding to a skill which is not in demonstration skillset. For example, pouring hot water into a coffee filter or placing a capsule into a coffee machine, which are skills that require fine manipulation or force control, could be delegated to a trained robot control modelor a user via teleoperation. Upon determining a handoff section, TAMPpauses generating robot actionsand triggers a transition to using at least one of a trained robot control modelor user inputsto perform the skill which is not in demonstration skillset. After the skill is completed, TAMPresumes generating robot actionsbased on robot statesuntil reaching the next hand off section. TAMPcontinues the process until taskis complete. In some embodiments, TAMPincludes a model-based approach for synthesizing long-horizon robot behavior. TAMPintegrates discrete (e.g., symbolic) planning with continuous (e.g., motion) planning to plan hybrid discrete-continuous robot actions. In some embodiments, TAMPuses a model of robot actions that a planner can apply, and the robot actionsmodify the current robot states. Using the model, TAMPcan search over the space of plans to find a sequence of robot actionsand the associated parameters that satisfies a skill. In some embodiments, each taskincludes a series of alternating TAMP sections and handoff sections, where TAMPdelegates generating robot actionsto a trained agent π. The sections are TAMP-gated (e.g., the sections are chosen at the discretion of the TAMP) and typically include skills that are difficult to automate with model-based planning. In various embodiments, a TAMP-gated policy learning problem can be modelled as a series of Markov Decision Processes (MDPs),

i where N is the number of MDPs (each corresponding to a handoff section),andare the state and action space, T is the transition dynamics, r(s) and

122 122 are the i-th reward function and initial state distribution, and γ is the discount factor. The start and end of each handoff section is chosen by TAMP. That is, TAMPdetermines the initial state distribution

304 304 122 160 304 301 301 for each handoff section, and the reward function r(s). In various embodiments, demonstration skillsetincludes one or more skills that are impractical to manually model. For example, skills such as gently stirring a cup without spilling or attaching a lid that requires precise alignment and force application could be included in demonstration skillsetdue to fine-grained dynamics and sensitivity to small variations. During data generation, whenever TAMPdetermines reaching a hand off section, robotperforms the skill in demonstration skillsetbased on user inputs. In various embodiments, user inputsinclude teleoperation commands provided by a human operator using various input devices, such as a joystick, a VR controller, a kinesthetic teaching interface, and/or the like.

115 303 160 301 160 304 115 303 123 123 Trajectory recorderis an application that records trajectory(e.g., demonstration trajectory) of robot, which is generated when user inputscause robotto perform a skill in the demonstration skillset. Trajectory recorderthen stores trajectoryin demonstration data. In various embodiments, demonstration datacan be represented as

t t i i 303 160 where sϵ, aϵand His the horizon, and gis the handoff section of the i-th trajectory. In some examples,is a 7-dimensional continuous action space that models 6-degree of freedom delta movement of the end-effector of robotalong with 1 dimension for finger control, andis modeled as a normal distribution with a scheduled standard deviation.

4 FIG.A 1 FIG. 4 FIG.A 4 FIG.B 116 420 121 420 421 420 121 402 401 401 117 401 403 118 405 404 403 123 116 420 404 121 116 420 121 116 421 121 421 illustrates how the model traineroftrains policy models, according to various embodiments. As shown, robot control modelincludes, without limitation, a policy modeland a residual model. In operation, policy modelof robot control modelgenerates robot actionsbased on robot states. Robot statesare applied to simulatorwhich generates the next robot statesand roll-out data. Loss calculatoruses a behavior cloning loss calculatorto calculate a behavior cloning lossbased on roll-out dataand demonstration data. Model trainerupdates the parameters of policy modelbased on behavior cloning loss. In various embodiments, the training of robot control modelsis carried out in two steps. In the first step, model traineriteratively trains one or more policy modelsincluded in robot control modelsusing behavior cloning. The first step is described in conjunction with. In the second step, model trainertrains one or more residual modelsincluded in robot control modelsusing reinforcement learning. The training of residual modelsis described in greater detail in conjunction with.

121 401 402 121 304 402 160 121 420 421 420 401 402 421 φ Robot control modelsare machine learning models, such as neural networks, which process robot statesand generate robot actions. In some embodiments, each robot control modelis associated with a skill from demonstration skillsetand is configured to generate robot actionsthat guide robotin performing that skill. Although described herein primarily with respect to robot control models that are each associated with a single skill, in some embodiments, a robot control model can be trained to perform multiple skills. Robot control modelsinclude, without limitation, policy modelsand residual models. In some examples, each policy modelcan be represented by a base policy π(s) which maps robot statesto robot actionsand is parameterized by parameters φ. Residual modelcan also be represented by a residual policy

401 402 121 401 402 121 which maps robot statesto robot actionsand is parameterized by parameters θ. Robot control modelthen maps robot statesto robot actionsbased on the base policy and the residual policy. Accordingly, the residual policy generates a delta action that is a modification/correction to the base action generated by the base policy. For example, robot control modelcan be represented by a policy

121 which is parameterized by both parameters φ, θ. In some examples, robot control modelsare convolutional neural networks. In some embodiments, the residual policy shares the same action space as the base policy but is initialized close to zero. Although described herein primarily with respect to training separate base and residual policies, in some embodiments, a single policy can be trained using behavior cloning and re-trained using reinforcement learning in a manner similar to the training of the base and residual policies.

117 402 401 403 117 160 117 160 402 117 117 401 403 121 Simulatorprocesses robot actionsand generates robot statesand roll-out data. In various embodiments, simulatorincludes a robot model that represents the kinematics, dynamics, geometry, and actuation properties of robot. The robot model permits simulatorto simulate the physical behavior of robotin response to robot actions, including joint movements, end-effector movements, and interactions with objects in the environment. In some embodiments, simulatoralso models external factors such as gravity, collisions, contact forces, sensor noise, and/or the like, permitting realistic simulation of various skills. Simulatorgenerates robot stateswhich reflect updated observations, such as joint angles, gripper positions, camera images, force readings, and/or the like. Roll-out dataincludes sequences of state-action pairs, which are used for training and evaluating robot control models.

118 404 403 123 118 405 118 405 404 404 123 401 420 123 404 φ Loss calculatorcalculates behavior cloning lossbased on roll-out dataand demonstration data. As shown, loss calculatorincludes, without limitation, a behavior cloning loss calculator. In some embodiments, loss calculatoruses behavior cloning loss calculatorto calculate behavior cloning loss. In various embodiments, behavior cloning lossis calculated as the negative loglikelihood of the robot actions included in demonstration datagiven the observed robot statesunder the policy modelπ. In some examples, for each state-action pair (s, a) in the demonstration data, behavior cloning lossis calculated as:

φ φ 123 where π(a|s) is the probability that the base policy assigns to action a given state s. Equation 1 measures how well the base policy πaligns with the demonstration trajectory included in demonstration data.

116 420 404 116 404 420 116 Model trainerupdates the parameters of policy modelsbased on behavior cloning loss. In various embodiments, model traineruses various optimization techniques such as stochastic gradient descent (SGD), adaptive moment estimation (Adam), and/or the like, to minimize behavior cloning lossand adjust the parameters φ of policy modelsaccordingly. In some examples, model trainersolves the following optimization problem:

116 404 404 116 420 404 For each training epoch, model trainercomputes the gradient of behavior cloning losswith respect to φ. The parameters are then updated in the direction that reduces behavior cloning loss. Model trainerupdates the parameters of policy modelover multiple training epochs until one or more stopping criteria are met, such as the behavior cloning lossconverging, reaching a predefined number of training epochs, and/or the like.

4 FIG.B 1 FIG. 116 421 119 121 420 421 401 402 117 402 401 407 118 411 407 408 116 410 421 119 116 121 421 illustrates how the model traineroftrains residual modelsusing a scheduler, according to various embodiments. As shown, robot control modeluses the trained policy modeland the untrained residusamal modelto process robot statesand generates robot actions. Simulatorprocesses robot actionsand generates the next robot statesand roll-out data. Loss calculatoruses reinforcement learning reward calculatorto process roll-out dataand generate reinforcement learning reward. Model traineruses reinforcement learning moduleto update the parameters of residual model. Schedulerinteracts with model trainerand robot control modelsto schedule training of residual modelsduring the reinforcement learning.

121 401 402 121 420 421 121 402 Robot control modelsprocess robot statesand generate robot actions. As shown, robot control modelsinclude, without limitation, the trained policy modelsand residual models. In various embodiments, robot control modelgenerates robot actionsbased on a policy

421 where θ are the parameters of residual modelto be trained. In some embodiments, only the mean of the trained base policy is added to the residual policy.

117 402 401 407 117 401 403 121 Simulatorprocesses robot actionsand generates the next robot statesand roll-out data. Simulatorgenerates robot states, which reflect updated observations, such as joint angles, gripper positions, camera images, force readings, and/or the like. Roll-out dataincludes sequences of state-action pairs, which are used for training and evaluating robot control models.

118 407 408 118 411 411 402 408 408 408 Loss calculatorprocesses roll-out dataand generates a reinforcement learning reward. As shown, loss calculatorincludes, without limitation, a reinforcement learning reward calculator. In various embodiments, reinforcement learning reward calculatorevaluates the outcome of each robot actionbased on skill-specific success criteria and assigns a numerical reward accordingly. In some embodiments, reinforcement learning rewardis sparse, such as providing a reward of 1 only upon successful completion of a skill (e.g., successfully placing a cup into a machine) and zero otherwise. In some embodiments, reinforcement learning rewardis dense, providing incremental rewards based on progress toward task goals (e.g., reducing positional error or maintaining alignment with a target object). In some embodiments, reinforcement learning rewardincludes penalty terms, such as penalty terms for excessive movement, collisions, and/or the like.

116 408 421 116 410 Model traineruses reinforcement learning rewardto train residual models. As shown, model trainerincludes, without limitation, a reinforcement learning module. RL is a learning framework in which an agent (e.g., the fixed trained base policy plus the to-be-trained residual policy

402 401 408 θ interacts with an environment by selecting robot actions, observing the resulting robot states, and receiving reinforcement learning rewardsthat reflect skill performance. The goal is to learn a policy that maximizes the expected cumulative reward over time. In some examples, the expected return under policy πis defined as:

0 0 1 1 T θ where τ=(s, a, s, a, . . . , s) is a trajectory generated by following policy π, and

408 408 123 410 121 θ φ* θ φ* denotes the reinforcement learning rewardof skill i received at time step t. In some embodiments, due to the sparsity of reinforcement learning reward, reinforcement learning objective in Equation 3 can exhibit high variance, which could cause the policy πto drift significantly from the base policy πtrained via behavior cloning. The drift can result in the loss of useful behavior learned from demonstration dataand reduce overall training stability. In some embodiments, to mitigate the issue, reinforcement learning moduleuses a Kullback-Leibler (KL) divergence penalty between the policy πand the base policy π. The KL divergence penalty provides a soft constraint that constrains the output of the robot control modelto remain close to the base policy throughout the fine-tuning with RL process. In some examples, the final reinforcement learning objective used to guide RL training is described as

θ θ KL θ φ* where J(π) is the expected task reward obtained by following the policy π, and D(π∥π) is the KL-divergence measuring how much the current policy deviates from the base policy. The KL term can be computed as:

410 421 421 The weighting factor α controls the strength of the KL divergence penalty term. In various embodiments, reinforcement learning moduleapplies any suitable RL algorithm, such as policy gradient, Q-learning, actor-critic techniques, and/or the like, to compute the gradient of the expected return with respect to θ and to update the residual model. In some embodiments, training continues until residual modelperformance converges or meets a predefined threshold, such as reaching a maximum number of training epochs. Any technically feasible RL technique, including known RL algorithms, can be used in some embodiments. Advantageously, the RL permits the residual model to explore different ways to perform a skill, which can result in better performing robot control models than if only behavior cloning were used. Further, use of a base policy that is trained using behavior cloning enables efficient RL training by guiding the exploration process, which can result in higher quality robot control models.

119 116 121 421 119 119 119 119 119 119 401 119 402 421 402 117 119 119 119 119 119 t t θ t t Schedulerinteracts with model trainerand robot control modelsand schedules the training of residual modelsusing reinforcement learning. In various embodiments, scheduleris implemented as a centralized control loop that coordinates a pool of TAMP workers, a shared status queue, and a sampling strategy. Each TAMP worker executes a TAMP planner in an environment instance. When a TAMP worker reaches a handoff section that requires reinforcement learning, the TAMP worker submits a tuple (i,j) to the status queue, where i identifies the worker and j identifies the section index. The worker then enters an idle state until the worker receives a command from scheduler. In some embodiments, the status queue is a first-in-first-out (FIFO) queue that tracks the availability of handoff sections from across all workers. In various embodiments, schedulercontinuously monitors the status queue. Upon retrieving an entry (i,j) from the status queue, schedulerqueries a strategy object, which provides the sampling strategy, to determine whether the section j is suitable for training. If the sampling strategy accepts the section, schedulerinitiates an RL episode with worker i. In some embodiments, the sampling strategy upsamples later sections so that later skills in a task, which may not be reached as frequently as earlier skills, are also learned. If the section is accepted, at each step t in the RL episode, schedulerreceives the current robot statessby calling observe( ) on the worker. Schedulerthen receives robot actionsa˜π(s) using the residual policy modelunder training. The robot actionsare sent back to the worker, which advances the environment in simulatorby calling step (a). Schedulercontinues the process in a loop until the worker indicates the current section is done by returning done( )=True. In some embodiments, whether the current section is done depends on whether the skill corresponding to the current section has been completed in the worker, meaning the robot reached the goal condition for that handoff section (e.g., successfully placing a cup, inserting an object, or aligning with a fixture). In some embodiments, after the current section is solved, the worker sends the done( ) success notification to schedulerand runs TAMP until reaching the next handoff section that requires further reinforcement learning. When the next handoff section is reached, the worker submits another tuple (i,j) to the status queue and then waits in the idle state until a command is received from scheduleragain. Whenever, on the other hand, the sampling strategy does not accept section j, schedulerissues a reset command to worker i, prompting the worker to restart the TAMP until TAMP reaches a handoff section and submits a tuple (i,j) to the status queue. The rejection-and-reset mechanism prevents training on sections that could be too difficult for the current policy or not yet ready according to a curriculum logic. The execution flow of schedulercan be expressed more formally as:

Algorithm 1: Scheduler Procedure: Scheduler(Workers, StatusQueue, Policy, Strategy) 1 while True do 2  (i, j) ← StatusQueue.pop( ) 3  if Strategy.accepts(j) then 4   while not Workers[i].done( ) do 5    s_obs ← Workers[i].observe( ) 6    a ← Policy.act(s_obs) 7    Workers[i].step(a) 8  else 9   Workers[i].reset( ) 119 119 In some embodiments, the strategy used by schedulerincludes a curriculum learning mechanism. For example, in a sequential strategy, scheduleraccepts section j whenever the average success rate over all previous sections from 0 to j−1 exceeds a predefined threshold τ. In some examples, the acceptance condition in a sequential strategy can be described as:

119 119 119 421 119 119 The curriculum permits the RL agent to train on simpler or earlier skills before progressing to more difficult skills, resulting in more stable and sample-efficient RL training. In some embodiments, scheduleruses a permissive strategy which accepts all sections unconditionally, allowing schedulerto optimize utilization of the workers but without any enforced learning progression. In some embodiments, scheduleralso improves the throughput of RL training of residual modelsusing parallelization. Whenever the TAMP planning time per worker is bounded by T seconds, each RL interaction step takes at least t seconds, and each handoff segment spans at least H steps, then schedulerwith n workers achieves a throughput of at least 1/t frames per second, provided n≥T/H. In contrast, a single-worker scheduler limited by sequential planning and interaction has a worst-case throughput of only H/(T+tH). When TAMP planning dominates interaction time, such as when T=k·tH for some constant k, schedulerwith n workers improves training speed by a factor of approximately k+1 compared to the single-worker baseline.

5 FIG. 1 FIG. 146 146 121 122 146 501 180 502 160 502 is a more detailed illustration of the robot control applicationof, according to various embodiments. As shown, robot control applicationincludes, without limitation, trained robot control modelsand TAMP. Robot control applicationprocesses sensor dataacquired via sensorsand taskreceived from one or more I/O devices to generate controls for robotto perform at least part of task, which includes one or more skills.

146 160 122 121 146 501 180 146 146 122 160 146 121 146 160 146 122 146 In some embodiments, robot control applicationcontrols robotusing a hybrid execution strategy, called Synergistic Planning, Imitation, and Reinforcement (SPIRE), based on TAMPand trained robot control models. At each timestep, robot control applicationreceives sensor datafrom sensors, including joint positions, end-effector pose, force and torque signals, visual observations, and/or the like, to estimate the current robot states s. Robot control applicationthen determines whether the current robot state satisfies the goal condition G of the current handoff section, where G represents the set of terminal robot states of a skill. Whenever sϵG, meaning the current skill has successfully completed, the control loop exits. Whenever the current skill goal has not yet been achieved and the skill is tagged as being TAMP-based, robot control applicationuses a motion planner in TAMPto generate robot actions {right arrow over (a)}=PLAN−TAMP(s, G), which returns a sequence of robot actions expected to guide robotfrom the current robot states toward a terminal handoff section. Any technically feasible motion planner, including known motion planners, can be used in some embodiments. Whenever the robot actions are instead tagged as RL-based (e.g., a. type=“RL”), robot control applicationuses a trained policy π=a.policy that can be one of trained robot control modelscorresponding to the current skill. Robot control applicationgenerates robot actions based on the trained policy and generates controls to robotuntil the handoff goal G for that skill is reached. For robot actions that are not RL-based (e.g., a. type≠“RL”), robot control applicationinstead executes a trajectory τ=a.trajectory that is generated by the motion planner in TAMP, described above. The work flow of robot control applicationcan be described as:

Algorithm 2: SPIRE procedure: SPIRE(G)  1 while True do  2 s ← OBSERVE( )  3 if s ∈ G then return True  4 {right arrow over (a)} ← PLAN − TAMP(s, G)  5 for a ∈ {right arrow over (a)} do  6  if a.type = “RL” then π ← a.policy  7  EXECUTE-POLICY (π)  8  break  9 else 10  τ ← a.trajectory 11  EXECUTE-TRAJECTORY (τ)

146 160 146 180 146 160 In some embodiments, robot control applicationuses various motion planning techniques, such as inverse kinematics and/or the like, to generate one or more controls based on the robot actions. The controls can include joint position commands, velocity commands, or torque commands, depending on the specific motion control architecture of robot. In some embodiments, robot control applicationincludes real-time feedback from sensorsto dynamically adjust the robot actions based on unexpected changes in the environment, such as the displacement of objects or obstacles. In some embodiments, robot control applicationsends low-level motor commands to the actuators of robotbased on the controls, or sends commands based on the controls to a low-level controller that generates low-level motor commands, enabling precise execution of the controls.

6 FIG. 1 5 FIGS.- 121 is a flow diagram of method steps for training robot control models, according to various embodiments. Although the method steps are described in conjunction with the systems of, persons skilled in the art will understand that any system configured to perform the method steps in any order falls within the scope of the present disclosure.

600 602 116 122 116 116 116 4000 116 121 116 −4 As shown, a methodbegins with step, where model trainerand TAMPare initialized. In various embodiments, initializing model trainerincludes setting up the training configuration for reinforcement learning and behavior cloning. For example, model trainercan initialize the learning rate (e.g., 1×10), discount factor (e.g., γ=0.99), and batch size (e.g., 256). In some embodiments, model trainerinitializes RL-specific settings, such as n-step returns (e.g., 3), action repeat (e.g., 1), and the number of seed frames (e.g.,) to influence sample efficiency and training stability. Model traineralso initializes the neural network architecture of robot control models, such as the feature dimension (e.g., 50), hidden layer size (e.g., 1024), and network structure (e.g., convolutional neural network), and selects an optimizer (e.g., Adam) to update model weights during training. In some embodiments, model traineralso initializes a penalty weight a used in the KL-divergence regularization as described in Equation 4. In some examples, the value of a (e.g., 0.1) can be set depending on the trade-off between exploration and adherence to demonstration behavior.

604 115 123 122 301 122 302 304 305 306 160 304 122 306 304 160 301 160 304 115 122 160 303 123 604 7 FIG. At step, trajectory recordergenerates demonstration data, using TAMP, based on user inputs. In various embodiments, TAMPreceives taskand demonstration skillsetand generates robot actionsbased on robot statesto cause robotto perform a skill which is not in demonstration skillset. Once TAMPdetermines reaching a handoff section based on robot statesand/or states of other objects corresponding to completion of a skill in demonstration skillset, robotreceives user inputsthat cause robotto perform a skill in demonstration skillset. Trajectory recorderinteracts with TAMPand robotand records trajectorywhich is stored in demonstration data. Stepis described in greater detail in conjunction with.

606 116 121 123 420 121 402 401 401 117 401 403 118 405 404 403 123 116 420 404 606 8 FIG. At step, model trainerperforms behavior cloning to train robot control modelsbased on demonstration data. In some embodiments, policy modelof robot control modelgenerates robot actionsbased on robot states. Robot statesare applied to simulatorwhich generates the next robot statesand roll-out data. Loss calculatoruses behavior cloning loss calculatorto calculate a behavior cloning lossbased on roll-out dataand demonstration data. Model trainerupdates the parameters of policy modelbased on behavior cloning loss. Stepis described in greater detail in conjunction with.

608 116 121 117 119 121 420 421 401 402 117 402 401 407 118 411 407 408 116 410 421 119 116 121 421 608 9 10 FIGS.and At step, model trainerperforms reinforcement learning to re-train robot control modelsusing simulatorand scheduler. In some embodiments, robot control modeluses the trained policy modeland the untrained residual modelto process robot statesand generates robot actions. Simulatorprocesses robot actionsand generates the next robot statesand roll-out data. Loss calculatoruses reinforcement learning reward calculatorto process roll-out dataand generate reinforcement learning reward. Model traineruses reinforcement learning moduleto update the parameters of residual model. Schedulerinteracts with model trainerand robot control modelsto schedule training of residual modelsduring the reinforcement learning. Stepis described in greater detail in conjunction with.

7 FIG. 1 5 FIGS.- 123 is a flow diagram of method steps for generating demonstration data, according to various embodiments. Although the method steps are described in conjunction with the systems of, persons skilled in the art will understand that any system configured to perform the method steps in any order falls within the scope of the present disclosure.

604 600 701 122 302 122 302 304 122 306 302 305 160 304 As shown, stepof the methodbegins with step, where TAMPdetermines current skill based on task. In some embodiments, TAMPreceives a taskand demonstration skillset. TAMPprocesses robot states, task, and a demonstration skillset and generates robot actionsto cause robotto perform a skill which is not in demonstration skillsetuntil reaching a hand off section.

702 122 122 306 304 122 604 600 704 122 604 600 703 At step, TAMPchecks whether skill is in a demonstration skillset. In some embodiments, TAMPdetermines reaching a handoff section based on robot statesand/or states of other objects corresponding to completion of a skill which is not in demonstration skillset. Whenever TAMPdetermines skill is in a demonstration skillset, stepof the methodproceeds to step. Whenever TAMPdetermines skill is not in a demonstration skillset, stepof the methodproceeds to step.

703 122 160 122 306 302 305 160 304 122 122 305 122 305 306 122 305 At step, TAMPcauses robotto perform a skill. In some embodiments, TAMPprocesses robot states, task, and a demonstration skillset and generates robot actionsto cause robotto perform a skill which is not in demonstration skillsetuntil reaching a hand off section. In some embodiments, TAMPincludes a model-based approach for synthesizing long-horizon robot behavior. TAMPintegrates discrete (e.g., symbolic) planning with continuous (e.g., motion) planning to plan hybrid discrete-continuous robot actions. In some embodiments, TAMPuses a model of robot actions that a planner can apply and how the robot actionsmodify the current robot states. Using the model, TAMPcan search over the space of plans to find a sequence of robot actionsand the associated parameters that satisfies a skill.

704 160 301 301 At step, robotreceives one or more user inputs. In various embodiments, user inputsinclude teleoperation commands provided by a human operator using various input devices, such as a joystick, a VR controller, a kinesthetic teaching interface, and/or the like.

705 122 160 301 122 160 304 301 At step, TAMPcauses robotto perform the skill based on user inputs. In some embodiments, whenever TAMPdetermines a hand off section has been reached, robotperforms the skill in demonstration skillsetbased on user inputs.

706 115 303 160 115 303 160 301 160 304 At step, trajectory recorderrecords trajectoryof robotperforming the skill. In some embodiments, trajectory recorderrecords trajectoryof robot, which is generated when user inputscause robotto perform a skill in the demonstration skillset.

707 115 303 123 115 303 123 123 At step, trajectory recorderstores trajectoryin demonstration data. In some embodiments, Trajectory recorderstores trajectoryin demonstration data. In various embodiments, demonstration datacan be represented as

t t i i 303 where sϵ, aϵand His the horizon, and gis the handoff section of the i-th trajectory.

708 122 302 122 302 600 606 122 302 604 600 701 At step, TAMPdetermines whether taskis complete. Whenever TAMPdetermines taskis complete, methodproceeds to step. Whenever TAMPdetermines taskis not complete, stepof methodreturns to stepto process the next skill.

8 FIG. 1 5 FIGS.- 420 is a flow diagram of method steps for training policy models, according to various embodiments. Although the method steps are described in conjunction with the systems of, persons skilled in the art will understand that any system configured to perform the method steps in any order falls within the scope of the present disclosure.

606 600 801 116 304 304 122 122 As shown, stepof the methodbegins with step, where model trainerreceives a skill from demonstration skillset. In various embodiments, demonstration skillsetincludes one or more skills that are impractical to manually model or to be performed by TAMP. In some embodiments, the skills that are impractical to manually model or be performed by TAMPcan be specified by a user.

802 121 402 420 121 304 402 160 420 401 402 φ At step, robot control modelgenerates robot actionsusing policy model. In some embodiments, each robot control modelis associated with a skill from demonstration skillsetand is configured to generate robot actionsthat guide robotin performing that skill. In some examples, policy modelcan be represented by a base policy π(s) which maps robot statesto robot actionsand is parameterized by parameters φ.

803 117 401 403 402 117 160 117 160 402 117 117 401 403 121 At step, simulatorgenerates robot statesand roll-out databased on robot actions. In various embodiments, simulatorincludes a robot model that represents the kinematics, dynamics, geometry, and actuation properties of robot. The robot model permits simulatorto simulate the physical behavior of robotin response to robot actions, including joint movements, end-effector movements, and interactions with objects in the environment. In some embodiments, simulatoralso models external factors such as gravity, collisions, contact forces, sensor noise, and/or the like, permitting realistic simulation of various skills. Simulatorgenerates robot stateswhich reflect updated observations, such as joint angles, gripper positions, camera images, force readings, and/or the like. Roll-out dataincludes sequences of state-action pairs, which are used for training and evaluating robot control models.

804 118 404 403 123 118 405 404 404 123 401 420 123 404 123 φ φ At step, loss calculatorcalculates behavior cloning lossbased on roll-out dataand demonstration data. In some embodiments, loss calculatoruses behavior cloning loss calculatorto calculate behavior cloning loss. In various embodiments, behavior cloning lossis calculated as the negative loglikelihood of the robot actions included in demonstration datagiven the observed robot statesunder the policy modelπ. In some examples, for each state-action pair (s, a) in the demonstration data, behavior cloning lossis calculated as described in Equation 1 that measures how well the base policy πaligns with the demonstration trajectory included in demonstration data.

805 116 420 404 116 404 420 116 116 404 404 At step, model trainerupdates parameters of policy modelbased on behavior cloning loss. In various embodiments, model traineruses various optimization techniques such as SGD, Adam, and/or the like, to minimize behavior cloning lossand adjust the parameters φ of policy modelsaccordingly. In some examples, model trainersolves the optimization problem described in Equation 2. For each training epoch, model trainercomputes the gradient of behavior cloning losswith respect to φ. The parameters are then updated in the direction that reduces behavior cloning loss.

806 116 116 420 404 116 606 600 807 116 606 600 802 At step, model trainerdetermines whether to continue training. In various embodiments, model trainerupdates the parameters of policy modelover multiple training epochs until one or more stopping criteria are met, such as the behavior cloning lossconverging, reaching a predefined number of training epochs, and/or the like. Whenever model trainerdetermines not to continue training, stepof methodproceeds to step. Whenever model trainerdetermines to continue training, stepof methodreturns to step.

807 116 420 304 116 420 304 600 608 116 420 304 606 600 801 304 At step, model trainerdetermines whether policy modelsare trained for all skills in demonstration skillset. Whenever model trainerdetermines policy modelsare trained for all skills in demonstration skillset, methodproceeds to step. Whenever model trainerdetermines policy modelsare not trained for all skills in demonstration skillset, stepof methodreturns to stepto receive the next skill in demonstration skillset.

9 FIG. 1 5 FIGS.- 421 119 is a flow diagram of method steps for training residual modelsusing scheduler, according to various embodiments. Although the method steps are described in conjunction with the systems of, persons skilled in the art will understand that any system configured to perform the method steps in any order falls within the scope of the present disclosure.

608 600 901 119 119 119 As shown, stepof methodbegins with step, where schedulerreceives a sampling strategy, workers, and a status queue. In various embodiments, scheduleris implemented as a centralized control loop that coordinates a pool of TAMP workers, a shared status queue, and a sampling strategy. Each TAMP worker executes a TAMP planner in an environment instance. When a TAMP worker reaches a handoff section, the TAMP worker submits a tuple (i,j) to the status queue, where i identifies the worker and j identifies the section index. In various embodiments, schedulercontinuously monitors the status queue.

902 119 At step, schedulerpops the status queue. In some embodiments, the status queue is a FIFO queue that tracks the availability of handoff sections from across all workers, and an element is popped from the status queue.

903 119 119 119 119 119 119 At step, schedulerdetermines whether to accept a section from the status queue based on the sampling strategy. In some embodiments, the sampling strategy upsamples later sections so that later skills in a task, which may not be reached as frequently as earlier skills, are also learned. In some embodiments, upon retrieving an entry (i,j) from the status queue, schedulerqueries a sampling strategy object to determine whether the section j is suitable for training. In some embodiments, the sampling strategy used by schedulerincludes a curriculum learning mechanism. For example, in a sequential strategy, scheduleraccepts section j whenever the average success rate over all previous sections from 0 to j−1 exceeds a predefined threshold τ. In some examples, the acceptance condition in a sequential strategy can be described by Equation 6. The curriculum permits that the RL agent trains on simpler or earlier skills before progressing to more difficult skills, resulting in more stable and sample-efficient RL training. In some embodiments, scheduleruses a permissive strategy which accepts all sections unconditionally, allowing schedulerto optimize utilization of the workers but without any enforced learning progression.

119 608 600 904 904 119 119 119 Whenever schedulerdetermines not to accept the section from the status queue based on the sampling strategy, stepof methodproceeds to step. At step, schedulerresets the worker. In some embodiments, whenever the sampling strategy does not accept section j, schedulerissues a reset command to worker i, prompting the worker to restart the TAMP until TAMP reaches a new handoff section and the worker submits another tuple (i,j) to the status queue and then waits in an idle state until a command is received from scheduler.

119 608 600 905 905 119 421 117 119 119 401 119 402 421 402 117 905 t t θ t t 10 FIG. On the other hand, whenever schedulerdetermines to accept the section from the status queue based on the sampling strategy, stepof methodproceeds to step. At step, schedulercauses reinforcement learning to be performed to train a residual modelfor the section using simulator. In some embodiments, causing the reinforcement learning can include allocating a thread for executing a worker. In some embodiments, schedulerinitiates an RL episode with TAMP worker i. At each step t in the RL episode, schedulerreceives the current robot statessby calling observe( ) on the TAMP worker. Schedulerthen receives robot actionsa˜π(s) using the residual policy modelunder training. The robot actionsare sent back to the worker, which advances the environment in simulatorby calling step (a). Stepis described in greater detail in conjunction with.

906 119 119 119 119 119 At step, schedulerdetermines whether the worker is done with the current section. Schedulercontinues the process in a loop until the worker indicates the current section is solved by returning done( )=True. In some embodiments, at each step, schedulerchecks whether the skill has been completed in the worker, meaning the robot reached a goal condition for the skill. In some embodiments, after the current section is solved, the worker sends a success notification to schedulerand runs TAMP until reaching the next handoff section that requires further reinforcement learning. When the next handoff section is reached, the worker submits another tuple (i,j) to the status queue and then waits in an idle state until a command is received from scheduler.

119 608 600 905 608 600 907 907 119 119 119 600 119 608 600 902 907 906 119 902 906 4 FIG.B Whenever schedulerdetermines the TAMP worker is not done with the current section, stepof methodreturns to step. On the other hand, whenever scheduler determines the TAMP worker is done with the current section, stepof methodproceeds to step. At step, schedulerdetermines whether the status queue is empty. In various embodiments, schedulercontinuously monitors the status queue. Whenever schedulerdetermines the status queue is empty, methodterminates. Whenever schedulerdetermines the status queue is not empty, stepof methodreturns to step. Although stepis shown as occurring after stepfor illustrative purposes, in some embodiments, schedulercan pop the status queue multiple times and cause workers for selected sections to execute in parallel across different processors according to steps-, as described above in conjunction with.

10 FIG. 1 5 FIGS.- is the flow diagram of method steps for training a residual model using reinforcement learning, according to various embodiments. Although the method steps are described in conjunction with the systems of, persons skilled in the art will understand that any system configured to perform the method steps in any order falls within the scope of the present disclosure.

905 1011 116 121 420 121 304 402 160 121 420 421 420 401 402 421 φ* As shown, stepbegins with step, where model trainerreceives robot control modelwith trained policy model. In some embodiments, each robot control modelis associated with a skill from demonstration skillsetand is configured to generate robot actionsthat guide robotin performing that skill. Initially, each of robot control modelsincludes, without limitation, a trained policy modeland an untrained residual model. In some examples, each trained policy modelcan be represented by a base policy π(s) which maps robot statesto robot actionsand is parameterized by trained parameters φ*. Residual modelcan also be represented by a residual policy

401 402 121 401 402 which maps robot statesto robot actionsand is parameterized by untrained parameters θ. Robot control modelthen maps robot statesto robot actionsbased on the base policy and the residual policy.

1012 121 402 420 421 121 402 At step, robot control modelgenerates robot actionsusing trained policy modeland residual model. In various embodiments, robot control modelgenerates robot actionsbased on a policy

421 where θ are the parameters of residual modelto be trained. In some embodiments, only the mean of the trained base policy is added to the residual policy.

1013 117 401 407 402 117 401 403 121 At step, simulatorgenerates robot statesand roll-out databased on robot actions. In some embodiments, simulatorgenerates robot states, which reflect updated observations, such as joint angles, gripper positions, camera images, force readings, and/or the like. Roll-out dataincludes sequences of state-action pairs, which are used for training and evaluating robot control models.

1014 118 408 407 118 411 402 408 408 408 At step, loss calculatorcalculates reinforcement learning rewardbased on roll-out data. In various embodiments, loss calculatoruses reinforcement learning reward calculatorto evaluate the outcome of each robot actionbased on skill-specific success criteria and assigns a numerical reward accordingly. In some embodiments, reinforcement learning rewardis sparse, such as providing a reward of 1 only upon successful completion of a skill and zero otherwise. In some embodiments, reinforcement learning rewardis dense, providing incremental rewards based on progress toward task goals. In some embodiments, reinforcement learning rewardincludes penalty terms, such as penalty terms for excessive movement, collisions, and/or the like.

1015 116 421 408 116 410 At step, model trainerupdates parameters of residual modelbased on reinforcement learning reward. In some embodiments, model trainerincludes a reinforcement learning module. RL is a learning framework in which an agent (e.g., the fixed trained base policy plus the to-be-trained residual policy

402 401 408 408 123 410 121 410 421 θ θ φ* θ φ* interacts with an environment by selecting robot actions, observing the resulting robot states, and receiving reinforcement learning rewardsthat reflect skill performance. The goal is to learn a policy that maximizes the expected cumulative reward over time. In some examples, the expected return under policy πis defined as given in Equation 3. In some embodiments, due to the sparsity of reinforcement learning reward, reinforcement learning objective in Equation 3 can exhibit high variance, which could cause the policy πto drift significantly from the base policy πtrained via behavior cloning. The drift can result in the loss of useful behavior learned from demonstration dataand reduce overall training stability. In some embodiments, to mitigate the issue, reinforcement learning moduleuses a KL divergence penalty between the policy πand the base policy π. The KL divergence penalty constrains the output of the robot control modelto remain close to the base policy throughout the fine-tuning with RL process. In some examples, the final reinforcement learning objective used to guide RL training is described as given in Equation 4. In various embodiments, reinforcement learning moduleapplies any suitable RL algorithm, such as policy gradient, Q-learning, actor-critic techniques, and/or the like, to compute the gradient of the expected return with respect to θ and to update the residual model.

1016 116 421 116 905 1012 116 900 906 At step, model trainerdetermines whether to continue training. In some embodiments, training continues until residual modelperformance converges or meets a predefined threshold, such as reaching a maximum number of training epochs. Whenever model trainerdetermines to continue training, stepreturns to step. Whenever model trainerdetermines not to continue training, the methodproceeds to step.

11 FIG. 1 5 FIGS.- 160 is the flow diagram of method steps for controlling robot, according to various embodiments. Although the method steps are described in conjunction with the systems of, persons skilled in the art will understand that any system configured to perform the method steps in any order falls within the scope of the present disclosure.

1100 1101 146 501 502 146 501 180 502 As shown, a methodbegins with step, where robot control applicationreceives sensor dataand task. In some embodiments, robot control applicationreceives sensor datavia sensorsand taskfrom one or more I/O devices.

1102 146 502 502 146 160 At step, robot control applicationselects a skill from task. As described, taskcan include multiple skills, and robot control applicationsequentially selects skills and controls robotto perform the selected skills.

1103 146 501 122 121 160 146 501 180 146 121 146 160 146 122 146 At step, robot control applicationprocesses sensor datausing either TAMPor a trained robot control model, to generate an action for robotto perform at least part of the skill. In various embodiments, at each timestep, robot control applicationprocesses sensor datafrom sensorsto estimate the current robot state s. Whenever the robot actions are tagged as RL-based (e.g., a. type=“RL”), robot control applicationuses a trained policy π including a base policy and a residual policy from one of trained robot control modelscorresponding to the current skill. Robot control applicationgenerates robot actions based on the trained policy and generates controls to robotuntil the handoff goal G for that skill is reached. For robot actions that are not RL-based (e.g., a. type≠“RL”), robot control applicationinstead executes a trajectory τ=a.trajectory that is generated by a motion planner in TAMP. The work flow of robot control applicationcan be as described by Algorithm 2.

1104 146 160 146 160 146 180 At step, robot control applicationgenerates controls for robotto perform, based on the action, at least part of the skill. In some embodiments, robot control applicationuses various motion planning techniques, such as inverse kinematics and/or the like, to generate one or more controls based on the robot actions. The controls can include joint position commands, velocity commands, or torque commands, depending on the specific motion control architecture of robot. In some embodiments, robot control applicationincludes real-time feedback from sensorsto dynamically adjust the robot actions based on unexpected changes in the environment, such as the displacement of objects or obstacles.

1105 146 160 146 160 At step, robot control applicationcauses robotto move based on the controls. In some embodiments, robot control applicationsends low-level motor commands to the actuators of robotbased on the controls, or sends commands based on the controls to a low-level controller that generates low-level motor commands, enabling precise execution of the controls.

1106 146 1100 1103 146 501 122 121 160 146 146 122 121 At step, if robot control applicationdetermines that the skill has not been completed, then methodreturns to step, where robot control applicationprocesses additional sensor datausing either TAMPor the trained robot control model, to generate another action for robotto perform at least part of the skill. In some embodiments, robot control applicationcan determines whether the current robot state satisfies the goal condition G of the current handoff section, where G represents the set of terminal robot states of a skill in the demonstration skillset. Whenever sϵG, meaning the current skill has successfully completed, the control loop exits. As long as the current skill goal has not yet been achieved, robot control applicationuses TAMPor the trained robot control modelto generate robot actions.

146 1106 1100 1107 1107 146 1100 146 1100 1102 146 On the other hand, if robot control applicationdetermines at stepthat the skill has been completed, then methodproceed directly to step. At step, if robot control applicationdetermines that there are no more skills in the task, then methodends. On the other hand, if robot control applicationdetermines that there are more skills in the task, then methodreturns to step, where robot control applicationselects a next skill from the task.

In sum, techniques are disclosed for controlling robots using TAMP and robot control models that are trained using behavior cloning and reinforcement learning. The robot control models are machine learning models, such as neural networks, that process robot states and generate robot actions to perform at least part of a robotic skill. Each robot control model includes a policy model that generates robot actions and a residual model that generates modifications to the robot actions output by the policy model. In some embodiments, TAMP is used to generate demonstration data based on user inputs. Given a robotic task and a demonstration skillset, TAMP divides the task into various skills and checks whether the current skill is in the demonstration skillset, which includes the set of skills for which one or more robot control models need to be trained. Whenever the skill is not in the demonstration skillset, TAMP causes the robot to perform the skill until reaching a handoff section, which implies that the next skill is in the demonstration skillset. Then, the robot receives one or more user inputs, which cause the robot to perform the skill in the demonstration skillset. A trajectory recorder records the trajectory of the robot, which is stored in the demonstration data. The foregoing process continues until the task is complete. In some embodiments, a model trainer uses the demonstration data to train the policy models included in the robot control models. For each skill in demonstration skillset, the model trainer uses a trajectory for that skill included in the demonstration data to train a policy model. In various embodiments, the robot control model generates robot actions which are applied within a simulator. The simulator generates the next robot state and roll-out data based on the robot actions. A loss calculator calculates a behavior cloning loss based on the roll-out data and the demonstration data, which includes robot states and robot actions, and the trajectory. The model trainer then iteratively updates the parameters of the policy model based on the calculated losses until one or more stopping criteria are met. The model trainer then trains another policy model for another skill until training for all skills in the demonstrations skillset have completed. In various embodiments, the model trainer uses the robot control model with the trained policy models to train residual models using reinforcement learning. During the reinforcement learning, the robot control model generates robot actions which are applied to the simulator. The simulator generates the next robot states and the roll-out data based on the robot actions. The loss calculation module uses a reinforcement learning reward calculator to calculate a reinforcement learning reward based on robot actions, robot state, and robot actions generated using the trained policy model. The model trainer then uses a reinforcement learning module to iteratively update the parameters of the residual model based on the calculated rewards until one or more stopping criteria are met. The reinforcement learning module can use a reward that includes a Kullback-Leibler (KL) divergence term that limits deviations of robot actions generated by the robot control model from the robot actions generated by the previously trained policy model. Once the residual models for all skills in the demonstration skillset are trained, the robot control models, which each include a trained policy model that generates robot actions and a trained residual model that generates modifications to the robot actions, can be used along with TAMP to process sensor data and a task, and generate actions to cause a robot to perform at least part of the task that includes multiple skills.

In some embodiments, the model trainer uses a scheduler to schedule training of robot control models during the reinforcement learning. In various embodiments, the scheduler receives a sampling strategy, one or more workers, and a status queue. Each worker includes a TAMP environment that performs skills not in the demonstration skillset by default and reports a section request to the status queue whenever TAMP reaches a handoff section. Each status queue element includes a worder identifier (ID) and a section ID. To begin with, the scheduler pops the status queue. The scheduler then checks whether a section from the status queue is acceptable based on the sampling strategy. Whenever the section is determined not to be acceptable, the scheduler resets the worker, thereby skipping reinforcement learning for that particular section. Otherwise, the scheduler interacts with the model trainer, which performs reinforcement learning to train the residual model for the section. The reinforcement learning continues until the worker indicates to the scheduler the completion of the skill, at which point the worker uses TAMP to perform a next skill, if any, until a handoff section is reached and the worker reports another section request to the status queue. The scheduler also checks whether the status queue is empty. Whenever the scheduler determines that the status queue is not empty, the scheduler pops the status queue again and repeats the process. Whenever the scheduler determines that the status queue is empty, the model trainer stores the trained residual models.

At least one technical advantage of the disclosed techniques relative to the prior art is that the disclosed techniques combine TAMP with behavior cloning and reinforcement learning in a synergistic framework that overcomes limitations of either approach alone. Unlike conventional RL techniques which require finely tuned, dense reward functions, the disclosed techniques restrict RL to predefined handoff sections determined by TAMP. The restriction simplifies reward design by allowing sparse, success-based rewards to be used. Another advantage of the disclosed techniques is that, rather than learning entire task behaviors end-to-end, the disclosed techniques use TAMP to handle routine skills included in the task, while reinforcement learning is used to fine-tune residual corrections for more challenging skills. The disclosed techniques also reduce the need for large, high-quality demonstration datasets by limiting the scope of behavior cloning to a subset of skills, where skills that are easier to model are delegated to TAMP. Yet another advantage of the disclosed techniques is that, by leveraging a scheduler that coordinates multiple TAMP workers and selectively allocates RL training opportunities to skills that are ready for training, the disclosed techniques permit scalable training for long-horizon tasks. These technical advantages provide one or more technological improvements over prior art approaches.

1. In some embodiments, a computer-implemented method for training one or more robot control models comprises performing, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot, and performing one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot.

2. The computer-implemented method of clause 1, wherein each first trained machine learning model included in the one or more first trained machine learning models is trained to control the robot to perform a different skill included in the one or more skills, and wherein each second trained machine learning model included in the one or more second trained machine learning models is trained to control the robot to perform a different skill included in the one or more skills.

3. The computer-implemented method of clauses 1 or 2, wherein each first trained machine learning model included in the one or more first trained machine learning models is trained to generate a base action to control the robot, and each second trained machine learning model included in the one or more second trained machine learning models is trained to generate a delta action that modifies the base action generated by a corresponding first trained machine learning model included in the one or more first trained machine learning models.

4. The computer-implemented method of any of clauses 1-3, further comprising generating the one or more demonstration trajectories based on one or more user inputs to control the robot via one or more input/output devices.

5. The computer-implemented method of any of clauses 1-4, wherein performing one or more training operations to generate the one or more first trained machine learning models comprises generating, using an untrained machine learning model, one or more robot actions, generating, based on the one or more robot actions and using a simulator, one or more state-action pairs, calculating, based on the one or more state-action pairs and at least one trajectory included in the one or more demonstration trajectories, a loss, and updating, based on the loss, one or more parameters of the untrained machine learning model.

6. The computer-implemented method of any of clauses 1-5, wherein performing one or more reinforcement learning operations to generate the one or more second trained machine learning models comprises generating, using a first trained machine learning model included in the one or more first trained machine learning models and an untrained machine learning model, one or more actions, generating, based on the one or more actions and using a simulator, one or more state-action pairs, calculating, based on the one or more state-action pairs, a reward, and updating, based on the reward, the one or more parameters of the untrained machine learning model to generate a second trained machine learning model included in the one or more second trained machine learning models.

7. The computer-implemented method of any of clauses 1-6, wherein the reward comprises at least one of a sparse reward of one upon successful completion of a first skill included in the one or more skills and zero otherwise, a dense reward based on progress toward one or more goals associated with the first skill, or one or more penalty terms for movements greater than a threshold and collisions.

8. The computer-implemented method of any of clauses 1-7, wherein performing one or more reinforcement learning operations comprises updating one or more parameters of an untrained machine learning model based on a Kullback-Leibler (KL) divergence term that penalizes differences between one or more first actions generated using a first trained machine learning model included in the one or more first trained machine learning models and one or more second actions generated using the first trained machine learning model and the untrained machine learning model.

9. The computer-implemented method of any of clauses 1-8, wherein performing one or more reinforcement learning operations comprises scheduling a plurality of workers based on a sampling strategy for sampling workers to execute and a queue that stores indications of workers that require scheduling, and executing the plurality of workers based on the scheduling to generate the one or more second trained machine learning models.

10. The computer-implemented method of any of clauses 1-9, further comprising receiving sensor data from one or more sensors, generating, based on the sensor data and using the one or more first trained machine learning models and the one or more second trained machine learning models, one or more actions, and causing the robot to perform one or more first movements based on the one or more actions.

11. In some embodiments, one or more non-transitory computer readable media store instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of performing, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot, and performing one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot.

12. The one or more non-transitory computer-readable media of clause 11, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of generating the one or more demonstration trajectories based on one or more user inputs to control the robot via one or more input/output devices.

13. The one or more non-transitory computer-readable media of clauses 11 or 12, wherein performing one or more training operations to generate the one or more first trained machine learning models comprises generating, using an untrained machine learning model, one or more robot actions, generating, based on the one or more robot actions and using a simulator, one or more state-action pairs, calculating, based on the one or more state-action pairs and at least one trajectory included in the one or more demonstration trajectories, a loss, and updating, based on the loss, one or more parameters of the untrained machine learning model.

14. The one or more non-transitory computer-readable media of any of clauses 11-13, wherein the loss comprises a difference between one or more first robot actions generated using the untrained machine learning model and one or more second robot actions included in the one or more demonstration trajectories.

15. The one or more non-transitory computer-readable media of any of clauses 11-14, wherein performing one or more reinforcement learning operations to generate the one or more second trained machine learning models comprises generating, using a first trained machine learning model included in the one or more first trained machine learning models and an untrained machine learning model, one or more actions, generating, based on the one or more actions and using a simulator, one or more state-action pairs, calculating, based on the one or more state-action pairs, a reward, and updating, based on the reward, the one or more parameters of the untrained machine learning model to generate a second trained machine learning model included in the one or more second trained machine learning models.

16. The one or more non-transitory computer-readable media of any of clauses 11-15, wherein performing one or more reinforcement learning operations comprises updating one or more parameters of an untrained machine learning model based on a Kullback-Leibler (KL) divergence term that penalizes differences between one or more first actions generated using a first trained machine learning model included in the one or more first trained machine learning models and one or more second actions generated using the first trained machine learning model and the untrained machine learning model.

17. The one or more non-transitory computer-readable media of any of clauses 11-16, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of controlling the robot to perform one or more other skills associated with the task using a task and motion planner (TAMP).

18. The one or more non-transitory computer-readable media of any of clauses 11-17, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the steps of receiving sensor data from one or more sensors, generating, based on the sensor data and using the one or more first trained machine learning models and the one or more second trained machine learning models, one or more actions, and causing the robot to perform one or more first movements based on the one or more actions.

19. The one or more non-transitory computer-readable media of any of clauses 11-18, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the step of causing the robot to perform one or more second movements based on one or more motions generated using a motion planning technique.

20. In some embodiments, a system comprises one or more memories storing instructions, and one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to perform, based on one or more demonstration trajectories of a robot performing one or more skills associated with a task, one or more training operations to generate one or more first trained machine learning models for controlling the robot, and perform one or more reinforcement learning operations using the one or more first trained machine learning models to generate one or more second trained machine learning models for controlling the robot.

1. In some embodiments, a computer-implemented method for training one or more robot control models comprises scheduling a plurality of workers based on a sampling strategy for sampling workers to execute and a queue that stores indications of workers that require scheduling, and executing the plurality of workers based on the scheduling to generate a plurality of trained machine learning models for controlling a robot to perform a plurality of skills associated with a task.

2. The computer-implemented method of clause 1, wherein scheduling the plurality of workers comprises popping an element from the queue, determining, based on the sampling strategy, to accept a section indicated by the element, wherein the section corresponds to a first skill included in the plurality of skills, and causing a first worker included in the plurality of workers to perform one or more reinforcement learning operations to train a first machine learning model to perform the first skill.

3. The computer-implemented method of clauses 1 or 2, further comprising receiving, from the first worker, a notification that the first skill has been completed.

4. The computer-implemented method of any of clauses 1-3, wherein, after the first worker completes the first skill, the first worker completes a second skill included in the plurality of skills and adds another element to the queue.

5. The computer-implemented method of any of clauses 1-4, wherein the first worker completes the second skill using a task and motion planner (TAMP).

6. The computer-implemented method of any of clauses 1-5, wherein scheduling the plurality of workers comprises popping an element from the queue, determining, based on the sampling strategy, to not accept a section indicated by the element, wherein the section corresponds to a first skill included in the plurality of skills, and resetting a worker indicated by the element.

7. The computer-implemented method of any of clauses 1-6, wherein the sampling strategy accepts a section indicated by an element of the queue when an average success rate of all previous sections before the section exceeds a predefined threshold.

8. The computer-implemented method of any of clauses 1-7, wherein the sampling strategy accepts all sections indicated by elements of the queue unconditionally, wherein each section corresponds to a skill included in the plurality of skills.

9. The computer-implemented method of any of clauses 1-8, wherein the plurality of trained machine learning models are generated using reinforcement learning.

10. The computer-implemented method of any of clauses 1-9, further comprising receiving sensor data from one or more sensors, generating, based on the sensor data and using the plurality of trained machine learning models, one or more actions, and causing the robot to move based on the one or more actions.

11. In some embodiments, one or more non-transitory computer readable media store instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of scheduling a plurality of workers based on a sampling strategy for sampling workers to execute and a queue that stores indications of workers that require scheduling, and executing the plurality of workers based on the scheduling to generate a plurality of trained machine learning models for controlling a robot to perform a plurality of skills associated with a task.

12. The one or more non-transitory computer-readable media of clause 11, wherein scheduling the plurality of workers comprises popping an element from the queue, determining, based on the sampling strategy, to accept a section indicated by the element, wherein the section corresponds to a first skill included in the plurality of skills, and causing a first worker included in the plurality of workers to perform one or more reinforcement learning operations to train a first machine learning model to perform the first skill.

13. The one or more non-transitory computer-readable media of clauses 11 or 12, wherein scheduling the plurality of workers comprises popping an element from the queue, determining, based on the sampling strategy, to not accept a section indicated by the element, wherein the section corresponds to a first skill included in the plurality of skills, and resetting a worker indicated by the element.

14. The one or more non-transitory computer-readable media of any of clauses 11-13, wherein the sampling strategy accepts a section indicated by an element of the queue when an average success rate of all previous sections before the section exceed a predefined threshold.

15. The one or more non-transitory computer-readable media of any of clauses 11-14, wherein the sampling strategy accepts all sections indicated by elements of the queue unconditionally, wherein each section corresponds to a skill included in the plurality of skills.

16. The one or more non-transitory computer-readable media of any of clauses 11-15, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the steps of receiving sensor data from one or more sensors, generating, based on the sensor data and using the plurality of trained machine learning models, one or more actions, and causing the robot to move based on the one or more actions.

17. The one or more non-transitory computer-readable media of any of clauses 11-16, wherein at least two of the plurality of workers are executed in parallel.

18. The one or more non-transitory computer-readable media of any of clauses 11-17, wherein the sampling strategy upsamples one or more sections corresponding to one or more later skills included in the plurality of skills.

19. The one or more non-transitory computer-readable media of any of clauses 11-18, wherein the plurality of trained machine learning models comprise a plurality of trained convolutional neural networks.

20. In some embodiments, a system comprises one or more memories storing instructions, and one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to schedule a plurality of workers based on a sampling strategy for sampling workers to execute and a queue that stores indications of workers that require scheduling, and execute the plurality of workers based on the scheduling to generate a plurality of trained machine learning models for controlling a robot to perform a plurality of skills associated with a task.

Any and all combinations of any of the claim elements recited in any of the claims and/or any elements described in this application, in any fashion, fall within the contemplated scope of the present disclosure and protection.

The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.

Aspects of the present embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine. The instructions, when executed via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such processors may be, without limitation, general purpose processors, special-purpose processors, application-specific processors, or field-programmable gate arrays.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While the preceding is directed to embodiments of the present disclosure, other and further embodiments of the disclosure may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

April 16, 2025

Publication Date

June 11, 2026

Inventors

Calen Reed GARRETT
Ajay Uday MANDLEKAR
Dieter FOX
Animesh GARG
Zihan ZHOU

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “TECHNIQUES FOR SYNERGISTIC PLANNING, IMITATION, AND REINFORCEMENT LEARNING FOR ROBOT CONTROL” (US-20260158647-A1). https://patentable.app/patents/US-20260158647-A1

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

TECHNIQUES FOR SYNERGISTIC PLANNING, IMITATION, AND REINFORCEMENT LEARNING FOR ROBOT CONTROL — Calen Reed GARRETT | Patentable