Task Prioritized Experience Replay Algorithm for Reinforcement Learning

PublishedApril 15, 2025

Assigneenot available in USPTO data we have

InventorsVarun Kompella James MacGlashan Peter Wurman Peter STONE

Technical Abstract

Patent Claims

18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of training an agent in a control loop, comprising: performing, by the agent, an action (at) sampled from a behavior policy (πb) for an observation (st), wherein the observation comprises information the agent receives, by any means, about an environment of the agent or the agent itself, wherein the information includes one or more of sensory information or signals received through sensory devices; compiled, abstract, or situational information compiled from a collection of the sensory devices combined with stored information; information about people or customers, or to characteristics of the people or the customers; information about internal parts of the agent; proprioceptive information; information regarding current or past actions of the agent; information about an internal state of the agent; information already computed or processed by the agent; and a termination value for each task of a plurality of tasks for which the agent is being trained; storing a transition tuple in a main buffer of the agent, the transition tuple including {(st, at, {right arrow over (rt)}, st+1)}, where {right arrow over (rt)} is a reward vector for each task of the plurality of tasks for the agent in an environment and st+1 is a next environment state after action (at); storing a priority value p(i), of the transition tuple with index i in the main buffer; determining a probability, P(i) of sampling the transition tuple with the index i from the main buffer; updating transiting priorities for each transition tuple stored in the main buffer; sampling a minibatch of transition tuples to update the task networks based on the stored priority value p(i) thereof; determining an action probability distribution parameter, πi(st), of updating task policies for the observation st; and optimizing the task policies from the updated task networks with an off-policy algorithm, wherein: data that is prioritized for one task is shared with one or more other tasks to transfer learning between multiple tasks.

2. The method of claim 1, further comprising continuing the control loop until all the tasks in the environment are solved.

3. The method of claim 1, wherein the tasks in the environment are unknown to the agent.

4. The method of claim 1, wherein the control loop is episodic, and a state of the agent is reset to an initial state after each episode.

5. The method of claim 1, wherein the control loop is continual, where the agent executes actions without resetting a state of the agent.

6. The method of claim 1, wherein the behavior policy is a uniform-random policy.

7. The method of claim 1, wherein the behavior policy is selected from a policy using optimistic biases for unseen regions or a human-designed policy.

8. The method of claim 1, wherein transitions that belong to are giving a priority greater than transitions that are not in .

9. The method of claim 8, wherein a priority value to the transitions that belong to are given a constant value.

10. The method of claim 9, wherein transitions that are not in J are given a non-zero priority value.

11. The method of claim 8, wherein a priority value to the transitions that belong to are given a variable value based on a magnitude of each transition's temporal-difference error.

12. The method of claim 1, wherein the sampling of the minibatch is performed using a stochastic prioritization approach of interpolating between greedy prioritization.

13. The method of claim 1, wherein the step of optimizing the task policies is agnostic to a choice of the off-policy algorithm.

14. A method of training an agent, comprising: performing, by the agent, an action (at) sampled from a behavior policy (πb) for an observation (st), wherein the observation comprises information the agent receives, by any means, about an environment of the agent or the agent itself, wherein the information includes one or more of sensory information or signals received through sensory devices; compiled, abstract, or situational information compiled from a collection of the sensory devices combined with stored information; information about people or customers, or to characteristics of the people or the customers; information about internal parts of the agent; proprioceptive information; information regarding current or past actions of the agent; information about an internal state of the agent; and information already computed or processed by the agent; and a termination value for each task of a plurality of tasks for which the agent is being trained; storing a transition tuple in a main buffer of the agent, the transition tuple including {(st, at, {right arrow over (rt)}, st+1)}, where {right arrow over (rt)} is a reward vector for each task of the plurality of tasks for the agent in an environment and st+1 is a next environment state after action (at); storing a priority value p(i), of the transition tuple with index i in the main buffer; determining a probability, P(i) of sampling the transition tuple with the index i from the main buffer; determining an action probability distribution parameter, πi(st), of updating task policies for the observation st; updating transiting priorities for each transition tuple stored in the main buffer; sampling a minibatch of transition tuples to update the task networks based on the stored priority value p(i) thereof; and optimizing task policies from the updated task networks with an off-policy algorithm, wherein transitions that belong to a set of transition indices that result in achievement of task-j during an ith episode are given a priority greater than transitions that do not result in achievement of task-j during the ith episode; and data that is prioritized for one task is shared with one or more other tasks to transfer learning between multiple tasks.

15. The method of claim 14, wherein a priority value to the transitions that belong to the set of transition indices that result in achievement of task-j during the ith episode are given a constant value and transitions that are not in the set of transition indices that result in achievement of task-j during the ith episode are given a non-zero priority value.

16. The method of claim 14, wherein a priority value to the transitions that belong to the set of transition indices that result in achievement of task-j during the ith episode are given a variable value based on a magnitude of each transition's temporal-difference error.

17. A non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program instructs one or more processors to perform the following steps: performing, by the agent, an action (at) sampled from a behavior policy (πb) for an observation (st), wherein the observation comprises information the agent receives, by any means, about an environment of the agent or the agent itself, wherein the information includes one or more of sensory information or signals received through sensory devices; compiled, abstract, or situational information compiled from a collection of the sensory devices combined with stored information; information about people or customers, or to characteristics of the people or the customers; information about internal parts of the agent; proprioceptive information; information regarding current or past actions of the agent; information about an internal state of the agent; and information already computed or processed by the agent; and a termination value for each task of a plurality of tasks for which the agent is being trained; storing a transition tuple in a main buffer of the agent, the transition tuple including {(st, at, {right arrow over (rt)}, st+1)}, where {right arrow over (rt)} is a reward vector for each task of the plurality of tasks for the agent in an environment and st+1 is a next environment state after action (at); storing a priority value p(i), of the transition tuple with index i in the main buffer; determining a probability, P(i) of sampling the transition tuple with the index i from the main buffer; determining an action probability distribution parameter, πi(st), of updating task policies for the observation st, updating transiting priorities for each transition tuple stored in the main buffer; sampling a minibatch of transition tuples to update the task networks based on the stored priority value p(i) thereof; and optimizing task policies from the updated task networks with an off-policy algorithm, wherein transitions that belong to a set of transition indices that result in achievement of task-j during an ith episode are given a priority greater than transitions that do not result in achievement of task-j during the ith episode; and data that is prioritized for one task is shared with one or more other tasks to transfer learning between multiple tasks.

18. The non-transitory computer-readable storage medium of claim 17, wherein either (a) a priority value to the transitions that belong to the set of transition indices that result in achievement of task-j during the ith episode are given a constant value and transitions that are not in the set of transition indices that result in achievement of task-j during the ith episode are given a non-zero priority value; or (b) the priority value to the transitions that belong to the set of transition indices that result in achievement of task-j during the ith episode are given a variable value based on a magnitude of each transition's temporal-difference error.

Patent Metadata

Filing Date

Unknown

Publication Date

April 15, 2025

Inventors

Varun Kompella

James MacGlashan

Peter Wurman

Peter STONE

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search