Patentable/Patents/US-20250336134-A1
US-20250336134-A1

Modular Pipeline for High-Fidelity Hand-Arm Motion Synthesis and Multi-View Rendering

PublishedOctober 30, 2025
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

A computer-implemented method of generating a synthetic dataset of hand and arm gestures includes generating, from a first conditional variational autoencoder comprising a first latent space and a first transformer decoder, a set of finger poses; generating, from a second conditional variational autoencoder comprising a second latent space and a second transformer decoder, a set of wrist motions; and combining the set of finger poses and the set of wrist motions to generate the synthetic dataset of hand and arm gestures.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

. A method comprising:

2

. The method of, wherein the combining comprises performing, by the processor, a Cartesian product of the set of finger poses and the set of wrist motions.

3

. The method of, wherein the first conditional variational autoencoder is different than the second conditional variational autoencoder.

4

. The method of, wherein the first transformer decoder has eight layers and the second transformer decoder has two layers.

5

. The method of, further comprising generating, by the processor, a hand-mesh model of the hand and arm gestures.

6

. The method of, wherein the set of finger poses comprises at least one number gesture, at least one trigger gesture, and at least one special gesture.

7

. A method comprising:

8

. The method of, further comprising removing, by the processor, overlapping faces between the hand mesh model and the arm mesh model at a wrist of the hand-arm mesh model.

9

. The method of, further comprising interpolating, by the processor, between the hand mesh model and the arm mesh model at the wrist to prevent visual seams between the hand mesh model and the arm mesh model.

10

. The method of, further comprising:

11

. The method of, wherein the hand mesh model comprises a NIMBLE model.

12

. The method of, wherein the arm mesh model comprises a SMPL-X model.

13

. The method of, further comprising applying, by the processor, a global transformation to the hand-arm mesh model.

14

. The method of, wherein the generating the hand mesh model comprises converting, by the processor, a MANO hand model to a NIMBLE hand model.

15

. The method of, wherein the hand mesh model comprises a Handy model.

16

. A method of simulating real-world camera configurations, the method comprising:

17

. The method of, wherein the plurality of cameras comprises a plurality of static cameras.

18

. The method of, wherein the plurality of cameras comprises a plurality of dynamic cameras.

19

. The method of, wherein the plurality of dynamic cameras comprises a first camera having a close-up lens facing a palm side of the hand-arm mesh model, and a pair of stereo cameras facing a back side of the hand-arm mesh model.

20

. The method of, further comprising generating the hand-arm mesh model, comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

CROSS-REFERENCE TO RELATED APPLICATION(S)

The present application claims priority to and the benefit of U.S. Provisional Application No. 63/639,339, filed Apr. 26, 2024, the entire content of which is incorporated herein by reference.

The present disclosure relates to hand gesture recognition and hand-arm mesh models.

Hand gesture databases are important to address the research and development needs in Extended Reality (XR), Human-Computer Interaction (HCl), and other domains that require data to train and evaluate hand-related models. Synthetic hand gesture databases are less costly than 3D capturing and annotating real-world data. However, related art dynamic hand gesture datasets often constrain gestures to fixed combinations of global wrist motions and specific finger poses (i.e., a rigid definition of hand gestures and a lack of motion modularity). Some synthetic hand pipelines may focus on limited 3D hands with random poses under limited viewpoints. Accordingly, some synthetic hand datasets may lack semantically meaningful gestures, motion dynamism, and data variation. For example, in some systems, simple wrist movements like moving a fist left, right, up, or down may be treated as distinct, unrelated gestures. Such rigid definitions may fail to capture the semantic meaning, and potential variability and flexibility in hand motions. Some synthetic hand datasets may therefore lack sufficient variation in hand shapes, gestures, dynamics, and viewpoints to robustly train and test 3D hand pose estimation (HPE) and hand gesture recognition (HGR) systems.

Additionally, some hand gesture databases may lack full hand-arm dynamics (i.e., realistic coordination between forearm, wrist, and fingers may be missing from datasets). For instance, unless specifically designed for very limited 3D models, the forearms may not be dynamically aligned with the hands.

The above information disclosed in this Background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art.

The present disclosure relates to various embodiments of a computer-implemented method of generating a synthetic dataset of hand and arm gestures. In one embodiment, the method includes generating, from a first conditional variational autoencoder including a first latent space and a first transformer decoder, a set of finger poses; generating, from a second conditional variational autoencoder including a second latent space and a second transformer decoder, a set of wrist motions; and combining the set of finger poses and the set of wrist motions to generate the synthetic dataset of hand and arm gestures.

The combining may include performing a Cartesian product of the set of finger poses and the set of wrist motions.

The first conditional variational autoencoder may be different than the second conditional variational autoencoder.

The first transformer decoder may have eight layers, and the second transformer decoder may have two layers.

The method may also include generating a hand-mesh model of the hand and arm gestures.

The set of finger poses may include at least one number gesture, at least one trigger gesture, and at least one special gesture.

The present disclosure also relates to various embodiments of a computer-based method of generating a hand-arm mesh model. In one embodiment, the method includes generating a hand mesh model; and joining an arm mesh model to the hand mesh model.

Joining the arm mesh model to the hand mesh model includes identifying wrist boundary vertices of the hand mesh model and the arm mesh model; ensuring a number of the wrist boundary vertices of the hand mesh model is equal to a number of the wrist boundary vertices of the arm mesh model; and applying a wrist rotation matrix to the hand mesh model.

The method may also include removing overlapping faces between the hand mesh model and the arm mesh model at a wrist of the hand-arm mesh model.

The method may also include interpolating between the hand mesh model and the arm mesh model at the wrist to prevent visual seams between the hand mesh model and the arm mesh model.

The method may also include applying a skin texture to the hand mesh model; and propagating the skin texture of the hand mesh model to the arm mesh model.

The hand mesh model may be a NIMBLE model.

The arm mesh model may be a SMPL-X model.

The method may include applying a global transformation to the hand-arm mesh model.

Generating the hand mesh model may include converting a MANO hand model to a NIMBLE hand model.

The hand mesh model may be a Handy model.

The present disclosure also relates to various embodiments of simulating real-world camera configurations. The method may include arranging cameras in a hemispherical configuration around a hand-arm mesh model; and capturing hand motions of the hand-arm mesh model from different perspectives with the cameras.

The cameras may include static cameras.

The cameras may include dynamic cameras.

The dynamic cameras may include first camera having a close-up lens facing a palm side of the hand-arm mesh model, and a pair of stereo cameras facing a back side of the hand-arm mesh model.

The method may also include generating the hand-arm mesh model, which may include generating, by a processor, a hand mesh model; and joining, by the processor, an arm mesh model to the hand mesh model to generate the hand-arm mesh model. Joining the arm mesh model to the hand mesh model may include identifying, by the processor, wrist boundary vertices of the hand mesh model and the arm mesh model; controlling, by the processor, a number of the wrist boundary vertices of the hand mesh model to be equal to a number of the wrist boundary vertices of the arm mesh model; and applying, by the processor, a wrist rotation matrix to the hand mesh model.

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in limiting the scope of the claimed subject matter. One or more of the described features may be combined with one or more other described features to provide a workable method or device.

Hereinafter, example embodiments will be described in more detail with reference to the accompanying drawings, in which like reference numbers refer to like elements throughout. The present invention, however, may be embodied in various different forms, and should not be construed as being limited to only the illustrated embodiments herein. Rather, these embodiments are provided as examples so that this disclosure will be thorough and complete, and will fully convey the aspects and features of the present invention to those skilled in the art. Accordingly, processes, elements, and techniques that are not necessary to those having ordinary skill in the art for a complete understanding of the aspects and features of the present invention may not be described. Unless otherwise noted, like reference numerals denote like elements throughout the attached drawings and the written description, and thus, descriptions thereof may not be repeated.

In the drawings, the relative sizes of elements, layers, and regions may be

exaggerated and/or simplified for clarity. Spatially relative terms, such as “beneath,” “below,” “lower,” “under,” “above,” “upper,” and the like, may be used herein for ease of explanation to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or in operation, in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as “below” or “beneath” or “under” other elements or features would then be oriented “above” the other elements or features. Thus, the example terms “below” and “under” can encompass both an orientation of above and below. The device may be otherwise oriented (e.g., rotateddegrees or at other orientations) and the spatially relative descriptors used herein should be interpreted accordingly.

It will be understood that, although the terms “first,” “second,” “third,” etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section described below could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the present invention.

It will be understood that when an element or layer is referred to as being “on,” “connected to,” or “coupled to” another element or layer, it can be directly on, connected to, or coupled to the other element or layer, or one or more intervening elements or layers may be present. In addition, it will also be understood that when an element or layer is referred to as being “between” two elements or layers, it can be the only element or layer between the two elements or layers, or one or more intervening elements or layers may also be present.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting of the present invention. As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and “including,” when used in this specification, specify the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list.

As used herein, the term “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent variations in measured or calculated values that would be recognized by those of ordinary skill in the art. Further, the use of “may” when describing embodiments of the present invention refers to “one or more embodiments of the present invention.” As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively. Also, the term “exemplary” is intended to refer to an example or illustration.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification, and should not be interpreted in an idealized or overly formal sense, unless expressly so defined herein.

The present disclosure relates to various embodiments of a method of generating a high-fidelity synthetic dataset of hand and arm gestures utilizing a phase-aware conditional variational autoencoder (CVAE) framework. The dataset of hand and arm gestures may be utilized in applications such as Hand Pose Examination (HPE), Hand Gesture Recognition (HGR), Extended Reality (XR), Human-Computer Interaction (HCl), or other domains that require realistic synthetic data to train and evaluate hand-related models. Generating and utilizing a synthetic hand gesture dataset that includes semantically meaningful gestures, motion dynamism, and data variation is configured to improve the training of hand-related models (e.g., hand-related models utilized in HPE, HGR, XR, or HCl applications) and the generation of a synthetic hand gesture dataset is less costly than 3D capturing and annotating real-world data (e.g., utilizing an array of cameras to capture real-world hand poses and then annotating those poses before utilizing the captured images to train a hand-related model). In one or more embodiments, the synthetic dataset generation is split into two CVAE streams (i.e., a dual-CVAE architecture), one for finger gestures (i.e., local gestures) and another for wrist motions (i.e., global motions) that are combined via a Cartesian product to create a diverse and flexible hand gesture database that includes gesture categories beyond those in existing related art datasets.

The present disclosure also relates to various embodiments of a cut-and-stitch method of generating a hand-arm mesh model by stitching an arm mesh template (e.g., a SMPL-X arm mesh model) to a NIMBLE hand mesh model. In one or more embodiments, the cut-and-stitch method is configured to enable dynamic hand articulation while keeping the arm's attachment stable. Additionally, in one or more embodiments, the cut-and-stitch method is configured to minimize (or at least reduce) visual seams between the hand mesh model and the arm model and to propagate the skin texture/tone of the hand model to the arm to ensure a uniform (or substantially uniform) skin tone between the hand and the arm and thereby maintain realism.

The present disclosure also relates to various embodiments of simulating a real-world setup of cameras in a hemispherical configuration around a hand-arm mesh model. The cameras are configured to capture diverse perspectives of the hand gestures articulated by the hand-arm mesh model. The cameras may include static cameras and/or dynamic cameras.

is a flowchart illustrating aspects of a methodof generating a high-fidelity synthetic dataset of hand and arm gestures according to one embodiment of the present disclosure. Althoughillustrates various operations in a method of generating a high-fidelity synthetic dataset of hand and arm gestures according to some embodiments, embodiments according to the present disclosure are not limited thereto. For example, according to various embodiments, the method may include additional operations, or fewer operations, or the order of operations may vary, unless otherwise stated or implied, without departing from the spirit and scope of embodiments according to the present disclosure.

As illustrated and described below, embodiments according to the present disclosure may utilize a dual conditional variation autoencoder (CVAE), that separately models global wrist motions and local finger gestures, that may be combined to create relatively diverse and flexible hand gestures. For example, in the illustrated embodiment, the methodincludes a taskof generating a set of finger poses (e.g., finger poses that represent semantic meaning, such as an extended index finger to represent the number “1” or a closed thumb and index finger in a circle representing “ok”). In one or more embodiments, the taskof generating the finger poses utilizes a first conditional variational autoencoder (CVAE). CVAEs are unsupervised generative models that are configured to generate samples from an input by encoding the input data into a latent representation and then reconstructing the input from the latent space. CVAEs extend variational autoencoders (VAEs) by incorporating conditional information, such as class labels, during training and inference, which enables for the controlled generation of data based on specific attributes or labels.

Additionally, in the illustrated embodiment, the methodalso includes a taskof generating a set of wrist motions. These wrist motions are indicative or representative of global hand motions. In one or more embodiments, the taskof generating the wrist motions utilizes a second CVAE. The second CVAE may be different than the first CVAE utilized in taskto generate the set of finger poses.

Thus, in contrast to some systems or datasets, that may constrain gestures to fixed combinations of global wrist motions and specific finger poses, embodiments according to the present disclosure may be capable of separately modeling global wrist motions and local finger gestures, as shown in taskandand discussed in more detail below.

is a schematic representation of a CVAEutilized in taskand. The left side of the diagram indepicts the CVAEduring training and the right side of the diagram indepicts the CVAEduring inference (e.g., generation of the finger poses or the wrist motions). As illustrated in, the CVAEincludes a transformer encoderand a transformer decoder. During training of the CVAE, gesture labels, 3D joints, pose parameters, and phase labelsare linearized and tokenized and then input into the transformer encoder, which encodes the input data and outputs parameters of a probability distribution (e.g., the mean and variance of a Gaussian distribution) into a latent space, which is a lower-dimensional, continuous space where the input data is encoded. The pose parametersrefer to the configuration of the fingers (e.g., index finger pointing; thumb up; ok shape) and the phase labelsrefer to the extent to which the finger configuration has transitioned into the final finger pose (e.g., initial position; transitioning; or final position). The transformer decoderis configured to create new data that resembles the input data by sampling (e.g., utilizing a reparameterization gradient estimator(also known as the reparameterization trick)) from the distribution in the latent space. In one or more embodiments, the CVAEutilized in taskand/or taskmay be the same as or similar to the CVAE described in U.S. Provisional Application No. 63/707,422, the entire contents of which are incorporated herein by reference.

In one or more embodiments, the transformer decoderof the first CVAE utilized in taskto generate the finger poses has more layers than the transformer decoderof the second CVAEutilized in taskto generate the wrist motions. For instance, in one or more embodiments, the transformer decoderof the first CVAE utilized in taskhas eight (8) layers and the transformer decoderof the second CVAE utilized in taskhas two (2) layers.

During task, inference of the first CVAEis performed by inputting a text-based finger gesture label(e.g., “finger pinch”; “finger swipe”; or “finger snap”) into the first CVAE. The latent spaceis then randomly sampled and this sample is input into the transformer decoder. The transformer decodergenerates a projectionfrom the sample from the latent spaceand then the projectionoutputs three-dimensional joints, pose parameters, and phase labels. In one or more embodiments, the output is a skeleton of joints in a configuration corresponding to the input finger gesture label (e.g., a skeleton of finger joints with the thumb and the index fingertips touching each other in response to the input finger gesture label being “finger pinch”). In this manner, the inference process in taskis configured to synthesize diverse three-dimensional finger gesture sequences (i.e., by sampling from the latent space, a variety of different 3D joints/mesh corresponding to the input text-based finger gesture label are generated).

During task, inference of the second CVAEis performed by inputting a text-based global (wrist) gesture label (e.g., “circle,” “cross,” or “upward movement”) into the second CVAE. The latent spaceis then randomly sampled and this sample is then input into the transformer decoder. The transformer decodergenerates a projection from the sample from the latent spaceand then the projection outputs three-dimensional joints/mesh, pose parameters, and phase labels. In this manner, the inference process in taskis configured to synthesize diverse wrist gesture sequences (i.e., by sampling from the latent space, a variety of different 3D joints/mesh corresponding to the input text-based wrist gesture label are generated).

With reference again to the embodiment illustrated in, the methodalso includes a taskof combining the set of finger poses generated in taskwith the set of wrist motions generated in taskto generate the synthetic dataset of hand and arm gestures. Accordingly, the finger gestures and the global wrist motions are synthesized separately in tasksandand then combined. Together, the dual process streams (i.e., the finger gestures generated from the first CVAE and the wrist motions generated from the second CVAE) generate diverse three-dimensional hand gesture sequences (i.e., diverse finger and wrist combinations are integrated to form a dataset of diverse hand gestures). In this manner, combining the two streams generates a wide range of meaningful hand gestures that extends beyond alternative datasets that may have rigid definitions or constrained combinations of hand poses, gestures, or movement trajectories. That is, the dataset of diverse hand gestures generated according to embodiments of the present disclosure provide an improvement over related art datasets that have limited hand gestures, and these enriched datasets represent a broad spectrum of gesture classes.

In one or more embodiments, the methodmay include a taskof utilizing the hand and arm gesture database generated in taskto train a hand pose estimation model (e.g., Mobile-StereoHPE) and/or to train a hand gesture recognition model (e.g., Fast-DNN). In one or more embodiments, the trained models (e.g., the trained hand pose estimation model and/or the trained hand gesture recognition model) may be incorporated in an extended reality (XR) device, such as an augmented reality (AR) device, a virtual reality (VR) device, or a mixed reality device. In one or more embodiments, the hand and arm gesture database generated in taskmay be utilized in a synthetic saliency-bokeh video dataset for cinematic video project.

depicts a set of local gestures (i.e., finger gestures), including finger gestures representing a number (e.g., the thumb or index finger extending to represent the number one, two digits extending to represent the number two, etc.), finger trigger gestures (e.g., tips of the thumb and index finger touching, tips of the thumb and middle finger touching, tips of the thumb and ring finger touching, or tips of the thumb and pinky finger touching), and finger special poses (e.g., a snap, a heart shape, a phone-call gesture, an OK gesture, etc.), and global wrist motions, such as left movement, right movement, a circular movement, a cross-shaped movement, forward movement, backward movement, upward movement, downward movement, etc. The right side ofdepicts the Cartesian product of these finger gestures and the wrist motions, such as an index finger pointing and the wrist moving in a circle, fingers closing into a fist and sliding (translating) to the right, an open palm rotating to the left, an open palm moving in a circulation motion, etc.

Patent Metadata

Filing Date

Unknown

Publication Date

October 30, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “MODULAR PIPELINE FOR HIGH-FIDELITY HAND-ARM MOTION SYNTHESIS AND MULTI-VIEW RENDERING” (US-20250336134-A1). https://patentable.app/patents/US-20250336134-A1

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

MODULAR PIPELINE FOR HIGH-FIDELITY HAND-ARM MOTION SYNTHESIS AND MULTI-VIEW RENDERING | Patentable