Patentable/Patents/US-20260087369-A1

US-20260087369-A1

Method and Apparatus for Continuous Learning Using Asymmetric Structures

PublishedMarch 26, 2026

Assigneenot available in USPTO data we have

InventorsMin Jae JUNG Joo Hee KIM Seung Jin OH Seung Taek KIM Kyung Shil KANG

Technical Abstract

The present invention aims to minimize catastrophic forgetting that occurs during the continual learning process of a large language model (LLM), and to improve the model's performance by efficiently acquiring new knowledge. A memory of a continual learning apparatus using an asymmetric structure according to one embodiment of the present invention may store datasets used for continual learning, model parameters, a router, and adapters. At least one processor may be configured to perform continual learning using a neural network model comprising a shallow layer and a deep layer; to add a new adapter corresponding to new learning to the deep layer whenever new data is learned; and to distribute input text to at least one of the shallow layer, the deep layer, and the adapter via a router provided between the shallow layer and the deep layer.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

a memory configured to store a dataset, model parameters, a router, and adapters used for continual learning; and at least one processor configured to communicate with the memory, wherein the at least one processor is configured to: perform continual learning using a neural network model comprising a shallow layer and a deep layer, add an adapter corresponding to new learning to the deep layer whenever new data is learned, distribute input text to at least one of the shallow layer, the deep layer, and the adapter via the router provided between the shallow layer and the deep layer, perform regularization on the adapter to improve learning efficiency of the neural network model, perform the regularization using an orthogonal loss function to maintain independence of information between different adapters, perform the regularization using the orthogonal loss function by calculating orthogonality between output data of different adapters and projecting the output data onto different planes, perform Sparse Low-rank Adaptation (SoRA) based on a gate vector provided between the adapters, and determine a capacity of each of the adapters based on characteristics of the input text, wherein the gate vector is composed of a vector in a rank dimension, wherein the gate vector adjusts the rank by performing a Hadamard product with a feature vector that has passed through a predetermined layer, wherein the orthogonal loss function is determined by Equation 1 as follows: . A continual learning apparatus using an asymmetric structure, the continual learning apparatus comprising: i j wherein L denotes a loss function, and Aand Adenote an i-th and a j-th A adapter, respectively, wherein the Hadamard product is performed based on Equation 2 as follows: wherein h represents a forward pass of each adapter, A and B represent parameters constituting each adapter, g represents the vector in the rank dimension, and x represents the feature vector that has passed through the predetermined layer.

claim 1 distribute the input text based on a feature corresponding to the deep layer via the router. . The continual learning apparatus using the asymmetric structure of, wherein the at least one processor is configured to:

claim 1 improve parameter efficiency by using Low-Rank Adaptation (LoRA) through the adapter provided in the deep layer and using at least one portion of the neural network model. . The continual learning apparatus using the asymmetric structure of, wherein the at least one processor is configured to:

claim 1 increase a learning speed of an added adapter by adjusting an update of parameters learned prior to a reference point at a predetermined ratio using a Gradient Decoupling Layer (GDL). . The continual learning apparatus using the asymmetric structure of, wherein the at least one processor is configured to:

claim 1 store data samples from tasks prior to a reference point in the memory, and perform replay based on the data samples. . The continual learning apparatus using the asymmetric structure of, wherein the at least one processor is configured to:

claim 1 determine a cross-entropy loss between the input text and a corresponding task identifier (ID), and train the router to assign the input text to the adapter based on the cross-entropy loss. . The continual learning apparatus using the asymmetric structure of, wherein the at least one processor is configured to:

storing a dataset, model parameters, a router, and adapters used for continual learning; performing the continual learning using a neural network model comprising a shallow layer and a deep layer; adding an adapter corresponding to new learning to the deep layer whenever new data is learned; and distributing input text to at least one of the shallow layer, the deep layer, and the adapter via the router provided between the shallow layer and the deep layer, wherein the performing the continual learning comprises: performing regularization on the adapter to improve learning efficiency of the neural network model; performing the regularization using an orthogonal loss function to maintain independence of information between different adapters; performing the regularization using the orthogonal loss function by calculating orthogonality between output data of different adapters and projecting the output data onto different planes; performing Sparse Low-rank Adaptation (SoRA) based on a gate vector provided between the adapters; and determining a capacity of each of the adapters based on characteristics of the input text, wherein the gate vector is composed of a vector in a rank dimension, wherein the gate vector adjusts the rank by performing a Hadamard product with a feature vector that has passed through a predetermined layer, wherein the orthogonal loss function is determined by Equation 1 as follows: . A method of a continual learning using an asymmetric structure, the method comprising: i j wherein L denotes a loss function, and Aand Adenote an i-th and a j-th A adapter, respectively, wherein the Hadamard product is performed based on Equation 2 as follows: wherein h represents a forward pass of each adapter, A and B represent parameters constituting each adapter, g represents the vector in the rank dimension, and x represents the feature vector that has passed through the predetermined layer.

claim 7 distributing the input text based on a feature corresponding to the deep layer via the router. . The method of, further comprising:

claim 7 increasing a learning speed of an added adapter by adjusting the update of parameters learned prior to a reference point at a predetermined ratio using a Gradient Decoupling Layer (GDL). . The method of, further comprising:

claim 7 determining a cross-entropy loss between the input text and a corresponding task identifier (ID), and training the router to assign the input text to the adapter based on the cross-entropy loss. . The method of, further comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims the benefit of Korean Patent Application No. 10-2024-0129310, filed on Sep. 24, 2024, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.

The present invention relates to a technology in the field of deep learning, and more particularly, to a method and an apparatus for alleviating catastrophic forgetting occurring during continual learning in a large language model (LLM) and for effectively acquiring new knowledge during the continual learning process.

With recent advances in deep learning technology, large language models (LLMs) have demonstrated remarkable performance in various natural language processing tasks, including text generation, translation, and question answering. However, LLMs suffer from a problem of catastrophic forgetting, in which previously acquired knowledge is lost during the process of learning new tasks.

To utilize LLMs efficiently, it is essential to enable continual learning capabilities that allow the model to retain previously learned knowledge while learning new tasks.

Conventional continual learning methods have limitations in that they fail to completely resolve the issue of catastrophic forgetting or exhibit degraded learning performance on new tasks.

A memory of a continual learning apparatus using an asymmetric structure according to one embodiment of the present invention may store a dataset used for continual learning, model parameters, a router, and adapters.

At least one processor may be configured to perform continual learning using a neural network model comprising a shallow layer and a deep layer, to add a new adapter corresponding to new learning to the deep layer whenever new data is learned, and to distribute input text to at least one of the shallow layer, the deep layer, and the adapter via a router provided between the shallow layer and the deep layer.

The at least one processor may be configured to distribute the input text based on a feature corresponding to the deep layer via the router.

The at least one processor may be configured to improve parameter efficiency by using Low-Rank Adaptation (LoRA) through the adapter provided in the deep layer, utilizing at least a portion of the neural network model.

The at least one processor may be configured to perform regularization on the adapter to improve the learning efficiency of the neural network model.

The at least one processor may be configured to perform the regularization using an orthogonal loss function in order to maintain the independence of information between different adapters.

The at least one processor may be configured to perform Sparse Low-rank Adaptation (SoRA) based on a gate vector provided between the adapters, and to determine the capacity of each adapter based on the characteristics of the input text.

The at least one processor may be configured to increase the learning speed of the added adapter by adjusting the update of parameters learned prior to a reference point at a predetermined ratio using a Gradient Decoupling Layer (GDL).

The memory may store data samples from tasks prior to a reference point, and the at least one processor may be configured to perform replay based on the data samples.

The at least one processor may be configured to determine a cross-entropy loss between the input text and a corresponding task ID, and to train the router to assign the input text to the adapter based on the cross-entropy loss.

A continual learning method using an asymmetric structure according to one embodiment of the present invention includes storing, by a memory, a dataset used for continual learning, model parameters, a router, and adapters, and performing, by a processor, continual learning using a neural network model comprising a shallow layer and a deep layer.

The performing of the continual learning includes: learning, by the processor, general data in the shallow layer; learning, by the processor, new data, which differs from the general data, in the deep layer; adding, by the processor, a new adapter corresponding to the new learning to the deep layer whenever the new data is learned; and distributing, by the processor, input text to at least one of the shallow layer, the deep layer, and the adapter via a router provided between the shallow layer and the deep layer.

Throughout the present disclosure, like reference numerals refer to like components. The present disclosure does not describe all elements of the embodiments, and general matters in the relevant technical field or redundant contents among embodiments are omitted.

The terms “unit,” “module,” “element,” and “block” used in the specification may be implemented in software or hardware, and in some embodiments, a plurality of such units, modules, elements, or blocks may be implemented as a single component, or a single unit, module, element, or block may include a plurality of components.

In the present specification, when a part is described as being “connected” to another part, it includes not only direct connections but also indirect connections, where the indirect connection may include a connection through a wireless communication network.

In addition, when a part is described as “including” a certain component, unless explicitly stated otherwise, it does not exclude other components and may further include additional components.

In the present specification, when one element is described as being “on” another element, it includes not only the case where the one element is in contact with the other element, but also the case where another element is interposed between the two elements.

The terms such as first, second, and the like are used merely to distinguish one component from another component, and the components are not limited by the terms.

The singular expressions are intended to include plural forms as well, unless the context clearly indicates otherwise.

In each step, reference numerals are used merely for convenience of explanation, and do not indicate the order of the steps. Unless a specific order is clearly described in the context, the steps may be performed in an order different from that explicitly stated.

The present invention aims to minimize catastrophic forgetting that occurs during the continual learning process of an LLM and to enhance model performance by efficiently learning new knowledge.

The present invention preserves previously acquired knowledge by alleviating catastrophic forgetting occurring during the continual learning process of an LLM.

The present invention improves model performance by efficiently acquiring knowledge for new tasks.

By suppressing catastrophic forgetting and effectively acquiring new knowledge, the present invention facilitates the reuse of a well-trained pretrained model.

The present invention enhances model reusability, thereby saving computational resources and energy required for training LLMs, and improving economic and environmental efficiency.

Hereinafter, the operating principles and embodiments of the present disclosure will be described with reference to the accompanying drawings.

1 FIG. is a diagram illustrating the overall configuration of a continual learning apparatus using an asymmetric structure, presenting the connections among a memory, a processor, a shallow layer, a deep layer, a router, and adapters, as well as the flow of data.

1 1 The figure visually represents the process in which new adapters (a. . . aN) are added to the deep layer (Ld) and the process in which input text is distributed to each layer or adapter (a. . . aN) via the router (R).

1 FIG. is a diagram illustrating the overall structure of a Progressive Mixture of Experts with Asymmetric Transformer (PMoE) model and its operation in a continual learning process.

The memory serves to store various types of information required for continual learning. Specifically, it stores datasets for each task, model parameters including pretrained weights and adapters, and the weights of the router (R).

The processor is a core component that reads the data and model information stored in the memory and performs the actual continual learning process.

The processor performs continual learning by utilizing a neural network model composed of a shallow layer and a deep layer.

Among the neural network model, the shallow layers (Ls) are specialized in retaining general knowledge or content learned from previous tasks. Even during the continual learning process, the model structure remains fixed, which helps prevent forgetting of previously acquired knowledge.

Each transformer block may be connected to one or more adapters (LoRA).

The deep layers (Ld) are specialized in learning task-specific knowledge for new tasks.

Each time a new task is learned, an adapter responsible for the specialized knowledge of that task may be added to the deep layers (Ld).

Each transformer block may be connected to multiple adapters, and the router can select an appropriate adapter for a given input text to process the information.

The router (R) is positioned between the shallow layers and the deep layers, and it analyzes the input text and distributes it to a suitable layer or adapter.

The distribution by the router may be determined based on the following equation.

Referring to Equation 1, G(x) may represent a probability distribution determined by the router, and x may represent an input sequence. Wg may represent a linear layer forming the network of the router.

The router may utilize features of the deep layer to identify the characteristics of the input text and, based on this, distribute the input text to the shallow layer, the deep layer, or a specific adapter.

Each adapter may be specialized for a specific task or type of knowledge and may be responsible for processing information related to that task.

In the deep layer, a new adapter may be added whenever a new task is learned, thereby extending the representational capacity of the model.

Adapters may use Low-Rank Adaptation (LoRA) to increase parameter efficiency. That is, instead of updating the parameters of the entire model, only a small number of parameters are updated and applied to the new task.

The PMoE model can efficiently learn new knowledge while preserving previously acquired knowledge through an asymmetric structure between the shallow layers and the deep layers.

Meanwhile, the self-attention layer constituting a transformer block may be determined based on the following equation.

F may represent a layer constituting the model, and h may represent a forward pass based on the input sequence. τ denotes the number that determines the deep layer and the shallow layer.

l l Meanwhile, Wand W′may be determined based on the following equation.

Equation 3 may represent the configuration of the self-attention layer corresponding to the shallow layer, and Equation 4 may represent the configuration of the self-attention layer corresponding to the deep layer.

A and B, as presented in Equations 3 and 4, may include trainable parameters that constitute the adapter.

By progressively adding adapters to the deep layer (Ld) and distributing input text to appropriate adapters via the router, catastrophic forgetting that occurs during continual learning can be alleviated, and model performance can be improved.

2 FIG. 220 210 210 220 is a diagram illustrating a process in which the router () distributes input text by utilizing features of the deep layer (). A feature vector (f) extracted from the deep layer () is delivered to the router, and the router () distributes the input text based on the feature vector.

2 FIG. 220 210 210 220 220 is a diagram visually illustrating the process in which the router () of the PMoE model distributes input text to appropriate adapters by utilizing a feature vector (f) extracted from the deep layer (). In this figure, the feature vector (f) extracted from the deep layer () is delivered to the router (), and the router () distributes the input text based on this feature.

2 FIG. also explains how the router effectively distributes the input text to adapters by leveraging the features of the deep layer.

210 220 The final transformer block of the deep layer () processes the input text and generates a feature vector (f) as a result. This feature vector contains the essential semantics and contextual information of the text, playing a crucial role in enabling the router () to identify the nature of the input.

220 The router () receives the feature vector (f) generated from the deep layer as input and calculates assignment probabilities for each adapter. This process is performed by analyzing the feature vector and serves as a basis for evaluating the suitability of each adapter.

220 The router () distributes the input text to the most appropriate adapter based on the calculated assignment probabilities (see Equation 1). In this step, the router analyzes the characteristics of the input text using the feature vector (f) and assigns it to an adapter that possesses the most suitable domain knowledge. This process allows the router to recognize the fine-grained characteristics of the input text and effectively select the optimal adapter accordingly.

220 The PMoE model has a structure in which information is processed more effectively as it flows from the shallow layer to the deep layer through abstraction and integration. This structure enables the feature vector (f) from the deep layer to better reflect the meaning of the input text, allowing the router () to distribute the input text more accurately based on this information.

The feature vector (f) extracted from the deep layer enables a clearer interpretation of the complex semantics and context of the text, which allows the router to make more precise distribution decisions.

220 This feature-based routing mechanism significantly enhances the performance of the router (), ultimately contributing to the improvement of the continual learning capability of the PMoE model. By precisely analyzing the deep-layer features and accurately assigning the input to the most appropriate adapter, the model can quickly and effectively adapt to continuously incoming data and changing situations. As a result, the overall learning efficiency and performance are improved.

This capability of the PMoE model to demonstrate high flexibility and adaptability in various learning scenarios is one of its core strengths and plays a key role in enhancing the quality of continual learning.

3 FIG. 3 FIG. 320 310 320 320 is a diagram illustrating in detail the process by which the adapter () enhances parameter efficiency using Low-Rank Adaptation (LoRA) technology. This figure depicts the pretrained weights () and the internal LoRA structure within the adapter (), explaining how the structure reduces the number of parameters and improves learning efficiency.shows how the LoRA technique optimizes the structure and function of the adapter (), thereby enhancing the overall performance of the model.

The figure describes how LoRA is applied in the continual learning apparatus of the present invention, and illustrates how this technique decomposes a large weight matrix into two smaller low-rank matrices. This low-rank matrix decomposition contributes to reducing model training time.

320 The processor can significantly improve the overall learning efficiency of the model by substantially reducing the number of parameters in the adapter () through LoRA. This allows for more efficient use of computational resources by selectively updating only the necessary parts of the model instead of retraining the entire model. As a result, the model becomes capable of rapidly adapting to new tasks while preserving previously learned knowledge and integrating new information.

Since continual learning requires the model to continuously learn new tasks, efficiency is a key factor in its success. By improving parameter efficiency through LoRA, the learning apparatus can accelerate the training process for new tasks, reduce the required storage space, and mitigate overfitting issues. These enhancements allow the model to operate more responsively and effectively during the continual learning process, ultimately delivering better performance to the user.

320 3 FIG. In conclusion, the adapter () of the present invention, as illustrated in, effectively supports the continual learning process by maximizing parameter efficiency through LoRA. This approach addresses various challenges faced by larger and more complex models, providing an efficient and cost-effective learning solution that greatly expands the potential of continual learning. It offers a robust foundation that can evolve alongside advancements in technology and may have a significant impact on future continual learning applications.

4 FIG. 410 420 is a diagram illustrating the process in which the processor applies regularization to adapters (,) and the resulting effects, visually representing how regularization alleviates catastrophic forgetting.

410 420 410 420 The figure shows how the processor in the PMoE model applies a regularization technique to the adapters (,) located in the deep layer, and explains how this process is effectively carried out. The processor applies a specific regularization method to each adapter (,), thereby enabling the adapters to process different types of information independently.

410 420 Through regularization, the processor controls the learning process of the adapters (,), which plays a key role in preventing previously learned knowledge from being excessively overwritten when new tasks are learned. In this process, the processor supports each adapter in maintaining independent information while effectively acquiring new knowledge.

This approach contributes to enhancing the continual learning capability of the PMoE model.

According to one embodiment of the present invention, the regularization techniques used may include L1 or L2 regularization, orthogonal loss functions, and the like.

The orthogonal loss function may be defined as follows.

a b Referring to Equation 5, θand θare parameters of the model with a magnitude (norm) of 1.

The orthogonal loss function may be defined as a regularization term that does not require input data.

410 420 In the present system, the use of such an orthogonal loss function involves computing the orthogonality between different adapters (,).

Specifically, the role of this loss function in the system is to project the hidden features passing through different adapters onto distinct planes.

An equation related to this projection may be expressed as follows.

i j Referring to Equation 2, Aand Amay represent the i-th and j-th A adapters, respectively.

410 420 In particular, the orthogonal loss function guides the information across adapters to be learned in independent directions, enabling each adapter (,) to process task-specific knowledge more effectively. This technique enhances the overall learning efficiency of the model and allows the adapters to function independently without interfering with one another.

4 FIG. Referring to, one can observe the structure of adapters that effectively retain information from previous tasks even after learning new tasks. This demonstrates how regularization enables the long-term preservation of acquired knowledge within the PMoE model.

Such a regularization-based approach significantly improves the stability and reliability of the model in a continual learning environment, allowing for rapid adaptation to new tasks while maintaining previously learned knowledge.

This mechanism plays a crucial role in maximizing the continual learning capability of the PMoE model and in ensuring robust performance across diverse learning scenarios.

410 420 Meanwhile, the processor analyzes interactions among the various adapters (,) located in the deep layer and computes the orthogonal loss function to ensure that each adapter processes information independently without interference from others. Through this process, the adapters learn different information independently, which contributes significantly to improving the overall efficiency and performance of the system.

The orthogonal loss function encourages the adapters to learn information in mutually independent directions, effectively making the feature vectors extracted from each adapter orthogonal to one another. This structure prevents interference during the learning processes of different adapters, allowing each adapter to handle information optimized for its corresponding task.

410 420 In addition, by computing the orthogonal loss function, the processor encourages each adapter to handle different types of input text. This allows each adapter (,) to perform more specialized learning by processing distinct categories of data, thereby increasing the overall diversity and flexibility of the model.

4 FIG. In conclusion, the application of the orthogonal loss function, as illustrated in, represents a core component of the PMoE model. It plays a crucial role in improving the model's overall learning capability and adaptability by enabling the adapters to learn independently and efficiently.

410 420 This process supports the adapters (,) in learning different information while functioning as part of an integrated system, allowing the model to perform various tasks effectively during continual learning without degradation in performance.

5 FIG. 520 is a diagram illustrating the process of automatically adjusting the capacity of an adapter according to the characteristics of a dataset using Sparse Low-rank Adaptation (SoRA). This figure visually represents the rank adjustment mechanism through the gate layer () in SoRA and precisely shows how the adapter's capacity changes flexibly depending on the dataset.

5 In the PMoE model, SoRA provides a key functionality that allows the capacity of the adapter () to be efficiently adjusted in response to diverse requirements of the dataset.

5 FIG. 5 520 As shown in, the detailed operation of the SoRA technique applied to each adapter () can be clearly understood. SoRA is an evolved version of the existing LoRA (Low-Rank Adaptation) structure, designed by inserting a gate layer () between the bottleneck structure of LoRA to control the rank.

520 The gate layer () is composed of a rank-dimensional vector and is capable of performing a Hadamard product with the features passed through the lower layer of LoRA.

Meanwhile, the Hadamard product performed by the gate may be based on the following equation.

Referring to Equation 7, h may represent the forward pass of each adapter, and A and B may represent the parameters constituting the adapter. g may represent a vector in the rank dimension.

520 The primary function of the gate layer () is to deactivate specific dimensions in the rank space by setting their values to zero. This effectively reduces the rank of LoRA, thereby allowing the capacity of the adapter to be adjusted.

5 FIG. visually illustrates how the dimensionality of LoRA is reduced depending on the number of zero values in the gate vector.

The SoRA technique analyzes the characteristics of a dataset to automatically determine and adjust the most suitable rank for each adapter. Through this process, the capacity of the adapter can be increased or decreased according to the requirements of the dataset, which plays a significant role in optimizing the overall performance and efficiency of the model.

5 FIG. The approach illustrated inhighlights the ability to flexibly adjust the adapter's capacity in response to the diversity and complexity of the dataset, thereby enabling the PMoE model to operate in a more refined and effective manner.

6 6 FIGS.A andB are diagrams visually illustrating the process of adjusting gradient values and differentiating learning speeds using a Gradient Decoupling Layer (GDL). These figures effectively depict how GDL suppresses updates to existing parameters while reinforcing the training of newly introduced adapters. GDL plays a crucial role in the PMoE model, particularly by focusing on reducing the update of previously learned parameters and accelerating the learning speed of newly added adapters.

6 6 FIGS.A andB According to, GDL functions by controlling the flow of backpropagated gradients between the shallow and deep layers of the network, or within the deep layer itself. Through this mechanism, gradients are selectively modulated between or within layers, which plays a critical role in the learning process.

6 FIG.B For example, GDL can reduce the magnitude of gradient values by a specific ratio (e.g., 0.1), thereby decreasing the update strength of existing parameters. In, this is visually represented by the smaller gradient arrows pointing toward the shallow layer.

By utilizing GDL, it is possible to suppress parameter updates in the shallow layer while delivering larger gradient values to the newly added adapters, thereby increasing their learning speed. In this process, GDL acts as a layer positioned between the shallow and deep layers, or within the deep layer, enabling efficient control of gradient flow and enhancing the overall learning efficiency of the network.

In other words, GDL optimizes the learning process by modulating gradient flow and contributes to improving the overall performance of the model by adjusting the update intensity of specific parameters.

Thus, GDL plays a central role within the PMoE model, functioning as a key component for controlling the flow of gradients. This structure enables more effective management of the learning process and plays an important role in enhancing both the learning speed and performance of the model. Accordingly, the use of GDL can be an essential strategy for simultaneously protecting previously learned parameters and enabling efficient learning of newly introduced adapters.

7 FIG. 710 is a diagram illustrating the process of performing replay using data samples from previous tasks stored in the memory (), visually explaining one of the core functions of the PMoE model.

7 FIG. 710 720 As shown in, data samples are transferred from the memory () to the processor (), and the processor utilizes these samples to retrain the model.

This process is designed to prevent the problem of “catastrophic forgetting,” which frequently occurs in continual learning models—where previously acquired knowledge is lost as new information is learned.

7 FIG. 710 710 is a diagram showing how past data samples stored in the memory () are utilized to prevent the loss of previously acquired knowledge and to strengthen the model's continual learning capability. The figure visually represents the “replay” process, which is one of the core mechanisms of the PMoE model. The memory () functions as a key storage unit—similar to human memory—that retains data samples used in prior tasks. These data samples (D) help the model maintain prior knowledge during the learning of new tasks.

720 710 The processor () not only retrieves data samples from previous tasks stored in the memory () to retrain the model, but also performs this process repeatedly to reinforce learning on new tasks. This replay mechanism is essential for preserving and reinforcing previously learned knowledge while acquiring new information.

720 By performing replay, the processor () re-learns information from earlier tasks, effectively mitigating catastrophic forgetting that may occur during the learning of new tasks. Through this process, the overall performance of the model can be enhanced.

7 FIG. Ultimately,illustrates a core mechanism that enables continual learning in the PMoE model. This process can be seen as a machine-implemented analogy to how humans periodically review previously learned content to retain memory. Such a replay process allows the model to effectively integrate new information while preserving prior knowledge, thereby achieving more robust and sustainable learning outcomes.

8 FIG. is a diagram illustrating the process of calculating the cross-entropy loss between input text and task ID, and using it to train the router. It shows how the processor calculates the cross-entropy loss and delivers it to the router so that the router can learn to assign the input text to the appropriate adapter.

Meanwhile, the cross-entropy loss between the input text and the task ID may be calculated based on the following equation.

Referring to Equation 8, x may represent the input, and k may represent the task ID. Gx (x) may indicate the probability, determined by the router, of assigning the input x to the corresponding task ID k. L (x, k) may represent the cross-entropy loss between the input text and the task ID.

8 FIG. 810 820 830 focuses on the training process of the router in the PMoE model. Specifically, it visually illustrates the process in which the cross-entropy loss is calculated between the input text () and the corresponding task ID (), and how this loss is used to train the router () to assign the input text to the appropriate adapter.

8 FIG. 810 820 The processor may calculate a cross-entropy loss by receiving the input text and the corresponding task ID as inputs. As shown in, the input text () and task ID () are provided to the processor, which is configured to compute the loss value based on this information.

830 830 8 FIG. According to one embodiment of the present invention, the calculated cross-entropy loss may be used for training the router (). That is, the router can improve its ability to assign input text to the correct adapter using this loss value. As illustrated in, the loss value computed by the processor is delivered to the router (), which then updates its internal weights based on this value.

830 810 820 830 Through training, the router () analyzes the characteristics of the input text (), predicts the task ID () most relevant to the input, and assigns the input text to the appropriate adapter. This process is essential for enabling the router () to develop the capability to identify the text's properties and make optimal adapter selection decisions.

810 820 The cross-entropy loss between the input text () and the task ID () supports this decision-making process, contributing to improved learning efficiency and enhanced overall performance of the PMoE model.

830 Through this process, the PMoE model can handle input text more accurately and efficiently, resulting in improved adaptability across a wide range of tasks. The interaction between the processor that calculates the cross-entropy loss and the router () that utilizes it for training equips the PMoE model with the capability to process more complex and diverse data.

9 FIG. is a conceptual block diagram illustrating components of a continual learning apparatus using an asymmetric structure according to various embodiments of the present disclosure.

100 110 120 130 140 100 The continual learning apparatus () using an asymmetric structure may include a processor (), a memory (), a communication unit (), and an input/output interface (). The internal components that may be included in the continual learning apparatus () are not limited to these.

100 110 The continual learning apparatus () of the present disclosure may also perform the functions of the processor () via a separate processing server or a cloud server in place of the local processor.

9 FIG. 110 100 120 100 110 120 110 120 Referring to, the processor () may be implemented to perform operations of the continual learning apparatus () using an asymmetric structure by utilizing data stored in the memory (), which stores algorithms or programs replicating such algorithms for controlling the operations of the components within the apparatus (). In this case, the processor () and the memory () may be implemented as separate chips. Alternatively, the processor () and the memory () may be integrated into a single chip.

110 100 1 8 FIGS.through The processor () may control one or more of the components described above in combination to implement various embodiments of the present disclosure described with reference toin the continual learning apparatus () using an asymmetric structure.

120 100 110 100 The memory () according to the embodiment may store data supporting various functions of the continual learning apparatus () using an asymmetric structure, as well as programs for operating the processor (). It may also store input/output data (e.g., images, videos), multiple application programs executed on the continual learning apparatus (), data for operating the apparatus, and instructions. At least some of these application programs may be downloaded from an external server via wireless communication.

120 The memory () may include at least one type of storage medium such as: flash memory type, hard disk type, solid state disk (SSD) type, silicon disk drive (SDD) type, multimedia card micro type, card-type memory (e.g., SD or XD memory), random access memory (RAM), static RAM (SRAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), programmable ROM (PROM), magnetic memory, magnetic disk, and optical disk.

100 Furthermore, the memory may be a database that is physically separate from the continual learning apparatus () but connected to it via a wired or wireless interface.

130 The communication unit () according to the embodiment may include one or more components that enable communication with external devices. For example, it may include at least one of a broadcast reception module, a wired communication module, a wireless communication module, a short-range communication module, and a location information module.

The wired communication module may include various types of wired communication interfaces such as a Local Area Network (LAN) module, a Wide Area Network (WAN) module, or a Value Added Network (VAN) module. It may also include various cable-based communication interfaces such as USB (Universal Serial Bus), HDMI (High Definition Multimedia Interface), DVI (Digital Visual Interface), RS-232 (Recommended Standard 232), power line communication, or POTS (Plain Old Telephone Service).

The wireless communication module may include not only a Wi-Fi module and a Wireless Broadband (WiBro) module, but also wireless communication modules supporting various communication standards such as GSM (Global System for Mobile Communication), CDMA (Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), UMTS (Universal Mobile Telecommunications System), TDMA (Time Division Multiple Access), LTE (Long Term Evolution), 4G, 5G, and 6G.

The short-range communication module is intended for short-range communication and may support communication using at least one of the following technologies: Bluetooth, RFID (Radio Frequency Identification), infrared communication (IrDA; Infrared Data Association), UWB (Ultra Wideband), ZigBee, NFC (Near Field Communication), Wi-Fi (Wireless Fidelity), Wi-Fi Direct, and Wireless USB (Wireless Universal Serial Bus).

140 100 140 100 140 The input/output interface () according to the embodiment serves as a communication channel with various types of external devices connected to the continual learning apparatus () using an asymmetric structure. The input/output interface () may include at least one of a wired/wireless headset port, external charger port, wired/wireless data port, memory card port, a port for connecting a device equipped with an identification module (SIM), audio input/output port, video input/output port, and earphone port. The continual learning apparatus () of the present disclosure may perform appropriate control related to external devices connected through the input/output interface ().

9 FIG. In correspondence with the performance of the components shown in, at least one component may be added or omitted. In addition, the relative positions of the components may be modified depending on the performance or structure of the apparatus, which will be readily understood by those skilled in the art.

9 FIG. Meanwhile, each of the components shown inrepresents a software component and/or a hardware component such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC).

As a result, such interactions improve the model's accuracy and responsiveness and, in particular, enhance the model's capability to process tasks in highly dynamic environments.

The disclosed embodiments may be implemented in the form of a computer-readable recording medium storing instructions executable by a computer. The instructions may be stored in the form of program code, and when executed by a processor, may generate program modules to perform the operations of the disclosed embodiments.

The recording medium may be implemented as a computer-readable medium. A computer-readable medium includes all types of media that store instructions that can be decoded by a computer, such as read-only memory (ROM), random access memory (RAM), magnetic tapes, magnetic disks, flash memory, and optical data storage devices.

As described above, the disclosed embodiments have been described with reference to the accompanying drawings. However, those skilled in the art will appreciate that the present disclosure may be implemented in different forms without changing the essential characteristics or technical spirit of the disclosure. The disclosed embodiments are illustrative and should not be construed as limiting.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06N G06N3/96

Patent Metadata

Filing Date

June 30, 2025

Publication Date

March 26, 2026

Inventors

Min Jae JUNG

Joo Hee KIM

Seung Jin OH

Seung Taek KIM

Kyung Shil KANG

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search