A method for training a model based on data parallelism and a terminal. The model comprises local models trained at training terminals, respectively, and the method comprises: obtaining, by a first terminal, respective training losses of the training terminals; and calculating, by the first terminal, a weighted average of the training losses to obtain a weighted training loss, wherein the weighted training loss is for updating a parameter of the local model trained at each of the training terminals.
Legal claims defining the scope of protection, as filed with the USPTO.
. A method for training a model based on data parallelism, wherein the model comprises local models trained at training terminals, respectively, and the method comprises:
. The method according to, wherein each of the training terminals obtains the respective training loss of said training terminals through:
. The method according to, wherein:
. The method according to, further comprising:
. The method according to, wherein:
. The method according to, further comprising:
. The method according to, wherein the model comprises a plurality of network layers, a first layer among the plurality of network layers is deployed among respective first instances of the training terminals, and the method comprises:
. The method according to, wherein the respective backpropagation gradient of the first layer at the first instance of each of the training terminals is a gradient of the weighted training loss with respect to:
. The method according to, wherein:
. The method according to, wherein:
. The method according to, wherein:
. The method according to, wherein the third layer is the second layer, and the single third instance is the single second instance.
. A terminal, comprising:
. A non-transitory computer-readable storage medium, storing computer-readable instructions, wherein the computer readable instructions when executed by a processor implement a method comprising:
Complete technical specification and implementation details from the patent document.
This application claims the priority to Chinese Patent Application No. 202410650013.7,titled “METHOD FOR TRAINING MODEL BASED ON DATA PARALLELISM AND RELATED DEVICE” and filed with the China National Intellectual Property Administration on May 23, 2024, Chinese Patent Application No. 202411998647.8, titled “METHOD FOR TRAINING MODEL AND RELATED DEVICE” and filed with the China National Intellectual Property Administration on Dec. 31, 2024, and Chinese Patent Application No. 202510495536.3, titled “METHOD FOR TRAINING MODEL USING DATA PARALLELISM, METHOD FOR TRAINING MODEL, AND RELATED DEVICES” and filed with the China National Intellectual Property Administration on Apr. 18, 2025, the entire contents of which are incorporated herein by reference in their entireties.
The present disclosure relates to the field of model training, and in particular to a method for training a model based on data parallelism, a terminal, and a non-transitory computer-readable storage medium.
Large models with deep learning engender rapid and profound changes in various aspects of modern society, work, and daily life. It is no longer feasible to train models on a single computing device (such as a graphic processing unit, GPU) or a single computing node due to the increasing sizes (i.e., a quantity of parameters) of the models. Various parallel algorithms have been proposed in industry and academia to address the above issue, and the most representative algorithms are data parallelism, model parallelism (e.g., pipeline parallelism and tensor parallelism), or the like.
In conventional technology, training a large model is extremely costly. For example, training a large model with thousands or hundreds of billions of parameters requires hardware investment and daily maintenance, which can easily cost tens of millions or even hundreds of millions of US dollars. At present, how to improve efficiency of training large models has become an urgent problem.
A method for training a model based on data parallelism, a terminal, and a non-transitory computer-readable storage medium are provided according to embodiments of the present disclosure. Efficiency of model training is improved.
In a first aspect, a method for training a model based on data parallelism is provided. The model comprises local models trained at training terminals, respectively. The method comprises: obtaining, by a first terminal, respective training losses of the training terminals; and calculating, by the first terminal, a weighted average of the training losses to obtain a weighted training loss, where the weighted training loss is for updating a parameter of the local model trained at each of the training terminals.
In an embodiment, each of the training terminals obtains the respective training loss of said training terminals through: training a current version of the local model at said training terminal using training sub-data of said training terminal, where the training sub-data is a part of a current batch of training data, and the training loss of said training terminal is determined according to a predetermined loss function and a result of forward-propagating the training sub-data of said training terminal through the current version of the local model at said training terminal.
In an embodiment, the first terminal is one of the training terminals. Obtaining the respective training losses of the training terminals comprises: receiving the training sub-data of the first terminal; training the current version of the local model at the first terminal using the training sub-data of the first terminal to obtain the training loss of the first terminal; and receiving, from the training terminals other than the first terminal, the respective training losses of the training terminals other than the first terminal. The method further comprises: adjusting the current version of the local model at the first terminal with backpropagation according to the weighted training loss to obtain an updated version of the local model at the first terminal.
In an embodiment, the method further comprises: in response to determining that the current training terminal meets a predetermined aggregation condition, transmitting a parameter of the local model at the first terminal to an aggregating terminal, receiving an aggregated parameter transmitted from the aggregating terminal, where the aggregated parameter is a weighted average of respective parameters of all the local models, and the parameter of the local model at the first terminal is one of the parameters, and overwriting the parameters of the local model at the first terminal using the aggregated parameter.
In an embodiment, the first terminal is an aggregating terminal different from the training terminals. Obtaining the respective training losses of all the training terminals comprises receiving, from all the training terminals, the respective training losses of the training terminals. The method further comprises: transmitting the weighted training loss to each of the training terminals to enable said training terminal to adjust the current version of the local model at said training terminal according to the weighted training loss to obtain an updated version of the local model at said training terminal.
In an embodiment, the method further comprises: receiving, from each of the training terminals, a respective parameter of the local model at said training terminal; calculating a weighted average of the respective parameters of all the local models at the training terminals to obtain an aggregated parameter; and transmitting the aggregation parameter to each of the training terminals to enable said training terminal to overwrite the respective parameter of the local model at said training terminal using the aggregated parameter.
In an embodiment, the model comprises a plurality of network layers, a first layer among the plurality of network layers is deployed among respective first instances of the training terminals. The method comprises: obtaining respective backpropagated gradients of the first layer at the first instances of the training terminals; and calculating a weighted average of the backpropagation gradients to obtain a weighted gradient, where the weighted gradient is for calculating a gradient of a second layer among the plurality of network layers, the second layer is an immediately previous layer of the first layer along a direction of forward propagation, and the second layer is deployed on a single second instance of which layer parameters are shared by the local models of all the training terminals.
In an embodiment, the respective backpropagation gradient of the first layer at the first instance of each of the training terminals is a gradient of the weighted training loss with respect to: an input of the first layer at the first instance of said training terminal during forward propagation of the training sub-data of said training terminal through current version of the local model at said training terminal, where the input of the first layer is fed from the second layer.
In an embodiment, the gradient of the second layer is calculated through calculating a product of the weighted gradient and a Jacobian matrix of the second layer. Updating the parameter of the local model trained at each of the training terminals comprises: updating a parameter of the second layer using a parameter gradient (hereinafter called a gradient for short) of the second layer at the single second instance.
In an embodiment, the plurality of network layer further comprises a third layer and a fourth layer, the fourth layer is an immediately previous layer of the third layer in the direction of forward propagation, the third layer is deployed on a single third instance of which layer parameters are shared by the local models of all the training terminals, and the fourth layer is deployed among respective fourth instances of the training terminals. The method further comprises: calculating a gradient of the fourth layer at the fourth instance of each of the training terminals according to a backpropagated gradient of the third layer, where the backpropagated gradient of the third layer is a gradient of the weighted training loss with respect to an input of the third layer during forward propagation of the current batch of training data.
In an embodiment, the gradient of the fourth layer at the fourth instance of each of the training terminals is calculated at the fourth instance of said training terminal through calculating a product of the backpropagated gradient of the third layer and a Jacobian matrix of the fourth layer at the fourth instance of said training terminal. The method further comprises: updating a parameter of the fourth layer at the fourth instance of each of the training terminals using a parameter gradient (hereinafter called a gradient for short) of the fourth layer at the fourth instance of said training terminal.
In an embodiment, the third layer is the second layer, and the single third instance is the single second instance.
In a second aspect, a terminal is provided. The terminal comprises: a memory storing computer-readable instructions, and a processor. The computer readable instructions when executed by the processor implement a method comprising: obtaining respective training losses of training terminals, where the training terminals are configured to train local models, respectively, of a model; and calculating a weighted average of the training losses to obtain a weighted training loss, where the weighted training loss is for updating a parameter of the local model trained at each of the training terminals.
In a second aspect, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium stories computer-readable instructions. The computer readable instructions when executed by a processor implement a method comprising: obtaining respective training losses of training terminals, where the training terminals are configured to train local models, respectively, of a model; and calculating a weighted average of the training losses to obtain a weighted training loss, where the weighted training loss is for updating a parameter of the local model trained at each of the training terminals.
Hereinafter technical solutions in embodiments of the present disclosure are described clearly and completely in conjunction with the drawings in embodiments of the present closure. Apparently, the described embodiments are only some rather than all of the embodiments of the present disclosure. Any other embodiments obtained based on the embodiments of the present disclosure by those skilled in the art without any creative effort fall within the scope of protection of the present disclosure.
A method for training a model based on data parallelism, a method for training a model, and related devices are provided according to embodiments of the present disclosure. Efficiency for model training is improved.
As shown in, architecture of a system for model training is provided according to an embodiment of the present disclosure to facilitate understanding of the training method based on data parallelism. As shown in, the architecture comprises multiple training terminals. Since the architecture is based on data parallelism, each training terminal (such as a current training terminal or another training terminal) participating in the training is embodied as one of multiple parallel units configured to implement a complete data parallelism scheme. In some embodiments, the training terminal itself may be responsible for implementing a certain model parallel scheme (such as pipeline parallelism, tensor parallelism, or the like). In such cases, model training adopts hybrid parallelism. In some embodiments, after the model is trained with each batch of training data, a weighted average of respective training losses of the training terminals is calculated to obtain a weighted training loss for such batch. Afterwards, each training terminal updates a parameter of its local model using the weighted training loss for such batch to obtain a new version of the local model. In a case that the updated new version of the local model meets a convergence condition, the training is terminated. Otherwise, the training would continue iteratively. The weighted average may be calculated separately at each training terminal or may be calculated at an aggregating terminal, which is not limited herein.
In some embodiments, training data in the training data set is distributed among the training terminals participating in the training to conform to data parallelism. For every batch of training data, each training terminal determines its own training data according to the data distributed to it.
Reference is made to. In an embodiment, a method for training a model based on data parallelism may be implemented using the above architecture. The method comprises following steps ofto.
In step, training sub-data for the current training terminal is obtained from a current batch of training data.
For the sake of clear illustration, the current training terminal refers to an arbitrary training terminal participating in the training based on data parallelism, and the training terminal(s) participating in the training other than the current training terminal is called “other training terminal(s)”. Hereinafter, the method would be first described from the perspective of the current training terminal. Since any training terminal may utilize the method in this embodiment and/or related embodiment(s) to achieve more efficient model training, each training terminal participating in the training based on data parallelism may serve as the current training terminal.
Generally, when training a model having a tremendous quantity of parameters, training data in a training data set are distributed among the training terminals. For example, a certain batch (or mini-batch) of training data is directly transmitted to the training terminals. Alternatively, multiple batches of training data are transmitted together to the training terminals, and each training terminal picks its training sub-data in a per-batch manner from the training data distributed to it. Thereby, computational parallelism can be improved in model training.
Hence, for the current training terminal, the training sub-data for the current batch needs to be obtained before the model is trained with the current batch of training data.
In step S, a current version of a local model at the current training terminal is trained using the training sub-data to obtain a training loss of the current training terminal for the current batch.
The obtained training sub-data for the current batch may be inputted into the current version of the local model at the current training terminal and forward-propagated, and a result of the forward propagation is the training loss of the current training terminal for the current batch. Detailed operations in this step may refer to conventional means of obtaining a training loss, which would not be illustrated herein.
In a broad sense, a process of model training comprises forward propagation, loss calculation, backward propagation, and parameter update (or model update). The training in this step is interpreted in a narrow sense, that is, it refers to forward propagation and loss calculation.
In step, a weighted training loss for the current batch is obtained, where the weighted training loss is a weighted average of respective training losses of multiple training terminals for the current batch.
Here the multiple training terminals comprise the current training terminal. The respective training loss of each training terminal is determined according to a predetermined loss function and a result of forward-propagating respective training sub-data of such training terminal through a current version of a respective local model at such training terminal. The current version of the local model is an initial version when the current batch is a foremost batch, and the current version of the local model is obtained through training an immediate previous version of the local model using an immediate previous batch of training data. In other words, the local model at each training terminal may be a universal initial model before being trained with the foremost batch of training data.
A weight of the respective training loss of each training terminal may be determined according to a volume of the training sub-data of such training terminal for the current batch. For example, the weight is positively correlated with the data amount. When the weights are identical across all training terminals, calculating the weighted average is equivalent to calculating a mean.
For example, there are two training terminals participating in the training. The current training terminal trains with 60 samples for the current batch, while the other training terminal trains with 40 samples for the current batch. When the weight average determines the data amount, the weighted training loss for the current batch is equal to 0.6 times the training loss of the current training terminal for the current batch plus 0.4 times the training loss of the other training terminal for the current batch. When the weights are equal, the weighted training loss for the current batch is equal to 0.5 times the training loss of the current training terminal for the current batch plus 0.5 times the training loss of the other training terminal for the current batch.
In step, the current version of the local model at the current training terminal is adjusted according to the weighted training loss for the current batch to obtain an updated version of the local model at the current training model, and training of the local model at the current training terminal is terminated in response to the updated version of the local model at the current training terminal being a target model.
The obtained weighted training loss for the current batch is backpropagated, and parameter(s) of the local model are updated according to a result of the backward propagation to obtain the new version of the local model. When the new version of the local model meets a requirement, the local model is determined to be the target model, and the iterative training is terminated.
In conventional training scheme using data parallelism, such as the all-reduce scheme, the obtained gradients of each layer in the entire model are collected, aggregated, and averaged, for one or more times through message passing interface (MPI) techniques during the training with each batch of training data. In comparison, herein it is not the gradients, but the training losses (i.e., values obtained through the loss function) of the local models obtained through forward propagation, that are averaged for each batch of training data. Each time the training synchronization occurs, only one piece of data, i.e., the training loss, is transmitted. Since the conventional training schemes need to transfer obtained model parameters or gradients that usually comprise hundreds of millions of pieces of data, embodiments of the present disclosure can achieve higher training synchronization efficiency and higher accuracy. Data synchronization efficiency is also efficiently improved.
Moreover, embodiments of the present disclosure achieve at least the following effects.
First, time is significantly saved. The large models have tremendous parameters (up to thousands or hundreds of billions of parameters), hence a huge amount of data needs to be transmitted (approximately, 2×quantity of training terminals×quantity of parameters×2, depending on which optimization algorithm is used) through network(s). Accordingly, network transmission becomes the key bottleneck in performances of the model training. In comparison with the conventional scheme, the amount of data required to be transmitted in embodiments of the present disclosure is reduced to a level of one billionth or one ten billions and even can be ignored (approximately, 2×quantity of training terminals×4). Overall efficiency of training the large models is greatly improved.
Second, the cost of hardware is greatly reduced. In a first aspect, the conventional schemes using data parallelism require dedicated network transmission devices and network transmission techniques having high performance, such as RDMA network, to achieve transmission of a large amount of data in a short time. In embodiments of the present disclosure, common networks may be utilized since only a small amount of data needs to be transmitted. In a second aspect, the conventional schemes usually require a separate parameter server due to a huge amount of calculation in averaging the parameters. In embodiments of the present disclosure, a common computing device may be utilized because the amount of calculation is also small.
Third, the main approach of optimizing the conventional scheme using data parallelism is the parallelism between calculation and transmission. That is, the averaging of gradients of a certain layer is performed at the same time when calculating gradients of an immediate previous layer. The above synchronization shall be applied to every two adjacent layers of the local model at each training terminal. Even if hardware configuration is exactly the same throughout the training terminals, calculation with respect to each layer does not cost exactly identical time among the respective local models at all training terminals. Hence, configuration of synchronization points would force the local model(s) of some training terminal(s) to stay in a waiting state during calculation for each layer. In embodiments of the present disclosure, only one synchronization point is required to average the training losses of the local models at the training terminals, and the local model of each training terminal runs independently and is trained independently at other moments. Thereby, the whole scheme is simpler and costs less time.
Fourth, the design of the software system is significantly simplified, and the costs of development and operation of the software system are significantly reduced. In the conventional schemes, the gradients of multiple operators in multiple training terminals shall be transmitted and synchronized. Thus, the design is extremely complicated, and the costs of development and operation are extremely high. In embodiments of the present disclosure, the scheme may be achieved by modifying a common training scheme, and special software system is not necessary.
Fifth, the scheme proposed herein is easy to integrate with other schemes. All conventional large models can be treated as a hybrid of a certain data parallelism scheme and a certain model parallelism scheme, and the data parallelism scheme is difficult to integrate with another model parallelism scheme due to complexities of both parallelism implementations. In embodiments of the present disclosure, the scheme may achieve a hybrid deployment with the model parallelism scheme with almost zero modification on the latter.
In some embodiments, the current training terminal obtaining the weighted training loss for the current batch may comprise, but is not limited to, following steps. The current training terminal transmits the training loss of the current training terminal for the current batch to a first aggregating terminal, and then receives the weighted training loss for the current batch from the first aggregating terminal, where the weighted training loss for the current batch is calculated by the first aggregating terminal. Alternatively, the current training terminal receives the training loss(es) from the other training terminal(s), respectively, and then calculates the weighted average of the training losses of the multiple training terminals for the current batch to obtain the weighted training loss for the current batch.
The difference between the above two manners lies in whether the process of calculating the weighted training loss for the current batch occurs in the training terminal (e.g., the current training terminal) or in the first aggregating terminal. In the former case, the current training terminal needs to receive the training loss for the current batch transmitted from each other training terminal. Generally, since the data volume of the training loss is extremely small (such as several hundreds of bytes), the above two manners both consume little time, and hence selection may be made on requirement.
In some embodiments, parameters of the local models of the multiple training terminals may be aggregated under a certain condition to obtain an aggregation parameter, and the local model of each training terminal is updated using the aggregation parameter. Efficiency of training the initial model can be improved to obtain an accurate target model more rapidly. The method may comprise following steps. When determining that the current training terminal meets a predetermined aggregation condition, a parameter of the local model at the current training terminal is transmitted to a second aggregating terminal, and the parameter is overwritten using an aggregation parameter transmitted from the second aggregating terminal to obtain an updated local model at the current training terminal. The aggregation parameter is obtained by the second aggregating terminal through calculating a weighted average of respective parameters of the local models at the multiple training terminals.
Different from the training loss, the model parameter is usually very large. Hence, when calculating the aggregation parameter, an aggregating terminal (such as the second aggregating terminal) is utilized to ensure efficiency of parameter aggregation. The predetermined aggregation condition may be that a quantity of epochs that have elapsed reaches a threshold for triggering system aggregation, a quantity of batches of training data that have been used by the training terminals reaches a threshold for triggering system aggregation, or a quantity of pieces of training data (e.g., a quantity of training samples) used by each training terminal is identical, which is not limited herein.
In some alternative embodiments, when determining that a training terminal (such as the current training terminal or another training terminal) meets the predetermined aggregation condition, the local model at each training terminal may be tested using a test data set to obtain a model evaluation index. The local model having the best test performance is determined according to the respective model evaluation index of each local model. The parameter of the local model having the best test performance is determined to serve as the aggregation parameter, and each training terminal overwrites the parameter of its own local model using the aggregation parameter. Efficiency of training the initial model can be improved to obtain an accurate target model more rapidly.
Both the first aggregating terminal and the second aggregating terminal are the aggregating terminals. The first aggregating terminal and the second aggregating terminal may be the same or different, which is not limited herein.
Unknown
November 27, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.