Disclosed are an apparatus and a method of multi-phase pruning a neural network with multi-sparsity levels and an SIMD-based neural network pruning method, and the SIMD-based neural network pruning method according to an exemplary embodiment of the present disclosure includes GEMM-transforming an internode weight kernel applied to a layer in a neural network; and pruning the GEMM-transformed weight kernel with a predetermined SIMD width as a unit.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
5. The neural network multi-phase pruning method according to claim 4, wherein in the performing of coarse-grain pruning, at least some continuous regions of an original weight kernel which is not GEMNI-transformed are removed from the original weight kernel.
This invention relates to neural network optimization, specifically a multi-phase pruning method for reducing computational complexity and improving efficiency. The method addresses the problem of excessive computational overhead in neural networks by selectively removing weight parameters while maintaining model accuracy. The technique involves a multi-phase pruning process, including coarse-grain pruning and fine-grain pruning. During coarse-grain pruning, continuous regions of an original weight kernel that has not undergone a GEMNI transformation are removed. The GEMNI transformation is a pre-processing step that restructures the weight kernel to facilitate efficient pruning. The coarse-grain pruning phase targets larger, contiguous sections of the weight kernel, significantly reducing the parameter count early in the process. This is followed by fine-grain pruning, which further refines the network by removing individual weights or smaller groups of weights based on their importance to the model's performance. The method ensures that the pruning process is both aggressive and selective, balancing computational efficiency with accuracy preservation. The approach is particularly useful in applications where real-time processing and low-power consumption are critical, such as edge computing and mobile devices.
9. The multi-phase pruning apparatus according to claim 8, wherein the processor is further configured to remove at least some continuous regions of an original weight kernel which is not GEMM-transformed from the original weight kernel.
This invention relates to a multi-phase pruning apparatus for optimizing neural network models, specifically targeting the reduction of computational overhead by selectively removing redundant or less significant weight values from convolutional kernels. The apparatus addresses the problem of inefficient computation in deep learning models, where many weight values in convolutional layers contribute minimally to the model's performance, leading to unnecessary computational and memory costs. The apparatus includes a processor configured to perform multi-phase pruning, where one phase involves removing continuous regions of an original weight kernel that has not been transformed into a General Matrix-Matrix Multiplication (GEMM) format. This selective removal targets contiguous blocks of weights that are deemed insignificant, thereby reducing the kernel's size and computational load without significantly degrading model accuracy. The pruning process is designed to preserve critical weight values while eliminating redundant or low-impact regions, improving inference speed and efficiency. The apparatus may also include additional pruning phases, such as structured pruning or magnitude-based pruning, to further optimize the model. The overall goal is to enhance the efficiency of neural network inference by minimizing unnecessary computations while maintaining model performance.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
November 18, 2020
May 28, 2024
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.