• Network pruning is to reduce the model size by trimming unimportant model weights or connections while the model capacity remains. It may or may not require re-training.

  • Pruning can be unstructured or structured.

    • Unstructured pruning is allowed to drop any weight or connection, so it does not retain the original network architecture. Unstructured pruning often does not work well with modern hardware and doesn’t lead to actual inference speedup.
    • Structured pruning aims to maintain the dense matrix multiplication form where some elements are zeros. They may need to follow certain pattern restrictions to work with what hardware kernel supports.
  • We focus on structured pruning to achieve high sparsity in models.

  • A routine workflow to construct a pruned network has three steps:

  1. Train a dense network until convergence;
  2. Prune the network to remove unwanted structure;
  3. Optionally retrain the network to recover the performance with new weights.

How to prune?

  • Magnitude pruning is simplest yet quite effective pruning method - weights with smallest absolute values are trimmed.

    • Magnitude pruning is simple to apply to large models and achieves reasonably consistent performance across a wide range of hyperparameters.
  • Zhu & Gupta (2017) found that large sparse models were able to achieve better performance than their small but dense counterparts.

    • They proposed Gradual Magnitude Pruning (GMP) algorithm that increases the sparsity of a network gradually over the course of training.
    • At each training step, weights with smallest absolute values are masked to be zeros to achieve a desired sparsity level and masked weights do not get gradient update during back-propagation.
    • The desired sparsity level goes up with more training steps. The process of GMP is sensitive to the learning rate schedule, which should be higher than what’s used in dense network training, but not too high to prevent convergence.
  • Iterative pruning (Renda et al. 2020) iterates step 2 (prune) & step 3 (retrain) multiple times: Only a small fraction of weights are pruned and the model is retrained in each iteration. The process repeats until a desired sparsity level is reached.

How to retrain?

  • The retraining step can be simple fine-tuning using the same pre-training data or other task-specific datasets.

  • Lottery Ticket Hypothesis proposed a weight rewinding retraining technique:

    • After pruning, the unpruned weights are reinitialized back to original values earlier in the training and then retrain with the same learning rate schedule.
  • Learning rate rewinding (Renda et al. 2020) only resets the learning rate back to its early value, while the unpruned weights stay unchanged since the end of the last train stage.

    • They observed that
      • (1) retraining with weight rewinding outperforms retraining with fine-tuning across networks and datasets and
      • (2) learning rate rewinding matches or outperforms weight rewinding in all tested scenarios