Efficient Vision Transformer on the Edge
EN.520.625 Course Project

  • Juyang Bai

Abstract


Recent advancements in transformer-based models have revolutionized fields such as natural language processing, computer vision, and speech recognition due to their superior ability to model complex patterns and long-range dependencies. Despite their success, deploying these models on resource-constrained edge devices presents significant challenges, primarily due to their high computational demands and substantial memory requirements. This project report delves into the practical aspects of adapting transformers for edge deployment, focusing specifically on the EfficientViT model—a derivative of the Vision Transformer tailored for efficient processing. We explore two primary model optimization techniques: pruning and quantization, as well as their combination application, to enhance the deployment feasibility of transformers on edge devices. The experiments conducted using these techniques on the EfficientViT model showcase significant improvements in inference speed and efficiency, making it viable for real-time applications on edge devices. The comprehensive analysis provides insights into the trade-offs involved in model optimization and offers a path forward for the deployment of high-performing AI models in resource-limited environments.


Approach


Backbone Model

EfficientViT [1] is a family of efficient, high-resolution vision transformer models that tackle the challenges of high-resolution dense prediction tasks through a novel combination of methods. The architecture integrates a multi-scale ReLU linear attention mechanism that captures a global receptive field while enabling multi-scale learning. By using lightweight and hardware-efficient operations, it eliminates the reliance on computationally expensive softmax attention or large-kernel convolutions. Specifically, EfficientViT replaces softmax attention with a ReLU linear attention mechanism, reducing computational complexity from quadratic to linear while still efficiently capturing global context (Figure 1).

Latency Comparison
Figure 1. Latency Comparison Between Softmax Attention and ReLU Linear Attention.
To further enhance its representational capacity, EfficientViT employs small-kernel convolutions that effectively generate multi-scale tokens. Attention is then applied across these tokens, ensuring that the model can efficiently handle dense prediction tasks at high resolution (Figure 2). The attention mechanism maintains the ability to learn global features while benefiting from multi-scale hierarchical token representations. This combination allows EfficientViT to achieve high performance while reducing computation, storage, and latency.
effcientblock
Figure 2. EfficientViT’s Building Block (left) and Multi-Scale Linear Attention (right).
The architecture is further optimized by balancing the complexity of each layer and incorporating efficient operations, ensuring the model remains practical for various edge computing platforms. In empirical evaluations across different computing environments (mobile CPUs, edge GPUs, and cloud GPUs), EfficientViT achieves substantial speedups over previous models like SegFormer and SegNeXt, particularly in semantic segmentation and super-resolution tasks. This demonstrates its ability to balance performance and speed across a diverse range of dense prediction applications, making it an ideal solution for high-resolution vision tasks on resource-constrained devices.

Pruning

Structured pruning refers to the systematic removal of specific groups of parameters or neurons in a neural network, as opposed to unstructured pruning, which removes individual weights. In channel-wise pruning, entire channels (filters) of convolutional layers are pruned based on their contribution to the network's overall performance.

Prunning
Figure 3. Channel Pruning in Convolution Layer. [2]
The channel-wise pruning begins by ranking channels using the Frobenius norm, a measure of a filter's magnitude that indicates its relative importance.

Frobenius Norm (W) = \[ \sqrt{\sum_{i,j} W_{ij}^2} \]

Filters are then ranked in ascending order according to their Frobenius norm, with smaller values suggesting that the corresponding filters contribute less to the network's output.
After ranking the filters, selective removal is conducted based on a predefined pruning ratio. This ratio determines the proportion of channels to be pruned, such as removing the 30% with the lowest norms. The less important filters identified from the ranking list are pruned from the network. Removing these channels can reduce both computational load and memory usage.
Once the pruning process is complete, retraining is necessary because this structural modification temporarily reduces the model's accuracy. To restore performance, the pruned network undergoes fine-tuning with the original training data, allowing the remaining weights to adjust and compensate for the changes. After retraining, the model is evaluated on a validation set to confirm its performance, ensuring that the pruning ratio and other hyperparameters are optimized to maintain acceptable accuracy levels while significantly reducing computational requirements.

Quantization

Linear quantization is a technique used to reduce the precision of neural network parameters, mapping floating-point values to a fixed, lower-precision integer range. The method follows a two-step process involving uniform scaling and mapping, which significantly reduces the model size and computational complexity, making it ideal for deployment on resource-constrained devices.

quantization
Figure 4. Symmetric Linear Quantization [2]
Uniform scaling is scaling down the full range of parameter values uniformly. This is achieved by analyzing the range of floating-point values used in the model's weights and activations. By identifying the minimum and maximum values across all parameters or activations, a scaling factor can be computed that will proportionally map this range to fit within the narrower bounds of the target integer type (e.g., 8-bit or 16-bit). For instance, if an 8-bit integer range is desired, the scaling factor is determined such that the floating-point range is evenly distributed across 256 discrete levels. This scaling factor ensures that the entire dynamic range of original parameters is retained in a compressed form.
Once the scaling factor is calculated, each parameter or activation value undergoes a linear transformation to align with the chosen integer range. This transformation typically involves a combination of scaling and shifting:
  • Scaling: Each floating-point value is multiplied by the scaling factor to adjust it proportionally within the target range.
  • Shifting: To ensure the integer representation starts at a desired base value, a shift is applied to offset the scaled values. This shift often accommodates the symmetric nature of many integer types (e.g., ranging from -128 to 127 for signed 8-bit integers). For example, if the floating-point weights range from -1.0 to +1.0, these values will be scaled and shifted to fit within the -128 to 127 range.

By uniformly scaling and mapping weights and activations, the resulting integer representations retain the approximate relationships and characteristics of the original floating-point parameters. This mapping enables efficient inference on hardware optimized for integer operations while ensuring minimal loss in model accuracy.
After the quantization process, the neural network model typically undergoes fine-tuning to adjust for any minor discrepancies caused by the reduction in precision. This post-quantization training helps refine the model's parameters to operate effectively within the new quantized framework, ensuring that the model maintains an acceptable level of accuracy while significantly reducing computational demands and memory usage.


Experiment Setup


We evaluate the performance of all optimized EfficientViT models on the NVIDIA Jetson Orin Nano, which features a balanced CPU-GPU architecture well-suited for running efficient models. The model inference is performed using the latest NVIDIA TensorRT deep learning framework to maximize computational efficiency.
All optimized EfficientViT models need to be converted to the ONNX (Open Neural Network Exchange) format for compatibility with TensorRT. The conversion process allows for the deployment of the models onto the Jetson Orin Nano.
For evaluation, we employed a subset of high-resolution images from the ImageNet validation set to measure inference latency. Accuracy is assessed through top-1 classification accuracy to ensure minimal loss of predictive performance during optimization.


Results


Pruning

We evaluated the impact of pruning on the EfficientViT model deployed on the NVIDIA Jetson Orin Nano. Pruning was implemented at varying levels (20%, 40%, and 60%) to observe how reductions in model complexity affect performance metrics such as Top 1 accuracy, parameter count, multiply-accumulate operations (MACs), and inference latency.

table1
Table 1 Performance evaluation of different pruning ratios

The results in Table 1 indicate a trade-off between model accuracy and computational efficiency. Lower levels of pruning (20% and 40%) manage to preserve more of the model's accuracy while still achieving noticeable improvements in latency and computational overhead. In contrast, higher levels of pruning (60%) significantly enhance computational efficiency but at a substantial cost to accuracy.

Quantization

To assess the impact of quantization on the EfficientViT model when deployed on the NVIDIA Jetson Orin Nano, we compare the original floating-point model (EfficientViT) with its INT8 quantized version to determine the effects on accuracy, model size, computational complexity, and inference latency.
In Table 2, the quantization of the EfficientViT model to INT8 on the Jetson Orin Nano has demonstrated substantial improvements in computational efficiency, particularly in reducing inference latency, while preserving the integrity of the model's accuracy. These results underline the effectiveness of quantization as a technique for deploying deep learning models on edge devices, where computational resources and power efficiency are limited. By converting the model to a lower precision, we achieve faster processing times, which is crucial for real-time applications, without a significant compromise in model performance. This makes INT8 quantization a compelling option for enhancing the deployment of neural network models in resource-constrained environments.

table2
Table 2 Performance evaluation of linear quantization

Pruning and Quantization

In our experiments, we combined pruning with INT8 quantization to assess their cumulative effect on the EfficientViT model when deployed on the NVIDIA Jetson Orin Nano. This combination aims to maximize the efficiency of the model while minimizing the compromise on accuracy and performance.
The results demonstrate a clear trend: as the level of pruning increases, combined with quantization, there is a significant decrease in both parameter count and MACs, leading to substantially lower latency. However, this efficiency comes at the cost of reduced accuracy. The 20% pruning level offers a reasonable balance, showing a minimal drop in accuracy with considerable gains in efficiency. In contrast, 60% pruning, although highly efficient, might be less desirable due to the substantial accuracy loss.

table3
Table 3 Performance evaluation of pruning + quantization

This combination of pruning and quantization provides valuable insights into the trade-offs between efficiency and accuracy. For edge deployments, where computational resources and response times are critical, a moderate level of pruning combined with quantization could offer the optimal balance, especially in scenarios where slight decreases in accuracy are acceptable for significant gains in speed and resource utilization.


Conclusion


In this work, we systematically explored the adaptation of the EfficientViT model for edge deployment through the implementation of pruning, quantization, and their combined application, revealing the trade-offs between performance and computational efficiency. The findings confirm that while each technique independently contributes to model optimization, their integration is particularly potent, significantly enhancing the practical viability of deploying sophisticated transformer models on resource-constrained edge devices.
Pruning alone was effective in reducing model size and computational requirements, but with varying impacts on accuracy depending on the intensity of the pruning applied. Quantization, while maintaining model structure, dramatically improved processing speed by reducing the precision of the computations, which is crucial for real-time applications. Notably, the combined approach of pruning and quantization provided a balanced solution, substantially lowering the latency and computational demands without a drastic reduction in accuracy.
The outcomes of this project underscore the necessity of considering both pruning and quantization not just as tools for optimization but as essential components of a strategic approach to model deployment on edge devices. These techniques enable the retention of transformative capabilities of AI models like EfficientViT, making them accessible and functional in scenarios where computing power is limited.
For the future work, we can explore integrating advanced methods such as dynamic quantization and automated pruning algorithms, to further refine the deployment process. Additionally, expanding the application areas of such optimized models to include more diverse real-world scenarios could illustrate their broader impact and facilitate widespread adoption in edge computing environments. This would not only enhance the functionality of edge devices but also open new avenues for innovation in AI deployment in various sectors.


References


[1] Cai, H., Li, J., Hu, M., Gan, C., & Han, S. (2022). EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction. ArXiv:2205.14756.
[2] Han, S. Lecture Slides from TinyML and Efficient Deep Learning Computing.