![]() In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. Qin Z.: Diagonal wise refactorization: an efficient training method for depthwise convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. He, K.: Deep residual learning for image recognition. Howard A.G.: MobileNets: efficient convolutional neural networks for mobile vision applications. In: International Conference on Parallel and Distributed Systems (ICPADS), pp. 117, 102156 (2021)Ĭhen, J.: Split convolutional neural networks for distributed inference on concurrent IoT sensors. ![]() He, S.: An efficient GPU-accelerated inference engine for binary neural network on mobile phones. Lin, D.L.: Accelerating large sparse neural network inference using GPU task graph parallelism. In: Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), p. 19(1), 1–28 (2020)ĭua A.: Systolic-CNN: an OpenCL-defined scalable run-time-flexible FPGA accelerator architecture for accelerating convolutional neural network inference in cloud/edge computing. Marco, V.S.: Optimizing deep learning inference on embedded systems through adaptive model selection. ĭagli, R., Eken, S.: Deploying a smart queuing system on edge with Intel OpenVINO toolkit. Koo, Y., Kim, S., Ha, Y.-G.: OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework. Mu, J.: Optimizing Opencl-Based CNN design on FPGA with comprehensive design space exploration and collaborative performance modeling. Wai, Y.J.: Fixed point implementation of Tiny-Yolo-v2 using OpenCL on FPGA. Kim, S.: Performance evaluation of INT8 quantized inference on mobile GPUs. In: 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), pp. IEEE, Piscataway, NJ (2021)ĭas A.: Enabling on-device smartphone GPU based training: lessons learned. ![]() Wang J.: End-to-end object detection with fully convolutional network. ![]() Guo P.: Multi-institutional collaborations for improving deep learning-based magnetic resonance image reconstruction using federated learning. Compared to the Caffe implementation, 40.16x, 1.67x speedups are achieved on the AMD GPU and 14.95x, 1.11x speedups are achieved on the Nvidia GPU. Finally, MobileNet v1 network and the 21-layer residual network based on OpenCL are run on AMD Radeon Vega Frontier GPU and Nvidia GeForce GTX 1070 GPU. Secondly, we further improve the inference performance by means of kernel fusion and increasing the workload per core. ![]() Meanwhile, we design OpenCL parallel kernel codes for other operations in the inference stage of deep convolutional neural networks. Firstly, we design and implement parallel kernel code using OpenCL to accelerate depthwise separable convolution, and implement parallel matrix multiplication combined with clBLAS to accelerate traditional convolution. To address this issue, we propose an OpenCL-based parallel deep convolutional neural network inference algorithms. But with the development of numerous heterogeneous computing devices, today’s popular deep learning inference tools only support specific devices, so they cannot effectively utilize different GPU devices to accelerate DNN inference. In recent years, in order to facilitate the efficient application of deep convolutional neural networks, it has become increasingly important to accelerate the inference stage of deep convolutional neural networks. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |