Hardware Architecture Design for Regular Convolutional Neural Networks Targeting Resource-Constrained Devices with an Automated Framework
[Thesis]
Hailesellasie, Muluken T.
Hasan, Syed Rafay
Tennessee Technological University
2019
158 p.
Ph.D.
Tennessee Technological University
2019
The popularity of deep learning has radically increased in the past few years due to its promising results. Convolutional Neural Network (CNN) is one of the most widely used deep learning algorithms in various computer vision applications. While the performance of CNN is impressive, its deployment in the current embedded technology is a challenge since CNN models are computation-intensive and memory-intensive. Hence, there is a growing need for hardware-based solutions in these embedded technologies that can make the dream of real-time computer vision in resource-constrained devices a reality. Due to their reconfigurability, high-performance and low-power features, embedded systems with Field Programmable Gate Arrays (FPGAs) are becoming a hardware platform choice for many deep learning applications. In this work, we explore various hardware architectures and design strategies to improve computation time and minimize the required computing resources. To alleviate the computation intensiveness of CNN models, first, we propose an efficient convolutional layer architecture with improved computation time per convolution. The proposed architecture finds a trade-off between latency and resource consumption through a technique of distributing the input data into a number of memory blocks. By distributing the input data into parallel memory blocks, where each memory block can be read simultaneously, clock cycle reduction is achieved. Subsequently, we propose an architecture that performs parallel computation of feature maps using a custom designed data flow. The strategy proposed obtained substantial computation speedup compared to the state-of-the-art for the same CNN model. On the other hand, while there is a need for flexible architectures that can be used for various models, the existing architectures are tailored or optimized to a particular CNN architecture. To address this limitation, we propose a novel and highly flexible hardware architecture that can process most regular CNN variants and achieves better resource utilization. We proposed processing cores implemented with multipliers and without multipliers. A fixed-point and power-of-2 quantization schemes are also developed to significantly reduce the on-chip memory space and the logic needed in the targeted device. With substantial on-chip memory reduction and an increase in performance and power efficiency, our results demonstrate that the proposed architecture can be very expedient for resource-constrained devices. To enhance the usability of our proposed architecture for deep learning practitioners and to improve the scalability of the proposed design, a framework that auto-generates a CNN processor in the form of a synthesized hardware intellectual property (IP) is proposed. The proposed framework optimizes the hardware IP based on the model workload and the target device specifications. A memory traffic optimization algorithm that results in higher performance and on-chip fitting optimization that results in higher resource utilization efficiency are employed. Our results demonstrate that the proposed framework is effective in reducing the design time and optimizing the performance and the resource consumption of the hardware IP.