On the 25th of September, the Customized Parallel Computing group presented an FPGA demo at the Centre for Immersive Visual Technologies (CIVIT) of AivoTTA, our application-specific processor tailored for convolutional neural network (CNN) inference. In the demo session, the processor design was running in an FPGA and executing a CNN-based real-time face recognition software in real time.
For those unfamiliar to the topic, FPGAs are reconfigurable logic devices, which can implement digital systems, such as software programmable processors (“soft cores”) or fixed function video encoders. In addition to production uses, they are also popular for prototyping and testing hardware designs targeted for ASIC implementation. The FPGA chip we used in this demo was a Xilinx Zynq 7020, which also has a dual-core ARM “hard processor” on the same chip as the FPGA fabric. The chip, integrated to a pretty PYNQ-Z1 development board, is already in use in some of the faculty’s introductory programming courses, with plans to extend its usage to various other digital design and computer architecture related courses given by our laboratory.
For the demo at CIVIT, we took our previously designed CNN processor originally targeted for ASIC implementation, and synthesized it on an FPGA chip. The original processor design was a custom DSP targeted for low power usage scenarios such as nano form factor drones (brains for spy bees!) or battery powered smart cameras. The processor we named “AivoTTA” (Finnish for “without brains” — har har har!) was originally designed by a visiting master’s student Mr. Jos IJzerman from Eindhoven University of Technology, Netherlands. Also the face detection network along with its training data was received from our friends in TU Eindhoven, more specifically from its PARsE research group led by Prof. Henk Corporaal.
AivoTTA processor design was created using TTA-Based Co-design Environment (TCE), a toolset for design and programming of application-specific processors. It is based on the Transport-Triggered Architectures processor design paradigm proposed by Prof. Henk Corporaal and colleagues in 1990s as a solution to scaling bottlenecks in VLIW-style processors. TCE has been continuously developed by CPC since early 2000s in various research projects. It has a graphical user interface for defining what operations the processor can execute, how these operations are grouped into function units, the register files to store intermediate results, and how the components are connected together. The toolset has a Clang/LLVM-based runtime retargetable compiler which takes in C, C++ or OpenCL programs and produces fine grained parallelized code for the designed processor architectures. TCE can also simulate the execution of the programs at the architectural level to produce instruction cycle counts and utilization statistics without the need to use slower, more detailed hardware simulations. From the user-defined processor architecture descriptions, TCE can produce a register transfer level (RTL) description of the hardware in VHDL or Verilog. The RTL can be finally synthesized on an ASIC, or, as was the case of this demo, an FPGA chip.
For the FPGA demo, in addition to the AivoTTA “soft custom DSP core” we added a bit of logic on the FPGA that converts the incoming HDMI video stream to grayscale. The stream is written to memory, from which AivoTTA reads it to perform its face detection task. With the results of the face detection application, rectangles were drawn on top of the original, full-color video stream to mark the detected faces. The result is sent to a monitor over HDMI for visualization.
While AivoTTA was originally designed with ASIC implementation in mind, the FPGA optimized implementation is interesting as such because the processor design has a highly customized parallel datapath and also uses “special arithmetics”: the number crunching power of the processor boils down to a customized vector function unit for multiply accumulates (which form the core of computation needed in CNN inference) that processes vector operands with 16b and 8b elements, and outputs a vector of 32b elements, an arrangement that is not typical in more general processors one can purchase off the shelf. This specialization and the processor’s carefully designed parallel datapath allows us to exploit the benefits of customized computing and fine grained parallelism offered by the flexible FPGA as an implementation platform.
As custom ASIC runs are very expensive and engineering/verification time demanding, it is extremely rare that academic processor design case studies are manufactured to new silicon. Typically the designs are only synthesized and simulated at the gate level to produce numbers for academic papers, which are considered good enough proofs for most publication forums. Thus, from this aspect, the FPGA demonstrators also serve as a means to enhance the motivation of the research group and its followers as it is possible to show running demos that both utilize the techniques developed by the group and that perform some “sensible task” in real time (although usually not as efficiently as the envisioned ASIC would). In case of the work of CPC, FPGA demonstrators show that the developed hardware and software components work correctly together and that there are no missing pieces needed to interact with a full hardware system. In addition, when the performance of the FPGA implementation can be visually inspected (in terms of increasing frame rates, for example), it allows us to concretely experience the fruits of the enhancements done to the hardware implementation, the runtime system and the compiler — a great way to “gamify” our work.
The demonstrator will be shown also in Tietotekniikan yö on the 11th of October. So if you attend the happening, please come by and check it out!
The FPGA demo was mostly created and this blog post written by Aleksi Tervo, Timo Viitanen, Lasse Lehtonen and Pekka Jääskeläinen of the Customized Parallel Computing research group.