onnx2versal
|
Presentation Slides: https://docs.google.com/presentation/d/1xq7Fp-YRgMAOc_wpQ1FjY7M9irCAVrBn46ohUy1YuEw/edit?usp=sharing
This covers an introduction to AI Engines, its architecture, onnx2versal and benchmark results on Tiny MLPerf.
The AI engine is a part of Versal Adaptive Compute Acceleration Platform (ACAP) architecture, designed for high compute density, deterministic timing and high performance applications. It comprises of an array of tiles that support Very Long Instruction Word (VLIW) parallelism, and SIMD fixed point and floating point processors. Below are images taken from Xilinx docs showing tile components and interfaces.
For more details
This repo holds AIE kernels/graphs and generator scripts to create system level design for AI engines given some_model.onnx
and some_data.npy
. It is built on top of AI Engine ISA, AIE programming API and the ADF graph programming model. Verify, profile and run your ONNX models on AI engine machines!
The pipeline has been tested for Tiny MLPerf models. The models below are trained from hls4ml-finn repositories for direct comparison with hls4ml implementation. It has shown latency improvements of 5x (Keyword Spotting), 8x (Image Classifciation) and 18x (Anomaly Detection) under 11-15% utilization. For details on the hls4ml implementation see Hls4ml MLPerf Tiny paper.
Use Case | Dtype | Latency (cycles or ns) | Throughput (samples/s) | Resource Utilization (Kernels/Buffers/Stream/PLIO/GMIO) | Accuracy (first 1k) | Quality Target | Model |
---|---|---|---|---|---|---|---|
Keyword Spotting | fp32 uint8 | 35076 3159 | 75369 1157407 | 46/56/116/5/24 48/51/83/7/0 | 84.8% (Top 1) | 82.5% (Top 1) | MLP |
Anomaly Detection | fp32 uint8 | 3165 1014 | 3205128 7142857 | 44/58/128/7/0 46/48/76/2/0 | 0.830 (AUC) | 0.83 (AUC) | AutoEncoder |
Image Classification | fp32 uint8 | 739274 174992 | 4324 22258 | 62/68/125/9/7 90/95/144/2/5 | 84.1% (Top 1) | 83.5% (Top 1) | CNN |
aie::tile::current().cycles()
. Obtained through aiesimulator logs.throughput.py
on aiesimulator output files and assumes AI engine is clocked at 1GHz.Below are certain issues that may arise from using the pipeline.
design/aie_src/my_op.cc
design/aie_src/my_op.h
design/aie_src/graph_my_op.cpp
design/aie_src/graph_my_op.h
data/my_op_in.txt
data/my_op_golden.txt
TARGET=sw_emu GRAPH=my_op make clean_reports graph aiesim # X86 GRAPH
TARGET=hw_emu GRAPH=my_op make clean_reports graph aiesim # SYSC GRAPH