Presentation Slides: https://docs.google.com/presentation/d/1xq7Fp-YRgMAOc_wpQ1FjY7M9irCAVrBn46ohUy1YuEw/edit?usp=sharing
This covers an introduction to AI Engines, its architecture, onnx2versal and benchmark results on Tiny MLPerf.

What are AI engines?

The AI engine is a part of Versal Adaptive Compute Acceleration Platform (ACAP) architecture, designed for high compute density, deterministic timing and high performance applications. It comprises of an array of tiles that support Very Long Instruction Word (VLIW) parallelism, and SIMD fixed point and floating point processors. Below are images taken from Xilinx docs showing tile components and interfaces.

For more details

Architecture documentation: https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/Overview
Main website: https://www.xilinx.com/products/technology/ai-engine.html
White paper: https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp506-ai-engine.pdf

What is onnx2versal?

This repo holds AIE kernels/graphs and generator scripts to create system level design for AI engines given some_model.onnx and some_data.npy. It is built on top of AI Engine ISA, AIE programming API and the ADF graph programming model. Verify, profile and run your ONNX models on AI engine machines!

TLDR CLI commands

GRAPH=tiny_kws
 
# Step 1: fuse
python fuse_onnx.py ../models/${GRAPH}.onnx ../models/${GRAPH}.onnx
 
# Step 2: quantize
python -m onnxruntime.quantization.preprocess --input ../models/${GRAPH}.onnx --output ../models/${GRAPH}_infer.onnx
python quantize_onnx.py ../models/${GRAPH}_infer.onnx ../models/${GRAPH}_int8.onnx ../data/$GRAPH/X_test.npy
 
# Step 3: generate
python generate.py ../models/${GRAPH}_int8.onnx ../data/$GRAPH/X_test.npy
 
# Step 4: Test latency
TARGET=hw_emu GRAPH=${GRAPH}_int8 make graph aiesim_profile
 
# Step 5: Test throughput
TARGET=hw_emu DOUT=0 DLOG=0 GRAPH=${GRAPH} make graph clean_reports aiesim ITER_CNT=2
python throughput.py reports_dir/$GRAPH/hw_emu/aiesimulator_output/k*
 
# Step 6: Build for hardware
TARGET=hw DOUT=0 DLOG=0 GRAPH=${GRAPH} make graph kernels xsa application package

@ref "/github/workspace/docs/md/usage.md" "Usage"

See setup and how to run details at docs/md/usage.md
See end to end example at Lenet Example. TODO: this example has not been updated and tested for a while.
See jupyter notebooks that run through the same examples at
- Conversion to Onnx pytorch2onnx notebook, tf2onnx notebook
- Onnx2Versal onnx2versal notebook

@ref "/github/workspace/docs/md/profile.md" "How good are the AI engines?"

See details at docs/md/profile.md

The pipeline has been tested for Tiny MLPerf models. The models below are trained from hls4ml-finn repositories for direct comparison with hls4ml implementation. It has shown latency improvements of 5x (Keyword Spotting), 8x (Image Classifciation) and 18x (Anomaly Detection) under 11-15% utilization. For details on the hls4ml implementation see Hls4ml MLPerf Tiny paper.

Use Case	Dtype	Latency (cycles or ns)	Throughput (samples/s)	Resource Utilization (Kernels/Buffers/Stream/PLIO/GMIO)	Accuracy (first 1k)	Quality Target	Model
Keyword Spotting	fp32 uint8	35076 3159	75369 1157407	46/56/116/5/24 48/51/83/7/0	84.8% (Top 1)	82.5% (Top 1)	MLP
Anomaly Detection	fp32 uint8	3165 1014	3205128 7142857	44/58/128/7/0 46/48/76/2/0	0.830 (AUC)	0.83 (AUC)	AutoEncoder
Image Classification	fp32 uint8	739274 174992	4324 22258	62/68/125/9/7 90/95/144/2/5	84.1% (Top 1)	83.5% (Top 1)	CNN

Latency is calculated based on cycle count from cycle-accurate aiesimulator through AI Engine programming logging API, specifically aie::tile::current().cycles(). Obtained through aiesimulator logs.
Throughput is calculated based on output bandwidth over multiple iterations. Obtained by running throughput.py on aiesimulator output files and assumes AI engine is clocked at 1GHz.

Issues

Below are certain issues that may arise from using the pipeline.

Certain operations, input shapes or parameter sizes is not supported

Write a op. Files required:
- design/aie_src/my_op.cc
- design/aie_src/my_op.h
- design/aie_src/graph_my_op.cpp
- design/aie_src/graph_my_op.h
Add data. Files required:
- data/my_op_in.txt
- data/my_op_golden.txt
Test it:
- x86 Graph test: TARGET=sw_emu GRAPH=my_op make clean_reports graph aiesim # X86 GRAPH
- SysC Graph test: TARGET=hw_emu GRAPH=my_op make clean_reports graph aiesim # SYSC GRAPH

There are issues generating the ADF graph

See reference generated graph from example
See documentation for any dimension restrictions for kernels/graphs.
Write the high level ADF graph for the network