|
onnx2versal
|
Vector implementation for MK*KN, stores weights and biases, requires N%16=0 Qgemm<a,a,1,80,32> takes 227 cycles.
#include <qgemm.h>
Public Member Functions | |
| Qgemm (TTPARAM(&w)[K *N], int32_t(&b)[N], float x_scale, float w_scale, float y_scale, TT x_zero, TTPARAM w_zero, TT y_zero) | |
| void | filter (input_stream< TT > *in, output_stream< TT > *out) |
Static Public Member Functions | |
| static void | registerKernelClass () |
| void Qgemm< TT, TTPARAM, M, K, N >::filter | ( | input_stream< TT > * | in, |
| output_stream< TT > * | out | ||
| ) |
Qgemm<28,32,24,32,1,1,6,5>
https://docs.xilinx.com/r/en-US/ug1079-ai-engine-kernel-coding/MAC-on-8x8-bits int8 * int8 requires x indexing %4, z indexing %2
reduce_add separately is slower, no instruction parallelism
z0 z1 z2 z3 z4 z5 z6 z7
acc0 += x0 x16 x32 x48 x64 x80 x96 x112 acc1 += x1 x17 acc2 += x2 x18 acc3 += x3 x19 acc4 += x4 x20 acc5 += x5 x21 acc6 += x6 x22 acc7 += x7 x23 acc8 += x8 x24 acc9 += x9 x25 acc10 += x10 x26 acc11 += x11 x27 acc12 += x12 x28 acc13 += x13 x29 acc14 += x14 x30 acc15 += x15 x31
xoffsets: 4b offset for every four lanes, e.g. 1 2 => 2*4=8, (1+2+1)*4=16 => 8 9 10 11, 16,17,18,19 square executes on 4x2 matrix zoffsets: 2b offset for every lane, e.g. 4 => 4*2=8 => 8,9 adjacent lane pairs are duplicated, e.g. lane0, lane1, lane0, lane1 square executes on 2x2 matrix