Vector implementation for MK*KN, stores weights and biases, requires N%16=0 Qgemm<a,a,1,80,32> takes 227 cycles.

#include <qgemm.h>

Public Member Functions
	Qgemm (TTPARAM(&w)[K *N], int32_t(&b)[N], float x_scale, float w_scale, float y_scale, TT x_zero, TTPARAM w_zero, TT y_zero)

void	filter (input_stream< TT > in, output_stream< TT > out)

Static Public Member Functions
static void	registerKernelClass ()

Member Function Documentation

◆ filter()

template<typename TT , typename TTPARAM , int M, int K, int N>

void Qgemm< TT, TTPARAM, M, K, N >::filter	(	input_stream< TT > *	in,
		output_stream< TT > *	out
	)

Qgemm<28,32,24,32,1,1,6,5>

https://docs.xilinx.com/r/en-US/ug1079-ai-engine-kernel-coding/MAC-on-8x8-bits int8 * int8 requires x indexing %4, z indexing %2

reduce_add separately is slower, no instruction parallelism

     z0  z1  z2  z3  z4  z5  z6  z7

acc0 += x0 x16 x32 x48 x64 x80 x96 x112 acc1 += x1 x17 acc2 += x2 x18 acc3 += x3 x19 acc4 += x4 x20 acc5 += x5 x21 acc6 += x6 x22 acc7 += x7 x23 acc8 += x8 x24 acc9 += x9 x25 acc10 += x10 x26 acc11 += x11 x27 acc12 += x12 x28 acc13 += x13 x29 acc14 += x14 x30 acc15 += x15 x31

xoffsets: 4b offset for every four lanes, e.g. 1 2 => 2*4=8, (1+2+1)*4=16 => 8 9 10 11, 16,17,18,19 square executes on 4x2 matrix zoffsets: 2b offset for every lane, e.g. 4 => 4*2=8 => 8,9 adjacent lane pairs are duplicated, e.g. lane0, lane1, lane0, lane1 square executes on 2x2 matrix

The documentation for this class was generated from the following files:

design/aie_src/qgemm.h
design/aie_src/qgemm.cc

Public Member Functions

Static Public Member Functions

Member Function Documentation

◆ filter()