See https://github.com/onnx/onnx/blob/main/docs/Operators.md#QLinearConv.
- y = saturate ((x / y_scale) + y_zero)
- Bias must be quantized using scale = x_scale * w_scale and zero = 0
Computation
- x = (qx - qx_zero) * qx_scale
- bias = qbias * x_scale * w_scale
- y = x*w + bias =>
- (qy-qy_zero)*qy_scale = (qx-qx_zero)*qx_scale * (qw-qw_zero)*qw_scale + qbias*qx_scale*qw_scale = [(qx-qx_zero) * (qw-qw_zero) + qbias] * qx_scale*qw_scale
- qy = qy_zero + [(qx-qx_zero)*(qw-qw_zero) + qbias] * qx_scale*qw_scale/qy_scale
Implementation
- only precompute -qx_zero*(qw_qw_zero), rounding is done before adding qy_zero
- int32 bias: -qx_zero*(qw_zero): k*int8*int8 > 16bits
- int8 shifted qy_zero: shift added into acc
- int16 scale: saturated to 8 bits
- each m kernel of shape (1,C_PER_M,K,K) applied on input of shape (1,C_PER_M,H,W) for kernels that allow GROUP != 1