[Review] XNOR-Nets: ImageNet Classification Using Binary Convolutional Neural Networks



The two models presented:

In Binary-Weight-Networks, the (convolution) filters are approximated with binary values resulting in 32 x memory saving.

In XNOR-Networks, both the filters and the input to convolutional layers are binary. … This results in 58 x faster convolutional operations…


XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time.

Binary Convolutional Neural Networks

For future discussions we use the following mathematical notation for a CNN layer:

\(\mathcal{I}_{l(l=1,...,L)} = \mathbf{I}\in \mathbb{R} ^{c \times w_{\text{in}} \times h_{\text{in}}}\)
\(\mathcal{W}_{lk(k=1,...,K^l)}=\mathbf{W} \in \mathbb{R} ^{c \times w \times h}\)
\(\ast\text{ : convolution}\)
\(\oplus\text{ : convolution without multiplication}\)
\(\otimes \text{ : convolution with XNOR and bitcount}\)
\(\odot \text{ : elementwise multiplication}\)

Convolution with binary weights

In binary convolutional networks, we estimate the convolution filter weight as \(\mathbf{W}\approx\alpha \mathbf{B}\), where \(\alpha\) is a scalar scaling factor and \(\mathbf{B} \in \{+1, -1\} ^{c \times w \times h}\). Hence, we estimate the convolution operation as follows:

\[\mathbf{I} \ast \mathbf{W}\approx (\mathbf{I}\oplus \mathbf{B})\alpha\]

To find an optimal estimation for \(\mathbf{W}\approx\alpha \mathbf{B}\) we solve the following problem:

\[J(\mathbf{B},\alpha)=\Vert \mathbf{W}-\alpha \mathbf{B}\Vert^2\] \[\alpha ^*,\mathbf{B}^* =\underset{\alpha, \mathbf{B}}{\text{argmin}}J(\mathbf{B},\alpha)\]

Going straight to the answer:

\[\alpha^* = \frac{1}{n}\Vert \mathbf{W}\Vert_{l1}\] \[\mathbf{B}^*=\text{sign}(\mathbf{W})\]


The gradients are computed as follows:

\[\frac{\partial \text{sign}}{\partial r}=r \text{1}_{\vert r \vert \le1}\] \[\frac{\partial L}{\partial \mathbf{W}_i}=\frac{\partial L}{\partial \widetilde{\mathbf{W}_i}}\left(\frac{1}{n} + \frac{\partial \text{sign}}{\partial \mathbf{W}_i}\alpha \right)\]

where \(\widetilde{\mathbf{W}}=\alpha \mathbf{B}\), the estimated value of \(\mathbf{W}\).

The gradient values are kepted as real values; they cannot be binarized due to excessive information loss. Optimization is done by either SGD with momentum or ADAM.


Convolutions are a set of dot products between a submatrix of the input and a filter. Thus we attempt to express dot products in terms of binary operations.

Binary Dot Product

For vectors \(\mathbf{X}, \mathbf{W} \in \mathbb{R}^n\) and \(\mathbf{H}, \mathbf{B} \in \{+1,-1\}^n\), we approximate the dot product between \(\mathbf{X}\) and \(\mathbf{W}\) as

\[\mathbf{X}^\top \mathbf{W} \approx \beta \mathbf{H}^\top \alpha \mathbf{B}\]

We solve the following optimization problem:

\[\alpha^*, \mathbf{H}^*, \beta^*, \mathbf{B}^*=\underset{\alpha, \mathbf{H}, \beta, \mathbf{B}}{\text{argmin}} \Vert \mathbf{X} \odot \mathbf{W} - \beta \alpha \mathbf{H} \odot \mathbf{B} \Vert\]

Going straight to the answer:

\[\alpha^* \beta^*=\left(\frac{1}{n}\Vert \mathbf{X} \Vert_{l1}\right)\left(\frac{1}{n}\Vert \mathbf{W} \Vert_{l1}\right)\] \[\mathbf{H}^* \odot \mathbf{B}^*=\text{sign}(\mathbf{X}) \odot \text{sign}(\mathbf{W})\]

Convolution with binary inputs and weights

Calculating \(\alpha^* \beta^*\) for every submatrix in input tensor \(\mathbf{I}\) involves a large number of redundant computations. To overcome this inefficiency we first calculate

\[\mathbf{A}=\frac{\sum{\vert \mathbf{I}_{:,:,i} \vert}}{c}\]

which is an average over absolute values of \(\mathbf{I}\) along its channel. Then, we convolve \(\mathbf{A}\) with a 2D filter \(\mathbf{k} \in \mathbb{R}^{w \times h}\) where \(\forall ij \ \mathbf{k}_{ij}=\frac{1}{w \times h}\):

\[\mathbf{K}=\mathbf{A} \ast \mathbf{k}\]

This \(\mathbf{K}\) acts as a global \(\beta\) spatially across the submatrices. Now we can estimate our convolution with binary inputs and weights as:

\[\mathbf{I} \ast \mathbf{W} \approx (\text{sign}(\mathbf{I}) \otimes \text{sign}(\mathbf{W}) \odot \mathbf{K} \alpha\]


A CNN block in XNOR-Net has the following structure:

[Binary Normalization] - [Binary Activation] - [Binary Convolution] - [Pool]

The BinNorm layer normalizes the input batch by its mean and variance. The BinActiv layer calculates \(\mathbf{K}\) and \(\text{sign}(\mathbf{I})\). We may insert a non-linear activation function between the BinConv layer and the Pool layer.


The paper implemented the AlexNet, the Residual Net, and a GoogLenet variant(Darknet) with binary convolutions. This resulted in a few percent point of accuracy decrease, but overall worked fairly well. Refer to the paper for details.


Binary convolutions were not at all entirely binary; the gradients had to be real values. It would be fascinating if even the gradient is binarizable.

Categories: ,


Leave a comment