[Review] XNOR-Nets: ImageNet Classification Using Binary Convolutional Neural Networks

Resources

Abstract/Introduction

The two models presented:

In Binary-Weight-Networks, the (convolution) filters are approximated with binary values resulting in 32 x memory saving.

In XNOR-Networks, both the filters and the input to convolutional layers are binary. … This results in 58 x faster convolutional operations…

Implications:

XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time.

Binary Convolutional Neural Networks

For future discussions we use the following mathematical notation for a CNN layer:

\(\mathcal{I}_{l(l=1,...,L)} = \mathbf{I}\in \mathbb{R} ^{c \times w_{\text{in}} \times h_{\text{in}}}\)
\(\mathcal{W}_{lk(k=1,...,K^l)}=\mathbf{W} \in \mathbb{R} ^{c \times w \times h}\)
\(\ast\text{ : convolution}\)
\(\oplus\text{ : convolution without multiplication}\)
\(\otimes \text{ : convolution with XNOR and bitcount}\)
\(\odot \text{ : elementwise multiplication}\)

Convolution with binary weights

In binary convolutional networks, we estimate the convolution filter weight as \(\mathbf{W}\approx\alpha \mathbf{B}\), where \(\alpha\) is a scalar scaling factor and \(\mathbf{B} \in \{+1, -1\} ^{c \times w \times h}\). Hence, we estimate the convolution operation as follows:

\[\mathbf{I} \ast \mathbf{W}\approx (\mathbf{I}\oplus \mathbf{B})\alpha\]

To find an optimal estimation for \(\mathbf{W}\approx\alpha \mathbf{B}\) we solve the following problem:

\[J(\mathbf{B},\alpha)=\Vert \mathbf{W}-\alpha \mathbf{B}\Vert^2\] \[\alpha ^*,\mathbf{B}^* =\underset{\alpha, \mathbf{B}}{\text{argmin}}J(\mathbf{B},\alpha)\]

Going straight to the answer:

\[\alpha^* = \frac{1}{n}\Vert \mathbf{W}\Vert_{l1}\] \[\mathbf{B}^*=\text{sign}(\mathbf{W})\]

Training

The gradients are computed as follows:

\[\frac{\partial \text{sign}}{\partial r}=r \text{1}_{\vert r \vert \le1}\] \[\frac{\partial L}{\partial \mathbf{W}_i}=\frac{\partial L}{\partial \widetilde{\mathbf{W}_i}}\left(\frac{1}{n} + \frac{\partial \text{sign}}{\partial \mathbf{W}_i}\alpha \right)\]

where \(\widetilde{\mathbf{W}}=\alpha \mathbf{B}\), the estimated value of \(\mathbf{W}\).

The gradient values are kepted as real values; they cannot be binarized due to excessive information loss. Optimization is done by either SGD with momentum or ADAM.

XNOR-Networks

Convolutions are a set of dot products between a submatrix of the input and a filter. Thus we attempt to express dot products in terms of binary operations.

Binary Dot Product

For vectors \(\mathbf{X}, \mathbf{W} \in \mathbb{R}^n\) and \(\mathbf{H}, \mathbf{B} \in \{+1,-1\}^n\), we approximate the dot product between \(\mathbf{X}\) and \(\mathbf{W}\) as

\[\mathbf{X}^\top \mathbf{W} \approx \beta \mathbf{H}^\top \alpha \mathbf{B}\]

We solve the following optimization problem:

\[\alpha^*, \mathbf{H}^*, \beta^*, \mathbf{B}^*=\underset{\alpha, \mathbf{H}, \beta, \mathbf{B}}{\text{argmin}} \Vert \mathbf{X} \odot \mathbf{W} - \beta \alpha \mathbf{H} \odot \mathbf{B} \Vert\]

Going straight to the answer:

\[\alpha^* \beta^*=\left(\frac{1}{n}\Vert \mathbf{X} \Vert_{l1}\right)\left(\frac{1}{n}\Vert \mathbf{W} \Vert_{l1}\right)\] \[\mathbf{H}^* \odot \mathbf{B}^*=\text{sign}(\mathbf{X}) \odot \text{sign}(\mathbf{W})\]

Convolution with binary inputs and weights

Calculating \(\alpha^* \beta^*\) for every submatrix in input tensor \(\mathbf{I}\) involves a large number of redundant computations. To overcome this inefficiency we first calculate

\[\mathbf{A}=\frac{\sum{\vert \mathbf{I}_{:,:,i} \vert}}{c}\]

which is an average over absolute values of \(\mathbf{I}\) along its channel. Then, we convolve \(\mathbf{A}\) with a 2D filter \(\mathbf{k} \in \mathbb{R}^{w \times h}\) where \(\forall ij \ \mathbf{k}_{ij}=\frac{1}{w \times h}\):

\[\mathbf{K}=\mathbf{A} \ast \mathbf{k}\]

This \(\mathbf{K}\) acts as a global \(\beta\) spatially across the submatrices. Now we can estimate our convolution with binary inputs and weights as:

\[\mathbf{I} \ast \mathbf{W} \approx (\text{sign}(\mathbf{I}) \otimes \text{sign}(\mathbf{W}) \odot \mathbf{K} \alpha\]

Training

A CNN block in XNOR-Net has the following structure:

[Binary Normalization] - [Binary Activation] - [Binary Convolution] - [Pool]

The BinNorm layer normalizes the input batch by its mean and variance. The BinActiv layer calculates \(\mathbf{K}\) and \(\text{sign}(\mathbf{I})\). We may insert a non-linear activation function between the BinConv layer and the Pool layer.

Experiments

The paper implemented the AlexNet, the Residual Net, and a GoogLenet variant(Darknet) with binary convolutions. This resulted in a few percent point of accuracy decrease, but overall worked fairly well. Refer to the paper for details.

Discussion

Binary convolutions were not at all entirely binary; the gradients had to be real values. It would be fascinating if even the gradient is binarizable.

Share on

Twitter Facebook Google+ LinkedIn

Jae-Won Chung

[Review] XNOR-Nets: ImageNet Classification Using Binary Convolutional Neural Networks

Resources

Abstract/Introduction

Binary Convolutional Neural Networks

Convolution with binary weights

Training

XNOR-Networks

Binary Dot Product

Convolution with binary inputs and weights

Training

Experiments

Discussion

Share on

You may also enjoy

Continuously Monomorphized Systems

Prospective PhD Students, Intern Abroad!

Vibe Coding is Speculative Decoding

Useful Matplotlib Tricks for Research