# Abstract/Introduction

The two models presented:

In Binary-Weight-Networks, the (convolution) filters are approximated with binary values resulting in 32 x memory saving.

In XNOR-Networks, both the filters and the input to convolutional layers are binary. … This results in 58 x faster convolutional operations…

Implications:

XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time.

# Binary Convolutional Neural Networks

For future discussions we use the following mathematical notation for a CNN layer:

$\mathcal{I}_{l(l=1,...,L)} = \mathbf{I}\in \mathbb{R} ^{c \times w_{\text{in}} \times h_{\text{in}}}$
$\mathcal{W}_{lk(k=1,...,K^l)}=\mathbf{W} \in \mathbb{R} ^{c \times w \times h}$
$\ast\text{ : convolution}$
$\oplus\text{ : convolution without multiplication}$
$\otimes \text{ : convolution with XNOR and bitcount}$
$\odot \text{ : elementwise multiplication}$

## Convolution with binary weights

In binary convolutional networks, we estimate the convolution filter weight as $\mathbf{W}\approx\alpha \mathbf{B}$, where $\alpha$ is a scalar scaling factor and $\mathbf{B} \in \{+1, -1\} ^{c \times w \times h}$. Hence, we estimate the convolution operation as follows:

To find an optimal estimation for $\mathbf{W}\approx\alpha \mathbf{B}$ we solve the following problem:

Going straight to the answer:

## Training

The gradients are computed as follows:

where $\widetilde{\mathbf{W}}=\alpha \mathbf{B}$, the estimated value of $\mathbf{W}$.

The gradient values are kepted as real values; they cannot be binarized due to excessive information loss. Optimization is done by either SGD with momentum or ADAM.

# XNOR-Networks

Convolutions are a set of dot products between a submatrix of the input and a filter. Thus we attempt to express dot products in terms of binary operations.

## Binary Dot Product

For vectors $\mathbf{X}, \mathbf{W} \in \mathbb{R}^n$ and $\mathbf{H}, \mathbf{B} \in \{+1,-1\}^n$, we approximate the dot product between $\mathbf{X}$ and $\mathbf{W}$ as

We solve the following optimization problem:

Going straight to the answer:

## Convolution with binary inputs and weights

Calculating $\alpha^* \beta^*$ for every submatrix in input tensor $\mathbf{I}$ involves a large number of redundant computations. To overcome this inefficiency we first calculate

which is an average over absolute values of $\mathbf{I}$ along its channel. Then, we convolve $\mathbf{A}$ with a 2D filter $\mathbf{k} \in \mathbb{R}^{w \times h}$ where $\forall ij \ \mathbf{k}_{ij}=\frac{1}{w \times h}$:

This $\mathbf{K}$ acts as a global $\beta$ spatially across the submatrices. Now we can estimate our convolution with binary inputs and weights as:

## Training

A CNN block in XNOR-Net has the following structure:

[Binary Normalization] - [Binary Activation] - [Binary Convolution] - [Pool]

The BinNorm layer normalizes the input batch by its mean and variance. The BinActiv layer calculates $\mathbf{K}$ and $\text{sign}(\mathbf{I})$. We may insert a non-linear activation function between the BinConv layer and the Pool layer.

# Experiments

The paper implemented the AlexNet, the Residual Net, and a GoogLenet variant(Darknet) with binary convolutions. This resulted in a few percent point of accuracy decrease, but overall worked fairly well. Refer to the paper for details.

# Discussion

Binary convolutions were not at all entirely binary; the gradients had to be real values. It would be fascinating if even the gradient is binarizable.

Categories:

Updated: