Jekyll2020-07-12T19:35:57+09:00https://jaewonchung.me/feed.xmlJaewon’s BlogComputersJaewon ChungHalide: a language and compiler for image processing and deep learning2020-04-15T00:00:00+09:002020-04-15T00:00:00+09:00https://jaewonchung.me/study/code-generation/Halide<h1 id="halide">Halide</h1>
<h2 id="resources">Resources</h2>
<ul>
<li><a href="https://halide-lang.org/">https://halide-lang.org</a></li>
<li>https://github.com/halide/Halide</li>
<li>Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines (PLDI’ 13)</li>
<li>Automatically Scheduling Halide Image Processing Pipelines (SIGGRAPH ’16)</li>
<li>Loop Transformations Leveraging Hardware Prefetching (CGO ’18)</li>
<li>Differentiable Programming for Image Processing and Deep Learning in Halide (SIGGRAPH’ 18)</li>
<li>Schedule Synthesis for Halide Pipelines through Reuse Analysis (TACO ‘19)</li>
<li>Learning to Optimize Halide with Tree Search and Random Programs (SIGGRAPH’ 19)</li>
</ul>
<h2 id="paper-summary">Paper Summary</h2>
<h3 id="halide-a-language-and-compiler-for-optimizing-parallelism-locality-and-recomputation-in-image-processing-pipelines"><strong>Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines</strong></h3>
<ul>
<li><strong>Motivation.</strong> Image processing pipelines are often graphs of different stencil computations with low arithmetic intensity and inherent data parallelism. It introduces complex tradeoffs involving locality, parallelism, and recomputation. Thus, hand-crafted code produced with tedious effort are often neither portable nor optimal.</li>
<li><strong>Solution.</strong> Halide decouples the <em>algorithm</em> (what is computed?) and the <em>schedule</em> (when and where?). From each schedule, the compiler produces parallel vector code and measures its runtime. It then searches for the best schedule in the tradeoff space using stochastic search based on genetic algorithm.</li>
<li><strong>Results.</strong> Generated code are an order faster than their hand-crafted counterparts. Automatic scheduling is quite slow and lacks robustness.</li>
<li>
<p><strong>Detail.</strong> Two-stage decision for <em>determining the schedule</em> of <em>each function</em>:</p>
<ul>
<li>Domain Order: the order in which the required region is traversed
<ul>
<li>sequential/parallel, unrolled/vectorized, dimension reorder, dimension split</li>
</ul>
</li>
<li>Call Schedule: when to compute its inputs; the granularity of store and computation
<ul>
<li>breadth-first/total fusion/sliding window</li>
</ul>
</li>
</ul>
</li>
<li><strong>Detail.</strong> <em>Compile steps</em> (all decisions directed by the schedule):
<ul>
<li>Lowering and Loop Synthesis: create nested loops of the entire process, insert allocations and callee computations at specified locations in the loop</li>
<li>Bounds Inference: from the output size, the bounds of each dimension is determined</li>
<li>Sliding Window Optimization and Storage Folding: look for specific conditions and apply</li>
<li>Flattening: flatten multi-dimensional addressing and allocation</li>
<li>Vectorization and Unrolling</li>
<li>Back-end Code Generation - only note GPU:
<ul>
<li>outer loop → inner loops divided into GPU kernel launches</li>
<li>inner loops are annotated in the schedule with block and thread dimensions</li>
</ul>
</li>
</ul>
</li>
<li><strong>Detail.</strong> Stochastic <em>search</em> based on genetic algorithm
<ul>
<li>Hint hand-crafted optimization styles through mutation rules. These include mutating one or more function schedules to a well-known template.</li>
</ul>
</li>
<li><strong>Thoughts.</strong>
<ul>
<li>The increase in performance is natural, since Halide invests a lot of time in optimization. The real contribution seems to be that Halide formulated the axes of optimization and exposed an easy handle that helps users search the space.</li>
<li>Generated CUDA kernels don’t seem to use CUDA streams or asyncronous copies.</li>
<li>Requries block and thread annotations provided by the programmer.</li>
<li>Without the hand-crafted mutation, I suspect that performance will greatly suffer.</li>
<li>Schedule search could be learned. Monte Carlo tree search maybe? RL will work too, as in NAS.</li>
</ul>
</li>
</ul>
<h3 id="differentiable-programming-for-image-processing-and-deep-learning-in-halide">Differentiable Programming for Image Processing and Deep Learning in Halide</h3>
<ul>
<li><strong>Motivation.</strong> Existing deep learning libraries are inefficient in terms of computation and memory. Also, in order to implement custom operations, the user must manually provide both the forward and backward CUDA kernels.</li>
<li><strong>Solution.</strong> Extend Halide with automatic differentiation (<code class="language-plaintext highlighter-rouge">propagate_adjoints</code>).</li>
<li><strong>Results.</strong> GPU tensor operations 0.8x, 2.7x, and 20x faster than PyTorch, measured with batch size 4.</li>
<li><strong>Detail.</strong> Two special cases of note when <em>creating backward operations</em>:
<ul>
<li>Scatter-gather Conversion: When the forward of a function is a <em>gather</em> operation, its backward is a <em>scatter</em>, e.g. convolutions. This leads to race conditions when parallelized. Thus, the scatter operation is converted to a gather operation.</li>
<li>Handling Partial Updates: When a function is partially updated, dependency is removed for some indices. If two consequtive function updates have different update arguments, the former’s gradient is masked to zero using the update argument of the latter.</li>
</ul>
</li>
<li><strong>Detail.</strong> <em>Checkpointing</em> is already supported but in a more fine-grained manner through schedules: <code class="language-plaintext highlighter-rouge">compute_root</code> for checkpointing, <code class="language-plaintext highlighter-rouge">compute_inline</code> for recomputation, and <code class="language-plaintext highlighter-rouge">compute_at</code> is something in between, e.g. tiling.</li>
<li>
<p><strong>Detail.</strong> <em>Automatic scheduling</em> (only note GPU, ordered by high priority)</p>
<ol>
<li>For all scatter/reduce operations, always checkpoint them and tile the first two dimensions and parallelize computation over tiles. Other types of operations are not checkpointed at all.</li>
<li>Apply <code class="language-plaintext highlighter-rouge">rfactor</code> for large associative reductions with domains too small to tile.</li>
<li>If parallelizing cannot but lead to race conditions, use atomic operations and parallelize.</li>
</ol>
</li>
<li><strong>Thoughts.</strong>
<ul>
<li>Again, automatic scheduling could be better. The scheduler in this work is filled with hand-crafted heuristics.</li>
<li>The paper doesn’t talk about the time needed for automatic scheduling. Probably it took pretty long. Then we can’t use this for deep learning research; training just a single hyperparameter configuration is already burdensome. Deployment has some hope though.</li>
<li>The ‘deep learning operations’ this paper conducted experiments on (grid_sample, affine_grid, optical flow warp, and bilateral slicing) are relatively uncommon compared with matrix multiplication or convolution. This aligns with their claim that Halide is advantageous when you have to <em>implement custom operations</em>.</li>
</ul>
</li>
</ul>
<h3 id="learning-to-optimize-halide-with-tree-search-and-random-programs">Learning to Optimize Halide with Tree Search and Random Programs</h3>
<ul>
<li><strong>Motivation.</strong> Existing autoschedulers are limited because 1) their search space is small, 2) their search procedures are coupled with the schedule type, and 3) their cost models are inaccurate and hand-crafted.</li>
<li><strong>Solution.</strong> Use 1) a new parametrization of the schedule space, 2) beam search, and 3) additionally employ a learned cost model trained on ramdomly generated programs.</li>
<li><strong>Results.</strong> Deep learning benchmarks on GPU were not reported at all! Those on CPU with image size 1 x 3 x 2560 x 1920 are claimed to outperform TF and PT and be competitive with MXNet + MKL, but the paper mentions no concrete numbers.</li>
<li>
<p><strong>Detail.</strong> <em>Parameters</em> of the schedule (underlined). Beginning from the <em>final</em> stage, make two decisions per stage to build a complete schedule:</p>
<ol>
<li><em>Compute and storage granularity</em> of new stage. An existing stage can be split, creating an extra level of tiling. <em>Tile sizes</em> are also parameters that should be determined.</li>
<li>For the newly added stage, we may parallelize outer tilings and/or vectorize inner tilings and <em>annotate</em>.</li>
</ol>
</li>
<li><strong>Detail.</strong> <em>Beam search</em> with pruning (just kill schedules that fail hand-crafted asserts). Run multiple passes that gradually select good schedules from corase to fine.</li>
<li>
<p><strong>Detail.</strong> <em>Predicting runtime</em>, which beam search minimizes, with a neural network.</p>
<ol>
<li>Schedule to feature: algorithm-specific + schedule-specific</li>
<li>Runtime prediction: design 27 runtime-related terms and have the a small model predict the coefficients of each term, use L2 loss between predicted and target <em>throughput</em></li>
<li>Training data generation: use the sytem itself, iterate between training the model and generating data with the system</li>
</ol>
</li>
<li>
<p><strong>Detail.</strong> Given more time, <em>benchmark</em> several candidates (instead of predicting runtime) and select best. Given even more time, fine-tune the neural network on the benchmark results and repeat beam search (<em>autotuning</em>).</p>
</li>
<li><strong>Thoughts.</strong>
<ul>
<li>A loop nest is a graph. Can we use graph embedding & pooling on schedules to predict runtime?</li>
<li>No comparisons with deep learning frameworks on GPUs. Maybe I have to check this myself.</li>
<li>This paper seems just to incorporate tremendous amounts of manual hand-crafted optimizations and tedious engineering. I cannot find any core novel ideas in this paper; I don’t think there’s anything new.</li>
</ul>
</li>
</ul>
<h2 id="code-peek">Code Peek</h2>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include "Halide.h" // all of Halide
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="c1">// Symbolic definition of the algorithm 'index_sum'.</span>
<span class="n">Halide</span><span class="o">::</span><span class="n">Var</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">;</span> <span class="c1">// think of these as for loop iterators</span>
<span class="n">Halide</span><span class="o">::</span><span class="n">Func</span> <span class="n">index_sum</span><span class="p">;</span> <span class="c1">// each Func represents one pipeline stage</span>
<span class="n">index_sum</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span> <span class="c1">// operation defined in an arbitrary point</span>
<span class="c1">// Manually schedule our algorithm.</span>
<span class="n">Halide</span><span class="o">::</span><span class="n">Var</span> <span class="n">x_outer</span><span class="p">,</span> <span class="n">x_inner</span><span class="p">,</span> <span class="n">y_outer</span><span class="p">,</span> <span class="n">y_inner</span><span class="p">,</span> <span class="c1">// divide loop into tiles</span>
<span class="n">tile_index</span><span class="p">,</span> <span class="c1">// fuse and parallelize</span>
<span class="n">x_inner_outer</span><span class="p">,</span> <span class="n">y_inner_outer</span><span class="p">,</span> <span class="c1">// tile each tile again</span>
<span class="n">x_vectors</span><span class="p">,</span> <span class="n">y_pairs</span><span class="p">;</span> <span class="c1">// vectorize and unroll</span>
<span class="n">index_sum</span>
<span class="c1">// tile with size (64, 64)</span>
<span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x_outer</span><span class="p">,</span> <span class="n">x_inner</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_outer</span><span class="p">,</span> <span class="n">y_inner</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="p">.</span><span class="n">reorder</span><span class="p">(</span><span class="n">x_inner</span><span class="p">,</span> <span class="n">y_inner</span><span class="p">,</span> <span class="n">x_outer</span><span class="p">,</span> <span class="n">y_outer</span><span class="p">)</span>
<span class="c1">// fuse the two outer loops and parallelize</span>
<span class="p">.</span><span class="n">fuse</span><span class="p">(</span><span class="n">x_outer</span><span class="p">,</span> <span class="n">y_outer</span><span class="p">,</span> <span class="n">tile_index</span><span class="p">)</span>
<span class="p">.</span><span class="n">parallel</span><span class="p">(</span><span class="n">tile_index</span><span class="p">)</span>
<span class="c1">// tile with size (4, 2), use shorthand this time!</span>
<span class="p">.</span><span class="n">tile</span><span class="p">(</span><span class="n">x_inner</span><span class="p">,</span> <span class="n">y_inner</span><span class="p">,</span> <span class="n">x_inner_outer</span><span class="p">,</span> <span class="n">y_inner_outer</span><span class="p">,</span> <span class="n">x_vectors</span><span class="p">,</span> <span class="n">y_pairs</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1">// vectorize over x_vectors (vector length is 4)</span>
<span class="p">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">x_vectors</span><span class="p">)</span>
<span class="c1">// unroll loop over y_pairs (2 duplications)</span>
<span class="p">.</span><span class="n">unroll</span><span class="p">(</span><span class="n">y_pairs</span><span class="p">);</span>
<span class="c1">// Run the algorithm. Loop bounds are automatically inferred by Halide!</span>
<span class="n">Halide</span><span class="o">::</span><span class="n">Buffer</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">result</span> <span class="o">=</span> <span class="n">index_sum</span><span class="p">.</span><span class="n">realize</span><span class="p">(</span><span class="mi">350</span><span class="p">,</span> <span class="mi">250</span><span class="p">);</span>
<span class="c1">// Print nested loop in pseudo-code.</span>
<span class="n">index_sum</span><span class="p">.</span><span class="n">print_loop_nest</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ g++ peek.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o peek -std=c++11
$ LD_LIBRARY_PATH=../bin ./peek
produce index_sum:
parallel x.x_outer.tile_index:
for y.y_inner.y_inner_outer:
for x.x_inner.x_inner_outer:
unrolled y.y_inner.y_pairs in [0, 1]:
vectorized x.x_inner.x_vectors in [0, 3]:
index_sum(...) = ...
</code></pre></div></div>Jaewon ChungIntroduction to Halide and review of several related papers. Halide aims to generate efficient domain-specific languages automatically from user-defined algorithms.The autoencoder family2019-01-31T00:00:00+09:002019-01-31T00:00:00+09:00https://jaewonchung.me/study/machine-learning/Autoencoders<p>Vanilla autoencoders(AE), denoising autoencoders(DAE), variational autoencoders(VAE), and conditional variational autoencoders(CVAE) are explained in this post. Referring to the <a href="https://jaywonchung.github.io/study/machine-learning/MLE-and-ML/">previous post</a> on Bayesian statistics may help your understanding.</p>
<h1 id="autoencoders-ae">Autoencoders (AE)</h1>
<h2 id="structure">Structure</h2>
<p><img src="/assets/images/posts/2019-01-31-AE.png" alt="Autoencoders" /></p>
<p>As seen in the above structure, autoencoders have the same input and output size. Ultimately, we want the output to be the same as the input. We penalize the difference of the input <script type="math/tex">x</script> and the output <script type="math/tex">y</script>.</p>
<p>We can formulate the simplest autoencoder (with a single fully connected layer at each side) as:</p>
<script type="math/tex; mode=display">x, y \in [0,1]^d</script>
<script type="math/tex; mode=display">z = h_\theta(x) = \text{sigmoid}(Wx+b) ~~~ (\theta = \{W, b\})</script>
<script type="math/tex; mode=display">y = g_{\theta^\prime}(z) = \text{sigmoid}(W^\prime z+b^\prime) ~~~ (\theta = \{W^\prime, b^\prime\})</script>
<p>Since we want <script type="math/tex">x=y</script>, we get the following optimization problem:</p>
<script type="math/tex; mode=display">\theta^*, \theta^{\prime *} = \underset{\theta, \theta^\prime}{\text{argmin}} \frac{1}{N} \sum_{i=1}^N l(x^{(i)}, y^{(i)})</script>
<p>The <script type="math/tex">l(x,y)</script> is the loss function, which calculates the difference between <script type="math/tex">x</script> and <script type="math/tex">y</script>. We can use square error or cross-entropy, which are written as:</p>
<script type="math/tex; mode=display">l(x, y) = \Vert x-y \Vert^2</script>
<script type="math/tex; mode=display">l(x, y) = - \sum_{k=1}^d [x_k \log(y_k) + (1-x_k)\log(1-y_k)]</script>
<p>We will use cross-entropy error, which we will specially denote as <script type="math/tex">l(x, y) = L_H(x, y)</script>.</p>
<h2 id="statistical-viewpoint">Statistical viewpoint</h2>
<p>We can view this loss function in terms of expectation:</p>
<script type="math/tex; mode=display">\theta^*, \theta^{\prime *} = \underset{\theta, \theta^\prime}{\text{argmin}} \mathbb{E}_{q^0(X)}[L_H(X, g_{\theta^\prime}(h_\theta(X)))]</script>
<p>where <script type="math/tex">q^0(X)</script> denotes the empirical distribution associated with our <script type="math/tex">N</script> training examples.</p>
<h1 id="denoising-autoencoders-dae">Denoising Autoencoders (DAE)</h1>
<h2 id="structure-1">Structure</h2>
<p><img src="/assets/images/posts/2019-01-31-DAE.png" alt="Denoising Autoencoders" /></p>
<p>With the encoder and decoder formula the same, denoising autoencoders intentionally drop a specific portion of the pixels of the input <script type="math/tex">x</script> to zero, creating <script type="math/tex">\tilde{x}</script>. Formally, we are sampling <script type="math/tex">\tilde{x}</script> from a stochastic mapping <script type="math/tex">q_D(\tilde{x}\vert x)</script>. We can compute the loss between the original <script type="math/tex">x</script> and the output <script type="math/tex">y</script>.</p>
<p>In formulating our objective function, we cannot use that of the vanilla autoencoder since now <script type="math/tex">g_{\theta^\prime}(f_\theta(\tilde{x}))</script> is a deterministic function of <script type="math/tex">\tilde{x}</script>, not <script type="math/tex">x</script>. Thus we need to take into account the connection between <script type="math/tex">\tilde{x}</script> and <script type="math/tex">x</script>, which is <script type="math/tex">q_D(\tilde{x}\vert x)</script>. Then we can write our optimization problem and expand it as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\theta^*,\theta^{\prime *}
&= \underset{\theta, \theta^\prime}{\text{argmin}} \mathbb{E}_{q^0(X, \tilde{X})}[L_H(X, g_{\theta^\prime}(f_\theta(\tilde{X})))]\\
&= \underset{\theta, \theta^\prime}{\text{argmin}} \frac{1}{N} \sum_{x\in D} \mathbb{E}_{q_D(\tilde{x}\vert x)}[L_H(x, g_{\theta^\prime}(f_\theta(\tilde{x})))]\\
&\approx \underset{\theta, \theta^\prime}{\text{argmin}}\frac{1}{N} \sum_{x\in D} \frac{1}{L} \sum_{i=1}^L L_H(x, g_{\theta^\prime}(f_\theta(\tilde{x}_i)))
\end{aligned} %]]></script>
<p>where <script type="math/tex">q^0(X, \tilde{X}) = q^0(X)q_D(\tilde{X}\vert X)</script>. Since we cannot compute the expectation in the second line, we approximate it with the Monte Carlo technique by drawing <script type="math/tex">L</script> samples and computing their mean loss.</p>
<h1 id="variational-autoencoders-vae">Variational Autoencoders (VAE)</h1>
<h2 id="structure-2">Structure</h2>
<p>VAEs have the same network structure with AEs; an encoder that calculates latent variable <script type="math/tex">z</script> and a decoder that generates output image <script type="math/tex">y</script>. Also, we train both networks such that the output image and the input image are the same. However, their goal is what’s different. The goal of an autoencoder is to generate the best feature vector <script type="math/tex">z</script> from an image, whereas the goal of a variational autoencoder is to generate realistic images from the vector <script type="math/tex">z</script>.</p>
<p>Also, the network structure of AEs and VAEs are not exactly the same. The encoder of an AE directly calculates the latent variable <script type="math/tex">z</script> from the input. On the other hand, the encoder of a VAE calculates the parameters of a Gaussian distribution ( <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>), where we then sample our <script type="math/tex">z</script> from. This is true for the decoder too. AEs output the image itself, but VAE output parameters for the image pixel distribution. Let us put this more formally.</p>
<ul>
<li>
<p><strong>Encoder</strong><br />
Let a standard normal distribution <script type="math/tex">p(z)</script> be the prior distribution of latent variable <script type="math/tex">z</script>.
Given an input image <script type="math/tex">x</script>, we have our encoder network calculate the posterior distribution <script type="math/tex">p(z \vert x)</script>. Then we sample our latent variable <script type="math/tex">z</script> from the posterior distribution.</p>
</li>
<li>
<p><strong>Decoder</strong><br />
Given a latent variable <script type="math/tex">z</script>, the likelihood of our decoder outputting <script type="math/tex">x</script>(the input image) is <script type="math/tex">p(x \vert z)</script>. We usually interpret this as a Multivariate Bernoulli where each pixel of the image corresponds to a dimension.</p>
</li>
</ul>
<h2 id="the-optimization-problem">The Optimization Problem</h2>
<p>We want to sample <script type="math/tex">z</script> from the posterior <script type="math/tex">p(z \vert x)</script>, which can be expanded with the Bayes Rule.</p>
<script type="math/tex; mode=display">p(z \vert x) = \frac{p(x \vert z)p(z)}{p(x)}</script>
<p>However <script type="math/tex">p(x) = \int p(x \vert z ) p(z) dz</script>, the evidence, is intractable since we need to integrate over all possible <script type="math/tex">z</script>. Thus without calculating the posterior <script type="math/tex">p(z \vert x)</script>, we’ll try to approximate it with a Gaussian distribution <script type="math/tex">q_\lambda (z \vert x)</script>. We call this <strong>variational inference</strong>.</p>
<p>Since we want the two distributions <script type="math/tex">q_\lambda (z \vert x)</script> and <script type="math/tex">p(z \vert x)</script> to be similar, we adopt the Kullback-Leibler Divergence and try to minimize it with respect to parameter <script type="math/tex">\lambda</script>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
D_{KL}(q_\lambda(z \vert x) \vert \vert p(z \vert x))
&= \int_{-\infty}^{\infty} q_\lambda (z \vert x)\log \left( \frac{q_\lambda (z \vert x)}{p(z \vert x)} \right) dz\\
&=\mathbb{E}_q\left[ \log(q_\lambda (z \vert x)) \right] - \mathbb{E}_q \left[ \log (p(z \vert x)) \right] \\
&=\mathbb{E}_q\left[ \log(q_\lambda (z \vert x)) \right] - \mathbb{E}_q \left[ \log (p(z, x)) \right] + \log(p(x))\\
\end{aligned} %]]></script>
<p>The problem here is that the intractable <script type="math/tex">p(x)</script> term is still present. Now let us write the above equation in terms of <script type="math/tex">\log(p(x))</script>.</p>
<script type="math/tex; mode=display">\log(p(x)) = D_{KL}(q_\lambda(z \vert x) \vert \vert p(z \vert x)) + \text{ELBO}(\lambda)</script>
<p>where</p>
<script type="math/tex; mode=display">\text{ELBO}(\lambda) = \mathbb{E}_q \left[ \log (p(z, x)) \right] - \mathbb{E}_q\left[ \log(q_\lambda (z \vert x)) \right]</script>
<p>KL divergences are always non-negative, and we want to minimize it with respect to <script type="math/tex">\lambda</script>. This is equivalent to <strong>maximizing the ELBO</strong> with respect to <script type="math/tex">\lambda</script>. The abbreviation is revealed: <strong>E</strong>vidence <strong>L</strong>ower <strong>BO</strong>und. This can also be understood as maximizing the evidence <script type="math/tex">p(x)</script> since we want to maximize the probability of getting the exact input image from the output.</p>
<h2 id="elbo">ELBO</h2>
<p>Let’s inspect the <script type="math/tex">\text{ELBO}</script> term. Since no two input images share the same latent variable <script type="math/tex">z</script>, we can write <script type="math/tex">\text{ELBO}_i (\lambda)</script> for a single input image <script type="math/tex">x_i</script>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}_i (\lambda)
&= \mathbb{E}_q \left[ \log (p(z, x_i)) \right] - \mathbb{E}_q\left[ \log(q_\lambda (z \vert x_i)) \right] \\
&= \int \log(p(z, x_i)) q_\lambda(z \vert x_i) dz - \int \log(q_\lambda(z \vert x_i))q_\lambda(z \vert x_i) dz \\
&= \int \log(p(x_i \vert z)p(z)) q_\lambda(z \vert x_i) dz - \int \log(q_\lambda(z \vert x_i))q_\lambda(z \vert x_i) dz \\
&= \int \log(p(x_i \vert z)) q_\lambda(z \vert x_i) dz - \int q_\lambda(z \vert x_i) \log\left(\frac{q_\lambda(z \vert x_i)}{p(z)}\right)dz \\
&= \mathbb{E}_q \left[ \log (p(x_i \vert z)) \right] - D_{KL}(q_\lambda(z \vert x_i) \vert \vert p(z))
\end{aligned} %]]></script>
<p>Now shifting our attention back to the network structure, our encoder network calculates the parameters of <script type="math/tex">q_\lambda(z \vert x_i)</script>, and our decoder network calculates the likelihood <script type="math/tex">p(x_i \vert z)</script>. Thus we can rewrite the above results so that the parameters match those of the autoencoder described above.</p>
<script type="math/tex; mode=display">\text{ELBO}_i(\phi, \theta) = \mathbb{E}_{q_\phi} \left[ \log(p_\theta(x_i \vert z)) \right] - D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z))</script>
<p>Negating <script type="math/tex">\text{ELBO}_i(\phi, \theta)</script>, we obtain our loss function for sample <script type="math/tex">x_i</script>.</p>
<script type="math/tex; mode=display">l_i(\phi, \theta) = -\text{ELBO}_i(\phi, \theta)</script>
<p>Thus our optimization problem becomes</p>
<script type="math/tex; mode=display">\phi^*, \theta^* = \underset{\phi, \theta}{\text{argmin}} \sum_{i=1}^N \left[ -\mathbb{E}_{q_\phi} \left[ \log(p_\theta(x_i \vert z)) \right] + D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z)) \right]</script>
<h2 id="understanding-the-loss-function">Understanding the loss function</h2>
<script type="math/tex; mode=display">l_i(\phi, \theta) = -\underline{\mathbb{E}_{q_\phi} \left[ \log(p_\theta(x_i \vert z)) \right]} + \underline{D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z))}</script>
<p>The first underlined part (excluding the negative sign) is to be maximized. This is called the reconstruction loss: how similar the reconstructed image is to the input image. For each latent variable <script type="math/tex">z</script> we sample from the approximated posterior <script type="math/tex">q_\phi(z \vert x_i)</script>, we calculate the log-likelihood of the decoder producing <script type="math/tex">x_i</script>. Thus maximizing this term is equivalent to the maximum likelihood estimation.</p>
<p>The second term is the Kullback-Leibler Divergence between the approximated posterior <script type="math/tex">q_\phi(z \vert x_i)</script> and the prior <script type="math/tex">p(z)</script>. This acts as a regularizer, forcing the approximated posterior to be similar to the prior distribution, which is a standard normal distribution.</p>
<p><img src="/assets/images/posts/2019-01-31-Learned-Manifold.JPG" alt="Learned Manifold" /></p>
<p>The above plots 2-dimensional latent variables of 500 test images for an AE and a VAE. As you can see, the distribution of latent variables of VAEs is close to the standard normal distribution, which is due to the regularizer. This is a virtue because, with this property, we can just easily sample a vector <script type="math/tex">z</script> from the standard normal distribution and feed it to the decoder network to generate a reasonable image. This is ideal because VAEs were intended as a generator.</p>
<h2 id="calculating-the-loss-function">Calculating the loss function</h2>
<p>To train our VAE, we should be able to calculate the loss. Let’s start with the <strong>regularizer</strong> term.</p>
<p><img src="/assets/images/posts/2019-01-31-Gaussian-Encoder.JPG" alt="Gaussian Encoder" /></p>
<p>We create our encoder network such that it calculates the mean and standard deviation of <script type="math/tex">q_\phi(z \vert x_i)</script>. We then sample vector <script type="math/tex">z</script> from this Multivariate Gaussian distribution: <script type="math/tex">z \sim \mathcal{N}(\mu, \sigma^2 I)</script>.</p>
<p>The KL divergence between two normal distributions is <a href="https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Multivariate_normal_distributions">known</a>. We can calculate the regularizer term as:</p>
<script type="math/tex; mode=display">D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z)) = \frac{1}{2}\sum_{i=1}^J \left( \mu_{i.j}^2 + \sigma_{i,j}^2 - \log(\sigma_{i,j}^2)-1\right)</script>
<p>Now let’s look at the <strong>reconstruction loss</strong> term. To calculate the log-likelihood of our image <script type="math/tex">\log(p_\theta(x_i \vert z))</script>, we should choose how to model our output. We have two choices.</p>
<ol>
<li>
<p>Multivariate Bernoulli Distribution<br />
<img src="/assets/images/posts/2019-01-31-Bernoulli-Decoder.JPG" alt="Bernoulli Decoder" /></p>
<p>This is often reasonable for black and white images like those from MNIST. We binarize the training and testing images with threshold 0.5. We can implement this easily with pytorch:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">image</span> <span class="o">=</span> <span class="p">(</span><span class="n">image</span> <span class="o">>=</span> <span class="mf">0.5</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>
</code></pre></div> </div>
<p>Each output of the decoder corresponds to a single pixel of the image, denoting the probability of the pixel being white. Then we can use the Bernoulli probability mass funtion <script type="math/tex">f(x_{i,j};p_{i,j}) = p_{i,j}^{x_{i,j}} (1-p_{i,j})^{1-x_{i,j}}</script> as our likelihood.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\log p(x_i \vert z)
&= \sum_{j=1}^D \log(p_{i,j}^{x_{i,j}} (1-p_{i,j})^{1-x_{i,j}}) \\
&= \sum_{j=1}^D \left[x_{i,j} \log(p_{i,j}) + (1-x_{i,j})\log(1-p_{i,j}) \right]
\end{aligned} %]]></script>
<p>This is equivalent to the cross-entropy loss.</p>
</li>
<li>
<p>Multivariate Gaussian Distribution<br />
<img src="/assets/images/posts/2019-01-31-Gaussian-Decoder.JPG" alt="Gaussian Decoder" /></p>
<p>The probability density function of a Gaussian distribution is as follows.</p>
<script type="math/tex; mode=display">f(x_{i,j};\mu_{i,j}, \sigma_{i,j}) = \frac{1}{\sqrt{2\pi\sigma_{i,j}^2}}e^{-\frac{(x_{i,j}-\mu_{i,j})^2}{2\sigma_{i,j}^2}}</script>
<p>Using this in our likelihood,</p>
<script type="math/tex; mode=display">\log p(x_i \vert z) = -\sum_{j=1}^D \left[ \frac{1}{2}\log(\sigma_{i,j}^2)+\frac{(x_{i,j}-\mu_{i,j})^2}{2\sigma_{i,j}^2} \right]</script>
<p>Notice that if we fix <script type="math/tex">\sigma_{i,j} = 1</script>, we get the square error.</p>
</li>
</ol>
<p>Now we’ve calculated the posterior <script type="math/tex">p_\theta(x_i \vert z)</script>, we can look at the whole reconstruction loss term. Unfortunately, the expectation is difficult to compute since it takes into account every possible <script type="math/tex">z</script>. So we use the Monte Carlo approximation of expectation by sampling <script type="math/tex">L</script> <script type="math/tex">z_l</script>’s from <script type="math/tex">q_\phi(z \vert x_i)</script> and take their mean log likelihood.</p>
<script type="math/tex; mode=display">\mathbb{E}_{q_\phi} \left[ \log p_\theta(x_i \vert z) \right] \approx \frac{1}{L} \sum_{l=1}^L \log p_\theta(x_i \vert z_l )</script>
<p>For convenience, we use <script type="math/tex">L = 1</script> in implementation.</p>
<h1 id="conditional-variational-autoencoders-cvae">Conditional Variational Autoencoders (CVAE)</h1>
<h2 id="structure-3">Structure</h2>
<p>The CVAE has the same structure and loss function as the VAE, but the input data is different. Notice that in VAEs, we never used the labels of our training data. If we have labels, why don’t we use them?</p>
<p><img src="/assets/images/posts/2019-01-31-CVAE.png" alt="Conditional Variational Autoencoders" /></p>
<p>Now in conditional variational autoencoders, we concatenate the onehot labels with the input images, and also with the latent variables. Everything else is the same.</p>
<h2 id="implications">Implications</h2>
<p>What do we get by doing this? One good thing about this is that the latent variable no longer needs to encode which label the input is. It only needs to encode its styles, or the <strong>class-invariant features</strong> of that image.</p>
<p>Then, we can concatenate any onehot vector to generate an image of the intended class with the specific style encoded by the latent variable.</p>
<p>For more images on generation, check out <a href="https://github.com/jaywonchung/Learning-ML/tree/master/Implementations/Conditional-Variational-Autoencoder">my repository</a>’s README file.</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<ul>
<li>
<p>Images in this post were borrowed from the <a href="https://www.slideshare.net/NaverEngineering/ss-96581209">presentation by Hwalsuk Lee</a>.</p>
</li>
<li>
<p>I’ve implemented everything discussed here. Check out <a href="https://github.com/jaywonchung/Learning-ML">my GitHub repository</a>.</p>
</li>
</ul>Jaewon ChungAE, DAE, VAE, and CVAE explained. The previous post on Bayesian statistics may help your understanding.Bayesian Statistics, Maximum Likelihood Estimation, and Machine Learning2019-01-29T00:00:00+09:002019-01-29T00:00:00+09:00https://jaewonchung.me/study/machine-learning/MLE-and-ML<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Prior_probability">Wikipedia: Prior Probability</a></li>
<li><a href="https://en.wikipedia.org/wiki/Posterior_probability">Wikipedia: Posterior Probability</a></li>
<li><a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">Wikipedia: Maximum Likelihood Estimation</a></li>
<li><a href="https://www.youtube.com/watch?v=o_peo6U7IRM">Youtube: 오토인코더의 모든 것 1/3</a></li>
</ul>
<h1 id="prior-probability">Prior probability</h1>
<p>The prior probability distribution of an uncertain quantity is the probability distribution about that quantity <strong>before</strong> some evidence is taken into account. This is often expressed as <script type="math/tex">p(\theta)</script>.</p>
<h1 id="posterior-probability">Posterior probability</h1>
<p>The posterior probability of a random event is the conditional probability that is assigned <strong>after</strong> relevant evidence is taken into account. This is often expressed as <script type="math/tex">p(\theta | X)</script>. The prior and posterior probabilities are related by the Bayes’ Theorem as follows:</p>
<script type="math/tex; mode=display">p(\theta | x) = \frac{p(x|\theta)p(\theta)}{p(x)}</script>
<h1 id="maximum-likelihood-estimation-mle">Maximum Likelihood Estimation (MLE)</h1>
<p>MLE is a method of estimating the parameters of a statistical model, given observations. Intuitively, we are trying to find the model parameters that make the observed data most probable. This is done by finding the parameters that maximizes the likelihood function <script type="math/tex">\mathcal{L}(\theta;x)</script>. When we are dealing with discrete random variables, the likelihood function is the probability. On the other hand, when we are dealing with continuous random variables, the likelihood function is the value of the probability distribution function.</p>
<p>We can formulate the MLE problem as follows:</p>
<script type="math/tex; mode=display">\theta^* \in \{\underset{\theta}{\text{argmax}} \mathcal{L}(\theta;x)\}</script>
<p>where <script type="math/tex">\theta</script> is the model parameters and <script type="math/tex">x</script> is the observed data.</p>
<p>We often use the average log-likelihood function</p>
<script type="math/tex; mode=display">\hat{\mathcal{l}}(\theta;x) = \frac{1}{n} \log \mathcal{L}(\theta;x)</script>
<p>since it has preferable qualities. One of this is illustrated later in this document.</p>
<h2 id="machine-learning-in-the-mle-perspective">Machine Learning in the MLE perspective</h2>
<p><img src="https://raw.githubusercontent.com/jaywonchung/jaywonchung.github.io/master/assets/images/posts/2019-01-29-ML-model-traditional.png" alt="Tradidional machine learning models" /></p>
<p>A traditional machine learning model for classification is visualized as the above: we receive an input image <script type="math/tex">x</script> and our model calculates <script type="math/tex">f_\theta (x)</script>, which is a vector denoting the probability for each class. Then based on our label, we calculate the loss function, which is then optimized using gradient descent. Now, let us view this in a maximum likelihood perspective.</p>
<p><img src="https://raw.githubusercontent.com/jaywonchung/jaywonchung.github.io/master/assets/images/posts/2019-01-29-ML-model-MLE.png" alt="Machine learning models in a MLE perspective" /></p>
<p>Now, when we create an ML model, we choose a statistical model that our output may follow. Then, our ML model function calculates the parameters of that statistical model. For example, let us assume that our output <script type="math/tex">y</script> is one dimensional and has a Gaussian distribution. Then we set <script type="math/tex">f_\theta(x)</script> to a two-dimensional vector and interpret it as</p>
<script type="math/tex; mode=display">f_\theta(x) =\begin{bmatrix}\mu\\\sigma\end{bmatrix}</script>
<p>Thus for each input <script type="math/tex">x</script> we obtain a Gaussian distribution for <script type="math/tex">y</script>. Using negative log-likelihood, our optimization problem is the following:</p>
<script type="math/tex; mode=display">\theta^* = \underset{\theta}{\text{argmin}}[-\log p(y|f_\theta(x))]</script>
<p>If we assume that our inputs are independent and identically distributed (i.i.d), we can obtain the following:</p>
<script type="math/tex; mode=display">p(y|f_\theta(x)) = \prod_i p(y_i|f_\theta(x_i))</script>
<p>Rewriting our optimization problem:</p>
<script type="math/tex; mode=display">\theta^* = \underset{\theta}{\text{argmin}}[-\sum_i\log p(y_i|f_\theta(x_i))]</script>
<p>When we perform inference from our model, we no longer get determined outputs as we did in traditional machine learning models. We now get a distribution of <script type="math/tex">y_\text{new}</script>,</p>
<script type="math/tex; mode=display">y_\text{new} \sim f_{\theta^*}(x_\text{new})</script>
<p>where we should sample a single <script type="math/tex">y_\text{new}</script>.</p>
<h2 id="loss-functions-in-the-mle-perspective">Loss Functions in the MLE perspective</h2>
<p>Two famous loss functions, mean square error and cross-entropy error, can be derived using the MLE perspective.</p>
<p><img src="https://raw.githubusercontent.com/jaywonchung/jaywonchung.github.io/master/assets/images/posts/2019-01-29-Loss-functions-MLE.png" alt="Loss function derived" />
(<a href="https://www.slideshare.net/NaverEngineering/ss-96581209">https://www.slideshare.net/NaverEngineering/ss-96581209</a>)</p>Jaewon ChungSome basic content I encountered while studying machine learning. A very brief explanation of prior probabilities, posterior probabilities, maximum likelihood estimation, and how they provide a new viewpoint for machine learning models.[Review] XNOR-Nets: ImageNet Classification Using Binary Convolutional Neural Networks2019-01-18T00:00:00+09:002019-01-18T00:00:00+09:00https://jaewonchung.me/read/papers/XNOR-Nets<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://arxiv.org/abs/1603.05279">arXiv</a></li>
<li><a href="http://allenai.org/plato/xnornet">Official XNOR implementation of AlexNet</a></li>
</ul>
<h1 id="abstractintroduction">Abstract/Introduction</h1>
<p>The two models presented:</p>
<blockquote>
<p>In Binary-Weight-Networks, the (convolution) filters are approximated with binary values resulting in 32 x memory saving.</p>
</blockquote>
<blockquote>
<p>In XNOR-Networks, both the filters and the input to convolutional layers are binary. … This results in 58 x faster convolutional operations…</p>
</blockquote>
<p>Implications:</p>
<blockquote>
<p>XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time.</p>
</blockquote>
<h1 id="binary-convolutional-neural-networks">Binary Convolutional Neural Networks</h1>
<p>For future discussions we use the following mathematical notation for a CNN layer:</p>
<p><script type="math/tex">\mathcal{I}_{l(l=1,...,L)} = \mathbf{I}\in \mathbb{R} ^{c \times w_{\text{in}} \times h_{\text{in}}}</script><br />
<script type="math/tex">\mathcal{W}_{lk(k=1,...,K^l)}=\mathbf{W} \in \mathbb{R} ^{c \times w \times h}</script><br />
<script type="math/tex">\ast\text{ : convolution}</script><br />
<script type="math/tex">\oplus\text{ : convolution without multiplication}</script><br />
<script type="math/tex">\otimes \text{ : convolution with XNOR and bitcount}</script><br />
<script type="math/tex">\odot \text{ : elementwise multiplication}</script></p>
<h2 id="convolution-with-binary-weights">Convolution with binary weights</h2>
<p>In binary convolutional networks, we estimate the convolution filter weight as <script type="math/tex">\mathbf{W}\approx\alpha \mathbf{B}</script>, where <script type="math/tex">\alpha</script> is a scalar scaling factor and <script type="math/tex">\mathbf{B} \in \{+1, -1\} ^{c \times w \times h}</script>. Hence, we estimate the convolution operation as follows:</p>
<script type="math/tex; mode=display">\mathbf{I} \ast \mathbf{W}\approx (\mathbf{I}\oplus \mathbf{B})\alpha</script>
<p>To find an optimal estimation for <script type="math/tex">\mathbf{W}\approx\alpha \mathbf{B}</script> we solve the following problem:</p>
<script type="math/tex; mode=display">J(\mathbf{B},\alpha)=\Vert \mathbf{W}-\alpha \mathbf{B}\Vert^2</script>
<script type="math/tex; mode=display">\alpha ^*,\mathbf{B}^* =\underset{\alpha, \mathbf{B}}{\text{argmin}}J(\mathbf{B},\alpha)</script>
<p>Going straight to the answer:</p>
<script type="math/tex; mode=display">\alpha^* = \frac{1}{n}\Vert \mathbf{W}\Vert_{l1}</script>
<script type="math/tex; mode=display">\mathbf{B}^*=\text{sign}(\mathbf{W})</script>
<h2 id="training">Training</h2>
<p>The gradients are computed as follows:</p>
<script type="math/tex; mode=display">\frac{\partial \text{sign}}{\partial r}=r \text{1}_{\vert r \vert \le1}</script>
<script type="math/tex; mode=display">\frac{\partial L}{\partial \mathbf{W}_i}=\frac{\partial L}{\partial \widetilde{\mathbf{W}_i}}\left(\frac{1}{n} + \frac{\partial \text{sign}}{\partial \mathbf{W}_i}\alpha \right)</script>
<p>where <script type="math/tex">\widetilde{\mathbf{W}}=\alpha \mathbf{B}</script>, the estimated value of <script type="math/tex">\mathbf{W}</script>.</p>
<p>The gradient values are kepted as real values; they cannot be binarized due to excessive information loss. Optimization is done by either SGD with momentum or ADAM.</p>
<h1 id="xnor-networks">XNOR-Networks</h1>
<p>Convolutions are a set of dot products between a submatrix of the input and a filter. Thus we attempt to express dot products in terms of binary operations.</p>
<h2 id="binary-dot-product">Binary Dot Product</h2>
<p>For vectors <script type="math/tex">\mathbf{X}, \mathbf{W} \in \mathbb{R}^n</script> and <script type="math/tex">\mathbf{H}, \mathbf{B} \in \{+1,-1\}^n</script>, we approximate the dot product between <script type="math/tex">\mathbf{X}</script> and <script type="math/tex">\mathbf{W}</script> as</p>
<script type="math/tex; mode=display">\mathbf{X}^\top \mathbf{W} \approx \beta \mathbf{H}^\top \alpha \mathbf{B}</script>
<p>We solve the following optimization problem:</p>
<script type="math/tex; mode=display">\alpha^*, \mathbf{H}^*, \beta^*, \mathbf{B}^*=\underset{\alpha, \mathbf{H}, \beta, \mathbf{B}}{\text{argmin}} \Vert \mathbf{X} \odot \mathbf{W} - \beta \alpha \mathbf{H} \odot \mathbf{B} \Vert</script>
<p>Going straight to the answer:</p>
<script type="math/tex; mode=display">\alpha^* \beta^*=\left(\frac{1}{n}\Vert \mathbf{X} \Vert_{l1}\right)\left(\frac{1}{n}\Vert \mathbf{W} \Vert_{l1}\right)</script>
<script type="math/tex; mode=display">\mathbf{H}^* \odot \mathbf{B}^*=\text{sign}(\mathbf{X}) \odot \text{sign}(\mathbf{W})</script>
<h2 id="convolution-with-binary-inputs-and-weights">Convolution with binary inputs and weights</h2>
<p>Calculating <script type="math/tex">\alpha^* \beta^*</script> for every submatrix in input tensor <script type="math/tex">\mathbf{I}</script> involves a large number of redundant computations. To overcome this inefficiency we first calculate</p>
<script type="math/tex; mode=display">\mathbf{A}=\frac{\sum{\vert \mathbf{I}_{:,:,i} \vert}}{c}</script>
<p>which is an average over absolute values of <script type="math/tex">\mathbf{I}</script> along its channel. Then, we convolve <script type="math/tex">\mathbf{A}</script> with a 2D filter <script type="math/tex">\mathbf{k} \in \mathbb{R}^{w \times h}</script> where <script type="math/tex">\forall ij \ \mathbf{k}_{ij}=\frac{1}{w \times h}</script>:</p>
<script type="math/tex; mode=display">\mathbf{K}=\mathbf{A} \ast \mathbf{k}</script>
<p>This <script type="math/tex">\mathbf{K}</script> acts as a global <script type="math/tex">\beta</script> spatially across the submatrices. Now we can estimate our convolution with binary inputs and weights as:</p>
<script type="math/tex; mode=display">\mathbf{I} \ast \mathbf{W} \approx (\text{sign}(\mathbf{I}) \otimes \text{sign}(\mathbf{W}) \odot \mathbf{K} \alpha</script>
<h2 id="training-1">Training</h2>
<p>A CNN block in XNOR-Net has the following structure:</p>
<p><code class="language-plaintext highlighter-rouge">[Binary Normalization] - [Binary Activation] - [Binary Convolution] - [Pool]</code></p>
<p>The BinNorm layer normalizes the input batch by its mean and variance. The BinActiv layer calculates <script type="math/tex">\mathbf{K}</script> and <script type="math/tex">\text{sign}(\mathbf{I})</script>. We may insert a non-linear activation function between the BinConv layer and the Pool layer.</p>
<h1 id="experiments">Experiments</h1>
<p>The paper implemented the AlexNet, the Residual Net, and a GoogLenet variant(Darknet) with binary convolutions. This resulted in a few percent point of accuracy decrease, but overall worked fairly well. Refer to the paper for details.</p>
<h1 id="discussion">Discussion</h1>
<p>Binary convolutions were not at all entirely binary; the gradients had to be real values. It would be fascinating if even the gradient is binarizable.</p>Jaewon ChungA model that binarizes both the input and convolution filters, offering the possibility of running SOTA networks on CPUs.공학수학 1,2 필기 공유2018-10-31T00:00:00+09:002018-10-31T00:00:00+09:00https://jaewonchung.me/study/lectures/EM-notes<p>필기는 아이패드 프로와 애플펜슬로 작성되었으며, 제가 실제로 수업을 들으며, 혹은 들은 후에 작성한 내용입니다. 필기 우측 하단 이메일주소만 남기시면 자유롭게 공유하셔도 좋습니다. 다만 상업적 이용은 금하겠습니다.</p>
<p>구글 드라이브 링크: <a href="https://drive.google.com/open?id=1fJDoA_5gIAPB1BeIXKLVpGXyFZlNS0BA">https://drive.google.com/open?id=1fJDoA_5gIAPB1BeIXKLVpGXyFZlNS0BA</a></p>
<p><img src="/assets/images/posts/2018-10-31-EM1.png" alt="example page 1" /></p>
<p><img src="/assets/images/posts/2018-10-31-EM2.png" alt="example page 2" /></p>
<p><img src="/assets/images/posts/2018-10-31-EM3.png" alt="example page 3" /></p>
<p><img src="/assets/images/posts/2018-10-31-EM4.png" alt="example page 4" /></p>
<p><img src="/assets/images/posts/2018-10-31-EM5.png" alt="example page 5" /></p>Jaewon ChungGiving out handwritten notes for Engineering Mathematics 1 and 2[Review] There’s No Such Thing as a General-Purpose Processor2018-09-30T00:00:00+09:002018-09-30T00:00:00+09:00https://jaewonchung.me/read/magazines/No-general-purpose-processor<h2 id="article">Article</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=2687011">There’s No Such Thing as a General-Purpose Processor (ACM Queue, October 2018)</a></p>
<h2 id="background-knowledge--summary">Background Knowledge & Summary</h2>
<p>What is it to be a ‘general-purpose’ processor? It should be able to run any given algorithm, thus turing complete. However, considering only the turing complete condition neglects the performance aspect that has been driving the whole industry of processor development. In other words, a general-purpose processor should be able to run all programs efficiently.</p>
<p>The article examines past and recent processors and their trends in many aspects, including memory virtualization and management of the operating system, how they predict branching and how much they rely on the compiler to generate efficient code, the use of cache memory that may bias performance in favor of specific algorithms, and how various models of parallelism makes it difficult for generalization. Through such investigation, the author attempts to conclude that no such processor was ever general-purpose, and no processor will and should be either.</p>
<p>For details on the comparison and examination made in each boundary, refer to the following keynote presentation I’ve made:</p>
<p><a class="embedly-card" data-card-controls="0" href="https://www.icloud.com/keynote/0sKLzLALYEPL4VKW9HVfRcUZQ">Keynote Presentation</a></p>Jaewon ChungACM Queue, October 2018. There were, are, and will be no general purpose processors, according to the author.[Review] Digital Nudging: Guiding Online User Choices through Interface Design2018-08-05T00:00:00+09:002018-08-05T00:00:00+09:00https://jaewonchung.me/read/magazines/Digital-nudging<h2 id="article">Article</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=3213765">Digital Nudging: Guiding Online User Choices through Interface Design (Communications of the ACM, July 2018)</a></p>
<h2 id="background-knowledge--summary">Background Knowledge & Summary</h2>
<p>The baseline of the article is that “What is chosen often depends on how the choice is presented.” or more specifically, “What is chosen often in a choice environment depends on how the choice interface is designed”. People indeed have limited cognitive capabilities and thus cannot exhaustively ponder through the given choice, which makes them rely on heuristics and biases.</p>
<p>Today, most choices are being transferred to digital/internet environments. In such environments, real-time tracking and analysis of user behavior is possible. Also, the administrator can implement changes at a relatively low cost. Most importantly, the choice architect can select a default value for the choice maker. Due to these capabilities, many digital nudging techniques can be applied.</p>
<p>For details on such techniques and which heuristic they exploit, refer to the following keynote presentation I’ve made:</p>
<p><a class="embedly-card" data-card-controls="0" href="https://www.icloud.com/keynote/0MluSw-ZDWvE1jUBvAnJkn8oQ">Keynote Presentation</a></p>Jaewon ChungCACM, July 2018. What is chosen often depends on how the choice is presented. What are the techniques and heuristics we can exploit?[Review] Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends2018-07-15T00:00:00+09:002018-07-15T00:00:00+09:00https://jaewonchung.me/read/magazines/Speech-emotion-recognition<h2 id="article">Article</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=3129340">Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends (Communications of the ACM, May 2018)</a></p>
<h2 id="background-knowledge--summary">Background Knowledge & Summary</h2>
<p>Recognizing the speaker’s emotion from his speech can act as a key part in seamless HCI (Human-Computer Interaction). In the traditional approach, we first model what emotions are and how to represent them. Then according to the emotion model, first, we divide the speech into clusters and label them, and second, extract speech/textual features that can reflect emotion. Then with the data and features, we can run machine learning.</p>
<p>The problem with this process is that emotions have no definitive answer, which makes labeling very difficult even for the speaker himself. Partly because of this, good-quality labelled data is always scarce, becoming a bottleneck.</p>
<p>Further, the article explains the upcoming and moonshot challenges of Speech Emotion Recognition, which include holistic speaker modeling, and handling atypical speech situations. Refer to the following keynote presentation I’ve made for more detail:</p>
<p><a class="embedly-card" data-card-controls="0" href="https://www.icloud.com/keynote/0pfl_o-tu9ZyNg9pEdMbiS05g">Keynote Presentation</a></p>Jaewon ChungCACM, May 2018. How do we model emotions, and how do we recognize them from speeches? What have we achieved until now and what do we aim for the future?[Review] Algorithms Behind Modern Storage Systems2018-06-24T00:00:00+09:002018-06-24T00:00:00+09:00https://jaewonchung.me/read/magazines/Algorithms-behind-modern-storage-systems<h2 id="article">Article</h2>
<ul>
<li>
<p><a href="https://dl.acm.org/citation.cfm?id=3220266">Algorithms Behind Modern Storage Systems (ACM Queue, Mar+Apr 2018)</a></p>
</li>
<li><a href="https://medium.com/databasss/on-disk-io-part-1-flavours-of-io-8e1ace1de017">On Disk IO, Part 1: Flavors of IO</a></li>
<li><a href="https://medium.com/databasss/on-disk-io-part-2-more-flavours-of-io-c945db3edb13">On Disk IO, Part 2: More Flavors of IO</a></li>
<li><a href="https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2da218496f">On Disk IO, Part 3: LSM Trees</a></li>
<li><a href="https://medium.com/databasss/on-disk-storage-part-4-b-trees-30791060741">On Disk IO, Part 4: B-Trees and RUM Conjecture</a></li>
<li><a href="https://medium.com/databasss/on-disk-io-access-patterns-in-lsm-trees-2ba8dffc05f9">On Disk IO, Part 5: Access Patterns in LSM Trees</a></li>
</ul>
<h2 id="background-knowledge--summary">Background Knowledge & Summary</h2>
<p>This main article focuses on data structures and algorithms that are used in modern database systems. Through a concise overview of B-trees and LSM trees, the author extends the trade-offs of each data structure to the RUM conjecture, which suggests that you can try to balance the read/update/memory overheads, but there isn’t a perfectly optimal structure.</p>
<p>For further information, refer to the following keynote presentation I’ve made. Explained here are the basics of I/O (including virtual memory, paging, and page swapping), B-Trees, LSM trees, Write Ahead Log and the RUM Conjecture.</p>
<p><a class="embedly-card" data-card-controls="0" href="https://www.icloud.com/keynote/0dqsZt83Icufku4HPiRwe8dbQ">Keynote Presentation</a></p>Jaewon ChungACM Queue March+April 2018. Technical article on the data structures and algorithms used in modern storage systems.[Review] High Performance Synthetic Information Environments2018-05-17T00:00:00+09:002018-05-17T00:00:00+09:00https://jaewonchung.me/read/magazines/High-performance-synthetic-information-environments<h2 id="article">Article</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=3158342">High Performance Synthetic Information Environments (ACM Ubiquity Symposium, February 2018)</a></p>
<h2 id="background-knowledge--summary">Background Knowledge & Summary</h2>
<p>Refer to the following keynote presentation I’ve made:</p>
<p><a class="embedly-card" data-card-controls="0" href="https://www.icloud.com/keynote/042Q2PJU2wK4AaxBdTc50_bkQ">Keynote Presentation</a></p>Jaewon ChungACM Ubiquity Symposium, February 2018. Technical article on the design of synthetic information environments.