Jae-Won’s Blog

Useful Matplotlib Tricks for Research

2024-09-20T00:00:00-04:00

I believe Matplotlib is the de facto standard for plotting. Because it’s entirely Python code, it’s reproducible (given the same input data), easy to version control, and easy to generate and iterate with ChatGPT.

I’ll go over a couple tricks with Matplotlib that I’ve found useful, especially for research.

Making plots compatible with PowerPoint

PNG files exported with Matplotlib don’t look very nice in PowerPoint slides, especially when you enlarge it. Also, you can’t, say, change the color of your lines or remove unnecessary components, because they’re all flattened into an image file.

Instead, export your plot in SVG as well. All you need to do is call fig.savefig once more with a file path that ends with .svg. Then, drag the SVG file onto PowerPoint, right click, and click Convert to Shape. Now, every component (e.g., line, text) will be converted to PowerPoint-native objects that you can tweak freely. See also Microsoft’s documentation on this.

Before you actually call savefig, make sure you have the following configuration:

import matplotlib as mpl
mpl.rcParams["svg.fonttype"] = "none"

Without this, every character in your plot will be saved as a separate text box. That’s beyond annoying.

Making plot files Git-friendly

Let’s say you’re exporting your plot in PDF or SVG and committing it into a Git repository for paper-writing. Normally, if you re-export the exact same plot (e.g., because you had to restart and re-run every Jupyter Notebook cell), git will recognize them as modified. This is because by default, the contents of the PDF/SVG file will change slightly. This is particularly bad for PDF files, because they are binaries – git will have to store the whole PDF file because it doesn’t know how to do line-level diff on binary files. However, it is possible to make Matplotlib generate the exact same file each time deterministically.

PDF files

fig.savefig("plot.pdf", metadata={"CreationDate": None})

Notice that we’re removing the file creation date from the PDF metadata.

SVG files

fig.savefig("plot.svg", metadata={"Date": None})

The savefig call is basically the same for SVG, except that it needs a slightly different metadata key for the creation date. However, we do need one more thing before we call savefig:

import matplotlib as mpl
mpl.rcParams["svg.hashsalt"] = "42"

Without fixing the hash salt, SVG generation is non-deterministic and git will (rightfully) detect the file as modified.

Avoiding Type 3 fonts

This may not apply to all publishers, but at least USENIX and ACM papers require final PDFs to not have any Type 3 fonts. They just want Type 1 fonts. So, when we export our plots into PDF files and embed them into our LaTeX document, it’s better to make sure they don’t contain any Type 3 fonts in the first place.

import matplotlib as mpl
mpl.rcParams["pdf.fonttype"] = "42"
mpl.rcParams["ps.fonttype"] = "42"

Unlike svg.hashsalt where 42 was just a random number, 42 for the font type actually means TrueType fonts. The default is 3, which means Type 3 fonts.

An `mplstyle` file

You will definitely forget all of these next time. Instead of trying to remember, put everything in a style file like this (call it something like paper.mplstyle):

# Fonts
font.size : 9

# SVG export now doesn't render font as path, but saves fonts as text objects.
# This allows for easier integration with MS Office.
svg.fonttype : none
# Make SVG generation deterministic
svg.hashsalt : 42

# Avoid type 3 font usage
pdf.fonttype : 42
ps.fonttype : 42

Then, import the file with a one-liner:

import matplotlib.pyplot as plt
plt.style.use("./paper.mplstyle")

Structured Slacking

2023-12-12T00:00:00-05:00

Sometimes, when I start playing a YouTube video on one of my monitors, I magically start working on the other monitor without paying attention to the video. Why do I do this?

I think it’s because it gives me a sense of scoring extra points in a game. Alas, when I played the video I have decided to spend some time chilling instead of working, but if I actually make progress in work during that bulk of time that was anyway going to be wasted, I feel like going beyond a set goal. And that feels good.

However, the YouTube trick is not always useful because I can’t really accomplish challenging tasks that require full focus with a YouTube video on. That’s how I came to think of Structured Slacking, designed to exploit this fake sense of extra accomplishment.

It works like this. Sometimes, you don’t want to do anything today. Then think to yourself, “Okay, then let’s actually not do anything today. After all, I can use some rest from time to time. But, I’ll just do this very small thing today. It’s like an extra point I can score on a day off to feel good about myself.” Then suddenly you’re motivated to do that small thing. And once you get that started, you’ll probably continue working. That said, it’s also important to set up that tiny task to be the tip of the iceberg of a bigger set of tasks.

As many would guess from the name, Structured Slacking was also inspired by Structured Procrastination. The two are similar in that both aim to enhance the productivity of an individual and operate on top of self-deception; Deep inside your heart you know Structured Slacking is bullshit, but you manage to trick yourself to almost believe that you will take today off, and you will actually stop working after accomplishing that small task. The difference is that Structured Procrastination haunts you with a pending important task that you’re procrastinating, which is honestly not the best feeling. On the other hand, with Structured Slacking, you’re pretty chill. Of course you are – you’re taking the day off!

Using Microsoft python-type-stubs with Pyright

2023-09-17T00:00:00-04:00

Python type annotations allow static type checking, so that you can catch obvious AttributeError: NoneType object has no attribute ... in your editor. They also allow better code completion, because in many cases, type checking tools can infer the type of the object based on return type annotations of functions or methods. However, not all libraries (especially the ones that were created before Python type annotation got established) have type annotations.

That’s why type stubs exist. They have .pyi extensions and are like C headers, which only declare class, function, or method names and their parameter types without implementation. These type stubs do not have to be coupled with the actual library. So virtually anyone can just create type stubs for an existing library and ship it separately.

Microsoft’s pylance ships with type stubs bundled for popular libraries without native type annotations. But especially for large libraries type stubs cannot be perfect, so they have a repository called python-type-stubs to collaboratively work on creating type stubs together with the community, and these stubs are bundled together with pylance.

However, pylance is closed source, and is only available inside VS Code. As a Neovim person, I instead have to use the open-source version of pylance, which is pyright. However, by default, pyright doesn’t ship with pylance’s type stubs.

So the question is, how do I use python-type-stubs with pyright? It’s actually simple enough, but at the time of writing, it seems like nowhere on the Internet just has a straightforward guide on this.

Say you have a Python project proj managed with git.

Add python-type-stubs as a git submodule under the directory stubs:

$ cd proj
# Assuming you have GitHub SSH authentication set up.
$ git submodule add git@github.com:microsoft/python-type-stubs stubs

Then, point pyright to the stubs inside the submodule.

If you’re using pyproject.toml:

[tool.pyright]
stubPath = "./stubs/stubs"

If you’re not using pyproject.toml, you need to have pyrightconfig.json in the root of your workspace:

{
    "stubPath": "./stubs/stubs"
}

When you see glitches in the type stubs provided by python-type-stubs, just post a PR fixing the issue. When the PR gets merged, update the submodule (e.g., git submodule update).

Advisors are like GPT

2023-05-11T00:00:00-04:00

I mean, my advisor is not a GPT model of course. However, talking with my advisor is not like just talking to my friend or colleague. Efficiently getting advice from him requires a certain mental model of how he thinks and acts, and I realized that it’s sort of similar to prompting GPT models.

GPT is Stateless

ChatGPT will remember the details of the conversation in the same thread, but that’s only because they cram in the entire conversation in their context window. Outside that window, they are basically stateless; when you start a new conversation with ChatGPT, it’ll have an empty context and won’t remember anything about your previous conversations.

Advisors are often also stateless. Not that they’re stateless intentionally or by design, but due to the sheer amount of things going on around them, it’s more convenient for students to assume them to be stateless and forget everything. It’s just like how you don’t ask ChatGPT, “Hey I think I asked you something about NP-Hardness in another thread last month, do you remember that?”

Context is Important

That’s why initializing GPT’s context right is important. You would have experienced some conversations with ChatGPT where you screwed up the initial description of your problem, and it takes more words than what would have taken if you had described it right in the first place in order to correct ChatGPT’s understanding. In such cases you can just mutter “Crap.” and click ‘New Chat’ (because ChatGPT is stateless). However, unfortunately, that’s not so easy if you were talking to your advisor. Therefore, I try to make sure my advisor’s context is initialized with a concise and accurate picture of where my research is.

I think this is especially observable when I sometimes hear contradicting advice from my advisor. Not contradicting with my opinion, but with his own advice in the past. That’s probably because the contexts I gave to my advisor that led to those contradictory advices were inconsistent in some way. Therefore, usually my next action is to prompt my advisor further to find out if there are any misunderstandings, either in the previous meeting or this one.

Fine-Tuning GPT

I’ve been saying all along that advisors are stateless, but we all know that they’re not completely stateless all the time. So, I like to think that they take one fine-tuning step at the end of every meeting, and their learning rate depends on how interesting the meeting was (and also on other things that I can’t control). If I excite my advisor with some interesting observation or good result, they’re more likely to remember. Otherwise, they probably won’t remember what happened during the last meeting.

In that sense, I think it’s an effective strategy to present my advisor with a concise summary at the end of the meeting. That way they don’t have to summarize the entire meeting on their own for fine-tuning, but rather just directly use the takeaway messages I present. For that, I also try to set forth a couple TODO bullets that are rooted on the core takeaways of this meeting and roughly represent what my advisor can expect for the next meeting.

The Importance of Mentoring as a PhD Student

2022-10-11T00:00:00-04:00

Two months ago, I began mentoring a Master’s student who reached out to my advisor asking for collaboration opportunities. I suggested an implementation project that aims to slightly expand the scope of my previous research work Zeus, and the student decided to go for it.

As a mentor, my task was to collaboratively design solutions with my mentee, answer technical questions, provide feedback on design decisions and code, and figure out what pieces of knowledge the student was missing and either provide study material or come up with Google search keywords. While I wasn’t doing any actual work, in that I wasn’t studying relevant technology or writing and testing code myself, the sequence of tasks was very difficult to perform satisfactorily because of the very fact that I wasn’t doing any actual work. That is, without a complete picture of what’s going on, I was supposed to have better foresight than my mentee about what would happen if we were to proceed in a certain direction, or my mentee risked hitting a hard wall. Eventually, when we officially merged the feature into Zeus, I felt very happy and proud that our collaboration worked out.

While the mentorship itself did not specifically push my ongoing research project forward, it indirectly helped with doing research, because I learned two important things from this experience.

First, I came to respect the weight of the advice my advisor gives me and understand the mental pressure of providing advice. My mentee and I were simply developing a moderately-sized software feature, and I could reasonably predict what would happen along the way and what things would look like in the end. Still, sometimes I wasn’t entirely sure about the advice I was giving to my mentee, but I anyway needed to at least look confident in order to give faith and motivation to my mentee. Moreover, it was obvious that if I ended up pointing my mentee to a wrong path, my mentee would face frustration.

What makes it more difficult for my advisor is that we are doing research, which is inherently uncertain; you never know if the hole you’re digging is your grave. Yet, PhD students expect their advisors to still provide advice that is roughly in the right direction. I have come to understand that not even the best professors can know with confidence that the way the student is headed is a good direction, and how incredibly good my advisor is in that he usually provides advice that ends up being correct.

Second, mentoring helped me ask better questions to my advisor. Recalling my mentoring experience, when my mentee asked for clarification about my suggestion, I sometimes failed to give good answers because I myself was operating based on logical guesses and inexplicable instincts. I figured this would be the case for my advisor, too. Thus, instead of interrogating my advisor with a bunch of why-questions, I started to instead ask questions such as “Is it because of A and B that you asked me to do X?”, essentially putting forth my guess of the reason behind his advice. Then, my advisor would answer whether or not he agrees with what I said, or sometimes even make better suggestions when the guess I provided sparked something, and both cases led to highly productive conversations.

All in all, before this mentorship experience, my perception of mentoring was that it’s only good for me if it ends with some nice tangible outcome in a reasonable amount of time. However, now I feel like the process of mentorship taught me a lot about the relationship between me and my advisor, and allowed me to improve the productivity of the meetings with my advisor. Moreover, mentoring itself was an extremely rewarding process, where I could interact with enthusiastic junior researchers.

On Staying Confident

2022-09-14T00:00:00-04:00

Last April, I submitted the manuscript of Zeus to the Spring deadline of NSDI. I was fairly confident about my research capability, since I managed to submit something that looks like a good paper by a very stringent deadline.

After a good three-week break, I came back to research in late May, exploring potential future directions. I suggested some ideas and directions during meetings with my advisor, and basically everything was killed. They were not just killed, but killed with very good reasons that seemed obvious in retrospect. Quickly in several weeks, I started to feel very unconfident to the point where I was afraid to talk with my advisor, although I knew logically that killing bad ideas in early stages is only saving me time.

Then suddenly in mid-July, we were notified that Zeus was accepted to appear at NSDI. My mood swung to the other end of confidence, especially because this was our first submission of the paper to any conference. I began to prepare the camera-ready version, together with Zeus’s open source repository and homepage. Finding new research directions was paused for a week or two, since we wanted Zeus to be posted on arXiv as soon as possible and have it collect citations from early on.

While polishing Zeus, I was thinking about what could be a follow-up work of it. Then I came up with another idea, an idea I liked no less than all the dead ideas in June, and presented that to my advisor. And he said it was very good. Now I’m working on that idea, and just found last week that my hypothesis was true at least for a limited set of workloads, and the potential gains can be quite large.

Looking back starting from April, when I submitted Zeus to NSDI, me as a researcher did not change that much. Most of the time I was on vacation, or was pouring time into polishing Zeus. However, my level of confidence in terms of research capability fluctuated greatly, which didn’t make any sense. If me in April and me today are similar researchers, there is no reason to be sometimes confident and sometimes not. Rather, my confidence level should be determined by my best times, which is currently mid-July, when Zeus was accepted. External factors, for example whether my ideas are well accepted by people, do not change my potential to do great work.

Appending the PhD Mindset

2022-05-12T00:00:00-04:00

Say that your coworker makes this statement:

The pizza served at Mani’s is literally the best in the world. They’re so good.

What is the most appropriate answer?

Hmm, how do you quantify the goodness of a restaurant? Is Google Maps stars a sufficient metric?
No, “good” does not mean “the best”. You should always think about the exact meaning of words when you speak.
You can’t just make such a statement. Do you have an argument to back that?
Lol yeah.

After spending half a year as a PhD student, I start to understand when people say:

Getting a PhD is not about acquiring a set of technical skills, but rather a specific mindset.

I suppose there are many elements that consist such a mindset, including but not limited to:

Excavating meaningful problems to solve and navigating the uncertain process of solving it (#1 above).
Communicating facts and arguments in precise language (#2 above).
Maintaining a critical view of relevant matter and accepting arguments after rigorous reasoning and observation (#3 above).

I believe many will agree that none of these are neither easy nor quick to acquire. One must imbue oneself with principles and constantly self-reflect and self-correct. However, I think one should go one step further. One should make a conscious effort so that no existing mindset is completely replaced by the PhD mindset; the PhD mindset must only append to the list of existing mindsets. Then, one must distinguish situations when it is more appropriate to apply the PhD mindset and when it is not in a fine-grained manner.

Life, at least partly, can be viewed as a multi-task learning problem. While acquiring new capabilities is important, one must make sure not to catastrophically forget other important things in the process, which may not always be easy especially when the new capability requires an immense amount of concentrated effort to learn. However, I believe that such an effort is meaningful in advancing one’s maturity as a person.

Halide: a language and compiler for image processing and deep learning

2020-04-15T00:00:00-04:00

Halide

Resources

https://halide-lang.org
https://github.com/halide/Halide
Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines (PLDI’ 13)
Automatically Scheduling Halide Image Processing Pipelines (SIGGRAPH ’16)
Loop Transformations Leveraging Hardware Prefetching (CGO ’18)
Differentiable Programming for Image Processing and Deep Learning in Halide (SIGGRAPH’ 18)
Schedule Synthesis for Halide Pipelines through Reuse Analysis (TACO ‘19)
Learning to Optimize Halide with Tree Search and Random Programs (SIGGRAPH’ 19)

Paper Summary

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines

Motivation. Image processing pipelines are often graphs of different stencil computations with low arithmetic intensity and inherent data parallelism. It introduces complex tradeoffs involving locality, parallelism, and recomputation. Thus, hand-crafted code produced with tedious effort are often neither portable nor optimal.
Solution. Halide decouples the algorithm (what is computed?) and the schedule (when and where?). From each schedule, the compiler produces parallel vector code and measures its runtime. It then searches for the best schedule in the tradeoff space using stochastic search based on genetic algorithm.
Results. Generated code are an order faster than their hand-crafted counterparts. Automatic scheduling is quite slow and lacks robustness.
Detail. Two-stage decision for determining the schedule of each function:
- Domain Order: the order in which the required region is traversed
  - sequential/parallel, unrolled/vectorized, dimension reorder, dimension split
- Call Schedule: when to compute its inputs; the granularity of store and computation
  - breadth-first/total fusion/sliding window
Detail. Compile steps (all decisions directed by the schedule):
- Lowering and Loop Synthesis: create nested loops of the entire process, insert allocations and callee computations at specified locations in the loop
- Bounds Inference: from the output size, the bounds of each dimension is determined
- Sliding Window Optimization and Storage Folding: look for specific conditions and apply
- Flattening: flatten multi-dimensional addressing and allocation
- Vectorization and Unrolling
- Back-end Code Generation - only note GPU:
  - outer loop → inner loops divided into GPU kernel launches
  - inner loops are annotated in the schedule with block and thread dimensions
Detail. Stochastic search based on genetic algorithm
- Hint hand-crafted optimization styles through mutation rules. These include mutating one or more function schedules to a well-known template.
Thoughts.
- The increase in performance is natural, since Halide invests a lot of time in optimization. The real contribution seems to be that Halide formulated the axes of optimization and exposed an easy handle that helps users search the space.
- Generated CUDA kernels don’t seem to use CUDA streams or asyncronous copies.
- Requries block and thread annotations provided by the programmer.
- Without the hand-crafted mutation, I suspect that performance will greatly suffer.
- Schedule search could be learned. Monte Carlo tree search maybe? RL will work too, as in NAS.

Differentiable Programming for Image Processing and Deep Learning in Halide

Motivation. Existing deep learning libraries are inefficient in terms of computation and memory. Also, in order to implement custom operations, the user must manually provide both the forward and backward CUDA kernels.
Solution. Extend Halide with automatic differentiation (propagate_adjoints).
Results. GPU tensor operations 0.8x, 2.7x, and 20x faster than PyTorch, measured with batch size 4.
Detail. Two special cases of note when creating backward operations:
- Scatter-gather Conversion: When the forward of a function is a gather operation, its backward is a scatter, e.g. convolutions. This leads to race conditions when parallelized. Thus, the scatter operation is converted to a gather operation.
- Handling Partial Updates: When a function is partially updated, dependency is removed for some indices. If two consequtive function updates have different update arguments, the former’s gradient is masked to zero using the update argument of the latter.
Detail. Checkpointing is already supported but in a more fine-grained manner through schedules: compute_root for checkpointing, compute_inline for recomputation, and compute_at is something in between, e.g. tiling.
Detail. Automatic scheduling (only note GPU, ordered by high priority)
1. For all scatter/reduce operations, always checkpoint them and tile the first two dimensions and parallelize computation over tiles. Other types of operations are not checkpointed at all.
2. Apply rfactor for large associative reductions with domains too small to tile.
3. If parallelizing cannot but lead to race conditions, use atomic operations and parallelize.
Thoughts.
- Again, automatic scheduling could be better. The scheduler in this work is filled with hand-crafted heuristics.
- The paper doesn’t talk about the time needed for automatic scheduling. Probably it took pretty long. Then we can’t use this for deep learning research; training just a single hyperparameter configuration is already burdensome. Deployment has some hope though.
- The ‘deep learning operations’ this paper conducted experiments on (grid_sample, affine_grid, optical flow warp, and bilateral slicing) are relatively uncommon compared with matrix multiplication or convolution. This aligns with their claim that Halide is advantageous when you have to implement custom operations.

Learning to Optimize Halide with Tree Search and Random Programs

Motivation. Existing autoschedulers are limited because 1) their search space is small, 2) their search procedures are coupled with the schedule type, and 3) their cost models are inaccurate and hand-crafted.
Solution. Use 1) a new parametrization of the schedule space, 2) beam search, and 3) additionally employ a learned cost model trained on ramdomly generated programs.
Results. Deep learning benchmarks on GPU were not reported at all! Those on CPU with image size 1 x 3 x 2560 x 1920 are claimed to outperform TF and PT and be competitive with MXNet + MKL, but the paper mentions no concrete numbers.
Detail. Parameters of the schedule (underlined). Beginning from the final stage, make two decisions per stage to build a complete schedule:
1. Compute and storage granularity of new stage. An existing stage can be split, creating an extra level of tiling. Tile sizes are also parameters that should be determined.
2. For the newly added stage, we may parallelize outer tilings and/or vectorize inner tilings and annotate.
Detail. Beam search with pruning (just kill schedules that fail hand-crafted asserts). Run multiple passes that gradually select good schedules from corase to fine.
Detail. Predicting runtime, which beam search minimizes, with a neural network.
1. Schedule to feature: algorithm-specific + schedule-specific
2. Runtime prediction: design 27 runtime-related terms and have the a small model predict the coefficients of each term, use L2 loss between predicted and target throughput
3. Training data generation: use the sytem itself, iterate between training the model and generating data with the system
Detail. Given more time, benchmark several candidates (instead of predicting runtime) and select best. Given even more time, fine-tune the neural network on the benchmark results and repeat beam search (autotuning).
Thoughts.
- A loop nest is a graph. Can we use graph embedding & pooling on schedules to predict runtime?
- No comparisons with deep learning frameworks on GPUs. Maybe I have to check this myself.
- This paper seems just to incorporate tremendous amounts of manual hand-crafted optimizations and tedious engineering. I cannot find any core novel ideas in this paper; I don’t think there’s anything new.

Code Peek

#include "Halide.h"          // all of Halide

int main() {

  // Symbolic definition of the algorithm 'index_sum'.
  Halide::Var x, y;          // think of these as for loop iterators
  Halide::Func index_sum;    // each Func represents one pipeline stage
  index_sum(x, y) = x + y;   // operation defined in an arbitrary point

  // Manually schedule our algorithm.
  Halide::Var x_outer, x_inner, y_outer, y_inner,  // divide loop into tiles
              tile_index,                          // fuse and parallelize
              x_inner_outer, y_inner_outer,        // tile each tile again
              x_vectors, y_pairs;                  // vectorize and unroll
  index_sum
    // tile with size (64, 64)
    .split(x, x_outer, x_inner, 64)
    .split(y, y_outer, y_inner, 64)
    .reorder(x_inner, y_inner, x_outer, y_outer)
    // fuse the two outer loops and parallelize
    .fuse(x_outer, y_outer, tile_index)
    .parallel(tile_index)
    // tile with size (4, 2), use shorthand this time!
    .tile(x_inner, y_inner, x_inner_outer, y_inner_outer, x_vectors, y_pairs, 4, 2)
    // vectorize over x_vectors (vector length is 4)
    .vectorize(x_vectors)
    // unroll loop over y_pairs (2 duplications)
    .unroll(y_pairs);

  // Run the algorithm. Loop bounds are automatically inferred by Halide!
  Halide::Buffer<int> result = index_sum.realize(350, 250);

  // Print nested loop in pseudo-code.
  index_sum.print_loop_nest();

  return 0;
}

$ g++ peek.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o peek -std=c++11
$ LD_LIBRARY_PATH=../bin ./peek
produce index_sum:
  parallel x.x_outer.tile_index:
    for y.y_inner.y_inner_outer:
      for x.x_inner.x_inner_outer:
        unrolled y.y_inner.y_pairs in [0, 1]:
          vectorized x.x_inner.x_vectors in [0, 3]:
            index_sum(...) = ...

The autoencoder family

2019-01-31T00:00:00-05:00

Vanilla autoencoders(AE), denoising autoencoders(DAE), variational autoencoders(VAE), and conditional variational autoencoders(CVAE) are explained in this post. Referring to the previous post on Bayesian statistics may help your understanding.

Autoencoders (AE)

Structure

As seen in the above structure, autoencoders have the same input and output size. Ultimately, we want the output to be the same as the input. We penalize the difference of the input \(x\) and the output \(y\).

We can formulate the simplest autoencoder (with a single fully connected layer at each side) as:

\[x, y \in [0,1]^d\] \[z = h_\theta(x) = \text{sigmoid}(Wx+b) ~~~ (\theta = \{W, b\})\] \[y = g_{\theta^\prime}(z) = \text{sigmoid}(W^\prime z+b^\prime) ~~~ (\theta = \{W^\prime, b^\prime\})\]

Since we want \(x=y\), we get the following optimization problem:

\[\theta^*, \theta^{\prime *} = \underset{\theta, \theta^\prime}{\text{argmin}} \frac{1}{N} \sum_{i=1}^N l(x^{(i)}, y^{(i)})\]

The \(l(x,y)\) is the loss function, which calculates the difference between \(x\) and \(y\). We can use square error or cross-entropy, which are written as:

\[l(x, y) = \Vert x-y \Vert^2\] \[l(x, y) = - \sum_{k=1}^d [x_k \log(y_k) + (1-x_k)\log(1-y_k)]\]

We will use cross-entropy error, which we will specially denote as \(l(x, y) = L_H(x, y)\).

Statistical viewpoint

We can view this loss function in terms of expectation:

\[\theta^*, \theta^{\prime *} = \underset{\theta, \theta^\prime}{\text{argmin}} \mathbb{E}_{q^0(X)}[L_H(X, g_{\theta^\prime}(h_\theta(X)))]\]

where \(q^0(X)\) denotes the empirical distribution associated with our \(N\) training examples.

Denoising Autoencoders (DAE)

Structure

With the encoder and decoder formula the same, denoising autoencoders intentionally drop a specific portion of the pixels of the input \(x\) to zero, creating \(\tilde{x}\). Formally, we are sampling \(\tilde{x}\) from a stochastic mapping \(q_D(\tilde{x}\vert x)\). We can compute the loss between the original \(x\) and the output \(y\).

In formulating our objective function, we cannot use that of the vanilla autoencoder since now \(g_{\theta^\prime}(f_\theta(\tilde{x}))\) is a deterministic function of \(\tilde{x}\), not \(x\). Thus we need to take into account the connection between \(\tilde{x}\) and \(x\), which is \(q_D(\tilde{x}\vert x)\). Then we can write our optimization problem and expand it as:

\[\begin{aligned} \theta^*,\theta^{\prime *} &= \underset{\theta, \theta^\prime}{\text{argmin}} \mathbb{E}_{q^0(X, \tilde{X})}[L_H(X, g_{\theta^\prime}(f_\theta(\tilde{X})))]\\ &= \underset{\theta, \theta^\prime}{\text{argmin}} \frac{1}{N} \sum_{x\in D} \mathbb{E}_{q_D(\tilde{x}\vert x)}[L_H(x, g_{\theta^\prime}(f_\theta(\tilde{x})))]\\ &\approx \underset{\theta, \theta^\prime}{\text{argmin}}\frac{1}{N} \sum_{x\in D} \frac{1}{L} \sum_{i=1}^L L_H(x, g_{\theta^\prime}(f_\theta(\tilde{x}_i))) \end{aligned}\]

where \(q^0(X, \tilde{X}) = q^0(X)q_D(\tilde{X}\vert X)\). Since we cannot compute the expectation in the second line, we approximate it with the Monte Carlo technique by drawing \(L\) samples and computing their mean loss.

Variational Autoencoders (VAE)

Structure

VAEs have the same network structure with AEs; an encoder that calculates latent variable \(z\) and a decoder that generates output image \(y\). Also, we train both networks such that the output image and the input image are the same. However, their goal is what’s different. The goal of an autoencoder is to generate the best feature vector \(z\) from an image, whereas the goal of a variational autoencoder is to generate realistic images from the vector \(z\).

Also, the network structure of AEs and VAEs are not exactly the same. The encoder of an AE directly calculates the latent variable \(z\) from the input. On the other hand, the encoder of a VAE calculates the parameters of a Gaussian distribution ( \(\mu\) and \(\sigma\)), where we then sample our \(z\) from. This is true for the decoder too. AEs output the image itself, but VAE output parameters for the image pixel distribution. Let us put this more formally.

Encoder
Let a standard normal distribution \(p(z)\) be the prior distribution of latent variable \(z\). Given an input image \(x\), we have our encoder network calculate the posterior distribution \(p(z \vert x)\). Then we sample our latent variable \(z\) from the posterior distribution.
Decoder
Given a latent variable \(z\), the likelihood of our decoder outputting \(x\)(the input image) is \(p(x \vert z)\). We usually interpret this as a Multivariate Bernoulli where each pixel of the image corresponds to a dimension.

The Optimization Problem

We want to sample \(z\) from the posterior \(p(z \vert x)\), which can be expanded with the Bayes Rule.

\[p(z \vert x) = \frac{p(x \vert z)p(z)}{p(x)}\]

However \(p(x) = \int p(x \vert z ) p(z) dz\), the evidence, is intractable since we need to integrate over all possible \(z\). Thus without calculating the posterior \(p(z \vert x)\), we’ll try to approximate it with a Gaussian distribution \(q_\lambda (z \vert x)\). We call this variational inference.

Since we want the two distributions \(q_\lambda (z \vert x)\) and \(p(z \vert x)\) to be similar, we adopt the Kullback-Leibler Divergence and try to minimize it with respect to parameter \(\lambda\).

\[\begin{aligned} D_{KL}(q_\lambda(z \vert x) \vert \vert p(z \vert x)) &= \int_{-\infty}^{\infty} q_\lambda (z \vert x)\log \left( \frac{q_\lambda (z \vert x)}{p(z \vert x)} \right) dz\\ &=\mathbb{E}_q\left[ \log(q_\lambda (z \vert x)) \right] - \mathbb{E}_q \left[ \log (p(z \vert x)) \right] \\ &=\mathbb{E}_q\left[ \log(q_\lambda (z \vert x)) \right] - \mathbb{E}_q \left[ \log (p(z, x)) \right] + \log(p(x))\\ \end{aligned}\]

The problem here is that the intractable \(p(x)\) term is still present. Now let us write the above equation in terms of \(\log(p(x))\).

\[\log(p(x)) = D_{KL}(q_\lambda(z \vert x) \vert \vert p(z \vert x)) + \text{ELBO}(\lambda)\]

where

\[\text{ELBO}(\lambda) = \mathbb{E}_q \left[ \log (p(z, x)) \right] - \mathbb{E}_q\left[ \log(q_\lambda (z \vert x)) \right]\]

KL divergences are always non-negative, and we want to minimize it with respect to \(\lambda\). This is equivalent to maximizing the ELBO with respect to \(\lambda\). The abbreviation is revealed: Evidence Lower BOund. This can also be understood as maximizing the evidence \(p(x)\) since we want to maximize the probability of getting the exact input image from the output.

ELBO

Let’s inspect the \(\text{ELBO}\) term. Since no two input images share the same latent variable \(z\), we can write \(\text{ELBO}_i (\lambda)\) for a single input image \(x_i\).

\[\begin{aligned} \text{ELBO}_i (\lambda) &= \mathbb{E}_q \left[ \log (p(z, x_i)) \right] - \mathbb{E}_q\left[ \log(q_\lambda (z \vert x_i)) \right] \\ &= \int \log(p(z, x_i)) q_\lambda(z \vert x_i) dz - \int \log(q_\lambda(z \vert x_i))q_\lambda(z \vert x_i) dz \\ &= \int \log(p(x_i \vert z)p(z)) q_\lambda(z \vert x_i) dz - \int \log(q_\lambda(z \vert x_i))q_\lambda(z \vert x_i) dz \\ &= \int \log(p(x_i \vert z)) q_\lambda(z \vert x_i) dz - \int q_\lambda(z \vert x_i) \log\left(\frac{q_\lambda(z \vert x_i)}{p(z)}\right)dz \\ &= \mathbb{E}_q \left[ \log (p(x_i \vert z)) \right] - D_{KL}(q_\lambda(z \vert x_i) \vert \vert p(z)) \end{aligned}\]

Now shifting our attention back to the network structure, our encoder network calculates the parameters of \(q_\lambda(z \vert x_i)\), and our decoder network calculates the likelihood \(p(x_i \vert z)\). Thus we can rewrite the above results so that the parameters match those of the autoencoder described above.

\[\text{ELBO}_i(\phi, \theta) = \mathbb{E}_{q_\phi} \left[ \log(p_\theta(x_i \vert z)) \right] - D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z))\]

Negating \(\text{ELBO}_i(\phi, \theta)\), we obtain our loss function for sample \(x_i\).

\[l_i(\phi, \theta) = -\text{ELBO}_i(\phi, \theta)\]

Thus our optimization problem becomes

\[\phi^*, \theta^* = \underset{\phi, \theta}{\text{argmin}} \sum_{i=1}^N \left[ -\mathbb{E}_{q_\phi} \left[ \log(p_\theta(x_i \vert z)) \right] + D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z)) \right]\]

Understanding the loss function

\[l_i(\phi, \theta) = -\underline{\mathbb{E}_{q_\phi} \left[ \log(p_\theta(x_i \vert z)) \right]} + \underline{D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z))}\]

The first underlined part (excluding the negative sign) is to be maximized. This is called the reconstruction loss: how similar the reconstructed image is to the input image. For each latent variable \(z\) we sample from the approximated posterior \(q_\phi(z \vert x_i)\), we calculate the log-likelihood of the decoder producing \(x_i\). Thus maximizing this term is equivalent to the maximum likelihood estimation.

The second term is the Kullback-Leibler Divergence between the approximated posterior \(q_\phi(z \vert x_i)\) and the prior \(p(z)\). This acts as a regularizer, forcing the approximated posterior to be similar to the prior distribution, which is a standard normal distribution.

The above plots 2-dimensional latent variables of 500 test images for an AE and a VAE. As you can see, the distribution of latent variables of VAEs is close to the standard normal distribution, which is due to the regularizer. This is a virtue because, with this property, we can just easily sample a vector \(z\) from the standard normal distribution and feed it to the decoder network to generate a reasonable image. This is ideal because VAEs were intended as a generator.

Calculating the loss function

To train our VAE, we should be able to calculate the loss. Let’s start with the regularizer term.

We create our encoder network such that it calculates the mean and standard deviation of \(q_\phi(z \vert x_i)\). We then sample vector \(z\) from this Multivariate Gaussian distribution: \(z \sim \mathcal{N}(\mu, \sigma^2 I)\).

The KL divergence between two normal distributions is known. We can calculate the regularizer term as:

\[D_{KL}(q_\phi(z \vert x_i) \vert \vert p(z)) = \frac{1}{2}\sum_{i=1}^J \left( \mu_{i.j}^2 + \sigma_{i,j}^2 - \log(\sigma_{i,j}^2)-1\right)\]

Now let’s look at the reconstruction loss term. To calculate the log-likelihood of our image \(\log(p_\theta(x_i \vert z))\), we should choose how to model our output. We have two choices.

Multivariate Bernoulli Distribution

This is often reasonable for black and white images like those from MNIST. We binarize the training and testing images with threshold 0.5. We can implement this easily with pytorch:
```
image = (image >= 0.5).float()
```
Each output of the decoder corresponds to a single pixel of the image, denoting the probability of the pixel being white. Then we can use the Bernoulli probability mass funtion \(f(x_{i,j};p_{i,j}) = p_{i,j}^{x_{i,j}} (1-p_{i,j})^{1-x_{i,j}}\) as our likelihood.
\[\begin{aligned} \log p(x_i \vert z) &= \sum_{j=1}^D \log(p_{i,j}^{x_{i,j}} (1-p_{i,j})^{1-x_{i,j}}) \\ &= \sum_{j=1}^D \left[x_{i,j} \log(p_{i,j}) + (1-x_{i,j})\log(1-p_{i,j}) \right] \end{aligned}\]
This is equivalent to the cross-entropy loss.
Multivariate Gaussian Distribution

The probability density function of a Gaussian distribution is as follows.
\[f(x_{i,j};\mu_{i,j}, \sigma_{i,j}) = \frac{1}{\sqrt{2\pi\sigma_{i,j}^2}}e^{-\frac{(x_{i,j}-\mu_{i,j})^2}{2\sigma_{i,j}^2}}\]
Using this in our likelihood,
\[\log p(x_i \vert z) = -\sum_{j=1}^D \left[ \frac{1}{2}\log(\sigma_{i,j}^2)+\frac{(x_{i,j}-\mu_{i,j})^2}{2\sigma_{i,j}^2} \right]\]
Notice that if we fix \(\sigma_{i,j} = 1\), we get the square error.

Now we’ve calculated the posterior \(p_\theta(x_i \vert z)\), we can look at the whole reconstruction loss term. Unfortunately, the expectation is difficult to compute since it takes into account every possible \(z\). So we use the Monte Carlo approximation of expectation by sampling \(L\) \(z_l\)’s from \(q_\phi(z \vert x_i)\) and take their mean log likelihood.

\[\mathbb{E}_{q_\phi} \left[ \log p_\theta(x_i \vert z) \right] \approx \frac{1}{L} \sum_{l=1}^L \log p_\theta(x_i \vert z_l )\]

For convenience, we use \(L = 1\) in implementation.

Conditional Variational Autoencoders (CVAE)

Structure

The CVAE has the same structure and loss function as the VAE, but the input data is different. Notice that in VAEs, we never used the labels of our training data. If we have labels, why don’t we use them?

Now in conditional variational autoencoders, we concatenate the onehot labels with the input images, and also with the latent variables. Everything else is the same.

Implications

What do we get by doing this? One good thing about this is that the latent variable no longer needs to encode which label the input is. It only needs to encode its styles, or the class-invariant features of that image.

Then, we can concatenate any onehot vector to generate an image of the intended class with the specific style encoded by the latent variable.

For more images on generation, check out my repository’s README file.

Acknowledgements

Images in this post were borrowed from the presentation by Hwalsuk Lee.
I’ve implemented everything discussed here. Check out my GitHub repository.

Bayesian Statistics, Maximum Likelihood Estimation, and Machine Learning

2019-01-29T00:00:00-05:00

Resources

Prior probability

The prior probability distribution of an uncertain quantity is the probability distribution about that quantity before some evidence is taken into account. This is often expressed as \(p(\theta)\).

Posterior probability

The posterior probability of a random event is the conditional probability that is assigned after relevant evidence is taken into account. This is often expressed as \(p(\theta | X)\). The prior and posterior probabilities are related by the Bayes’ Theorem as follows:

\[p(\theta | x) = \frac{p(x|\theta)p(\theta)}{p(x)}\]

Maximum Likelihood Estimation (MLE)

MLE is a method of estimating the parameters of a statistical model, given observations. Intuitively, we are trying to find the model parameters that make the observed data most probable. This is done by finding the parameters that maximizes the likelihood function \(\mathcal{L}(\theta;x)\). When we are dealing with discrete random variables, the likelihood function is the probability. On the other hand, when we are dealing with continuous random variables, the likelihood function is the value of the probability distribution function.

We can formulate the MLE problem as follows:

\[\theta^* \in \{\underset{\theta}{\text{argmax}} \mathcal{L}(\theta;x)\}\]

where \(\theta\) is the model parameters and \(x\) is the observed data.

We often use the average log-likelihood function

\[\hat{\mathcal{l}}(\theta;x) = \frac{1}{n} \log \mathcal{L}(\theta;x)\]

since it has preferable qualities. One of this is illustrated later in this document.

Machine Learning in the MLE perspective

A traditional machine learning model for classification is visualized as the above: we receive an input image \(x\) and our model calculates \(f_\theta (x)\), which is a vector denoting the probability for each class. Then based on our label, we calculate the loss function, which is then optimized using gradient descent. Now, let us view this in a maximum likelihood perspective.

Now, when we create an ML model, we choose a statistical model that our output may follow. Then, our ML model function calculates the parameters of that statistical model. For example, let us assume that our output \(y\) is one dimensional and has a Gaussian distribution. Then we set \(f_\theta(x)\) to a two-dimensional vector and interpret it as

\[f_\theta(x) =\begin{bmatrix}\mu\\\sigma\end{bmatrix}\]

Thus for each input \(x\) we obtain a Gaussian distribution for \(y\). Using negative log-likelihood, our optimization problem is the following:

\[\theta^* = \underset{\theta}{\text{argmin}}[-\log p(y|f_\theta(x))]\]

If we assume that our inputs are independent and identically distributed (i.i.d), we can obtain the following:

\[p(y|f_\theta(x)) = \prod_i p(y_i|f_\theta(x_i))\]

Rewriting our optimization problem:

\[\theta^* = \underset{\theta}{\text{argmin}}[-\sum_i\log p(y_i|f_\theta(x_i))]\]

When we perform inference from our model, we no longer get determined outputs as we did in traditional machine learning models. We now get a distribution of \(y_\text{new}\),

\[y_\text{new} \sim f_{\theta^*}(x_\text{new})\]

where we should sample a single \(y_\text{new}\).

Loss Functions in the MLE perspective

Two famous loss functions, mean square error and cross-entropy error, can be derived using the MLE perspective.

(https://www.slideshare.net/NaverEngineering/ss-96581209)

Jae-Won’s Blog

Useful Matplotlib Tricks for Research

Making plots compatible with PowerPoint

Making plot files Git-friendly

PDF files

SVG files

Avoiding Type 3 fonts

An mplstyle file

Structured Slacking

Using Microsoft python-type-stubs with Pyright

Advisors are like GPT

GPT is Stateless

Context is Important

Fine-Tuning GPT

The Importance of Mentoring as a PhD Student

On Staying Confident

Appending the PhD Mindset

Halide: a language and compiler for image processing and deep learning

Halide

Resources

Paper Summary

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines

Differentiable Programming for Image Processing and Deep Learning in Halide

Learning to Optimize Halide with Tree Search and Random Programs

Code Peek

The autoencoder family

Autoencoders (AE)

Structure

Statistical viewpoint

Denoising Autoencoders (DAE)

Structure

Variational Autoencoders (VAE)

Structure

The Optimization Problem

ELBO

Understanding the loss function

Calculating the loss function

Conditional Variational Autoencoders (CVAE)

Structure

Implications

Acknowledgements

Bayesian Statistics, Maximum Likelihood Estimation, and Machine Learning

Resources

Prior probability

Posterior probability

Maximum Likelihood Estimation (MLE)

Machine Learning in the MLE perspective

Loss Functions in the MLE perspective

An `mplstyle` file