Halide: a language and compiler for image processing and deep learning

Halide

Resources

  • https://halide-lang.org
  • https://github.com/halide/Halide
  • Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines (PLDI’ 13)
  • Automatically Scheduling Halide Image Processing Pipelines (SIGGRAPH ’16)
  • Loop Transformations Leveraging Hardware Prefetching (CGO ’18)
  • Differentiable Programming for Image Processing and Deep Learning in Halide (SIGGRAPH’ 18)
  • Schedule Synthesis for Halide Pipelines through Reuse Analysis (TACO ‘19)
  • Learning to Optimize Halide with Tree Search and Random Programs (SIGGRAPH’ 19)

Paper Summary

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines

  • Motivation. Image processing pipelines are often graphs of different stencil computations with low arithmetic intensity and inherent data parallelism. It introduces complex tradeoffs involving locality, parallelism, and recomputation. Thus, hand-crafted code produced with tedious effort are often neither portable nor optimal.
  • Solution. Halide decouples the algorithm (what is computed?) and the schedule (when and where?). From each schedule, the compiler produces parallel vector code and measures its runtime. It then searches for the best schedule in the tradeoff space using stochastic search based on genetic algorithm.
  • Results. Generated code are an order faster than their hand-crafted counterparts. Automatic scheduling is quite slow and lacks robustness.
  • Detail. Two-stage decision for determining the schedule of each function:

    • Domain Order: the order in which the required region is traversed
      • sequential/parallel, unrolled/vectorized, dimension reorder, dimension split
    • Call Schedule: when to compute its inputs; the granularity of store and computation
      • breadth-first/total fusion/sliding window
  • Detail. Compile steps (all decisions directed by the schedule):
    • Lowering and Loop Synthesis: create nested loops of the entire process, insert allocations and callee computations at specified locations in the loop
    • Bounds Inference: from the output size, the bounds of each dimension is determined
    • Sliding Window Optimization and Storage Folding: look for specific conditions and apply
    • Flattening: flatten multi-dimensional addressing and allocation
    • Vectorization and Unrolling
    • Back-end Code Generation - only note GPU:
      • outer loop → inner loops divided into GPU kernel launches
      • inner loops are annotated in the schedule with block and thread dimensions
  • Detail. Stochastic search based on genetic algorithm
    • Hint hand-crafted optimization styles through mutation rules. These include mutating one or more function schedules to a well-known template.
  • Thoughts.
    • The increase in performance is natural, since Halide invests a lot of time in optimization. The real contribution seems to be that Halide formulated the axes of optimization and exposed an easy handle that helps users search the space.
    • Generated CUDA kernels don’t seem to use CUDA streams or asyncronous copies.
    • Requries block and thread annotations provided by the programmer.
    • Without the hand-crafted mutation, I suspect that performance will greatly suffer.
    • Schedule search could be learned. Monte Carlo tree search maybe? RL will work too, as in NAS.

Differentiable Programming for Image Processing and Deep Learning in Halide

  • Motivation. Existing deep learning libraries are inefficient in terms of computation and memory. Also, in order to implement custom operations, the user must manually provide both the forward and backward CUDA kernels.
  • Solution. Extend Halide with automatic differentiation (propagate_adjoints).
  • Results. GPU tensor operations 0.8x, 2.7x, and 20x faster than PyTorch, measured with batch size 4.
  • Detail. Two special cases of note when creating backward operations:
    • Scatter-gather Conversion: When the forward of a function is a gather operation, its backward is a scatter, e.g. convolutions. This leads to race conditions when parallelized. Thus, the scatter operation is converted to a gather operation.
    • Handling Partial Updates: When a function is partially updated, dependency is removed for some indices. If two consequtive function updates have different update arguments, the former’s gradient is masked to zero using the update argument of the latter.
  • Detail. Checkpointing is already supported but in a more fine-grained manner through schedules: compute_root for checkpointing, compute_inline for recomputation, and compute_at is something in between, e.g. tiling.
  • Detail. Automatic scheduling (only note GPU, ordered by high priority)

    1. For all scatter/reduce operations, always checkpoint them and tile the first two dimensions and parallelize computation over tiles. Other types of operations are not checkpointed at all.
    2. Apply rfactor for large associative reductions with domains too small to tile.
    3. If parallelizing cannot but lead to race conditions, use atomic operations and parallelize.
  • Thoughts.
    • Again, automatic scheduling could be better. The scheduler in this work is filled with hand-crafted heuristics.
    • The paper doesn’t talk about the time needed for automatic scheduling. Probably it took pretty long. Then we can’t use this for deep learning research; training just a single hyperparameter configuration is already burdensome. Deployment has some hope though.
    • The ‘deep learning operations’ this paper conducted experiments on (grid_sample, affine_grid, optical flow warp, and bilateral slicing) are relatively uncommon compared with matrix multiplication or convolution. This aligns with their claim that Halide is advantageous when you have to implement custom operations.

Learning to Optimize Halide with Tree Search and Random Programs

  • Motivation. Existing autoschedulers are limited because 1) their search space is small, 2) their search procedures are coupled with the schedule type, and 3) their cost models are inaccurate and hand-crafted.
  • Solution. Use 1) a new parametrization of the schedule space, 2) beam search, and 3) additionally employ a learned cost model trained on ramdomly generated programs.
  • Results. Deep learning benchmarks on GPU were not reported at all! Those on CPU with image size 1 x 3 x 2560 x 1920 are claimed to outperform TF and PT and be competitive with MXNet + MKL, but the paper mentions no concrete numbers.
  • Detail. Parameters of the schedule (underlined). Beginning from the final stage, make two decisions per stage to build a complete schedule:

    1. Compute and storage granularity of new stage. An existing stage can be split, creating an extra level of tiling. Tile sizes are also parameters that should be determined.
    2. For the newly added stage, we may parallelize outer tilings and/or vectorize inner tilings and annotate.
  • Detail. Beam search with pruning (just kill schedules that fail hand-crafted asserts). Run multiple passes that gradually select good schedules from corase to fine.
  • Detail. Predicting runtime, which beam search minimizes, with a neural network.

    1. Schedule to feature: algorithm-specific + schedule-specific
    2. Runtime prediction: design 27 runtime-related terms and have the a small model predict the coefficients of each term, use L2 loss between predicted and target throughput
    3. Training data generation: use the sytem itself, iterate between training the model and generating data with the system
  • Detail. Given more time, benchmark several candidates (instead of predicting runtime) and select best. Given even more time, fine-tune the neural network on the benchmark results and repeat beam search (autotuning).

  • Thoughts.
    • A loop nest is a graph. Can we use graph embedding & pooling on schedules to predict runtime?
    • No comparisons with deep learning frameworks on GPUs. Maybe I have to check this myself.
    • This paper seems just to incorporate tremendous amounts of manual hand-crafted optimizations and tedious engineering. I cannot find any core novel ideas in this paper; I don’t think there’s anything new.

Code Peek

#include "Halide.h"          // all of Halide

int main() {

  // Symbolic definition of the algorithm 'index_sum'.
  Halide::Var x, y;          // think of these as for loop iterators
  Halide::Func index_sum;    // each Func represents one pipeline stage
  index_sum(x, y) = x + y;   // operation defined in an arbitrary point

  // Manually schedule our algorithm.
  Halide::Var x_outer, x_inner, y_outer, y_inner,  // divide loop into tiles
              tile_index,                          // fuse and parallelize
              x_inner_outer, y_inner_outer,        // tile each tile again
              x_vectors, y_pairs;                  // vectorize and unroll
  index_sum
    // tile with size (64, 64)
    .split(x, x_outer, x_inner, 64)
    .split(y, y_outer, y_inner, 64)
    .reorder(x_inner, y_inner, x_outer, y_outer)
    // fuse the two outer loops and parallelize
    .fuse(x_outer, y_outer, tile_index)
    .parallel(tile_index)
    // tile with size (4, 2), use shorthand this time!
    .tile(x_inner, y_inner, x_inner_outer, y_inner_outer, x_vectors, y_pairs, 4, 2)
    // vectorize over x_vectors (vector length is 4)
    .vectorize(x_vectors)
    // unroll loop over y_pairs (2 duplications)
    .unroll(y_pairs);

  // Run the algorithm. Loop bounds are automatically inferred by Halide!
  Halide::Buffer<int> result = index_sum.realize(350, 250);

  // Print nested loop in pseudo-code.
  index_sum.print_loop_nest();

  return 0;
}
$ g++ peek.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o peek -std=c++11
$ LD_LIBRARY_PATH=../bin ./peek
produce index_sum:
  parallel x.x_outer.tile_index:
    for y.y_inner.y_inner_outer:
      for x.x_inner.x_inner_outer:
        unrolled y.y_inner.y_pairs in [0, 1]:
          vectorized x.x_inner.x_vectors in [0, 3]:
            index_sum(...) = ...

Leave a comment