MIT-何恺明《Learning Deep Representations》笔记

To represent the world by general & simple modules.

Deep Learning = Represenation Learning

The top conference in deep learning domain, International Conference on Learning Representations (ICLR), is named by representation learning.

Representation learning: to represent raw data in different forms (e.g., pixels, words, waves, gameboards, DNA...) to solve complex problems (by compression, abstraction, conceptualization...).

From bad representations to good ones

Go: Analyze \(3^{361}\) states? No! \(\rightarrow\) AlphaGo outperform best human players without human knowledge by using better representation.

Image: How to represent it?

  • Image \(\rightarrow\) class
  • Image \(\rightarrow\) edge \(\rightarrow\) class
  • Image \(\rightarrow\) edge \(\rightarrow\) orientation \(\rightarrow\) class
  • Image \(\rightarrow\) edge \(\rightarrow\) orientation \(\rightarrow\) histogram \(\rightarrow\) class

More and more deeper and robust, but require more and more domain knowledge! \(\rightarrow\) Feature designing problem will be extremely difficult if we want to define some high level representations.

For example, what is a cat?

Another methodology: Deep Learning uses general modules instead of specialized features, it composes simple modules into complex functions. + Build muliple levels of abstractions + Learn by back-propagation + Leran from data + Reduce domain knowledge and feature engineering

The research problem is shifted from engineering the features to collecting the data that is related to the problem.

Simple modules used: + locally connect layer: greatly reduce trainable paramethers + weight sharing + pooling: produce small feature map and achieve local invariance (more abstract representation) + fc layers

Milestones

  • 1989: LeNet (LeCun Y et al. Backpropagation applied to handwritten zip code recognition.)
    • Data: MNIST, small and lack persuasiveness
    • Sigmoid
  • 2012: AlexNet (Krizhevsky A et al. Imagenet classification with deep convolutional neural networks.)
    • Scale up data (ImageNet, 1.28 million and 1000 classes)
    • Scale up architecture (60 million parameters)
    • Introduce data augmentation and dropout to reduce overfitting
    • Explore GPU training (data distribution: small batch, model distribution: different layers in different GPUs)
    • Explore ReLU (avoid zero grad to support deeper network, a revolution of deep learning)
  • 2013: Visualizing (Zeiler M D, Fergus R. Visualizing and understanding convolutional networks.)
    • Understand representations by visualization --- to find what input can produce the feature
    • Set a one-hot feature map and back-prop to pixels
    • The single most important discovery in DL revolution: Deep representations are transferrable (only fine-tuning and a small set of relative data is neededs)
  • 2014: VGG-Net (Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition.)
    • Very deep convnets
    • Only has conv (only \(3 \times 3\)), pool and fc
    • Deeper is better (elegant design, only more and more \(3 \times 3\))
    • Not end-to-end training
  • 2014: GoogLeNet (Szegedy et al. Going deeper with convolutions.) / Inception (Szegedy et al. Rethinking the inception architecture for computer vision.)
    • Deep and economical ConvNets
    • \(1 \times 1\) shortcut
    • Very lots of variants

Difficulties of going deeper:

Forward:

\[Var[y]=\prod_{d}n_{d}Var[w_d]Var[x].\]

Backward:

\[Var[\frac{\partial \epsilon}{\partial x}]=\prod_{d}m_{d}Var[w_d]Var[\frac{\partial \epsilon}{\partial y}].\]

Exploding(factor>1)/vanishing(factor<1) signals accumulate in propagation. Signal variance should be kept.

  • 2015: Network initialization
    • Xavier (to set the scaling factor as 1 for every layer)
    • Kaiming initialization: \(\times 0.5\) (works for ReLU, He K et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.)
    • Norm modules applied to layers (another simple but general module. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift.) Given by: \(\hat{x}=\frac{x-E[x]}{\sqrt{Var[x]}}\) and \(y=a\hat{x}+b\). Normalization modules can
      1. enable models for training (otherwise it may be not trainable)
      2. speed up convergence
      3. improve accuracy

Although we have good init and norm, NN still degrades after 20 layers, but not due to overfitting. It just becomes hard to train.

  • 2015: ResNet (He K et al. Deep residual learning for image recognition.)
    • Enable networks with hundreds of layers by identify shortcuts \[H(x)=F(x)+x,\] where the last \(x\) represents for "identity mapping" (恒等映射).
    • Idea: Encourage building a block to make small, conservative and incremental changes.
    • New generic module

A checklist of training DNN

All about "healthy" signal propagation!

  • ReLU
  • Init
  • Norm
  • Res

RNN v.s. CNN for seq modeling

Similarity

  • Weight-sharing (across time dimension)
  • Locally-connected
  • We can again enjoy benefits from common DL methodologies (e.g., ResNet)

Difference

  • RNN uses full context to present the last state, it is not feedforward and not efficient on GPU (to get the final results, we should wait for all the hidden state computations finished).
  • CNN only use limited context and hence be feedforward.

In Attention mechanism, every node can see every other node (full context), it is also feedforward. So we have Transformer (Attention is all you need) in 2017, then GPT (where transfer learning paradigm is still widely used), AlphaFold and Vision Transformer (ViT, seqs of image patches).

The Video: