Notes on Domain Adaptation

Domain Adaptation论文

ICML2015, Unsupervised Domain Adaptation by Backpropagation (DANN)
- Learn features that are domain-invariant (domain classifier) and discriminative (label predictor)
- Introduce a new gate “Gradient Reversal Layer” in applying sgd of adversarial objective (implementation)
NIPS2016, Unsupervised Domain Adaptation with Residual Transfer Networks (RTN)
- learn discriminative features via Deep Architectures (CNN),
- minimize Maximum Mean Discrepancy(MMD) between source and target domain (Kronecker Product), DL
- insert residual layers between Source classifier and target classifier, learn permutation function
- Q: cross-entropy at target classifier

RTN_Loss

ICML2017, Deep Transfer Learning with Joint Adaptation Networks (JAN)
- Joint Distribution of some untransferable layers are used to minimize the discrepancy of features (JMMD), first model
- Then add Adversarial scheme in minimizing JMMD (with a network maximizing the JMMD in another dimension)

JAN_Loss

NIPS2017, Mean teachers are better role models (Mean Teacher), semi-supervised
- Use back-prop to train student as label classifier,
- Apply Exponential Moving Average (EMA) of network weights as teacher
- Use mse between predictions of teacher & student as consistency loss
- Upscaling the unsupervised loss in a time dependent manner is necessary

MeanTeacher_Loss MeanTeacher_net

arxiv: Self-ensembling for domain adaptation (Self-Ensembling)
- Applying Mean-Teacher in Domain Adaptation Setting
- data splited into two paths each iteration: cross-entropy for classification on source-domain & unsupervised self-ensembling loss for target.
- Two batches — source batch & target batch — were feed each iteration, and different BN parameters are given.
- SGD is performed jointy
ICLR2018: A DIRT-T APPROACH TO UNSUPERVISED DOMAIN ADAPTATION (Dirt-T)
- Unsupervised, non-conservative domain adaptation — a)source fully labeled, target non-labeled; b)classifier works well on both source & target is not guaranteed
- introduce clustering assumption on data distribution, applying on a violation term (conditional entropy) in objective function
- VADA, applying violation penalization for source & target classifier
- DIRT (Decision Boundary Iterative Refinement) init with VADA, and use violation penalization on target domain to improve the performance on target domain
  - With Lagrangian multiplier
ICLR2017: Temporal Ensembling for semi-supervised learning (Temporal Learning):
- Ensemble of NN generally works better than single prediction, more likely to be right.
- Introducing self-ensembling models , Pi model in taking different translation/noise/dropout paths in extracting same info
- Temporal ensembling take the Exponential Average Mean of the predictions in past epochs