A major problem in effective deployment of machine learning systems in practice is domain adaptation — given a large auxiliary supervised dataset and a smaller dataset of interest, using the auxiliary dataset to increase performance on the smaller dataset. This paper considers the case where we have K datasets from distinct domains and adapting quickly to a new dataset. It learns K separate models on each of the K datasets and treats each as experts. Then given a new domain it creates another model for this domain, but in addition, computes attention over the experts. It computes attention via a dot product that computes the similarity of the new domain’s hidden representation with the other K domains’ representations.
In addition to this core idea, the authors propose a couple modifications to improve performance. In the first, they compute an additional form of an attention. This time they take the inner product between the new domain’s hidden layer and the label embedding of the most likely output of the expert networks. They also find that sparsely choosing experts by only keeping the top K’ experts non-zero improves performance. They choose K’ via grid search on the validation set.
Authors: Young-Bum Kim, Karl Stratos, Dongchan Kim