What We’re Reading: Domain Attention with an Ensemble of Experts

by Spence Green
1 Minute Read

A major problem in effective deployment of machine learning systems in practice is domain adaptation — given a large auxiliary supervised dataset and a smaller dataset of interest, using the auxiliary dataset to increase performance on the smaller dataset. This paper considers the case where we have K datasets from distinct domains and adapting quickly to a new dataset. It learns K separate models on each of the K datasets and treats each as experts. Then given a new domain it creates another model for this domain, but in addition, computes attention over the experts. It computes attention via a dot product that computes the similarity of the new domain’s hidden representation with the other K domains’ representations.

In addition to this core idea, the authors propose a couple modifications to improve performance. In the first, they compute an additional form of an attention. This time they take the inner product between the new domain’s hidden layer and the label embedding of the most likely output of the expert networks. They also find that sparsely choosing experts by only keeping the top K’ experts non-zero improves performance. They choose K’ via grid search on the validation set.

Paper: Domain Attention with an Ensemble of Experts

Authors: Young-Bum Kim, Karl Stratos, Dongchan Kim

What We’re Reading: Single-Queue Decoding for Neural Machine Translation

1 Minute Read

The most popular way of finding a translation for a source sentence with a neural sequence-to-sequence model is a simple beam search. The target sentence is predicted one word at a time and after each prediction, a fixed number of possibilities (typically between 4 and 10) is retained for further exploration. This strategy can be suboptimal as these local hard decisions do not take the remainder of the translation into account and can not be reverted later on.

Read More

What We’re Reading: Neural Machine Translation with Reconstruction

1 Minute Read

Neural MT systems generate translations one word at a time. They can still generate fluid translations because they choose each word based on all of the words generated so far. Typically, these systems are just trained to generate the next word correctly, based on all previous words. One systematic problem with this word-by-word approach to training and translating is that the translations are often too short and omit important content. In the paper Neural Machine Translation with Reconstruction, the authors describe a clever new way to train and translate. During training, their system is encouraged not only to generate each next word correctly but also to correctly generate the original source sentence based on the translation that was generated. In this way, the model is rewarded for generating a translation that is sufficient to describe all of the content in the original source.

Read More