When doing beam search in sequence to sequence models, one explores next words in order of their likelihood. However, during decoding, there may be other constraints we have or objectives we wish to maximize. For example, sequence length, BLEU score, or mutual information between the target and source sentences. In order to accommodate these additional desiderata, the authors add an additional term Q onto the likelihood capturing the appropriate criterion and then choose words based on this combined objective.
The difficulty here is that we don’t know the values of these quantities until we have completed our decoding. Eg, we don’t know how long the sequence we are going to output is until we have actually finished decoding the sentence. In order to solve this issue, the authors learn Q as a function that has the following inputs: the source sentence, the prefix of previously outputted target symbols, and the current hidden state of the decoder. Based off of this information, it predicts the quantity in question. In the sequence length example, it predicts number of output tokens that the decoder will generate.
Paper: Learning to Decode for Future Success
Authors: Jiwei Li, Will Monroe, Dan Jurafsky
Publication: Stanford University