Humans tend to decompose a sentence into different parts like sth do sth at someplace and then fill each part with certain content. We gain 135.8\(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3\(\%\) CIDEr on the online testing server. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Residual attention may increase the internal covariate shifts. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. Therefore, bias scores about geometric position information should be added to the attention scores. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. The attention scores are the key factor to the success of the attention mechanism. ![]() Attention mechanisms and grid features are widely used in current visual language tasks like image captioning.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |