RoBERTa-large + self-explaining layer | 92.3 | ? | 355m+ | Self-Explaining Structures Improve NLP Models | |
Distance-based Self-Attention Network | 86.3 | 89.6 | 4.7m | Distance-based Self-Attention Network for Natural Language Inference | - |
Stacked Bi-LSTMs (shortcut connections, max-pooling, attention) | 84.4 | - | - | Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News | |
300D Gumbel TreeLSTM encoders | 85.6 | 91.2 | 2.9m | Learning to Compose Task-Specific Tree Structures | |
SJRC (BERT-Large +SRL) | 91.3 | 95.7 | 308m | Explicit Contextual Semantics for Text Comprehension | - |
1024D GRU encoders w/ unsupervised 'skip-thoughts' pre-training | 81.4 | 98.8 | 15m | Order-Embeddings of Images and Language | |
200D decomposable attention model with intra-sentence attention | 86.8 | 90.5 | 580k | A Decomposable Attention Model for Natural Language Inference | |
600D (300+300) BiLSTM encoders with intra-attention and symbolic preproc. | 85.0 | 85.9 | 2.8m | Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention | |
600D BiLSTM with generalized pooling | 86.6 | 94.9 | 65m | Enhancing Sentence Embedding with Generalized Pooling | |
Enhanced Sequential Inference Model (Chen et al., [2017a]) | 88.0 | - | - | Enhanced LSTM for Natural Language Inference | |
300D Reinforced Self-Attention Network | 86.3 | 92.6 | 3.1m | Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling | |
600D (300+300) Deep Gated Attn. BiLSTM encoders | 85.5 | 90.5 | 12m | Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference | |
300D Residual stacked encoders | 85.7 | 89.8 | 9.7m | Shortcut-Stacked Sentence Encoders for Multi-Domain Inference | |
ESIM + ELMo Ensemble | 89.3 | 92.1 | 40m | Deep contextualized word representations | |