Fastai community entry to 2020 Reproducibility Challenge
- Reformer Paper
- Authors ICLR video
- Google Blog
- Authors code (TRAX)
- Reformer enwik8 model and training config
- @lucidrain’s Reformer code
- HuggingFace: Reformer source code
- HuggingFace: Reformer notebook example
- HuggingFace: long sequences
- HuggingFace: Pretraining
enwik8
- enwik8.zip, raw data, 100mb
- Tensor2Tensor enwik8 data generator code, with train/dev/test split. File lengths:
- Train: 89,621,832
- Eval: 5,000,000
- Test: 5,000,000
- enwik8 notebook Tensor2Tensor
WMT14
- WMT on HuggingFace Datasets
- Reformer WMT14 vocab
- Reformer.input_vocab_size = 33300, from WMT14 model config
- Train Test split: (guess) newstest2013 for validation and newstest2014 for test, in consistence with Vaswani et al. (2017) - from https://arxiv.org/pdf/2009.02070.pdf
- Tokenizer: Tensor2Tensor SubWordTextEncoder