PyTorch/Tensorflow model code used by the 1st place winner for the 2020 Jigsaw Multilingual Kaggle competition. Our solution is detailed in this Kaggle forum post.
This is "one-half" of the overall solution designed for and executed on local hardware. The other-half, created by my teammate @rafiko1 and designed for and executed on Kaggle TPUv3 instances, can be found at the GitHub repo: here.
- Bootstrap test-set predictions (pseudo-labels) for all languages using the XLM-Roberta-Large multilingual model
- Combine pseudo-labels with training data, train a monolingual/multilingual model, and predict test-set samples in the relevant language(s)
- Blend predictions from the current model with the previous ensemble predictions - update pseudo-labels
- Repeat steps 2 and 3 with various pretrained monolingual and multilingual models
We provide 3 sets of sample pseudo-labels (scoring on public LB: 9372, 9500, 9537) for you to test training monolingual and multilingual models against (refer to the Data and model files section below).
You can use the provided 9372 pseudo-labels to bootstrap your training process or alternatively, run the public Kaggle kernel (created by Kaggle user @xhlulu) to generate test-set predictions/pseudo-labels using a vanilla XLM-Roberta-Large multilingual model.
| Model | Comment |
|---|---|
| FastText BiGRU | monolingual RNN approach using non-contextualized FastText embeddings |
| HuggingFace Transformer | mono/multilingual Transformer approach |
| Helper modules | Comment |
|---|---|
| prepare_data | Generates the prerequisite train/test/validation data necessary for training |
| prepare_predictions | Blends current run predictions with the previous ensemble |
| preprocessor | Includes helper functions to extract raw strings and labels from training CSVs |
| postprocessor | Includes helper functions to ensemble multiple CSV predictions |
- HuggingFace models are downloaded directly via API so there is no need to manually download them.
- FastText monolingual models can be downloaded here. Our model code uses bin model files (e.g., cc.es.300.bin).
- Translations of the Toxic 2018 dataset and pseudo-labels for public LB 9372, public LB 9500, public LB 9537 (used as sample inputs to training) can be found here.
- A DockerFile is provided which builds against an Nvidia-provided Docker image, and installs the necessary Nvidia, system, and Python prerequisites - IMPORTANT: the Dockerfile installs an SSH server in-container with a default password (root/testdocker).
- Sample Docker build and run invocation: docker build -t jigsaw-multi . && docker run -p 8888:22 -p 8889:8888 -v /home/leecming/PycharmProjects/jigsaw-multilingual/data:/root/data -v /home/leecming/PycharmProjects/jigsaw-multilingual/models:/root/models -v /home/leecming/PycharmProjects/jigsaw-multilingual/notebooks:/root/notebooks --name jigsaw-multi jigsaw-multi
- Various functions ingest and generate files - it is suggested that you mount them within container volumes to allow for smooth movement of files in & out of the container
- With an SSH server and Jupyter notebook server within the container - it is suggested that you bind ports to enable external connections
- Training I/O configuration: update SETTINGS.json to point to locations of the 2018 training CSV, pseudo-labels CSV, and various other relevant paths.
- To configure which languages to generate training data for, update the LANG_LIST global variable in prepare_data.py (accepts 1 or more of the 6 test-set languages in ISO code)
- For Transformer model settings (including which pretrained model to use), update the global variables in classifier_base.py
- For FastText classifier model settings (including which language model to use), update the global variables in classifier_bigru_fasttext_tf.py
- We use the pretrained spanish model: mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
- Update SETTINGS.json so that PSEUDO_LABELS_PATH and other paths are updated for your setup.
- Run prepare_data.py
- Run classifier_baseline.py
- Run prepare_predictions.py
A submission file should be generated at $TRAIN_DATA_DIR/curr_run_submission.csv. For this, we were able to generate a 9502 public LB submission. Due to training variability, you may get different results. To use the predictions as pseudo-labels for another model run, note that you'll have to merge the predicted labels with test.csv.
- We use the pretrained official FastText embeddings for spanish (cc.es.300.bin)
- Update SETTINGS.json so that PSEUDO_LABELS_PATH and other paths are updated for your setup.
- Run prepare_data.py
- Run classifier_bigru_fasttext_tf.py
- Update prepare_predictions.py so that ENSEMBLE_WEIGHT=0.8 (give less weight to predictions from this model)
- Run prepare_predictions.py
A submission file should be generated at $TRAIN_DATA_DIR/curr_run_submission.csv. For this, we were able to generate a 9540 public LB submission. Due to training variability, you may get different results. To use the predictions as pseudo-labels for another model run, note that you'll have to merge the predicted labels with test.csv
Example 3: Running a multilingual Transformer model (XLM-Roberta-Large) using public LB9372 pseudo-labels
- We use the pretrained multilingual model: xlm-roberta-large
- Update SETTINGS.json so that PSEUDO_LABELS_PATH and other paths are updated for your setup.
- Update prepare_data.py so that the LANG_LIST = ['es', 'fr', 'it', 'pt', 'ru', 'tr'] and SAMPLE_FRAC=0.1
- Update classifier_base.py so that PRETRAINED_MODEL = 'xlm-roberta-large' and BASE_MODEL_OUTPUT_DIM=1024. You'll likely need to adjust the ACCUM_FLAG and BATCH_SIZE given the model size (ACCUM_FLAG=2 and BATCH_SIZE=24 on a single RTX Titan)
- Run classifier_base.py
- Update prepare_predictions.py so that ENSEMBLE_WEIGHT=0.2 (give more weight to predictions from this model)
- Run prepare_predictions.py
A submission file should be generated at $TRAIN_DATA_DIR/curr_run_submission.csv. For this, we were able to generate a 9409 public LB submission. Due to training variability, you may get different results. To use the predictions as pseudo-labels for another model run, note that you'll have to merge the predicted labels with test.csv
-
The code and container environment were run on a fairly heavy-weight workstation (24C/48T Threadripper + 64GB RAM + 2x RTX Titans w/ 24GB GPU RAM each) and using mixed precision training + gradient accumulation. It may not be feasible to finetune larger models such as XLM-Roberta-Large on smaller-scale machines especially with entry-level Nvidia cards. You can adjust the BATCH_SIZE and ACCUM_FOR flags in classifier_baseline.py to fit in memory but it may impact model performance.
-
Below lists the various pretrained HuggingFace transformer models we used -
| Language | Models |
|---|---|
| All | xlm-roberta-large |
| French | camembert-base, camembert/camembert-large, flaubert/flaubert_large_cased |
| Italian | dbmdz/bert-base-italian-xxl-cased, dbmdz/bert-base-italian-xxl-uncased, m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0 |
| Portuguese | neuralmind/bert-large-portuguese-cased |
| Russian | DeepPavlov/rubert-base-cased-conversational |
| Spanish | dccuchile/bert-base-spanish-wwm-cased, dccuchile/bert-base-spanish-wwm-uncased, mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es |
| Turkish | dbmdz/bert-base-turkish-128k-cased, dbmdz/bert-base-turkish-128k-uncased, dbmdz/electra-base-turkish-cased-v0-discriminator |