GitHub - leecming82/jigsaw-multilingual: PyTorch code used by the 1st place winner of the 2020 Kaggle multilingual Toxic Comments competition

Introduction

PyTorch/Tensorflow model code used by the 1st place winner for the 2020 Jigsaw Multilingual Kaggle competition. Our solution is detailed in this Kaggle forum post.

This is "one-half" of the overall solution designed for and executed on local hardware. The other-half, created by my teammate @rafiko1 and designed for and executed on Kaggle TPUv3 instances, can be found at the GitHub repo: here.

Recipe for training:

Bootstrap test-set predictions (pseudo-labels) for all languages using the XLM-Roberta-Large multilingual model
Combine pseudo-labels with training data, train a monolingual/multilingual model, and predict test-set samples in the relevant language(s)
Blend predictions from the current model with the previous ensemble predictions - update pseudo-labels
Repeat steps 2 and 3 with various pretrained monolingual and multilingual models

We provide 3 sets of sample pseudo-labels (scoring on public LB: 9372, 9500, 9537) for you to test training monolingual and multilingual models against (refer to the Data and model files section below).

You can use the provided 9372 pseudo-labels to bootstrap your training process or alternatively, run the public Kaggle kernel (created by Kaggle user @xhlulu) to generate test-set predictions/pseudo-labels using a vanilla XLM-Roberta-Large multilingual model.

Code

Model	Comment
FastText BiGRU	monolingual RNN approach using non-contextualized FastText embeddings
HuggingFace Transformer	mono/multilingual Transformer approach

Helper modules	Comment
prepare_data	Generates the prerequisite train/test/validation data necessary for training
prepare_predictions	Blends current run predictions with the previous ensemble
preprocessor	Includes helper functions to extract raw strings and labels from training CSVs
postprocessor	Includes helper functions to ensemble multiple CSV predictions

Data and model files

HuggingFace models are downloaded directly via API so there is no need to manually download them.
FastText monolingual models can be downloaded here. Our model code uses bin model files (e.g., cc.es.300.bin).
Translations of the Toxic 2018 dataset and pseudo-labels for public LB 9372, public LB 9500, public LB 9537 (used as sample inputs to training) can be found here.

Setup

A DockerFile is provided which builds against an Nvidia-provided Docker image, and installs the necessary Nvidia, system, and Python prerequisites - IMPORTANT: the Dockerfile installs an SSH server in-container with a default password (root/testdocker).
Sample Docker build and run invocation: docker build -t jigsaw-multi . && docker run -p 8888:22 -p 8889:8888 -v /home/leecming/PycharmProjects/jigsaw-multilingual/data:/root/data -v /home/leecming/PycharmProjects/jigsaw-multilingual/models:/root/models -v /home/leecming/PycharmProjects/jigsaw-multilingual/notebooks:/root/notebooks --name jigsaw-multi jigsaw-multi
Various functions ingest and generate files - it is suggested that you mount them within container volumes to allow for smooth movement of files in & out of the container
With an SSH server and Jupyter notebook server within the container - it is suggested that you bind ports to enable external connections
Training I/O configuration: update SETTINGS.json to point to locations of the 2018 training CSV, pseudo-labels CSV, and various other relevant paths.
To configure which languages to generate training data for, update the LANG_LIST global variable in prepare_data.py (accepts 1 or more of the 6 test-set languages in ISO code)
For Transformer model settings (including which pretrained model to use), update the global variables in classifier_base.py
For FastText classifier model settings (including which language model to use), update the global variables in classifier_bigru_fasttext_tf.py

Example 1: Running a spanish monolingual Transformer model using public LB 9500 pseudo-labels

We use the pretrained spanish model: mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
Update SETTINGS.json so that PSEUDO_LABELS_PATH and other paths are updated for your setup.
Run prepare_data.py
Run classifier_baseline.py
Run prepare_predictions.py

A submission file should be generated at $TRAIN_DATA_DIR/curr_run_submission.csv. For this, we were able to generate a 9502 public LB submission. Due to training variability, you may get different results. To use the predictions as pseudo-labels for another model run, note that you'll have to merge the predicted labels with test.csv.

Example 2: Running a spanish monolingual FastText model using public LB9537 pseudo-labels

We use the pretrained official FastText embeddings for spanish (cc.es.300.bin)
Update SETTINGS.json so that PSEUDO_LABELS_PATH and other paths are updated for your setup.
Run prepare_data.py
Run classifier_bigru_fasttext_tf.py
Update prepare_predictions.py so that ENSEMBLE_WEIGHT=0.8 (give less weight to predictions from this model)
Run prepare_predictions.py

A submission file should be generated at $TRAIN_DATA_DIR/curr_run_submission.csv. For this, we were able to generate a 9540 public LB submission. Due to training variability, you may get different results. To use the predictions as pseudo-labels for another model run, note that you'll have to merge the predicted labels with test.csv

Example 3: Running a multilingual Transformer model (XLM-Roberta-Large) using public LB9372 pseudo-labels

We use the pretrained multilingual model: xlm-roberta-large
Update SETTINGS.json so that PSEUDO_LABELS_PATH and other paths are updated for your setup.
Update prepare_data.py so that the LANG_LIST = ['es', 'fr', 'it', 'pt', 'ru', 'tr'] and SAMPLE_FRAC=0.1
Update classifier_base.py so that PRETRAINED_MODEL = 'xlm-roberta-large' and BASE_MODEL_OUTPUT_DIM=1024. You'll likely need to adjust the ACCUM_FLAG and BATCH_SIZE given the model size (ACCUM_FLAG=2 and BATCH_SIZE=24 on a single RTX Titan)
Run classifier_base.py
Update prepare_predictions.py so that ENSEMBLE_WEIGHT=0.2 (give more weight to predictions from this model)
Run prepare_predictions.py

A submission file should be generated at $TRAIN_DATA_DIR/curr_run_submission.csv. For this, we were able to generate a 9409 public LB submission. Due to training variability, you may get different results. To use the predictions as pseudo-labels for another model run, note that you'll have to merge the predicted labels with test.csv

Notes

The code and container environment were run on a fairly heavy-weight workstation (24C/48T Threadripper + 64GB RAM + 2x RTX Titans w/ 24GB GPU RAM each) and using mixed precision training + gradient accumulation. It may not be feasible to finetune larger models such as XLM-Roberta-Large on smaller-scale machines especially with entry-level Nvidia cards. You can adjust the BATCH_SIZE and ACCUM_FOR flags in classifier_baseline.py to fit in memory but it may impact model performance.
Below lists the various pretrained HuggingFace transformer models we used -

Language	Models
All	xlm-roberta-large
French	camembert-base, camembert/camembert-large, flaubert/flaubert_large_cased
Italian	dbmdz/bert-base-italian-xxl-cased, dbmdz/bert-base-italian-xxl-uncased, m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0
Portuguese	neuralmind/bert-large-portuguese-cased
Russian	DeepPavlov/rubert-base-cased-conversational
Spanish	dccuchile/bert-base-spanish-wwm-cased, dccuchile/bert-base-spanish-wwm-uncased, mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
Turkish	dbmdz/bert-base-turkish-128k-cased, dbmdz/bert-base-turkish-128k-uncased, dbmdz/electra-base-turkish-cased-v0-discriminator

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
SETTINGS.json		SETTINGS.json
classifier_baseline.py		classifier_baseline.py
classifier_bigru_fasttext_tf.py		classifier_bigru_fasttext_tf.py
entry_points.md		entry_points.md
postprocessor.py		postprocessor.py
prepare_data.py		prepare_data.py
prepare_predictions.py		prepare_predictions.py
preprocessor.py		preprocessor.py
swa.py		swa.py
torch_helpers.py		torch_helpers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Recipe for training:

Code

Data and model files

Setup

Example 1: Running a spanish monolingual Transformer model using public LB 9500 pseudo-labels

Example 2: Running a spanish monolingual FastText model using public LB9537 pseudo-labels

Example 3: Running a multilingual Transformer model (XLM-Roberta-Large) using public LB9372 pseudo-labels

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Recipe for training:

Code

Data and model files

Setup

Example 1: Running a spanish monolingual Transformer model using public LB 9500 pseudo-labels

Example 2: Running a spanish monolingual FastText model using public LB9537 pseudo-labels

Example 3: Running a multilingual Transformer model (XLM-Roberta-Large) using public LB9372 pseudo-labels

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages