Open solution to the Toxic Comment Classification Challenge

Overview

Starter code: Kaggle Toxic Comment Classification Challenge

More competitions 🎇

Check collection of public projects 🎁 , where you can find multiple Kaggle competitions with code, experiments and outputs.

Here, at Neptune we enjoy participating in the Kaggle competitions. Toxic Comment Classification Challenge is especially interesting because it touches important issue of online harassment.

Ensemble our predictions in the cloud!

You need to be registered to neptune.ml to be able to use our predictions for your ensemble models.

  • click start notebook
  • choose browse button
  • select the neptune_ensembling.ipynb file from this repository.
  • choose worker type: gcp-large is the recommended one.
  • run first few cells to load our predictions on the held out validation set along with the labels
  • grid search over many possible parameter options. The more runs you choose the longer it will run.
  • train your second level, ensemble model (it should take less than an hour once you have the parameters)
  • load our predictions on the test set
  • feed our test set predictions to your ensemble model and get final predictions
  • save your submission file
  • click on browse files and find your submission file to download it.

Running the notebook as is got 0.986+ on the LB.

Disclaimer

In this open source solution you will find references to the neptune.ml. It is free platform for community Users, which we use daily to keep track of our experiments. Please note that using neptune.ml is not necessary to proceed with this solution. You may run it as plain Python script 😉 .

The idea

We are contributing starter code that is easy to use and extend. We did it before with Cdiscount’s Image Classification Challenge and we believe that it is correct way to open data science to the wider community and encourage more people to participate in Challenges. This starter is ready-to-use end-to-end solution. Since all computations are organized in separate steps, it is also easy to extend. Check devbook.ipynb for more information about different pipelines.

Now we want to go one step further and invite you to participate in the development of this analysis pipeline. At the later stage of the competition (early February) we will invite top contributors to join our team on Kaggle.

Contributing

You are welcome to extend this pipeline and contribute your own models or procedures. Please refer to the CONTRIBUTING for more details.

Installation

option 1: Neptune cloud

on the neptune site

  • log in: neptune accound login
  • create new project named toxic: Follow the link Projects (top bar, left side), then click New project button. This action will generate project-key TOX, which is already listed in the neptune.yaml.

run setup commands

$ git clone https://github.com/neptune-ml/kaggle-toxic-starter.git
$ pip3 install neptune-cli
$ neptune login

start experiment

$ neptune send --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config best_configs/fasttext_gru.yaml -- train_evaluate_predict_cv_pipeline --pipeline_name fasttext_gru --model_level first

This should get you to 0.9852 Happy Training :)

Refer to Neptune documentation and Getting started: Neptune Cloud for more.

option 2: local install

Please refer to the Getting started: local instance for installation procedure.

Solution visualization

Below end-to-end pipeline is visualized. You can run exactly this one! pipeline_001

We have also prepared something simpler to just get you started:

pipeline_002

User support

There are several ways to seek help:

  1. Read project's Wiki, where we publish descriptions about the code, pipelines and neptune.
  2. Kaggle discussion is our primary way of communication.
  3. You can submit an issue directly in this repo.
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 72 Dec 19, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 07, 2023
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
Practical Machine Learning with Python

Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.

Dipanjan (DJ) Sarkar 2k Jan 08, 2023
A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Hydra-Auto-Hack A python script that will use hydra to get user and password to login to ssh, ftp, and telnet Project Description This python script w

2 Jan 16, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Won Joon Yoo 335 Jan 04, 2023
CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

Guangxu Xun 38 Dec 31, 2022
Simple GUI where you can enter an article and get a crisp summarized version.

Text-Summarization-using-TextRank-BART Simple GUI where you can enter an article and get a crisp summarized version. How to run: Clone the repo Instal

Rohit P 4 Sep 28, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features 🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

606 Dec 28, 2022