Winning solution for the Galaxy Challenge on Kaggle

Overview

kaggle-galaxies

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

Documentation about the method and the code is available in doc/documentation.pdf. Information on how to generate the solution file can also be found below.

Generating the solution

Install the dependencies

Instructions for installing Theano and getting it to run on the GPU can be found here. It should be possible to install NumPy, SciPy, scikit-image and pandas using pip or easy_install. To install pylearn2, simply run:

git clone git://github.com/lisa-lab/pylearn2.git

and add the resulting directory to your PYTHONPATH.

The optional dependencies listed in the documentation don't have to be installed to reproduce the winning solution: the generated data files are already provided, so they don't have to be regenerated (but of course you can if you want to). If you want to install them, please refer to their respective documentation.

Download the code

To download the code, run:

git clone git://github.com/benanne/kaggle-galaxies.git

A bunch of data files (extracted sextractor parameters, IDs files, training labels in NumPy format, ...) are also included. I decided to include these since generating them is a bit tedious and requires extra dependencies. It's about 20MB in total, so depending on your connection speed it could take a minute. Cloning the repository should also create the necessary directory structure (see doc/documentation.pdf for more info).

Download the training data

Download the data files from Kaggle. Place and extract the files in the following locations:

  • data/raw/training_solutions_rev1.csv
  • data/raw/images_train_rev1/*.jpg
  • data/raw/images_test_rev1/*.jpg

Note that the zip file with the training images is called images_training_rev1.zip, but they should go in a directory called images_train_rev1. This is just for consistency.

Create data files

This step may be skipped. The necessary data files have been included in the git repository. Nevertheless, if you wish to regenerate them (or make changes to how they are generated), here's how to do it.

  • create data/train_ids.npy by running python create_train_ids_file.py.
  • create data/test_ids.npy by running python create_test_ids_file.py.
  • create data/solutions_train.npy by running python convert_training_labels_to_npy.py.
  • create data/pysex_params_extra_*.npy.gz by running python extract_pysex_params_extra.py.
  • create data/pysex_params_gen2_*.npy.gz by running python extract_pysex_params_gen2.py.

Copy data to RAM

Copy the train and test images to /dev/shm by running:

python copy_data_to_shm.py

If you don't want to do this, you'll need to modify the realtime_augmentation.py file in a few places. Please refer to the documentation for more information.

Train the networks

To train the best single model, run:

python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

On a GeForce GTX 680, this took about 67 hours to run to completion. The prediction file generated by this script, predictions/final/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.csv.gz, should get you a score that's good enough to land in the #1 position (without any model averaging). You can similarly run the other try_*.py scripts to train the other models I used in the winning ensemble.

If you have more than 2GB of GPU memory, I recommend disabling Theano's garbage collector with allow_gc=False in your .theanorc file or in the THEANO_FLAGS environment variable, for a nice speedup. Please refer to the Theano documentation for more information on how to get the most out Theano's GPU support.

Generate augmented predictions

To generate predictions which are averaged across multiple transformations of the input, run:

python predict_augmented_npy_maxout2048_extradense.py

This takes just over 4 hours on a GeForce GTX 680, and will create two files predictions/final/augmented/valid/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz and predictions/final/augmented/test/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz. You can similarly run the corresponding predict_augmented_npy_*.py files for the other models you trained.

Blend augmented predictions

To generate blended prediction files from all the models for which you generated augmented predictions, run:

python ensemble_predictions_npy.py

The script checks which files are present in predictions/final/augmented/test/ and uses this to determine the models for which predictions are available. It will create three files:

  • predictions/final/blended/blended_predictions_uniform.npy.gz: uniform blend.
  • predictions/final/blended/blended_predictions.npy.gz: weighted linear blend.
  • predictions/final/blended/blended_predictions_separate.npy.gz: weighted linear blend, with separate weights for each question.

Convert prediction file to CSV

Finally, in order to prepare the predictions for submission, the prediction file needs to be converted from .npy.gz format to .csv.gz. Run the following to do so (or similarly for any other prediction file in .npy.gz format):

python create_submission_from_npy.py predictions/final/blended/blended_predictions_uniform.npy.gz

Submit predictions

Submit the file predictions/final/blended/blended_predictions_uniform.csv.gz on Kaggle to get it scored. Note that the process of generating this file involves considerable randomness: the weights of the networks are initialised randomly, the training data for each chunk is randomly selected, ... so I cannot guarantee that you will achieve the same score as I did. I did not use fixed random seeds. This might not have made much of a difference though, since different GPUs and CUDA toolkit versions will also introduce different rounding errors.

Owner
Sander Dieleman
Sander Dieleman
Factorization machines in python

Factorization Machines in Python This is a python implementation of Factorization Machines [1]. This uses stochastic gradient descent with adaptive re

Corey Lynch 892 Jan 03, 2023
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Deploy AutoML as a service using Flask

AutoML Service Deploy automated machine learning (AutoML) as a service using Flask, for both pipeline training and pipeline serving. The framework imp

Chris Rawles 221 Nov 04, 2022
A Streamlit demo to interactively visualize Uber pickups in New York City

Streamlit Demo: Uber Pickups in New York City A Streamlit demo written in pure Python to interactively visualize Uber pickups in New York City. View t

Streamlit 230 Dec 28, 2022
jaxfg - Factor graph-based nonlinear optimization library for JAX.

Factor graphs + nonlinear optimization in JAX

Brent Yi 134 Dec 21, 2022
CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

CyLP CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL). CyLP’s unique feature is that you can use i

COIN-OR Foundation 161 Dec 14, 2022
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. Documenta

2.5k Jan 07, 2023
A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

LazyProgrammer.me 7.1k Jan 01, 2023
Short PhD seminar on Machine Learning Security (Adversarial Machine Learning)

Short PhD seminar on Machine Learning Security (Adversarial Machine Learning)

141 Dec 27, 2022
Firebase + Cloudrun + Machine learning

A simple end to end consumer lending decision engine powered by Google Cloud Platform (firebase hosting and cloudrun)

Emmanuel Ogunwede 8 Aug 16, 2022
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022
CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

ZhihuiYangCS 8 Jun 07, 2022
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validat

The Apache Software Foundation 121 Dec 28, 2022
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by

Robustness Gym 115 Dec 12, 2022
BioPy is a collection (in-progress) of biologically-inspired algorithms written in Python

BioPy is a collection (in-progress) of biologically-inspired algorithms written in Python. Some of the algorithms included are mor

Jared M. Smith 40 Aug 26, 2022
Winning solution for the Galaxy Challenge on Kaggle

Winning solution for the Galaxy Challenge on Kaggle

Sander Dieleman 483 Jan 02, 2023
Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

Feature-Engineering Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared. When the dataset

kemalgunay 5 Apr 21, 2022
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

Aurélien Geron 1.6k Jan 05, 2023