ml4ir: Machine Learning for Information Retrieval

Overview

ml4ir: Machine Learning for Information Retrieval

CircleCI | changelog

Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe

ml4ir is an open source library for training and deploying deep learning models for search applications. ml4ir is built on top of python3 and tensorflow 2.x for training and evaluation. It also comes packaged with scala utilities for JVM inference.

ml4ir is designed as modular subcomponents which can each be combined and customized to build a variety of search ML models such as:

  • Learning to Rank
  • Query Auto Completion
  • Document Classification
  • Query Classification
  • Named Entity Recognition
  • Top Results
  • Query2SQL
  • add your application here

ml4ir

Motivation

Search is a complex data space with lots of different types of ML tasks working on a combination of structured and unstructured data sources. There existed no single library that

  • provides an end-to-end training and serving solution for a variety of search applications
  • allows training of models with limited coding expertise
  • allows easy customization to build complex models to tackle the search domain
  • focuses on performance and robustness
  • enables fast prototyping

So, we built ml4ir to do all of the above.

Guiding Principles

Customizable Library

Firstly, we want ml4ir to be an easy-to-use and highly customizable library so that you can build the search application of your need. ml4ir allows each of its subcomponents to be overriden, mixed and match with other custom modules to create and deploy models.

Configurable Toolkit

While ml4ir can be used as a library, it also comes prepackaged with all the popular search based losses, metrics, embeddings, layers, etc. to enable someone with limited tensorflow expertise to quickly load their training data and train models for the task of interest. ml4ir achieves this by following a hybrid approach which allow for each subcomponent to be completely controlled through configurations alone. Most search based ML applications can be built this way.

Performance First

ml4ir is built using the TFRecord data pipeline, which is the recommended data format for tensorflow data loading. We combine ml4ir's high configurability with out of the box tensorflow data optimization utilities to define model features and build a data pipeline that easily allows training on huge amounts of data. ml4ir also comes packaged with utilities to convert data from CSV and libsvm format to TFRecord.

Training-Serving Handshake

As ml4ir is a common library for training and serving deep learning models, this allows us to build tight integration and fault tolerance into the models that are trained. ml4ir also uses the same configuration files for both training and inference keeping the end-to-end handshake clean. This allows user's to easily plug in any feature store(or solr) into ml4ir's serving utilities to deploy models in one's production environments.

Search Model Hub

The goal of ml4ir is to form a common hub for the most popular deep learning layers, losses, metrics, embeddings used in the search domain. We've built ml4ir with a focus on quick prototyping with wide variety of network architectures and optimizations. We encourage contributors to add to ml4ir's arsenal of search deep learning utilities as we continue to do so ourselves.

Continuous Integration

We use CircleCI for running tests. Both jvm and python tests will run on each commit and pull request. You can find both the CI pipelines here

Unit test can be run from the Python/Java IDEs directly or with dedictated mvn or python command

For integration test, you need to run, in the jvm directory:

  • mvn verify -Pintegration_tests after enabling your Python environement as described in the python README.md
  • or, if you prefer running the Python training in Docker, mvn verify -Pintegration_tests -DuseDocker

Alternatively, you can abuse the e2e test to test the jvm inference against a custom directory throught this command: mvn test -Dtest=TensorFlowInferenceIT#testRankingSavedModelBundleWithCSVData -DbundleLocation=/path/to/my/trained/model -DrunName=myRunName

Documentation

We use sphinx for ml4ir documentation. The documentation is hosted using Read the Docs at ml4ir.readthedocs.io/en/latest.

For python doc strings, please use the numpy docstring format specified here.

Comments
  • Reverting all changes from SequenceExample restructuring

    Reverting all changes from SequenceExample restructuring

    Reverting all the changes made in https://github.com/salesforce/ml4ir/pull/91

    Most of the revert were straightforward. I got stuck with inconsistencies and debugging weird shape errors. But at the end it is a clean swap.

    We should really really re-revert this once we add the end to end tests because this is the intended use of SequenceExamples and also gives us more freedom to play with the tensor shapes.

    opened by lastmansleeping 10
  • trying to add new metric (TopKCategoricalAccuracy)

    trying to add new metric (TopKCategoricalAccuracy)

    This PR tries to add TOP_K_CATEGORICAL_ACCURACY (TopKCategoricalAccuracy from tf.keras.metrics)

    Unfortunately, the tests fail with incopatible tensor shapes, but I do not understand why as CategoricalAccuracy and TopKCategoricalAccuracy have the same APIs.

    The error I get:

    ValueError: Shape must be rank 2 but is rank 3 for ‘in_top_k/InTopKV2’ (op: ‘InTopKV2’) with input shapes: [?,1,9], [?,?], [].”
    The thing is TopKClassificationAccuracy has exactly the same API with Classification Accuracy. 
    

    I can not understand how one can succeed and the other fail. As a matter of fact, my implementation follows exactly the implementation of CategoricalAccuracy

    If anyone can help, it is appreciated..

    opened by balikasg 10
  • Allowing the user to save a model bundle and vocabulary files

    Allowing the user to save a model bundle and vocabulary files

    I am adding code to create the necessary files when training a model for deployment. Essentially, apart from the SUCCESS/FAILURE files, we create a model bundle and we pack also the files used for vocabulary lookups.

    opened by balikasg 5
  • Adding cyclic learning rate schedule @rev lastmansleeping@

    Adding cyclic learning rate schedule @rev lastmansleeping@

    • Adding support to cyclic learning rate schedule.
    • Refactoring how to create the optimizer by moving both the optimizer and learning rate schedulers to the model config file.
    cla:signed 
    opened by mohazahran 4
  • Scala yaml parsing

    Scala yaml parsing

    scala makes simple data structs very concise, and it plays nicely with jackson, including the nice:

    @JsonIgnoreProperties(ignoreUnknown = true)
    

    annotation!

    cla:signed 
    opened by jakemannix 4
  • Defining a serving signature with inference specific parsing_fn

    Defining a serving signature with inference specific parsing_fn

    Things covered in the PR:

    • create dummy values for fields that are not required
    • not pad records for inference
    • regenerate the models with updated serving signature(without retraining)
    cla:signed 
    opened by lastmansleeping 4
  • Adding LinearRankingModel @rev asaintrequier@

    Adding LinearRankingModel @rev asaintrequier@

    This PR adds the changes needed to train (no changes necessary) and save a simple one layer linear model for ranking. The PR also contains accompanying config files and unit tests.

    opened by lastmansleeping 3
  • Enabling Spark based HDFS I/O for running ml4ir @rev balikasg@

    Enabling Spark based HDFS I/O for running ml4ir @rev balikasg@

    The PR includes changes required to read the following:

    • data (any directory -> copied over to local FS)
    • feature config + model config (any YAML/JSON file)
    • dataframe(eg: vocabulary files)

    and writing the following back to HDFS:

    • models
    • logs, metrics, etc
    cla:signed 
    opened by lastmansleeping 3
  • updates of tf-writer

    updates of tf-writer

    This PR updates the code needed for writing TF records from csv to files. I started diving to the library trying to write tfrecords for the iris dataset (see notebook also). In this current version, the code has a few bugs that I try to fix here. Same for the notebook example PointwiseRankingDemo.ipynb that does not run anymore.

    My idea, if these are merged, is to have few such notebook with public data that document the process. Also, my first PR to ml4ir!!

    opened by balikasg 3
  • Bump tensorflow from 2.0.4 to 2.9.3 in /python

    Bump tensorflow from 2.0.4 to 2.9.3 in /python

    Bumps tensorflow from 2.0.4 to 2.9.3.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.9.3

    Release 2.9.3

    This release introduces several vulnerability fixes:

    TensorFlow 2.9.2

    Release 2.9.2

    This releases introduces several vulnerability fixes:

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.9.3

    This release introduces several vulnerability fixes:

    Release 2.8.4

    This release introduces several vulnerability fixes:

    ... (truncated)

    Commits
    • a5ed5f3 Merge pull request #58584 from tensorflow/vinila21-patch-2
    • 258f9a1 Update py_func.cc
    • cd27cfb Merge pull request #58580 from tensorflow-jenkins/version-numbers-2.9.3-24474
    • 3e75385 Update version numbers to 2.9.3
    • bc72c39 Merge pull request #58482 from tensorflow-jenkins/relnotes-2.9.3-25695
    • 3506c90 Update RELEASE.md
    • 8dcb48e Update RELEASE.md
    • 4f34ec8 Merge pull request #58576 from pak-laura/c2.99f03a9d3bafe902c1e6beb105b2f2417...
    • 6fc67e4 Replace CHECK with returning an InternalError on failing to create python tuple
    • 5dbe90a Merge pull request #58570 from tensorflow/r2.9-7b174a0f2e4
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 0
  • Bump jackson-databind from 2.10.0 to 2.12.7.1 in /jvm

    Bump jackson-databind from 2.10.0 to 2.12.7.1 in /jvm

    Bumps jackson-databind from 2.10.0 to 2.12.7.1.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies java 
    opened by dependabot[bot] 0
  • Bump pyspark from 3.0.1 to 3.2.2 in /python

    Bump pyspark from 3.0.1 to 3.2.2 in /python

    Bumps pyspark from 3.0.1 to 3.2.2.

    Commits
    • 78a5825 Preparing Spark release v3.2.2-rc1
    • ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases
    • 001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...
    • 9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check
    • bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....
    • c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...
    • 32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...
    • be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check
    • 1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec
    • 3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 0
  • Bump numpy from 1.18.5 to 1.22.0 in /python

    Bump numpy from 1.18.5 to 1.22.0 in /python

    Bumps numpy from 1.18.5 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 0
Releases(v0.1.11)
Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
PySpark ML Bank Churn Prediction

PySpark-Bank-Churn Surname: corresponds to the record (row) number and has no effect on the output. CreditScore: contains random values and has no eff

kemalgunay 2 Nov 11, 2021
A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

Utsav 33 Dec 03, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 02, 2023
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

FEATURE ENGINEERING Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is

Pinar Oner 7 Dec 18, 2021
A Software Framework for Neuromorphic Computing

A Software Framework for Neuromorphic Computing

Lava 338 Dec 26, 2022
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Tom Weichle 2 Apr 18, 2022
A naive Bayes model for cancer classification using a set of documents

Naivebayes text classifcation model for cancer and noncancer documents Author: Alex King Purpose Requirements/files included How to use 1. Purpose The

Alex W King 1 Nov 24, 2021
Crunchdao - Python API for the Crunchdao machine learning tournament

Python API for the Crunchdao machine learning tournament Interact with the Crunc

3 Jan 19, 2022
Python package for causal inference using Bayesian structural time-series models.

Python Causal Impact Causal inference using Bayesian structural time-series models. This package aims at defining a python equivalent of the R CausalI

Thomas Cassou 219 Dec 11, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 08, 2023
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by

Robustness Gym 115 Dec 12, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
A Python toolbox to churn out organic alkalinity calculations with minimal brain engagement.

Organic Alkalinity Sausage Machine A Python toolbox to churn out organic alkalinity calculations with minimal brain engagement. Getting started To mak

Charles Turner 1 Feb 01, 2022
Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

PyTASER PyTASER is a Python (3.9+) library and set of command-line tools for classifying spectral features in bulk materials, post-DFT. The goal of th

Materials Design Group 4 Dec 27, 2022
A game theoretic approach to explain the output of any machine learning model.

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allo

Scott Lundberg 18.2k Jan 02, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022