Kroomsa: A search engine for the curious

Overview

Kroomsa

Kroomsa

A search engine for the curious. It is a search algorithm designed to engage users by exposing them to relevant yet interesting content during their session.

Description

The search algorithm implemented in your website greatly influences visitor engagement. A decent implementation can significantly reduce dependency on standard search engines like Google for every query thus, increasing engagement. Traditional methods look at terms or phrases in your query to find relevant content based on syntactic matching. Kroomsa uses semantic matching to find content relevant to your query. There is a blog post expanding upon Kroomsa's motivation and its technical aspects.

Getting Started

Prerequisites

  • Python 3.6.5
  • Run the project directory setup: python3 ./setup.py in the root directory.
  • Tensorflow's Universal Sentence Encoder 4
    • The model is available at this link. Download the model and extract the zip file in the /vectorizer directory.
  • MongoDB is used as the database to collate Reddit's submissions. MongoDB can be installed following this link.
  • To fetch comments of the reddit submissions, PRAW is used. To scrape credentials are needed that authorize the script for the same. This is done by creating an app associated with a reddit account by following this link. For reference you can follow this tuorial written by Shantnu Tiwari.
    • Register multiple instances and retrieve their credentials, then add them to the /config under bot_codes parameter in the following format: "client_id client_secret user_agent" as list elements separated by ,.
  • Docker-compose (For dockerized deployment only): Install the latest version following this link.

Installing

  • Create a python environment and install the required packages for preprocessing using: python3 -m pip install -r ./preprocess_requirements.txt
  • Collating a dataset of Reddit submissions
    • Scraping posts
      • Pushshift's API is being used to fetch Reddit submissions. In the root directory, run the following command: python3 ./pre_processing/scraping/questions/scrape_questions.py. It launches a script that scrapes the subreddits sequentially till their inception and stores the submissions as JSON objects in /pre_processing/scraping/questions/scraped_questions. It then partitions the scraped submissions into as many equal parts as there are registered instances of bots.
    • Scraping comments
      • After populating the configuration with bot_codes, we can begin scraping the comments using the partitioned submission files created while scraping submissions. Using the following command: python3 ./pre_processing/scraping/comments/scrape_comments.py multiple processes are spawned that fetch comment streams simultaneously.
    • Insertion
      • To insert the submissions and associated comments, use the following commands: python3 ./pre_processing/db_insertion/insertion.py. It inserts the posts and associated comments in mongo.
      • To clean the comments and tag the posts that aren't public due to any reason, Run python3 ./post_processing/post_processing.py. Apart from cleaning, it also adds emojis to each submission object (This behavior is configurable).
  • Creating a FAISS Index
    • To create a FAISS index, run the following command: python3 ./index/build_index.py. By default, it creates an exhaustive IDMap, Flat index but is configurable through the /config.
  • Database dump (For dockerized deployment)
    • For dockerized deployment, a database dump is required in /mongo_dump. Use the following command at the root dir to create a database dump. mongodump --db database_name(default: red) --collection collection_name(default: questions) -o ./mongo_dump.

Execution

  • Local deployment (Using Gunicorn)
    • Create a python environment and install the required packages using the following command: python3 -m pip install -r ./inference_requirements.txt
    • A local instance of Kroomsa can be deployed using the following command: gunicorn -c ./gunicorn_config.py server:app
  • Dockerized demo
    • Set the demo_mode to True in /config.
    • Build images: docker-compose build
    • Deploy: docker-compose up

Authors

License

This project is licensed under the Apache License Version 2.0

An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Kazakh Named Entity Recognition This repository contains an open-source Kazakh named entity recognition dataset (KazNERD), named entity annotation gui

ISSAI 9 Dec 23, 2022
Code for the paper Open Sesame: Getting Inside BERT's Linguistic Knowledge.

Open Sesame This repository contains the code for the paper Open Sesame: Getting Inside BERT's Linguistic Knowledge. Credits We built the project on t

9 Jul 24, 2022
Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models Abstract Many applications of generative models rely on the marginali

Stanford Intelligent Systems Laboratory 9 Jun 06, 2022
A python library for face detection and features extraction based on mediapipe library

FaceAnalyzer A python library for face detection and features extraction based on mediapipe library Introduction FaceAnalyzer is a library based on me

Saifeddine ALOUI 14 Dec 30, 2022
Autoencoder - Reducing the Dimensionality of Data with Neural Network

autoencoder Implementation of the Reducing the Dimensionality of Data with Neural Network – G. E. Hinton and R. R. Salakhutdinov paper. Notes Aim to m

Jordan Burgess 13 Nov 17, 2022
HyDiff: Hybrid Differential Software Analysis

HyDiff: Hybrid Differential Software Analysis This repository provides the tool and the evaluation subjects for the paper HyDiff: Hybrid Differential

Yannic Noller 22 Oct 20, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 01, 2023
Instance-based label smoothing for improving deep neural networks generalization and calibration

Instance-based Label Smoothing for Neural Networks Pytorch Implementation of the algorithm. This repository includes a new proposed method for instanc

Mohamed Maher 1 Aug 13, 2022
[arXiv] What-If Motion Prediction for Autonomous Driving ❓🚗💨

WIMP - What If Motion Predictor Reference PyTorch Implementation for What If Motion Prediction [PDF] [Dynamic Visualizations] Setup Requirements The W

William Qi 96 Dec 29, 2022
🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer"

SGLKT-VisDial Pytorch Implementation for the paper: Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer Gi-Cheon Kang, Junseok P

Gi-Cheon Kang 9 Jul 05, 2022
TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials

TorchMD-net TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular

TorchMD 104 Jan 03, 2023
Machine Learning Models were applied to predict the mass of the brain based on gender, age ranges, and head size.

Brain Weight in Humans Variations of head sizes and brain weights in humans Kaggle dataset obtained from this link by Anubhab Swain. Image obtained fr

Anne Livia 1 Feb 02, 2022
This is an open source library implementing hyperbox-based machine learning algorithms

hyperbox-brain is a Python open source toolbox implementing hyperbox-based machine learning algorithms built on top of scikit-learn and is distributed

Complex Adaptive Systems (CAS) Lab - University of Technology Sydney 21 Dec 14, 2022
A font family with a great monospaced variant for programmers.

Fantasque Sans Mono A programming font, designed with functionality in mind, and with some wibbly-wobbly handwriting-like fuzziness that makes it unas

Jany Belluz 6.3k Jan 08, 2023
Chunkmogrify: Real image inversion via Segments

Chunkmogrify: Real image inversion via Segments Teaser video with live editing sessions can be found here This code demonstrates the ideas discussed i

David Futschik 112 Jan 04, 2023
Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

yueliu1999 297 Dec 27, 2022
Train the HRNet model on ImageNet

High-resolution networks (HRNets) for Image classification News [2021/01/20] Add some stronger ImageNet pretrained models, e.g., the HRNet_W48_C_ssld_

HRNet 866 Jan 04, 2023
MonoScene: Monocular 3D Semantic Scene Completion

MonoScene: Monocular 3D Semantic Scene Completion MonoScene: Monocular 3D Semantic Scene Completion] [arXiv + supp] | [Project page] Anh-Quan Cao, Rao

298 Jan 08, 2023
Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

A Differentiable Recurrent Surface for Asynchronous Event-Based Data Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous

Marco Cannici 21 Oct 05, 2022
Functional deep learning

Pipeline abstractions for deep learning. Full documentation here: https://lf1-io.github.io/padl/ PADL: is a pipeline builder for PyTorch. may be used

LF1 101 Nov 09, 2022