Kroomsa: A search engine for the curious

Overview

Kroomsa

Kroomsa

A search engine for the curious. It is a search algorithm designed to engage users by exposing them to relevant yet interesting content during their session.

Description

The search algorithm implemented in your website greatly influences visitor engagement. A decent implementation can significantly reduce dependency on standard search engines like Google for every query thus, increasing engagement. Traditional methods look at terms or phrases in your query to find relevant content based on syntactic matching. Kroomsa uses semantic matching to find content relevant to your query. There is a blog post expanding upon Kroomsa's motivation and its technical aspects.

Getting Started

Prerequisites

  • Python 3.6.5
  • Run the project directory setup: python3 ./setup.py in the root directory.
  • Tensorflow's Universal Sentence Encoder 4
    • The model is available at this link. Download the model and extract the zip file in the /vectorizer directory.
  • MongoDB is used as the database to collate Reddit's submissions. MongoDB can be installed following this link.
  • To fetch comments of the reddit submissions, PRAW is used. To scrape credentials are needed that authorize the script for the same. This is done by creating an app associated with a reddit account by following this link. For reference you can follow this tuorial written by Shantnu Tiwari.
    • Register multiple instances and retrieve their credentials, then add them to the /config under bot_codes parameter in the following format: "client_id client_secret user_agent" as list elements separated by ,.
  • Docker-compose (For dockerized deployment only): Install the latest version following this link.

Installing

  • Create a python environment and install the required packages for preprocessing using: python3 -m pip install -r ./preprocess_requirements.txt
  • Collating a dataset of Reddit submissions
    • Scraping posts
      • Pushshift's API is being used to fetch Reddit submissions. In the root directory, run the following command: python3 ./pre_processing/scraping/questions/scrape_questions.py. It launches a script that scrapes the subreddits sequentially till their inception and stores the submissions as JSON objects in /pre_processing/scraping/questions/scraped_questions. It then partitions the scraped submissions into as many equal parts as there are registered instances of bots.
    • Scraping comments
      • After populating the configuration with bot_codes, we can begin scraping the comments using the partitioned submission files created while scraping submissions. Using the following command: python3 ./pre_processing/scraping/comments/scrape_comments.py multiple processes are spawned that fetch comment streams simultaneously.
    • Insertion
      • To insert the submissions and associated comments, use the following commands: python3 ./pre_processing/db_insertion/insertion.py. It inserts the posts and associated comments in mongo.
      • To clean the comments and tag the posts that aren't public due to any reason, Run python3 ./post_processing/post_processing.py. Apart from cleaning, it also adds emojis to each submission object (This behavior is configurable).
  • Creating a FAISS Index
    • To create a FAISS index, run the following command: python3 ./index/build_index.py. By default, it creates an exhaustive IDMap, Flat index but is configurable through the /config.
  • Database dump (For dockerized deployment)
    • For dockerized deployment, a database dump is required in /mongo_dump. Use the following command at the root dir to create a database dump. mongodump --db database_name(default: red) --collection collection_name(default: questions) -o ./mongo_dump.

Execution

  • Local deployment (Using Gunicorn)
    • Create a python environment and install the required packages using the following command: python3 -m pip install -r ./inference_requirements.txt
    • A local instance of Kroomsa can be deployed using the following command: gunicorn -c ./gunicorn_config.py server:app
  • Dockerized demo
    • Set the demo_mode to True in /config.
    • Build images: docker-compose build
    • Deploy: docker-compose up

Authors

License

This project is licensed under the Apache License Version 2.0

BiSeNet based on pytorch

BiSeNet BiSeNet based on pytorch 0.4.1 and python 3.6 Dataset Download CamVid dataset from Google Drive or Baidu Yun(6xw4). Pretrained model Download

367 Dec 26, 2022
Implementation of Invariant Point Attention, used for coordinate refinement in the structure module of Alphafold2, as a standalone Pytorch module

Invariant Point Attention - Pytorch Implementation of Invariant Point Attention as a standalone module, which was used in the structure module of Alph

Phil Wang 113 Jan 05, 2023
Implementation of neural class expression synthesizers

NCES Implementation of neural class expression synthesizers (NCES) Installation Clone this repository: https://github.com/ConceptLengthLearner/NCES.gi

NeuralConceptSynthesis 0 Jan 06, 2022
Official implementation of "Open-set Label Noise Can Improve Robustness Against Inherent Label Noise" (NeurIPS 2021)

Open-set Label Noise Can Improve Robustness Against Inherent Label Noise NeurIPS 2021: This repository is the official implementation of ODNL. Require

Hongxin Wei 12 Dec 07, 2022
Repository for MuSiQue: Multi-hop Questions via Single-hop Question Composition

🎵 MuSiQue: Multi-hop Questions via Single-hop Question Composition This is the repository for our paper "MuSiQue: Multi-hop Questions via Single-hop

21 Jan 02, 2023
Based on Stockfish neural network(similar to LcZero)

MarcoEngine Marco Engine - interesnaya neyronnaya shakhmatnaya set', kotoraya ispol'zuyet metod samoobucheniya(dostizheniye khoroshoy igy putem proboy

Marcus Kemaul 4 Mar 12, 2022
Codes for [NeurIPS'21] You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership.

You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership Codes for [NeurIPS'21] You are caught stealing my winni

VITA 8 Nov 01, 2022
Official PyTorch implementation of Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.

Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval PyTorch This is the PyTorch implementation of Retrieve in Style: Unsupervised Fa

60 Oct 12, 2022
Builds a LoRa radio frequency fingerprint identification (RFFI) system based on deep learning techiniques

This project builds a LoRa radio frequency fingerprint identification (RFFI) system based on deep learning techiniques.

20 Dec 30, 2022
[ICCV 2021] Official PyTorch implementation for Deep Relational Metric Learning.

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

Borui Zhang 39 Dec 10, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 364 Dec 28, 2022
Multivariate Boosted TRee

Multivariate Boosted TRee What is MBTR MBTR is a python package for multivariate boosted tree regressors trained in parameter space. The package can h

SUPSI-DACD-ISAAC 61 Dec 19, 2022
A Home Assistant custom component for Lobe. Lobe is an AI tool that can classify images.

Lobe This is a Home Assistant custom component for Lobe. Lobe is an AI tool that can classify images. This component lets you easily use an exported m

Kendell R 4 Feb 28, 2022
Implementation of the paper ''Implicit Feature Refinement for Instance Segmentation''.

Implicit Feature Refinement for Instance Segmentation This repository is an official implementation of the ACM Multimedia 2021 paper Implicit Feature

Lufan Ma 17 Dec 28, 2022
My implementation of transformers related papers for computer vision in pytorch

vision_transformers This is my personnal repo to implement new transofrmers based and other computer vision DL models I am currenlty working without a

samsja 1 Nov 10, 2021
A custom DeepStack model that has been trained detecting ONLY the USPS logo

This repository provides a custom DeepStack model that has been trained detecting ONLY the USPS logo. This was created after I discovered that the Deepstack OpenLogo custom model I was using did not

Stephen Stratoti 9 Dec 27, 2022
Workshop Materials Delivered on 28/02/2022

intro-to-cnn-p1 Repo for hosting workshop materials delivered on 28/02/2022 Questions you will answer in this workshop Learning Objectives What are co

Beginners Machine Learning 5 Feb 28, 2022
A simple, clean TensorFlow implementation of Generative Adversarial Networks with a focus on modeling illustrations.

IllustrationGAN A simple, clean TensorFlow implementation of Generative Adversarial Networks with a focus on modeling illustrations. Generated Images

268 Nov 27, 2022
The missing CMake project initializer

cmake-init - The missing CMake project initializer Opinionated CMake project initializer to generate CMake projects that are FetchContent ready, separ

1k Jan 01, 2023