Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	🔒 Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
⠹       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
A tool to brute force a gmail account. Use this tool to crack multiple accounts

A tool to brute force a gmail account. Use this tool to crack multiple accounts. This tool is developed to crack multiple accounts

Saad 12 Dec 30, 2022
Simple yara rule manager

Yara Manager A simple program to manage your yara ruleset in a (sqlite) database. Todos Search rules and descriptions Cluster rules in rulesets Enforc

Nils Kuhnert 65 Nov 17, 2022
MITMSDR for INDIAN ARMY cybersecurity hackthon

There mainly three things here: MITMSDR spectrum Manual reverse shell MITMSDR Installation Clone the project and run the setup file: ./setup One of th

2 Jul 26, 2022
Crypto Meta Extractor

Crypto Meta Extractor This repository contains the code which extracts some metadata of all the cryptocurrencies listed (9K) on CoinMarketCap. Coding

Samyak Jain 3 Jul 03, 2022
Open-source keylogger write in python

Python open-source keylogger Language Python open-source keylogger using pynput module Using Install dependences in archive setup.py or install.sh in

Dio brando 4 Jan 15, 2022
DNSSEQ: PowerDNS with FALCON Signature Scheme

PowerDNS-based proof-of-concept implementation of DNSSEC using the post-quantum FALCON signature scheme.

Nils Wisiol 4 Feb 03, 2022
Static Token And Credential Scanner

Static Token And Credential Scanner What is it? STACS is a YARA powered static credential scanner which suports binary file formats, analysis of neste

STACS 81 Dec 27, 2022
SEBUAH TOOLS TERMUX CRACK AKUN FF HOMKI AKUN EPEP DAH SATU FOLLOW AE YA BROO AWOKWOK

print " INSTALL TOOLS " $ pkg update && upgrade $ pkg install python2 $ pkg install git $ pip2 install lolcat $ pip2 install bs4 $ pip2 install reques

Jeeck 2 Nov 29, 2021
Log4j command generator: Generate commands for CVE-2021-44228

Log4j command generator Generate commands for CVE-2021-44228. Description The vulnerability exists due to the Log4j processor's handling of log messag

1 Jan 03, 2022
Python implementation for PrintNightmare using standard Impacket.

PrintNightmare Python implementation for PrintNightmare (CVE-2021-1675 / CVE-2021-34527) using standard Impacket. Installtion $ pip3 install impacket

ollypwn 141 Dec 31, 2022
RCE Exploit for Gitlab < 13.9.4

GitLab-Wiki-RCE RCE Exploit for Gitlab 13.9.4 RCE via unsafe inline Kramdown options when rendering certain Wiki pages Allows any user with push acc

Enox 52 Nov 09, 2022
Exploit for CVE-2017-17562 vulnerability, that allows RCE on GoAhead (< v3.6.5) if the CGI is enabled and a CGI program is dynamically linked.

GoAhead RCE Exploit Exploit for CVE-2017-17562 vulnerability, that allows RCE on GoAhead ( v3.6.5) if the CGI is enabled and a CGI program is dynamic

Francisco Spínola 2 Dec 12, 2021
自动化爆破子域名,并遍历所有端口寻找http服务,并使用crawlergo、dirsearch、xray等工具扫描并集成报告;支持动态添加扫描到的域名至任务;

AutoScanner AutoScanner是什么 AutoScanner是一款自动化扫描器,其功能主要是遍历所有子域名、及遍历主机所有端口寻找出所有http服务,并使用集成的工具进行扫描,最后集成扫描报告; 工具目前有:oneforall、masscan、nmap、crawlergo、dirse

633 Dec 30, 2022
Brute Force Guess the password for Instgram accounts with python

Brute-Force-instagram Guess the password for Instgram accounts Tool features : It has two modes: 1- Combo system from you 2- Automatic (random) system

45 Dec 11, 2022
A GitHub action for organizations that enables advanced security code scanning on all new repos

Advanced-Security-Enforcer What this repository does This code is for an active GitHub Action written in Python to check (on a schedule) for new repos

Zack Koppert 30 May 17, 2022
This is an injection tool that can inject any xposed modules apk into the debug android app

This is an injection tool that can inject any xposed modules apk into the debug android app, the native code in the xposed module can also be injected.

Windy 32 Nov 05, 2022
A script based on sqlmap that uses sql injection vulnerabilities to traverse the existence of a file

A script based on sqlmap that uses sql injection vulnerabilities to traverse the existence o

2 Nov 09, 2022
Python Password Generator

This is a console-based version of a password generator written with Python. The program generates a password based on numbers of letters, numbers, and symbols specified by the user. This is a simple

p.katekomol 1 Jan 24, 2022
Dome - Subdomain Enumeration Tool. Fast and reliable python script that makes active and/or passive scan to obtain subdomains and search for open ports.

DOME - A subdomain enumeration tool Check the Spanish Version Dome is a fast and reliable python script that makes active and/or passive scan to obtai

Vadi 329 Jan 01, 2023
Tenssens framework focused on gathering information from free tools or resources. The intention is to help people find free OSINT resources.

Tenssens framework focused on gathering information from free tools or resources. The intention is to help people find free OSINT resources.

Md. Nur habib 31 Oct 21, 2022