Data preprocessing rosetta parser for python

Overview

datapreprocessing_rosetta_parser

I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity, specifically targeting popular packages like pandas, beautifulsoup and spacy.

The main idea of my project is to recreate Jelle Teijema's preprocessing pipeline and then try to run Dutch language model on each document to extract things of interest, such as emails, urls, organizations, people and dates. Maybe at this point, it shouldn't be considered just pre-processing, hmmm. Anyway, I've used nl_core_news_lg model. It is not very reliable, especially for organization and person names, however, it still allows for interesting queries.

Moreover, I've decided to try to do a summarization and collection of the most frequent words in the documents. My script tries to find N_SUMMARY_SENTENCES most important sentences and store it in the summary column. Please note, my Dutch is not very strong, so I can't really judge how well it works :)

Finally, the script also saves cleaned title and file contents, as per track anticipated output.

Output file

generate.py reads .csv files from input_data folder and produces output .csv file with | separator. It is pretty heavy (about x1.8 of input csv, ~75MB) and has a total of 15 columns:

Column name Description
filename Original filename provided in the input file
file_content Original file contents provided in the input file
id The dot separated numbers from the filename
category Type of a file
filename_date Date extracted from a filename
parsed_date Date extracted from file contents
found_emails Emails found in the file contents
found_urls URLs found in the file contents
found_organizations Organizations found in the file contents
found_people People found in the file contents
found_dates Dates found in the file contents
summary Summary of the document
top5words Top 5 most frequently used words in the file contents
title Somewhat cleaned title
abstract Somewhat cleaned file contents

Some interesting queries that I could think of at 12pm

  1. Load the output processed .csv file:
import pandas as pd
df = pd.read_csv('./output_data/processed_data.csv', sep='|',
                 index_col=0, dtype=str)
  1. All unique emails found in the documents:
import ast
emails = sum([ast.literal_eval(x) for x in df['found_emails']], [])
unique_emails = set(emails)
  1. Top 10 communicated domains in the documents:
from collections import Counter
domains = [x.split('@')[1] for x in emails]
d_counter = Counter(domains)
print(d_counter.most_common(10))
  1. Top 10 organizations mentioned in the documents:
orgs = sum([ast.literal_eval(x) for x in df['found_organizations']], [])
o_counter = Counter(orgs)
print(o_counter.most_common(10))
  1. Find IDs of documents that contain word "confidential" in them:
df['id'][df['abstract'].str.contains('confidential')]
  1. How many documents and categories there are in the dataset:
print(f'Total number of documents: {len(df)}')
print('Documents by category:')
df['category'].value_counts()

and I am sure you can be significantly more creative with this :)

How to generate output data

  1. Install dependencies with conda and switch to the environment:
conda env create -f environment.yml
conda activate ftm_hackathon

Alternatively (not tested), you can install packages to your current environment manually:

pip install spacy tqdm pandas bs4
  1. Download Dutch spacy model, ~500MB:
python -m spacy download nl_core_news_lg
  1. Put your raw .csv files into input_data folder.

  2. Run generate.py. On my 6yo laptop it takes ~17 minutes.

  3. The result will be written in output_data/processed_data.csv

Owner
ASReview hackathon for Follow the Money
ASReview hackathon for Follow the Money
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Creating an LSTM model to generate music

Music-Generation Creating an LSTM model to generate music music-generator Used to create basic sin wave sounds music-ai Contains the functions to conv

Jerin Joseph 2 Dec 02, 2021
Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

Wesley Hileman 47 Nov 18, 2022
Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library.

GI-Pi Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library. The SP0

Nick Bild 8 Dec 15, 2021
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Crosslingual Coreference Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non

Pandora Intelligence 71 Jan 04, 2023
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

blurr A library that integrates huggingface transformers with version 2 of the fastai framework Install You can now pip install blurr via pip install

ohmeow 253 Dec 31, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
CoSENT、STS、SentenceBERT

CoSENT_Pytorch 比Sentence-BERT更有效的句向量方案

102 Dec 07, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
Python generation script for BitBirds

BitBirds generation script Intro This is published under MIT license, which means you can do whatever you want with it - entirely at your own risk. Pl

286 Dec 06, 2022
Machine Psychology: Python Generated Art

Machine Psychology: Python Generated Art A limited collection of 64 algorithmically generated artwork. Each unique piece is then given a title by the

Pixegami Team 67 Dec 13, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022