Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

Code for the paper PermuteFormer

Code for the paper "Language Models are Unsupervised Multitask Learners"

A paper list for aspect based sentiment analysis.

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

An assignment on creating a minimalist neural network toolkit for CS11-747

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

中文无监督SimCSE Pytorch实现

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

This repo stores the codes for topic modeling on palliative care journals.

Generating new names based on trends in data using GPT2 (Transformer network)

A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Linear programming solver for paper-reviewer matching and mind-matching

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Outreachy TFX custom component project

BERT, LDA, and TFIDF based keyword extraction in Python

Snowball compiler and stemming algorithms

Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。