topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Overview

NLP Space News Topic Modeling

Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com

Binder Open In Colab nbviewer pre-commit CI CodeQL License: MIT OpenSource Code style: black prs-welcome pyup

Table of Contents

  1. Project Idea
  2. Data acquisition
  3. Analysis
  4. Usage
  5. Project Organization

Project Idea

This project aims to learn topics published in Space news from the Guardian (UK) news publication1.

1: articles were also retrieved from the blog Space.com (web scraping), the New York Times (space news from the science section) and from the Hubble Telescope news archive, but these data sources were not used in analysis

Data acquisition

Primary data source

News articles are retrieved using the official API provided by the Guardian.

Supplementary data sources

Data is also acquired from articles published by the Hubble Telescope, the New York Times (US) and blog publication Space.com

Although these articles were acquired, they were not used in analysis.

Data file creation

  1. Use 1_get_list_of_urls.ipynb
    • programmatically retrieves urls from API or archive of publication
    • retrieves metadata such as date and time, section, sub-section, headline/abstract/short summary, etc.
  2. Use 2_scrape_urls.ipynb
    • scrapes news article text from publication url
  3. Use 3_merge_scraped_and_filter.ipynb
    • merge metadata (1_get_list_of_urls.ipynb) with scraped article text (2_scrape_urls.ipynb)

Analysis

Analysis will be performed using an un-supervised learning model. Details are included in the 8_gensim_coherence_nlp_trials_v3.ipynb notebook in the root directory.

Usage

  1. Clone this repository
    $ git clone
  2. Create Python virtual environment, install packages and launch interactive Python platform
    $ make build
  3. Run notebooks in the following order
    • 3_merge_scraped_and_filter.ipynb (view) (covers data from the Hubble news feed, New York Times and Space.com)
      • merge multiple files of articles text data retrieved from news publications API or archive
      • filter out articles of less than 500 words
      • export to *.csv file for use in unsupervised machine learning models
    • 8_gensim_coherence_nlp_trials_v3.ipynb (view) (does not cover data from the Hubble news feed, New York Times and Space.com)
      • experiments in selecting number of topics using
        • coherence score from built-in coherence model to score Gensim's NMF
        • sklearn's implementation of TFIDF + NMF, using best number of topics found using Gensim's NMF
      • manually reading articles that NMF associates with each topic
    • 9_nlp_workflow.ipynb (view)
      • code-only version of 9_gensim_coherence_nlp_trials_v3.ipynb, with necessary considerations for deployment of topic model

Project Organization

├── .pre-commit-config.yaml       <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── integrate.yml         <- configuration file for Github Actions
├── LICENSE
├── environment.yml               <- configuration file to create environment to run project on Binder
├── Makefile                      <- Makefile with commands like `make lint` or `make build`
├── README.md                     <- The top-level README for developers using this project.
├── app
│   ├── data                      <- data exported from training topic modeler, for use with API
|   └── tests                     <- Source code for use in API tests
|       ├── test-logs             <- Reports from running unit tests on API
|       └── testing_utils         <- Source code for use in unit tests
|           └── *.py              <- Scripts to use in testing API routes
|       ├── __init__.py           <- Allows Python modules to be imported from testing_utils
|       └── test_api.py           <- Unit tests for API
├── api.py                        <- Defines API routes
├── pytest.ini                    <- Test configuration
├── requirements.txt              <- Packages required to run and test API
├── s*,t*.py                      <- Scripts to use in defining API routes
├── data
│   ├── raw                       <- raw data retrieved from news publication
|   └── processed                 <- merged and filtered data
├── executed-notebooks            <- Notebooks with output.
├── *.ipynb                       <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                    and a short `-` delimited description
├── requirements.txt              <- packages required to execute all Jupyter notebooks interactively (not from CI)
├── setup.py                      <- makes project pip installable (pip install -e .) so `src` can be imported
├── src                           <- Source code for use in this project.
│   ├── __init__.py               <- Makes src a Python module
│   └── *.py                      <- Scripts to use in analysis for pre-processing, training, etc.
├── papermill_runner.py           <- Python functions that execute system shell commands.
└── tox.ini                       <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner
edesz
edesz
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
The swas programming language

The Swas programming language This is a language that was made for fun. Installation Step 0: Make sure you have python installed Step 1. Clone this re

Swas.py 19 Jul 18, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

606 Dec 28, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022