Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Google Scholar Web Scraping

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Get-web-images - A python code that get images from any site

Console application for downloading images from Reddit in Python

Find thumbnails and original images from URL or HTML file.

Automated data scraper for Thailand COVID-19 data

SmartScraper: 简单、自动、快捷的Python网络爬虫

a high-performance, lightweight and human friendly serving engine for scrapy

A python module to parse the Open Graph Protocol

A pure-python HTML screen-scraping library

Web-Scrapper using Python and Flask

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

A simple reddit scraper to get memes (only images) from r/ProgrammerHumor.

Audio media crawler for lbry.

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

A low-code tool that generates python crawler code based on curl or url

Unja is a fast & light tool for fetching known URLs from Wayback Machine