Crawl BookCorpus

Overview

Homemade BookCorpus

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Clawling could be difficult due to some issues of the website. Also, please consider another option such as using publicly available files at your own risk.

For example,

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


These are scripts to reproduce BookCorpus by yourself.

BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...

This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus. Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.

How to use

Prepare URLs of available books. However, this repository already has a list as url_list.jsonl which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.

python -u download_list.py > url_list.jsonl &

Download their files. Downloading is performed for txt files if possible. Otherwise, this tries to extract text from epub. The additional argument --trash-bad-count filters out epub files whose word count is largely different from its official stat (because it may imply some failure).

python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

The results are saved into the directory of --out (here, out_txts).

Postprocessing

Make concatenated text with sentence-per-line format.

python make_sentlines.py out_txts > all.txt

If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.

python make_sentlines.py out_txts | python tokenize_sentlines.py > all.tokenized.txt

Disclaimer

For example, you can refer to terms of smashwords.com. Please use the code responsibly and adhere to respective copyright and related laws. I am not responsible for any plagiarism or legal implication that rises as a result of this repository.

Requirement

  • python3 is recommended
  • beautifulsoup4
  • progressbar2
  • blingfire
  • html2text
  • lxml
pip install -r requirements.txt

Note on Errors

  • It is expected some error messages are shown, e.g., Failed: epub and txt, File is not a zip file or Failed to open. But, the number of failures will be much less than one of successes. Don't worry.

Acknowledgement

epub2txt.py is derived and modified from https://github.com/kevinxiong/epub2txt/blob/master/epub2txt.py

Citation

If you found this code useful, please cite it with the URL.

@misc{soskkobayashi2018bookcorpus,
    author = {Sosuke Kobayashi},
    title = {Homemade BookCorpus},
    howpublished = {\url{https://github.com/BIGBALLON/cifar-10-cnn}},
    year = {2018}
}

Also, the original papers which made the original BookCorpus are as follows:

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler. "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books." arXiv preprint arXiv:1506.06724, ICCV 2015.

@InProceedings{Zhu_2015_ICCV,
    title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},
    author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},
    booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
    month = {December},
    year = {2015}
}
@inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
}

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. "Skip-Thought Vectors." arXiv preprint arXiv:1506.06726, NIPS 2015.

@article{kiros2015skip,
    title={Skip-Thought Vectors},
    author={Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja},
    journal={arXiv preprint arXiv:1506.06726},
    year={2015}
}
Comments
  • Could you share the processed all.txt?

    Could you share the processed all.txt?

    Hi Sosuke,

    Thanks a lot for the wonderful work! I expect to obtain the bookcorpus dataset with your crawler, but I failed to crawl the articles owing to some network errors. I am afraid that I cannot achieve a complete dataset. So could you please share with me the dataset you have got, e.g. the all.txt. My email address is [email protected]. Thanks!

    Zhijie

    opened by thudzj 9
  • Fix merging sentences in one paragraph

    Fix merging sentences in one paragraph

    This PR simply merges sentences in stack whenever it met an empty line. I am not sure why blank was necessary at the first place, so let't discuss about it if I'm missing some thing here.

    Consider one example from starting section of out_txts/100021__three-plays.txt. Current implementation output:

    Three Plays Published by Mike Suttons at Smashwords Copyright 2011 Mike Sutton ISBN 978-1-4659-8486-9 Tripping on Nothing
    

    It obviously merged the paragraph title Tripping on Nothing into stack incorrectly. With this PR, output is:

    Three Plays Published by Mike Suttons at Smashwords Copyright 2011 Mike Sutton ISBN 978-1-4659-8486-9
    
    
    Tripping on Nothing
    
    opened by yoquankara 4
  • intermittent issues with connections and file names

    intermittent issues with connections and file names

    example: python3.6 download_files.py --list url_list.jsonl --out out_txts --trash-bad-count 0 files had already been saved in out_txts. File is not a zip file | File is not a zip file File is not a zip file File is not a zip file File is not a zip file Failed to open https://www.smashwords.com/books/download/490185/8/latest/0/0/existence.epub HTTPError: HTTP Error 503: Service Temporarily Unavailable Succeeded in opening https://www.smashwords.com/books/download/490185/8/latest/0/0/existence.epub File is not a zip file File is not a zip file | File is not a zip file File is not a zip file File is not a zip file | File is not a zip file File is not a zip file "There is no item named '' in the archive" File is not a zip file File is not a zip file "There is no item named 'OPS/' in the archive" File is not a zip file File is not a zip file | File is not a zip file Failed to open https://www.smashwords.com/books/download/793264/8/latest/0/0/jaynells-wolf.epub HTTPError: HTTP Error 503: Service Temporarily Unavailable Succeeded in opening https://www.smashwords.com/books/download/793264/8/latest/0/0/jaynells-wolf.epub Failed to open https://www.smashwords.com/books/download/479710/6/latest/0/0/tainted-ava-delaney-lost-souls-1.txt HTTPError: HTTP Error 503: Service Temporarily Unavailable Succeeded in opening https://www.smashwords.com/books/download/479710/6/latest/0/0/tainted-ava-delaney-lost-souls-1.txt File is not a zip file "There is no item named 'OPS/' in the archive" File is not a zip file Failed to open https://www.smashwords.com/books/download/496160/8/latest/0/0/royal-blood-royal-blood-1.epub HTTPError: HTTP Error 404: Not Found Failed to open https://www.smashwords.com/books/download/496160/8/latest/0/0/royal-blood-royal-blood-1.epub HTTPError: HTTP Error 404: Not Found Gave up to open https://www.smashwords.com/books/download/496160/8/latest/0/0/royal-blood-royal-blood-1.epub [Errno 2] No such file or directory: 'out_txts/royal-blood-royal-blood-1.epub'

    opened by David-Levinthal 3
  • Network Error

    Network Error

    Hi,Thanks for your code, it's really useful for most nlp researchers and thank you again.

    And when I run this code, it's often interrupted by network error after download a little files, I thought this maybe caused by my network. so, could you please send me a email attached with the crawled BookCorpus datasets if you have ?

    My email is: [email protected]. Thank you very much.

    Best,

    opened by SummmerSnow 3
  • HTTPError: HTTP Error 401: Authorization Required

    HTTPError: HTTP Error 401: Authorization Required

    Thanks for you code, but I got some network trouble when I run the download_list script. The full error message is Failed to open https://www.smashwords.com/books/category/1/downloads/0/free/medium/0 HTTPError: HTTP Error 401: Authorization Required

    What's more, when I use your url_list.jsonl to download file, the download_filles script gaves the same error message: Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt HTTPError: HTTP Error 401: Authorization Required

    And I tried to open the url in my chrome, and I can see that page without error 401. Could help to find a solution? Thanks a lot~

    opened by NotToday 2
  • smashwords.com forbids this; readme should tell people to get permission first

    smashwords.com forbids this; readme should tell people to get permission first

    The code in this repo violates both the robots.txt of smashwords.com:

    $ curl -s https://www.smashwords.com/robots.txt | tail -4
    User-agent: *
    Disallow: /books/search?
    Disallow: /books/download/
    Crawl-delay: 4
    

    and their terms of service, as far as I can see: “Third parties are not authorized to download, host and otherwise redistribute Smashwords books without prior written agreement from Smashwords” (you could imagine that this only prohibits downloading for subsequent hosting or redistribution, but I think that would be an opportunistic interpretation :) ).

    The readme should tell people very clearly that they must get permission from smashwords.com before running this stuff against their site.

    opened by gthb 1
  • How to resolve URLError SSL: CERTIFICATE_VERIFY_FAILED

    How to resolve URLError SSL: CERTIFICATE_VERIFY_FAILED

    If you get the following error:

    URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>
    

    Adding this block of code at the top of the script at download_files.py will resolve it.

    import os, ssl
    if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
        getattr(ssl, '_create_unverified_context', None)):
        ssl._create_default_https_context = ssl._create_unverified_context
    
    opened by delzac 1
  • add: utf8 encoding for all file opens

    add: utf8 encoding for all file opens

    First of all, Thank you for sharing your work.

    There were some errors about encoding like below. image

    It's been resolved by adding encoding='utf8' for every opens.

    Have a beautiful day.

    opened by YongWookHa 1
  • download_list.py not working due to title change.

    download_list.py not working due to title change.

    Apparently the titles on smashwords changed. txt is now found under "Plain text; contains no formatting" and epub under "Supported by many apps and devices (e.g., Apple Books, Barnes and Noble Nook, Kobo, Google Play, etc.)"

    opened by 1227505 1
  • add strip for genre scraping

    add strip for genre scraping

    It was dirty. I added strip.

    "genres": ["\n                            Category: Fiction \u00bb Mystery & detective \u00bb Women Sleuths ", "\n                            Category: Fiction \u00bb Fantasy \u00bb Paranormal "]
    

    will be

    "genres": ["Category: Fiction \u00bb Mystery & detective \u00bb Women Sleuths", "Category: Fiction \u00bb Fantasy \u00bb Paranormal"]
    
    opened by soskek 0
  • Update on the `url_list.jsonl`

    Update on the `url_list.jsonl`

    Hello, on 2022-12-17 I run the script download_list.py with modified number to page to 31430 which covered the last search page. Here is the updated url_list.jsonl.zip

    There are 4544 entries loss, and 8475 entries added from the original file

    Hope this help

    opened by thipokKub 0
  • Here’s a download link for all of bookcorpus as of Sept 2020

    Here’s a download link for all of bookcorpus as of Sept 2020

    You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21

    it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)

    The new script:

    1. Correctly preserves structure, matching the table of contents very closely;

    2. Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),

    3. Correctly preserves code structure, so that source code and similar things are visually coherent,

    4. Converts numbered lists from “1\.” to “1.”

    5. Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;

    6. Expands Unicode ellipses to “...” (three separate ascii characters).

    The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.

    opened by shawwn 13
  • epub2txt.py produces incorrect results for many epubs

    epub2txt.py produces incorrect results for many epubs

    Specifically this line: https://github.com/soskek/bookcorpus/blob/05a3f227d9748c2ee7ccaf93819d0e0236b6f424/epub2txt.py#L149

    image

    When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.

    The reason is that the Table of Contents looks similar to this:

    ch1.html#section1
    ch1.html#section2
    ch1.html#section3
    ... ch2.html#section1 ch2.html#section2 ...

    The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.

    I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206

    opened by shawwn 1
  • Can anyone download all the files in the url list file?

    Can anyone download all the files in the url list file?

    I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot HTTP Error: 403 Forbidden How to fix this ? Or can i get the all the bookscorpus data from somewhere ?

    Thanks

    opened by wxp16 13
Releases(v1.0)
Owner
Sosuke Kobayashi
[email protected] ML, NLP, CV
Sosuke Kobayashi
Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

Scrapinghub 8.7k Jan 05, 2023
Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

68 Oct 08, 2022
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022
Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-lea

Ronie Martinez 326 Dec 15, 2022
Scraping Top Repositories for Topics on GitHub,

0.-Webscrapping-using-python Scraping Top Repositories for Topics on GitHub, Web scraping is the process of extracting and parsing data from websites

Dev Aravind D Satprem 2 Mar 18, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Jan 04, 2023
Scrape all the media from an OnlyFans account - Updated regularly

Scrape all the media from an OnlyFans account - Updated regularly

CRIMINAL 3.2k Dec 29, 2022
A tool can scrape product in aliexpress: Title, Price, and URL Product.

Scrape-Product-Aliexpress A tool can scrape product in aliexpress: Title, Price, and URL Product. Usage: 1. Install Python 3.8 3.9 padahal halaman ins

Rahul Joshua Damanik 1 Dec 30, 2021
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

Rodney 4 Nov 18, 2022
Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

James 0 Jul 04, 2021
A simple Discord scraper for discord bots

A simple Discord scraper for discord bots. That includes sending an guild members ids to an file, Mass inviter for joining servers your bot is in and Fetching all the servers of the bot (w/MemberCoun

3zg 1 Jan 06, 2022
A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

Swaroop Todankar 1 Jan 17, 2022
jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022
feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫,以及完善的爬虫报警机制。

feapder 是一款简单、快速、轻量级的爬虫框架。起名源于 fast、easy、air、pro、spider的缩写,以开发快速、抓取快速、使用简单、功能强大为宗旨,历时4年倾心打造。支持轻量爬虫、分布式爬虫、批次爬虫、爬虫集成,以及完善的爬虫报警机制。 之

boris 1.4k Dec 29, 2022
crypto currency scraping

SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

15 Sep 01, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

0 Jan 07, 2022