A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Last update: Jan 04, 2023

Overview

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python

This project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.

Installation

It's compatible with python 3.

Install latest version from git repository using pip:

$ pip install git+https://github.com/alirezamika/autoscraper.git

Install from PyPI:

$ pip install autoscraper

Install from source:

$ python setup.py install

How to use

Getting similar results

Say we want to fetch all related post titles in a stackoverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Here's the output:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

Now you can use the scraper object to get related topics of any stackoverflow page:

scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')

Getting exact result

Say we want to scrape live stock prices from yahoo finance:

from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = ["124.81"]

scraper = AutoScraper()

# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper.build(url, wanted_list)
print(result)

Note that you should update the wanted_list if you want to copy this code, as the content of the page dynamically changes.

You can also pass any custom requests module parameter. for example you may want to use proxies or custom headers:

proxies = {
    "http": 'http://127.0.0.1:8001',
    "https": 'https://127.0.0.1:8001',
}

result = scraper.build(url, wanted_list, request_args=dict(proxies=proxies))

Now we can get the price of any symbol:

scraper.get_result_exact('https://finance.yahoo.com/quote/MSFT/')

You may want to get other info as well. For example if you want to get market cap too, you can just append it to the wanted list. By using the get_result_exact method, it will retrieve the data as the same exact order in the wanted list.

Another example: Say we want to scrape the about text, number of stars and the link to issues of Github repo pages:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '2.5k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.build(url, wanted_list)

Simple, right?

Saving the model

We can now save the built model to use it later. To save:

# Give it a file path
scraper.save('yahoo-finance')

And to load:

scraper.load('yahoo-finance')

Tutorials

See this gist for more advanced usages.
AutoScraper and Flask: Create an API From Any Website in Less Than 5 Minutes

Issues

Feel free to open an issue if you have any problem using the module.

Support the project

Happy Coding ♥️

Comments

Pulling tables would be awesome

Perhaps I missed it somewhere, but it would be great to go here: https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6829/Stages/15151/PlayerStatistics/England-Premier-League-2017-2018

And grab the entire table(s): Premier League Player Statistics Premier League Assist to Goal Scorer

opened by craine 11
Asynchronous methods for fetching URLs, parsing HTML, and exporting data
Introduction

I was looking over the code for this project and am impressed with it's simplicity in design and brilliant approach to this problem. However, one thing that jumped out to me was the lack of asynchronous methods to allow for a huge speed increase, especially as the number of pages to scrape increases. I am quite familiar with the standard libraries used to meet this goal and propose the following changes:

Let me know your thoughts and if you're interested in the idea. The performance gains would be immense! Thanks!

Technical changes and additions proposal

[ ] 1. Subclass AutoScraper with AsyncAutoScraper, which would require the packages aiohttp, aiofiles, and aiosql along with a few others purely optionally to increase speed - uvloop, brotlipy, cchardet, and aiodns

[ ] 2. Refactor the _get_soup method by extracting an async method to download HTML asynchronously using aiohttp

[ ] 3. Refactor the get_results* and _build* functions to also be async (simply adding the keyword) and then making sure to call them by using a multiprocessing/threading pool

[ ] a. The get_* functions should handle the calling of these in an executor set to aforementioned pool

[ ] b. Pools are created using concurrent.futures.*

[ ] c. Inner-method logic should remain untouched since parsing is a CPU-bound task

[ ] 4. Use aiofiles for the save method to be able to export many individual JSON files quickly if desired, same for the load method if multiple sources are being used

[ ] 5. Add functionality for exporting to an SQL database asynchronously using aiosql

References

aiohttp

aiofiles

aiosql

@alirezamika
opened by tarasivashchuk 10
About removing duplicate result

I‘m sorry to add this issue, I dont konw whether this is an issue.

In my code.I dont want to remove the duplicate result,and I had tried to commented out some code.But it seems doesn't work,so I add this issue.

sorry for this issue again.Pls tell me If this is not an issue,I will delete this.

opened by Mervyen 9
Added metadata field
This new PR allows users to add metadata dictionary and save/load it. Since metadata is a generic dict, users are free to add any kind of metadata. Some examples include - Author, license, description etc. This provides an identity to the learnt rules. (would be useful for those who publish their work)

Added set_metadata() and get_metadata() to bring in these features.

Changes are made to load() and save()

Updated docs reflecting these features.

Metadata field would be useful, we can save any sort of information along with the rules. In future if you try to add any other fields to the saved representation you can include them in metadata field, without making any major change to the codebase.
opened by Narasimha1997 8
Defining large block of text as wanted list

When our target value is a large block of text, it becomes messy. Instead can a feature be added so that we can define the text shortly?

For example: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

can be defined as: Lorem ipsum(...)est laborum

opened by ohidurbappy 6
Training text with extra spaces before and after while predicted text does not

I am dealing with Q&A pages that some paragraphs contains extra spaces before and after the span (on inspecting the source), while some other span do not. E.g.: (With extra space) https://www.sfc.hk/en/faqs/intermediaries/licensing/Associated-entities#0FCC1339F7B94DF69DD1DF73DB5F7DCA (No extra space) https://www.sfc.hk/en/faqs/intermediaries/licensing/Family-Offices#F919B6DCE05349D8A9E8CEE8CA9C7750

As a result it seems like a model trained with the prior would not predict latter as similar. In fact even during the "build" process question with extra space don't treat other without space as similar.

Another question is on the expanded part of the text (the "A: " answer text). It doesn't expand unless a "+" sign is clicked. In that case is there anyway to get Exact result including the answer part?

Thanks for the great work.

opened by predoctech 6
Ignores duplicate value

Hi,

I was trying to fetch from a website which had some duplicate values like item A and item B had similar price i.e. $1.0 Your AutoScraper simply ignored any duplicate values and fetched unique items to the result list.

Website had 18 items, result list had only 5, all unique. I hope if you can fix this issue, thanks.

opened by thouravi 4
Extracting webpages with a collections of items (structurally)
Hi, How do I extract a list of a list of text from a webpage with:

Name: Amy, Age: 13 Name: Bobby, Age: 33 Name: Chris, Age: 54

Ideally I would like the results to be:

[['Amy', '13'], ['Bobby', '33], ['Chris', '54'] ]
opened by ws1088 4
ERROR: Package 'autoscraper' requires a different Python: 2.7.16 not in '>=3.6'

All 3 listed installation methods return the error shown in the issue title & cause an installation failure. No change when using pip or pip3 command. I tried running the following 2 commands to get around the pre-commit issue but with no change in the result: $ pip uninstall pre-commit # uninstall from Python2.7 $ pip3 install pre-commit # install with Python3

opened by mechengineermike 4
Nonbreaking spaces lead to surprising behavior

I tried using autoscraper to scrape items from the hackernews home page. The scraper had issues with the nonbreaking space in the comments link on each list item. I was eventually able to workaround the issue by using '\xa0' in the wanted_list string. That matched the comments field but then returned incorrect results anyway. My guess is that something is not matching the nonbreaking space in the "stack" analysis (but I didn't invest the time to find the root cause).

This project is an interesting idea, but I recommend unit tests and some documentation about the matching algorithm to help users help you with diagnosing bugs.

opened by steve-bate 4
Add support for incremental learning

As of now, the rules are formed at once based on the targets specified in wanted_list and the stack list is generated for those targets. Sometimes there will be scenarios where I have to update the existing stack list with new rules learnt from different set of targets on the same URL. As seen in the build method, you create a new stack list every time a build method is called. Provide an update method, that updates the stack list simply by appending the new rules learnt from new set of targets. This will be very useful functionality because it will allow developers to incrementally add new targets by retaining the older rules.

opened by Narasimha1997 4
Scrapping output is zero
i tried to scrape the webpage but the results are zero 👍

///////// ` from autoscraper import AutoScraper

url = 'https://trade.mango.markets/account?pubkey=8zJHqNa9sVvyLmVBQwY2vch5729dqfmzF3cxE25ZYVn'

wanted_list = ['Futures Positions','Notion Size']

scraper = AutoScraper() result = scraper.build(url, wanted_list) print(result)

` /////

Output Results are none

@alirezamika can you guide whats the issue:

are the webpage is using node.js ?
opened by sbhadade 2
How to scrape a dynamic website?

I am trying to export a localhost website that is generated with this project:

https://github.com/HBehrens/puncover

The project generates a localhost website, and each time the user interacts clicks a link the project receives a GET request and the website generates the HTML. This means that the HTML is generated each time the user access a link through their browser. At the moment the project does not export the website to html or pdf. For this reason I want to know how could I recursively get all the hyperlinks and then generate the HTML version. Would this be possible with autoscraper?

opened by vChavezB 2

Getting candidate value in when trying scraping.

This is my code

from autoscraper import AutoScraper

url = 'https://www.thedailystar.net/news/bangladesh/diplomacy/news/rohingya-repatriation-countries-should-impose-sanctions-pressurise-myanmar-2922581'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
scraper = AutoScraper()
wanted_list = ["Many of our development partners are selling arms to Myanmar: Foreign Minister"]
scraper1 = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

This is the result

I am getting the value of candidate i.e. wanted_list = ["Many of our development partners are selling arms to Myanmar: Foreign Minister"] as result. I am new to autoscraper (actually I am just trying out from today). Is this the usual result I should hope for or do I get the content of whole webpage ?

opened by p0l4r 0

Releases(v1.1.14)

v1.1.14(Jul 17, 2022)

Source code(tar.gz)
Source code(zip)
v1.1.12(Jan 23, 2021)

Source code(tar.gz)
Source code(zip)
v1.1.11(Jan 10, 2021)

Source code(tar.gz)
Source code(zip)
v1.1.10(Nov 29, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.9(Nov 5, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.8(Oct 26, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.7(Oct 15, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.6(Oct 4, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.5(Sep 17, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.4(Sep 16, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.3(Sep 14, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.2(Sep 13, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.1(Sep 11, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.0(Sep 11, 2020)

Source code(tar.gz)
Source code(zip)
v1.0.4(Sep 8, 2020)

Source code(tar.gz)
Source code(zip)

Owner

Mika

GitHub Repository

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

python+selenium实现的web端自动打卡说明本打卡脚本适用于郑州大学健康打卡，其他web端打卡也可借鉴学习。（自己用的，从2月分稳定运行至今）仅供学习交流使用，请勿依赖。开发者对使用本脚本造成的问题不负任何责任，不对脚本执行效果做出任何担保，原则上不提供任何形式的技术支持。为防止

1 Aug 27, 2022

A web service for scanning media hosted by a Matrix media repository

Matrix Content Scanner A web service for scanning media hosted by a Matrix media repository Installation TODO Development In a virtual environment wit

5 Dec 01, 2022

Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

2k Jan 05, 2023

A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 07, 2023

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

2.2k Jan 05, 2023

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

4.8k Jan 04, 2023

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 04, 2022

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

3 Feb 13, 2022

Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

3 May 31, 2022

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Twitter Turbo / Auto Claimer / Swapper Version: 1.0 Last Update: 01/26/2022 Use this at your own descretion. I've only used this on test accounts and

6 May 02, 2022

Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

1 Dec 07, 2021

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Related tags

Overview

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Installation

How to use

Getting similar results

Getting exact result

Saving the model

Tutorials

Issues

Support the project

Happy Coding ♥️

Comments

Introduction

Technical changes and additions proposal

References

Releases(v1.1.14)

v1.1.14(Jul 17, 2022)

v1.1.12(Jan 23, 2021)

v1.1.11(Jan 10, 2021)

v1.1.10(Nov 29, 2020)

v1.1.9(Nov 5, 2020)

v1.1.8(Oct 26, 2020)

v1.1.7(Oct 15, 2020)

v1.1.6(Oct 4, 2020)

v1.1.5(Sep 17, 2020)

v1.1.4(Sep 16, 2020)

v1.1.3(Sep 14, 2020)

v1.1.2(Sep 13, 2020)

v1.1.1(Sep 11, 2020)

v1.1.0(Sep 11, 2020)

v1.0.4(Sep 8, 2020)

Owner

Mika

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

A web service for scanning media hosted by a Matrix media repository

Web crawling framework based on asyncio.

A Python library for automating interaction with websites.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Scrape Twitter for Tweets

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Pro Football Reference Game Data Webscraper

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Get paper names from dblp.org

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Pseudo API for Google Trends

New World Market Scraper

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

This repo has the source code for the crawler and data crawled from auto-data.net

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）