robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Last update: Dec 27, 2022

Related tags

Web Crawling robobrowser

Overview

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png

https://travis-ci.org/jmcarp/robobrowser.png?branch=master

https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Related tags

Overview

RoboBrowser: Your friendly neighborhood web scraper

Requirements

License

Owner

Joshua Carp

A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

This project was created using Python technology and flask tools to scrape a music site

自动完成每日体温上报（Github Actions）

A dead simple crawler to get books information from Douban.

Simply scrape / download all the media from an fansly account.

Pelican plugin that adds site search capability

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Google Developer Profile Badge Scraper

Telegram group scraper tool

Html Content / Article Extractor, web scrapping lib in Python

A dead simple crawler to get books information from Douban.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Explore scraping with BeautifulSoup!

Web Scraping COVID 19 Meta Portal with Python

爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

A python module to parse the Open Graph Protocol

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

This tool can be used to extract information from any website

Automated Linkedin bot that will improve your visibility and increase your network.

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Related tags

Overview

RoboBrowser: Your friendly neighborhood web scraper

Requirements

License

Owner

Joshua Carp

A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

This project was created using Python technology and flask tools to scrape a music site

自动完成每日体温上报（Github Actions）

A dead simple crawler to get books information from Douban.

Simply scrape / download all the media from an fansly account.

Pelican plugin that adds site search capability

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Google Developer Profile Badge Scraper

Telegram group scraper tool

Html Content / Article Extractor, web scrapping lib in Python

A dead simple crawler to get books information from Douban.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Explore scraping with BeautifulSoup!

Web Scraping COVID 19 Meta Portal with Python

爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

A python module to parse the Open Graph Protocol

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

This tool can be used to extract information from any website

Automated Linkedin bot that will improve your visibility and increase your network.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）