Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Last update: Jan 04, 2022

Related tags

Web Crawling Web-scraping

Overview

Extract Data from the IRS website A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a range of years.

How to run the script? This script runs on Python 3.8. Install the libraries on requirements.txt into a new environment, then run 'Script.py'.

What should I expect? The script will ask you for the form number(s) then scrap the IRS website. --> Please enter the complete tax form number separated by a comma followed by a space (not case sensitive): (ie. Form W-2, Form 1095-C, Form W-3, etc) --> Form W-2, Form 1095-C

Then the bot will ask if the user would like to download the forms. --> Would you like to download all related pdfs? (Y/N)

If selected, the bot will follow up by asking a year range. --> Please provide the year range by using a dash in between the years (starting year must be smaller than ending year): (ie. 2018-2020)

Once executed, the bot will automatically create a folder and download the relevant pdfs into the folder.

Finally, the results will be returned as a json string. If there are no results, the user will get a 'No results' instead.

Sample output: [ {'form_number': 'Form W-2', 'form_title': 'Wage and Tax Statement (Info Copy Only)', 'min_year': '1954', 'max_year': '2022'}, {'form_number': 'Form 1095-C', 'form_title': 'Employer-Provided Health Insurance Offer and Coverage', 'min_year': '2014', 'max_year': '2022'}, {'form_number': 'Form W-3', 'form_title': 'Transmittal of Wage and Tax Statements (Info Copy Only)', 'min_year': '1990', 'max_year': '2022'} ]

Note: To keep users engaged, the bot will display which task it is performing and what URL it is currently searching.

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Related tags

Overview

Owner

mlscraper: Scrape data from HTML pages automatically with Machine Learning

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

Web Scraping images using Selenium and Python

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

The core packages of security analyzer web crawler

👁️ Tool for Data Extraction and Web Requests.

Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

A simple Discord scraper for discord bots

A distributed crawler for weibo, building with celery and requests.

This project was created using Python technology and flask tools to scrape a music site

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

This tool crawls a list of websites and download all PDF and office documents

Download images from forum threads

WebScrapping Project - G1 Latest News

Command line program to download documents from web portals.

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques