Python tutorial for implementing Oxylabs' Residential Proxies with AIOHTTP

Overview

Integrating Oxylabs' Residential Proxies with AIOHTTP

Requirements for the Integration

For the integration to work you'll need to install aiohttp library, use Python 3.6 version or higher and Residential Proxies.
If you don't have aiohttp library, you can install it by using pip command:

pip install aiohttp

You can get Residential Proxies here: https://oxylabs.io/products/residential-proxy-pool

Proxy Authentication

There are 2 ways to authenticate proxies with aiohttp.
The first way is to authorize and pass credentials along with the proxy URL using aiohttp.BasicAuth:

import aiohttp

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"
 
async def fetch():
    async with aiohttp.ClientSession() as session:
        proxy_auth = aiohttp.BasicAuth(USER, PASSWORD)
        async with session.get(
                "http://ip.oxylabs.io", 
                proxy="http://pr.oxylabs.io:7777", 
                proxy_auth=proxy_auth ,
        ) as resp:
            print(await resp.text())

The second one is by passing authentication credentials in proxy URL:

import aiohttp

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"

async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get(
                "http://ip.oxylabs.io", 
                proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
        ) as resp: 
            print(await resp.text())

In order to use your own proxies, adjust user and pass fields with your Oxylabs account credentials.

Testing Proxies

To see if the proxy is working, try visiting https://ip.oxylabs.io. If everything is working correctly, it will return an IP address of a proxy that you're currently using.

Sample Project: Extracting Data From Multiple Pages

To better understand how residential proxies can be utilized for asynchronous data extracting operations, we wrote a sample project to scrape product listing data and save the output to a CSV file. The proxy rotation allows us to send multiple requests at once risk-free – meaning that we don't need to worry about CAPTCHA or getting blocked. This makes the web scraping process extremely fast and efficient – now you can extract data from thousands of products in a matter of seconds!

li > article.product_pod"): data = { "title": product_data.select_one("h3 > a")["title"], "url": product_data.select_one("h3 > a").get("href")[5:], "product_price": product_data.select_one("p.price_color").text, "stars": product_data.select_one("p")["class"][1], } results_list.append(data) # Fill results_list by reference. print(f"Extracted data for a book: {data['title']}") async def fetch(session, sem, url, results_list): async with sem: async with session.get( url, proxy=f"http://{USER}:{PASSWORD}@{END_POINT}", ) as response: await parse_data(await response.text(), results_list) async def create_jobs(results_list): sem = asyncio.Semaphore(4) async with aiohttp.ClientSession() as session: await asyncio.gather( *[fetch(session, sem, url, results_list) for url in url_list] ) if __name__ == "__main__": results = [] start = time.perf_counter() # Different EventLoopPolicy must be loaded if you're using Windows OS. # This helps to avoid "Event Loop is closed" error. if sys.platform.startswith("win") and sys.version_info.minor >= 8: asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) try: asyncio.run(create_jobs(results)) except Exception as e: print(e) print("We broke, but there might still be some results") print( f"\nTotal of {len(results)} products from {len(url_list)} pages " f"gathered in {time.perf_counter() - start:.2f} seconds.", ) df = pd.DataFrame(results) df["url"] = df["url"].map( lambda x: "".join(["https://books.toscrape.com/catalogue", x]) ) filename = "scraped-books.csv" df.to_csv(filename, encoding="utf-8-sig", index=False) print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}") ">
import asyncio
import time
import sys
import os

import aiohttp
import pandas as pd
from bs4 import BeautifulSoup

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"

# Generate a list of URLs to scrape.
url_list = [
    f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
    for page_num in range(1, 51)
]


async def parse_data(text, results_list):
    soup = BeautifulSoup(text, "lxml")
    for product_data in soup.select("ol.row > li > article.product_pod"):
        data = {
            "title": product_data.select_one("h3 > a")["title"],
            "url": product_data.select_one("h3 > a").get("href")[5:],
            "product_price": product_data.select_one("p.price_color").text,
            "stars": product_data.select_one("p")["class"][1],
        }
        results_list.append(data)  # Fill results_list by reference.
        print(f"Extracted data for a book: {data['title']}")


async def fetch(session, sem, url, results_list):
    async with sem:
        async with session.get(
            url,
            proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
        ) as response:
            await parse_data(await response.text(), results_list)


async def create_jobs(results_list):
    sem = asyncio.Semaphore(4)
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[fetch(session, sem, url, results_list) for url in url_list]
        )


if __name__ == "__main__":
    results = []
    start = time.perf_counter()

    # Different EventLoopPolicy must be loaded if you're using Windows OS.
    # This helps to avoid "Event Loop is closed" error.
    if sys.platform.startswith("win") and sys.version_info.minor >= 8:
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

    try:
        asyncio.run(create_jobs(results))
    except Exception as e:
        print(e)
        print("We broke, but there might still be some results")

    print(
        f"\nTotal of {len(results)} products from {len(url_list)} pages "
        f"gathered in {time.perf_counter() - start:.2f} seconds.",
    )
    df = pd.DataFrame(results)
    df["url"] = df["url"].map(
        lambda x: "".join(["https://books.toscrape.com/catalogue", x])
    )
    filename = "scraped-books.csv"
    df.to_csv(filename, encoding="utf-8-sig", index=False)
    print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}")

If you want to test the project's script by yourself, you'll need to install some additional packages. To do that, simply download requirements.txt file and use pip command:

pip install -r requirements.txt

If you're having any trouble integrating proxies with aiohttp and this guide didn't help you - feel free to contact Oxylabs customer support at [email protected].

Owner
Oxylabs.io
Oxylabs.io
Repo for investigation of timeouts that happens with prolonged training on clients

Flower-timeout Repo for investigation of timeouts that happens with prolonged training on clients. This repository is meant purely for demonstration o

1 Jan 21, 2022
wireguard-config-benchmark is a python script that benchmarks the download speeds for the connections defined in one or more wireguard config files

wireguard-config-benchmark is a python script that benchmarks the download speeds for the connections defined in one or more wireguard config files. If multiple configs are benchmarked it will output

Sal 12 May 07, 2022
boofuzz: Network Protocol Fuzzing for Humans

boofuzz: Network Protocol Fuzzing for Humans Boofuzz is a fork of and the successor to the venerable Sulley fuzzing framework. Besides numerous bug fi

Joshua Pereyda 1.7k Dec 31, 2022
Monitoring plugin to check network interfaces with Icinga, Nagios and other compatible monitoring solutions

check_network_interface - Monitor network interfaces This is a monitoring plugin for Icinga, Nagios and other compatible monitoring solutions to check

DinoTools 3 Nov 15, 2022
A simple python script that parses the MSFT Teams log file for the users current Teams status and then outputs the status color to a MQTT connected light.

Description A simple python script that parses the MSFT Teams log file for the users current Teams status and then outputs the status color to a MQTT

Lorentz Factr 8 Dec 16, 2022
netpy - more than implementation of netcat 🐍🔥

netpy - more than implementation of netcat 🐍🔥

Mahmoud S. ElGammal 1 Jan 26, 2022
A simple and lightweight server that allows clients to connect and launch a shell remotely through a browser.

carrotsh A simple and lightweight server that allows clients to connect and launch a shell remotely through a browser. Uses xterm.js for the frontend

V9 31 Dec 27, 2022
A Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.

A Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.

George O 929 Jan 01, 2023
This is a small python code that I use with my NAS server connected to Plex

Spotifarr This is a small python code that I use with my NAS server connected to Plex I didn't appreciate how Lidarr works because it downloads a full

Automator 35 Oct 04, 2022
BLE parser for passive BLE advertisements

This pypi package is parsing BLE advertisements to readable data for several sensors and can be used for device tracking, as long as the MAC address is static. The parser was originally developed as

Ernst Klamer 19 Dec 26, 2022
This tool will scans your wi-fi/wlan and show you the connected clients

This tool will scans your wi-fi/wlan and show you the connected clients

VENKAT SAI SAGAR 3 Mar 24, 2022
ASC - Api Server Controller

ASC - Api Server Controller

Uriel Alves 1 Jan 03, 2022
This program ingests a Cisco "sh ip arp" as a text file and produces the list of vendors seen in the file

IP-ARP-Vendor_lookup This program ingests a Cisco "sh ip arp" as a text file and produces the list of vendors seen in the file Why? Answers the questi

Stew Alexander 1 Dec 24, 2022
A TrueCharts automatic and bulk update utility

trueupdate A TrueCharts automatic and bulk update utility How to install run pip install trueupdate Please be aware you will need to reinstall after e

TrueCharts 125 Jan 04, 2023
A fire and forget command-line tool to allow for easy transitions of VPN connections between a pool of AWS machines.

VPN Swapper A fire and forget command-line tool to allow for easy transitions of VPN connections between a pool of AWS machines. Dependencies poetry -

Workday 5 Jul 07, 2022
Automatic Proxy scraper and Proxy-rotating Nitro Generator.

Automatic Proxy scraper and Proxy-rotating Nitro Generator.

Tawren007 2 Nov 08, 2021
Tool for ROS 2 IP Discovery + System Monitoring

Monitor the status of computers on a network using the DDS function of ROS2.

Ar-Ray 33 Apr 03, 2022
GhostVPN - Simple and lightweight TUI application for CyberGhostVPN

GhostVPN Simple and lightweight TUI application for CyberGhostVPN. Screenshot Us

Mehmet Ali KERİMOĞLU 5 Jul 27, 2022
Use Raspberry Pi and CircuitSetup's power monitor hardware to publish electrical usage to MQTT

This repo has code and notes for whole home electrical power monitoring using a Raspberry Pi and CircuitSetup modules. Beyond just collecting data, it

Eric Tsai 10 Jul 25, 2022
Simple P2P application for sending files over open and forwarded network ports.

FileShareV2 A major overhaul to the V1 (now deprecated) FileShare application. V2 brings major improvements in both UI and performance. V2 is now base

Michael Wang 1 Nov 23, 2021