Downloader Middleware to support Playwright in Scrapy & Gerapy

Last update: Dec 31, 2022

Related tags

Overview

Gerapy Playwright

This is a package for supporting Playwright in Scrapy, also this package is a module in Gerapy.

Installation

pip3 install gerapy-playwright

Usage

You can use PlaywrightRequest to specify a request which uses playwright to render.

For example:

yield PlaywrightRequest(detail_url, callback=self.parse_detail)

And you also need to enable PlaywrightMiddleware in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,
}

Congratulate, you've finished the all of the required configuration.

If you run the Spider again, Playwright will be started to render every web page which you configured the request as PlaywrightRequest.

Settings

GerapyPlaywright provides some optional settings.

Concurrency

You can directly use Scrapy's setting to set Concurrency of Playwright, for example:

CONCURRENT_REQUESTS = 3

Pretend as Real Browser

Some website will detect WebDriver or Headless, GerapyPlaywright can pretend Chromium by inject scripts. This is enabled by default.

You can close it if website does not detect WebDriver to speed up:

GERAPY_PLAYWRIGHT_PRETEND = False

Also you can use pretend attribute in PlaywrightRequest to overwrite this configuration.

Logging Level

By default, Playwright will log all the debug messages, so GerapyPlaywright configured the logging level of Playwright to WARNING.

If you want to see more logs from Playwright, you can change the this setting:

import logging
GERAPY_PLAYWRIGHT_LOGGING_LEVEL = logging.DEBUG

Download Timeout

Playwright may take some time to render the required web page, you can also change this setting, default is 30s:

# playwright timeout
GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30

Headless

By default, Playwright is running in Headless mode, you can also change it to False as you need, default is True:

GERAPY_PLAYWRIGHT_HEADLESS = False

Window Size

You can also set the width and height of Playwright window:

GERAPY_PLAYWRIGHT_WINDOW_WIDTH = 1400
GERAPY_PLAYWRIGHT_WINDOW_HEIGHT = 700

Default is 1400, 700.

Proxy

You can set a proxy channel via below this config:

GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
  'username': 'xxx',
  'password': 'xxxx'
}

Screenshot

You can get screenshot of loaded page, you can pass screenshot args to PlaywrightRequest as dict:

Below are the supported args:

type (str): Specify screenshot type, can be either jpeg or png. Defaults to png.
quality (int): The quality of the image, between 0-100. Not applicable to png image.
full_page (bool): When true, take a screenshot of the full scrollable page. Defaults to False.
clip (dict): An object which specifies clipping region of the page. This option should have the following fields:
- x (int): x-coordinate of top-left corner of clip area.
- y (int): y-coordinate of top-left corner of clip area.
- width (int): width of clipping area.
- height (int): height of clipping area.
omit_background (bool): Hide default white background and allow capturing screenshot with transparency.
timeout (str): Maximum time in milliseconds, defaults to 30 seconds, pass 0 to disable timeout.

For example:

yield PlaywrightRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={
            'type': 'png',
            'full_page': True
        })

then you can get screenshot result in response.meta['screenshot']:

Simplest save it to file:

def parse_index(self, response):
    with open('screenshot.png', 'wb') as f:
        f.write(response.meta['screenshot'].getbuffer())

If you want to enable screenshot for all requests, you can configure it by GERAPY_PLAYWRIGHT_SCREENSHOT.

For example:

GERAPY_PLAYWRIGHT_SCREENSHOT = {
    'type': 'png',
    'full_page': True
}

PlaywrightRequest

PlaywrightRequest provide args which can override global settings above.

url: request url
callback: callback
wait_until: one of "load", "domcontentloaded", "networkidle" see https://playwright.dev/python/docs/api/class-page#page-wait-for-load-state, default is domcontentloaded
wait_for: wait for some element to load, also supports dict
script: script to execute
actions: actions defined for execution of Page object
proxy: use proxy for this time, like http://x.x.x.x:x
proxy_credential: the proxy credential, like {'username': 'xxxx', 'password': 'xxxx'}
sleep: time to sleep after loaded, override GERAPY_PLAYWRIGHT_SLEEP
timeout: load timeout, override GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT
ignore_resource_types: ignored resource types, override GERAPY_PLAYWRIGHT_IGNORE_RESOURCE_TYPES
pretend: pretend as normal browser, override GERAPY_PLAYWRIGHT_PRETEND
screenshot: ignored resource types, see https://playwright.dev/python/docs/api/class-page#page-screenshot, override GERAPY_PLAYWRIGHT_SCREENSHOT

For example, you can configure PlaywrightRequest as:

from gerapy_playwright import PlaywrightRequest

def parse(self, response):
    yield PlaywrightRequest(url,
        callback=self.parse_detail,
        wait_until='domcontentloaded',
        wait_for='title',
        script='() => { return {name: "Germey"} }',
        sleep=2)

Then Playwright will:

wait for document to load
wait for title to load
execute console.log(document) script
sleep for 2s
return the rendered web page content, get from response.meta['screenshot']
return the script executed result, get from response.meta['script_result']

For waiting mechanism controlled by JavaScript, you can use await in script, for example:

js = '''async () => {
    await new Promise(resolve => setTimeout(resolve, 10000));
    return {
        'name': 'Germey'
    }
}
'''
yield PlaywrightRequest(url, callback=self.parse, script=js)

Then you can get the script result from response.meta['script_result'], result is {'name': 'Germey'}.

If you think the JavaScript is wired to write, you can use actions argument to define a function to execute Python based functions, for example:

async def execute_actions(page):
    await page.evaluate('() => { document.title = "Hello World"; }')
    return 1
yield PlaywrightRequest(url, callback=self.parse, actions=execute_actions)

Then you can get the actions result from response.meta['actions_result'], result is 1.

Also you can define proxy and proxy_credential for each Reqest, for example:

yield PlaywrightRequest(
  self.base_url,
  callback=self.parse_index,
  priority=10,
  proxy='http://tps254.kdlapi.com:15818',
  proxy_credential={
      'username': 'xxxx',
      'password': 'xxxx'
})

proxy and proxy_credential will override the settings GERAPY_PLAYWRIGHT_PROXY and GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL.

Example

For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-playwright-example

Outputs:

2021-12-27 16:54:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2021-12-27 16:54:14 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (default, Aug 31 2020, 07:22:35) - [Clang 10.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Darwin-21.1.0-x86_64-i386-64bit
2021-12-27 16:54:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2021-12-27 16:54:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 1,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet Password: e931b241390ad06a
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 16:54:14 [scrapy.core.engine] INFO: Spider opened
2021-12-27 16:54:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 16:54:14 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/1
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/1>
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: set options {'headless': False}
cookies []
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/1
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/1 with options {'url': 'https://antispider1.scrape.center/page/1', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/1> (referer: None)
2021-12-27 16:54:20 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/2
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/2>
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: set options {'headless': False}
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/1
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/2
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/3
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/4
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/5
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/6
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/7
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/8
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/9
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/10
cookies []
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/2
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/2 with options {'url': 'https://antispider1.scrape.center/page/2', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:23 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:24 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/2> (referer: None)
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/detail/10>
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: set options {'headless': False}
...

Comments

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

python: 3.9 GerapyPlaywright: 0.2.4 os: mac 11.6

运行scrapy crawl spider的时候直接报错:

Traceback (most recent call last):
  File "/Users/zz/.virtualenvs/crawler-apk/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/cmdline.py", line 145, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
    func(*a, **kw)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/cmdline.py", line 153, in _run_command
    cmd.run(args, opts)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 22, in run
    crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/twisted/internet/selectreactor.py", line 194, in install
    installReactor(reactor)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed"
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

这个应该改哪里呢

opened by yyyy777 3

Fix leaking file descriptors by using the context manager

I was running into OSError: [Errno 24] Too many open files while using this with scrapy for scraping a domain.

By using async_playwright() as a context manager, we ensure it's closed once finished. This fixes the issue.

opened by xolan 2
raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'

崔佬，我这边也不能用。一启动scrapy，就会报这个。配置： python:3.9.4 macOs: 11.5.2

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1445, in _inlineCallbacks result = current_context.run(g.send, result) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_response response = yield deferred_from_coro(method(request=request, response=response, spider=spider)) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 62, in process_response decoded_body = self._decode(response.body, encoding.lower()) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 82, in _decode body = gunzip(body) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/utils/gz.py", line 27, in gunzip chunk = f.read1(8196) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 313, in read1 return self._buffer.read1(size) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 487, in read if not self._read_gzip_header(): File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 435, in _read_gzip_header raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'<!')

opened by wf4867612 2

Method is not Json serializable for actions

Hello,

I am running into an issue with 'yield' for gerapy-playwright when I need to access a login page. I try to run: yield PlaywrightRequest(login_page_url, self.parse_login, actions = self.login_action) in order to first use playwright to login and then access data that can only be accessed when logging in with self.parse_login. I am getting a: builtins.TypeError: is not JSON serializable.

I am using scrapy cluster along with gerapy-playwright in order to run a scheduler for all the spiders that I have: https://github.com/istresearch/scrapy-cluster

It seems that the action is saved in the meta data as a method and cannot be passed to the scheduler. Is it possible to type cast the action as a string and then when the action is called later, to do a method call on the string? If I understand correctly, the action is produced on line 339 of the downloadermiddlewares.py inside of gerepy-playwright. Would it be possible to evaluate the string as a method so that the scrapy-cluster scheduler can pass the string but gerapy-playwright still calls the self.login_action method prior to the self.parse_login?

opened by BenzTivianne 0

出现gzip.BadGzipFile: Not a gzipped file (b'

如何爬去的一个网站返回的response里面的headers包含了 content-encoding: "gzip"的话，那么就会报上述错误，虽然作者在 downloadermiddlewares.py 的代码段中去掉了这个属性：

Necessary to bypass the compression middleware

        # 这个地方只能去掉 headers 中的content-encoding，但是response.headers中的依然存在，所以下面应该直接改为  headers=headers,
        headers = response.headers
        headers.pop('content-encoding', None)
        headers.pop('Content-Encoding', None)

        response = HtmlResponse(
            page.url,
            status=response.status,
            headers=response.headers,    # 解决办法就是改为： headers=headers, 
            body=content,
            encoding='utf-8',
            request=request
        )

但是很可惜的是，去不掉，只有把 headers=response.headers, 改为headers才可以。

opened by legend-zl 3

gzip.BadGzipFile: Not a gzipped file (b'

使用Sscrapy的时候，利用

yield PlaywrightRequest(article_url, callback=self.parse_result, wait_until='domcontentloaded', headers=self.headers)

出现错误：gzip.BadGzipFile: Not a gzipped file (b'<!')

opened by legend-zl 1

playwright._impl._api_types.Error: Browser closed.

这种报错会是什么原因呢...

2022-03-23 06:21:06 [scrapy.core.scraper] ERROR: Error downloading <GET https://apkpure.com/bikers-men-women-bike-photo-editor-future-trends/com.dsrtech.bikers> Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks result = current_context.run( File "/usr/local/lib/python3.8/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/local/lib/python3.8/dist-packages/scrapy/core/downloader/middleware.py", line 41, in process_request response = yield deferred_from_coro(method(request=request, spider=spider)) File "/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py", line 1030, in adapt extracted = result.result() File "/usr/local/lib/python3.8/dist-packages/gerapy_playwright/downloadermiddlewares.py", line 243, in _process_request context = await browser.new_context( File "/usr/local/lib/python3.8/dist-packages/playwright/async_api/_generated.py", line 11254, in new_context await self._async( File "/usr/local/lib/python3.8/dist-packages/playwright/_impl/_browser.py", line 117, in new_context channel = await self._channel.send("newContext", params) File "/usr/local/lib/python3.8/dist-packages/playwright/_impl/_connection.py", line 39, in send return await self.inner_send(method, params, False) File "/usr/local/lib/python3.8/dist-packages/playwright/_impl/_connection.py", line 63, in inner_send result = next(iter(done)).result() playwright._impl._api_types.Error: Browser closed. ==================== Browser output: ==================== /ms-playwright/chromium-978106/chrome-linux/chrome --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,AcceptCHFrame,AutoExpandDetailsElement --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --disable-extensions --hide-scrollbars --mute-audio --no-sandbox --disable-setuid-sandbox --disable-gpu --user-data-dir=/tmp/playwright_chromiumdev_profile-LGppgb --remote-debugging-pipe --no-startup-window pid=1185 [pid=1185][err] [0323/062041.773971:ERROR:platform_thread_posix.cc(151)] pthread_create: Resource temporarily unavailable (11) [pid=1185][err] [0323/062041.774268:ERROR:platform_thread_posix.cc(151)] pthread_create: Resource temporarily unavailable (11) [pid=1185][err] [0323/062041.778128:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.778090:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.778548:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.778568:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 1 time(s) [pid=1185][err] [0323/062041.778540:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.780496:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.785120:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.785947:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.785963:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 2 time(s) [pid=1185][err] [0323/062041.786835:ERROR:bus.cc(397)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [pid=1185][err] [0323/062041.786892:ERROR:bus.cc(397)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [pid=1185][err] [0323/062041.787157:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable. [pid=1185][err] [0323/062041.787335:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.816287:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.815965:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.816903:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.816914:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 3 time(s) [pid=1185][err] [0323/062041.816721:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.821091:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.821123:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.821310:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.821321:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 4 time(s) [pid=1185][err] [0323/062041.822089:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.823058:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.823172:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.823358:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.823369:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 5 time(s) [pid=1185][err] [0323/062041.824213:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.825010:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.825129:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.825312:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.825323:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 6 time(s) [pid=1185][err] [0323/062041.825608:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.825332:FATAL:gpu_data_manager_impl_private.cc(447)] GPU process isn't usable. Goodbye. [pid=1185][err] #0 0x55f9da32a369 base::debug::CollectStackTrace() [pid=1185][err] #1 0x55f9da2908c3 base::debug::StackTrace::StackTrace() [pid=1185][err] #2 0x55f9da2a3650 logging::LogMessage::~LogMessage() [pid=1185][err] #3 0x55f9d7e92bf7 content::(anonymous namespace)::IntentionallyCrashBrowserForUnusableGpuProcess() [pid=1185][err] #4 0x55f9d7e903fe content::GpuDataManagerImplPrivate::FallBackToNextGpuMode() [pid=1185][err] #5 0x55f9d7e8f303 content::GpuDataManagerImpl::FallBackToNextGpuMode() [pid=1185][err] #6 0x55f9d7e99d13 content::GpuProcessHost::RecordProcessCrash() [pid=1185][err] #7 0x55f9d7e9af44 content::GpuProcessHost::OnProcessLaunchFailed() [pid=1185][err] #8 0x55f9d7d15421 content::BrowserChildProcessHostImpl::OnProcessLaunchFailed() [pid=1185][err] #9 0x55f9d7d6faf5 content::internal::ChildProcessLauncherHelper::PostLaunchOnClientThread() [pid=1185][err] #10 0x55f9d7d6fd15 base::internal::Invoker<>::RunOnce() [pid=1185][err] #11 0x55f9da2e8bb0 base::TaskAnnotator::RunTaskImpl() [pid=1185][err] #12 0x55f9da2fca99 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWorkImpl() [pid=1185][err] #13 0x55f9da2fc7bc base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork() [pid=1185][err] #14 0x55f9da2fcf92 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork() [pid=1185][err] #15 0x55f9da2ac06b base::(anonymous namespace)::WorkSourceDispatch() [pid=1185][err] #16 0x7fe589c2b17d g_main_context_dispatch [pid=1185][err] #17 0x7fe589c2b400 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6+0x523ff) [pid=1185][err] #18 0x7fe589c2b4a3 g_main_context_iteration [pid=1185][err] #19 0x55f9da2abeb3 base::MessagePumpGlib::Run() [pid=1185][err] #20 0x55f9da2fd1fe base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run() [pid=1185][err] #21 0x55f9da2ca3ed base::RunLoop::Run() [pid=1185][err] #22 0x55f9d7d2d2ad content::BrowserMainLoop::RunMainMessageLoop() [pid=1185][err] #23 0x55f9d7d2eb62 content::BrowserMainRunnerImpl::Run() [pid=1185][err] #24 0x55f9df75683e headless::HeadlessContentMainDelegate::RunProcess() [pid=1185][err] #25 0x55f9d9e42862 content::RunBrowserProcessMain() [pid=1185][err] #26 0x55f9d9e43d0f content::ContentMainRunnerImpl::RunBrowser() [pid=1185][err] #27 0x55f9d9e4389f content::ContentMainRunnerImpl::Run() [pid=1185][err] #28 0x55f9d9e40cb4 content::RunContentProcess() [pid=1185][err] #29 0x55f9d9e415ce content::ContentMain() [pid=1185][err] #30 0x55f9d9e9cc5a headless::(anonymous namespace)::RunContentMain() [pid=1185][err] #31 0x55f9d9e9c965 headless::HeadlessShellMain() [pid=1185][err] #32 0x55f9d6961fa8 ChromeMain [pid=1185][err] #33 0x7fe588ea60b3 __libc_start_main [pid=1185][err] #34 0x55f9d6961dea _start [pid=1185][err] Task trace: [pid=1185][err] #0 0x55f9d7d6f9ac content::internal::ChildProcessLauncherHelper::PostLaunchOnLauncherThread() [pid=1185][err] #1 0x55f9d7d6f3aa content::internal::ChildProcessLauncherHelper::StartLaunchOnClientThread() [pid=1185][err] #2 0x55f9da682456 mojo::SimpleWatcher::Context::Notify() [pid=1185][err] #3 0x55f9d7d6f3aa content::internal::ChildProcessLauncherHelper::StartLaunchOnClientThread() [pid=1185][err] #4 0x55f9da682456 mojo::SimpleWatcher::Context::Notify() [pid=1185][err] Task trace buffer limit hit, update PendingTask::kTaskBacktraceLength to increase. [pid=1185][err]

opened by yyyy777 0

Releases(v0.2.3)

v0.2.3(Jan 11, 2022)
Fix bug: https://github.com/Gerapy/GerapyPlaywright/issues/3

Source code(tar.gz)
Source code(zip)
v0.2.0(Dec 28, 2021)
New Feature: Add support for:

Specifying channel for launching

Specifying executablePath for launching

Specifying slowMo for launching

Specifying devtools for launching

Specifying --disable-extensions in args for launching

Specifying --hide-scrollbars in args for launching

Specifying --no-sandbox in args for launching

Specifying --disable-setuid-sandbox in args for launching

Specifying --disable-gpu in args for launching

Update: change GERAPY_PLAYWRIGHT_SLEEP default to 0

Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 28, 2021)

Fix bug: Add error handling logic for playwright._impl._api_types.Error, to add retrying logic and avoid unexpected memory increase.
Source code(tar.gz)
Source code(zip)
v0.1.1(Dec 27, 2021)
First version of Playwright, add basic support for:

Auto Installation

Render with Playwright

Setting Concurrency

Setting Proxy

Setting Cookies

Screenshot

Evaluating Script

Wait for Elements

Wait loading control

Setting Timeout

Pretending Webdriver

Source code(tar.gz)
Source code(zip)

Owner

Gerapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd.

GitHub Repository

Youtube video downloader and info extractor for python.

tube_dl Tube_dl is a Simple Youtube video downloader for Python. A Modular approach to bypass and download Youtube Videos and Playlist from Youtube us

16 Jul 09, 2022

Download songs and playlists from Spotify for free!

spotify-to-mp3-converter You can basically understand the process with just this image but for clarity, these are the steps. Before using the exe down

2 Jan 25, 2022

Python script to download (TCR) genes from IMGT/GENE-DB

IMGTgeneDL 0.1.0 Jamie Heather | CCR @ MGH | 2021 This script provides an alternative way to access TCR and IG genes stored in IMGT/GENE-DB. It's prim

1 Mar 30, 2022

A manga download script written in python.

manga-dlp python script to download mangas Description A manga download script written in python. It only supports mangadex.org for now. But support f

15 Nov 28, 2022

ASF Sentinel-1 Metadata Download tool

ASF Sentinel-1 Metadata Download tool Copyright: 2021-2022 Antonio Valentino Small Python tool (asfsmd) that allows to download XML files containing S

9 Dec 04, 2022

PyDownloader - Downloads files and folders at high speed (based on your interent speed).

4 Feb 24, 2022

A program that can download animations from myself website

MYD A program that can download animations from myself website 一個可以用來下載Myself網站上動漫的程式 Quick Start [無GUI版本] 確定電腦內包含 ffmpeg 並設為環境變數 (Environment Variabl

1 Nov 07, 2021

Convert BMS songs to osu! With options to convert keysounds and convert to 7key.

bmx2osu Convert BMS to osu! With options to: convert keysounds to one song file using BMX2WAV include 7k version change Overall Difficulty and HP Drai

7 Nov 28, 2022

A simple kemono.party downloader using python.

kemono-dl This is a simple kemono.party downloader. How to use Install python Download source code from releases and extract it Then install requireme

318 Dec 27, 2022

Ripurei is a free-to-use osu! replay downloader, that can be configured to download from any osu! server.

Ripurei Ripurei is a fully functional osu! replay downloader, fully capable of downloading from almost any osu! server. Functionality Timeline ✔️ Able

0 Feb 11, 2022

Bulk Downloader for Reddit

saveddit is a bulk media downloader for reddit pip3 install saveddit Setting up authorization Register an application with Reddit Write down your clie

136 Jan 03, 2023

Noto fonts go universal! Download Noto fonts combined to suit your region

noto-cjk Noto CJK fonts Noto Serif CJK update was released on 25 October 2021. We moved the release history and other notes into both Sans and Serif s

2k Jan 02, 2023

Python-Youtube-Downloader - An Open Source Python Youtube Downloader

Python-Youtube-Downloader Hello There This Is An Open Source Python Youtube Down

3 Jun 14, 2022

A Quick demo of how to use the youtube_dl module in python.

youtube_dl python module demo A Quick demo of how to use the youtube_dl module in python. Whole documentation for the youtube_dl Installation git

7 Aug 27, 2021

A simple Python program which uses youtube-dl for downloading YouTube videos as mp3 files.

yt-mp3 converter This is a simple Python program which uses youtube-dl for downloading YouTube videos as mp3 files. This program is for you if you are

1 Oct 24, 2021

Download Apple Music Cover Artwork in the best Quality by providing an Apple Music Link. It downloads the jpg, png and webp version since they often differ from another.

amogus.py - Version 0.0.5 amogus - Apple Music Hi-Res Artwork Fetcher this is my first real python tool so sorry if its bad amogus is a Python script

46 Jan 09, 2023

A collection of modules I have created to programmatically search for/download imagery from live cam feeds across the state of California.

A collection of modules that I have created to programmatically search for/download imagery from all publicly available live cam feeds across the state of California. In no way am I affiliated with a

5 Nov 21, 2022

Download India Stocks Historical Data

Kite Helper - Download Stock Market Data 🌎 Website Simple Application to Download any stock market data in .csv format using Kite 🏃‍♂️ Running Serve

12 Dec 06, 2022

Download YouTube videos that are available in the given playlist

Youtube-Playlist-Downloader Download YouTube videos that are in a playlist Project assets: music downloaded music folder. (will be generated) music.db

1 Dec 22, 2021

FireDM is a python open source (Internet Download Manager) with multi-connections, high speed engine, it downloads general files and videos from youtube and tons of other streaming websites .

python open source (Internet Download Manager) with multi-connections, high speed engine, based on python, LibCurl, and youtube_dl https://github.com/firedm/FireDM

1.6k Apr 12, 2022

Downloader Middleware to support Playwright in Scrapy & Gerapy

Related tags

Overview

Gerapy Playwright

Installation

Usage

Settings

Concurrency

Pretend as Real Browser

Logging Level

Download Timeout

Headless

Window Size

Proxy

Screenshot

PlaywrightRequest

Example

Comments

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Fix leaking file descriptors by using the context manager

raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'

Method is not Json serializable for actions

出现gzip.BadGzipFile: Not a gzipped file (b'

Necessary to bypass the compression middleware

gzip.BadGzipFile: Not a gzipped file (b'

playwright._impl._api_types.Error: Browser closed.

Releases(v0.2.3)

v0.2.3(Jan 11, 2022)

v0.2.0(Dec 28, 2021)

v0.1.2(Dec 28, 2021)

v0.1.1(Dec 27, 2021)

Owner

Gerapy

Youtube video downloader and info extractor for python.

Download songs and playlists from Spotify for free!

Python script to download (TCR) genes from IMGT/GENE-DB

A manga download script written in python.

ASF Sentinel-1 Metadata Download tool

PyDownloader - Downloads files and folders at high speed (based on your interent speed).

A program that can download animations from myself website

Convert BMS songs to osu! With options to convert keysounds and convert to 7key.

A simple kemono.party downloader using python.

Ripurei is a free-to-use osu! replay downloader, that can be configured to download from any osu! server.

Bulk Downloader for Reddit

Noto fonts go universal! Download Noto fonts combined to suit your region

Python-Youtube-Downloader - An Open Source Python Youtube Downloader

A Quick demo of how to use the youtube_dl module in python.

A simple Python program which uses youtube-dl for downloading YouTube videos as mp3 files.

Download Apple Music Cover Artwork in the best Quality by providing an Apple Music Link. It downloads the jpg, png and webp version since they often differ from another.

A collection of modules I have created to programmatically search for/download imagery from live cam feeds across the state of California.

Download India Stocks Historical Data

Download YouTube videos that are available in the given playlist

FireDM is a python open source (Internet Download Manager) with multi-connections, high speed engine, it downloads general files and videos from youtube and tons of other streaming websites .