Extract embedded metadata from HTML markup

Overview

extruct

Build Status Coverage report PyPI Version

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, base_url=base_url) with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
                                      'content': 'What is Open Graph Protocol '
                                                 'and why you need it? Learn to '
                                                 'implement Open Graph Protocol '
                                                 'for Facebook on your website. '
                                                 'Open Graph Protocol Meta Tags.',
                                      'name': 'description'}],
                      'namespaces': {},
                      'terms': []}],

'json-ld': [ { '@context': 'https://schema.org',
                 '@id': '#organization',
                 '@type': 'Organization',
                 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
                 'name': 'Optimize Smart',
                 'sameAs': [ 'https://www.facebook.com/optimizesmart/',
                             'https://uk.linkedin.com/in/analyticsnerd',
                             'https://www.youtube.com/user/optimizesmart',
                             'https://twitter.com/analyticsnerd'],
                 'url': 'https://www.optimizesmart.com/'}],
  'microdata': [ { 'properties': {'headline': ''},
                   'type': 'http://schema.org/WPHeader'}],
  'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
                                                     'name': [ 'Open Graph '
                                                               'Protocol for '
                                                               'Facebook '
                                                               'explained with '
                                                               'examples\n'
                                                               '\n'
                                                               'Specialized '
                                                               'Tracking\n'
                                                               '\n'
                                                               '\n'
                                                               (...)
                                                               'Follow '
                                                               '@analyticsnerd\n'
                                                               '!function(d,s,id){var '
                                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                               "'script', "
                                                               "'twitter-wjs');"]},
                                     'type': ['h-entry']}],
                     'properties': { 'name': [ 'Open Graph Protocol for '
                                               'Facebook explained with '
                                               'examples\n'
                                               (...)
                                               'Follow @analyticsnerd\n'
                                               '!function(d,s,id){var '
                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                               "'script', 'twitter-wjs');"]},
                     'type': ['h-feed']}],
  'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
                   'properties': [ ('og:locale', 'en_US'),
                                   ('og:type', 'article'),
                                   ( 'og:title',
                                     'Open Graph Protocol for Facebook '
                                     'explained with examples'),
                                   ( 'og:description',
                                     'What is Open Graph Protocol and why you '
                                     'need it? Learn to implement Open Graph '
                                     'Protocol for Facebook on your website. '
                                     'Open Graph Protocol Meta Tags.'),
                                   ( 'og:url',
                                     'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
                                   ('og:site_name', 'Optimize Smart'),
                                   ( 'og:updated_time',
                                     '2018-03-09T16:26:35+00:00'),
                                   ( 'og:image',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
                                   ( 'og:image:secure_url',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
  'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
              'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
            { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
              'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
              'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
              'article:section': [{'@value': 'Specialized Tracking'}],
              'http://ogp.me/ns#description': [ { '@value': 'What is Open '
                                                            'Graph Protocol '
                                                            'and why you need '
                                                            'it? Learn to '
                                                            'implement Open '
                                                            'Graph Protocol '
                                                            'for Facebook on '
                                                            'your website. '
                                                            'Open Graph '
                                                            'Protocol Meta '
                                                            'Tags.'}],
              'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
              'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
                                                      'Facebook explained with '
                                                      'examples'}],
              'http://ogp.me/ns#type': [{'@value': 'article'}],
              'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
              'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes

It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                  'fb': 'http://www.facebook.com/2008/fbml',
                                  'og': 'http://ogp.me/ns#'},
                   'properties': [ ('fb:app_id', '308540029359'),
                                   ('og:site_name', 'Songkick'),
                                   ('og:type', 'songkick-concerts:artist'),
                                   ('og:title', 'Elysian Fields'),
                                   ( 'og:description',
                                     'Find out when Elysian Fields is next '
                                     'playing live near you. List of all '
                                     'Elysian Fields tour dates and concerts.'),
                                   ( 'og:url',
                                     'https://www.songkick.com/artists/236156-elysian-fields'),
                                   ( 'og:image',
                                     'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Uniform

Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure:

{'@context': 'http://example.com',
             '@type': 'example_type',
             /* All other the properties in keys here */
             }

To do so set uniform=True when calling extract, it's false by default for backward compatibility. Here the same example as before but with uniform set to True:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                               'fb': 'http://www.facebook.com/2008/fbml',
                               'og': 'http://ogp.me/ns#'},
                 '@type': 'songkick-concerts:artist',
                 'fb:app_id': '308540029359',
                 'og:description': 'Find out when Elysian Fields is next '
                                   'playing live near you. List of all '
                                   'Elysian Fields tour dates and concerts.',
                 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
                 'og:site_name': 'Songkick',
                 'og:title': 'Elysian Fields',
                 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet

Returning HTML node

It is also possible to get references to HTML node for every extracted metadata item. The feature is supported only by microdata syntax.

To use that, just set the return_html_node option of extract method to True. As the result, an additional key "nodeHtml" will be included in the result for every item. Each node is of lxml.etree.Element type:

>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
                   'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
                                                  'Not your thin sticky pad, '
                                                  'No-Muv is truly the best!',
                                   'image': ['', ''],
                                   'name': ['No-Muv', 'No-Muv'],
                                   'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': 'Price:  '
                                                                          '$45'},
                                                 'type': 'http://schema.org/Offer'},
                                               { 'htmlNode': <Element div at 0x7f10f8e60f48>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': '(Select '
                                                                          'Size/Shape '
                                                                          'for '
                                                                          'Pricing)'},
                                                 'type': 'http://schema.org/Offer'}],
                                   'ratingValue': ['5.00', '5.00']},
                   'type': 'http://schema.org/Product'}]}

Single extractors

You can also use each extractor individually. See below.

Microdata extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <meta property="og:type" content="article"/>
...   <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
...   <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
...   <meta property="fb:admins" content="himanshu160"/>
...   <meta property="og:site_name" content="Event Education"/>
...   <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
...  </head>
...  <body>
...   <div id="fb-root"></div>
...   <script>(function(d, s, id) {
...               var js, fjs = d.getElementsByTagName(s)[0];
...               if (d.getElementById(id)) return;
...                  js = d.createElement(s); js.id = id;
...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
...                  fjs.parentNode.insertBefore(js, fjs);
...                  }(document, 'script', 'facebook-jssdk'));</script>
...  </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
      "og": "http://ogp.me/ns#"
  },
  "properties": [
      [
          "og:title",
          "Himanshu's Open Graph Protocol"
      ],
      [
          "og:type",
          "article"
      ],
      [
          "og:url",
          "https://www.eventeducation.com/test.php"
      ],
      [
          "og:image",
          "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
      ],
      [
          "og:site_name",
          "Event Education"
      ],
      [
          "og:description",
          "Event Education provides free courses on event planning and management to event professionals worldwide."
      ]
    ]
 }]

Microformat extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <article class="h-entry">
...    <h1 class="p-name">Microformats are amazing</h1>
...    <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
...       on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
...    <p class="p-summary">In which I extoll the virtues of using microformats.</p>
...    <div class="e-content">
...     <p>Blah blah blah</p>
...    </div>
...   </article>
...  </head>
...  <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
      "h-entry"
  ],
  "properties": {
      "name": [
          "Microformats are amazing"
      ],
      "author": [
          {
              "type": [
                  "h-card"
              ],
              "properties": {
                  "name": [
                      "W. Developer"
                  ],
                  "url": [
                      "http://example.com"
                  ]
              },
              "value": "W. Developer"
          }
      ],
      "published": [
          "2013-06-13 12:00:00"
      ],
      "summary": [
          "In which I extoll the virtues of using microformats."
      ],
      "content": [
          {
              "html": "\n<p>Blah blah blah</p>\n",
              "value": "\nBlah blah blah\n"
          }
      ]
    }
 }]

DublinCore extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML &lt;meta&gt; elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                    'content': 'Expressing Dublin Core\n'
                               'in HTML/XHTML meta and link elements',
                    'lang': 'en',
                    'name': 'DC.title'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                    'content': 'Andy Powell, UKOLN, University of Bath',
                    'name': 'DC.creator'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                    'content': 'http://dublincore.org/documents/dcq-html/',
                    'name': 'DC.identifier',
                    'scheme': 'DCTERMS.URI'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/format',
                    'content': 'text/html',
                    'name': 'DC.format',
                    'scheme': 'DCTERMS.IMT'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/type',
                    'content': 'Text',
                    'name': 'DC.type',
                    'scheme': 'DCTERMS.DCMIType'}],
    'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                    'DCTERMS': 'http://purl.org/dc/terms/'},
    'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                 'content': '2003-11-01',
                 'name': 'DCTERMS.issued',
                 'scheme': 'DCTERMS.W3CDTF'},
               { 'URI': 'http://purl.org/dc/terms/abstract',
                 'content': 'This document describes how\n'
                            'qualified Dublin Core metadata can be encoded\n'
                            'in HTML/XHTML <meta> elements',
                 'name': 'DCTERMS.abstract'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DC.Date.modified'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DCTERMS.modified'},
               { 'URI': 'http://purl.org/dc/terms/replaces',
                 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                 'hreflang': 'en',
                 'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install extruct[cli]

Usage

extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph and Microformat metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph and Microformat). If you want to restrict the output to just one or a subset of those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from "http://example.com":

extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

Matheus Kolln 1 Sep 05, 2021
Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

Daryl Yu 2 Jan 07, 2022
Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

AMEY THAKUR 8 Apr 28, 2022
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

Devansh Singh 1 Feb 10, 2022
An arxiv spider

An Arxiv Spider 做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其

Jie Liu 11 Sep 09, 2022
Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Anton Ivarsson 1 Nov 07, 2021
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Jan 04, 2023
Web3 Pancakeswap Sniper bot written in python3

Pancakeswap_BSC_Sniper_Bot Web3 Pancakeswap Sniper bot written in python3, Please note the license conditions! The first Binance Smart Chain sniper bo

Treading-Tigers 295 Dec 31, 2022
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
a high-performance, lightweight and human friendly serving engine for scrapy

a high-performance, lightweight and human friendly serving engine for scrapy

Speakol Ads 30 Mar 01, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

Rodney 4 Nov 18, 2022
Find thumbnails and original images from URL or HTML file.

Haul Find thumbnails and original images from URL or HTML file. Demo Hauler on Heroku Installation on Ubuntu $ sudo apt-get install build-essential py

Vinta Chen 150 Oct 15, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022