A webmining CLI tool & library for python.

Overview

Build Status DOI download number

Minet

minet is a webmining command line tool & library for python (>= 3.6) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, CrowdTangle, YouTube, Twitter, Media Cloud etc.

It adopts a very simple approach to various webmining problems by letting you perform a variety of actions from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.

In addition, minet also exposes its high-level programmatic interface as a python library so you can tweak its behavior at will.

Shortcuts: Command line documentation, Python library documentation.

Summary

What it does

Minet can single-handedly:

  • Extract URLs from a text file (or a table)
  • Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
  • Join two CSV files by matching the columns containing URLs
  • From a list of URLs, resolve their redirections
    • ...and check their HTTP status
    • ...and download the HTML
    • ...and extract hyperlinks
    • ...and extract the text content and other metadata (title...)
    • ...and scrape structured data (using a declarative language to define your heuristics)
  • Crawl (using a declarative language to define a browsing behavior, and what to harvest)
  • Mine or search:
  • Scrape (without requiring special access):
  • Grab & dump cookies from your browser
  • Dump Hyphe data

Documented use cases

Features (from a technical standpoint)

  • Multithreaded, memory-efficient fetching from the web.
  • Multithreaded, scalable crawling using a comfy DSL.
  • Multiprocessed raw text content extraction from HTML pages.
  • Multiprocessed scraping from HTML pages using a comfy DSL.
  • URL-related heuristics utilities such as extraction, normalization and matching.
  • Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

Don't trust us enough to pipe the result of a HTTP request into bash? We wouldn't either, so feel free to read the installation script here and run it on your end if you prefer.

On ubuntu & similar you might need to install curl and unzip before running the installation script if you don't already have it:

sudo apt-get install curl unzip

Else, minet can be installed directly as a python CLI tool and library using pip:

pip install minet

If you need more help to install and use minet from scratch, you can check those installation documents.

Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release here.

Upgrading

To upgrade the standalone version, simply run the install script once again:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

To upgrade the python version you can use pip thusly:

pip install -U minet

Uninstallation

To uninstall the standalone version:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash

To uninstall the python version:

pip uninstall minet

Documentation

Contributing

To contribute to minet you can check out this documentation.

How to cite

minet is published on Zenodo as DOI

You can cite it thusly:

Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, & Amélie Pellé. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399

Comments
  • casanova.exceptions.EmptyFileError

    casanova.exceptions.EmptyFileError

    I am trying to run minet in a github action. It fails with the following message:

      minet tw scrape tweets -o tweets.csv "from:@taniki #tutotal2022"
      shell: /usr/bin/bash -e {0}
      env:
        pythonLocation: /opt/hostedtoolcache/Python/3.9.5/x64
        LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.5/x64/lib
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 151, in __init__
        fieldnames = next(self.reader)
    StopIteration
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/__init__.py", line 33, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/scrape.py", line 45, in twitter_scrape_action
        enricher = casanova.enricher(
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/enricher.py", line 31, in __init__
        super().__init__(input_file, no_headers=no_headers, **kwargs)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 157, in __init__
        raise EmptyFileError
    casanova.exceptions.EmptyFileError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Error: Process completed with exit code 1.
    
    opened by taniki 16
  • Get Retweeters

    Get Retweeters

    Hi, thanks for the last release, I'm glad to see there is a Retweeters tool but I went through some issues with it... for a few days.. I may not understood how it should implemented ? I run it and I get this error : image May someone who manage with it help me ?

    Thank you

    opened by jlbreeeez 15
  • Twitter API scraper: acquire guest_token by API

    Twitter API scraper: acquire guest_token by API

    new method to acquire the guest_token through activate API relates #384 #382

    Method taken from @JustAnotherArchivist in snscrape see: https://github.com/JustAnotherArchivist/snscrape/commit/0336ce13edbd195b3e91487061a0e7a2857f0c68 Thanks for sharing the solution.

    For now this edit is simply a new method to acquire the token. The token is used as a cookie as before but it's not preserved on disk in case of multiple calls.

    opened by paulgirard 11
  • tw scrape fails on some queries due to Over capacity error

    tw scrape fails on some queries due to Over capacity error

    minet tw scrape tweets '#5gcovid' > tweets.csv

    <class 'minet.twitter.exceptions.TwitterPublicAPIInvalidResponseError'>

    {'errors': [{'message': 'Over capacity', 'code': 130}]} 503

    bug 
    opened by Yomguithereal 10
  • [retweeters] KeyError: 'url'

    [retweeters] KeyError: 'url'

    Hi, when I try to retrieve the retweeters list from a file containing tweets previously extracted from Twitter using minet scrapper, I get this error after scanning a few tweets from my list (after 7, 10, or 30 tweets scanned... it depend of the database...). Does anyone encountered this error before ? Thanks for helping :-) image

    opened by tloops329384 8
  • impossible d'extraire totalité des tweets d'une requête

    impossible d'extraire totalité des tweets d'une requête

    Lorsque je lance une requête, avec comme critère un mot clé + un utilisateur, le résultat est très aléatoire : une fois 0 tweet, une fois 1 tweet, une fois 20 tweets, une fois 80 tweets etc sans jamais arriver à une extraction totale (qui est d'environ seulement 200 tweets pourtant). J'ai relancé cette requête de nombreuses fois, sans jamais extraire l'ensemble des tweets en question.

    Que dois-je faire pour y parvenir ? Merci

    opened by parisGH 8
  • [twitter] unable to get user tweets

    [twitter] unable to get user tweets

    Hello,

    Thanks for sharing the lib with the community. I am not able to get user tweets , I got the error:

    Traceback (most recent call last):
      File "/home/bafou/.local/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/__main__.py", line 198, in main
        to_close = resolve_arg_dependencies(cli_args, config)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 290, in resolve_arg_dependencies
        setattr(cli_args, name, value.resolve(config))
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 253, in resolve
        return getpath(config, self.key, self.default)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/ebbe/utils.py", line 72, in getpath
        target = target[step]
    TypeError: string indices must be integers
    

    when executingminet tw user-tweets screen_name users.csv > tweets.csv with users.csv

    Regards.

    bug 
    opened by billmetangmo 6
  • GH actions + Minet Scrap Twitter fail.

    GH actions + Minet Scrap Twitter fail.

    hi,

    i have this GH action to generate a twitter scrap csv (written by @taniki) :

    name: scrape bfm
    
    on:
      workflow_dispatch:
      schedule:
        - cron:  '0 9 * * *'
    
    jobs:
      scrape_bfm:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/[email protected]
          - uses: actions/[email protected]
            with:
              python-version: '3.x'
          - name: install minet
            run: |
              python -m pip install --upgrade pip
              pip install minet==0.56.2
          - name: scrape @BFMTV tweets
            shell: bash
            run: |
              minet tw scrape tweets "from:@BFMTV since:2021-09-01" > bfmtv-tweets.csv
          - name: commit
            uses: ./.github/actions/commit
            with:
              message: lol @bfmtv
    

    Sometimes, no problem. Sometimes, GH return error log :

    Run minet tw scrape tweets "from:@CNEWS since:2021-09-01" > cnews-tweets.csv
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                            
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                   
    Searching for "from:@CNEWS since:2021-09-01"
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.10.1/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/__init__.py", line 31, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/scrape.py", line 69, in twitter_scrape_action
        for tweet, meta in iterator:
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 370, in search
        new_cursor, tweets = retryer(self.request_search, query, cursor, refs=refs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
        do = self.iter(retry_state=retry_state)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 349, in iter
        return fut.result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 438, in result
        return self.__get_result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 407, in __call__
        result = fn(*args, **kwargs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 72, in wrapped
        self.acquire_guest_token()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 261, in acquire_guest_token
        raise TwitterGuestTokenError
    minet.twitter.exceptions.TwitterGuestTokenError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]
    Error: Process completed with exit code 1.
    

    Dont understand. Did anyone have the same problem Twitter ban GH sometimes ?

    Thanks for Minet, super outil !

    opened by stefw 6
  • Access denied

    Access denied

    Forewords : sorry, new on GitHub, and I'm not sure it is the appropriate place to post my question... Is it ?

    Hi, First, thank you for the tool which will help me a lot in my research ! I got a problem, which I think is not that complicated, but when I run Minet in order to get the "friends" of the twitter_users contained in the data_users.csv file, I don't manage to get access to the file : "Permission Denied"... I tried to open the CMD as an Administrator but it didn't solve the problem. Can you help me ?

    Capture

    opened by jlbreeeez 6
  • error in installing pip install mineit

    error in installing pip install mineit

    while installing mineit via pip it does not work. says, "" Collecting mineit Could not install packages due to an EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/mineit/

    ""

    is this issue already solved?

    opened by moonisali 6
  • Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    As in #382 I experience systematic TwitterGuestTokenError exceptions. Was not the case a few weeks ago. I didn't test other versions than 0.56.1 and 0.56.2.

    Looks like we need to review the twitter scrape heuristic. I will try to have a look later today or tomorrow.

    bug 
    opened by paulgirard 5
  • instagram

    instagram

    • [ ] get comments from a post id: https://www.instagram.com/api/v1/media/POST_ID/comments/?can_support_threading=true&permalink_enabled=false
    • [x] get user info from username: https://i.instagram.com/api/v1/users/web_profile_info/?username=USERNAME
    • [ ] other route for posts associated with hashtag (more info but don't know how to change page): https://www.instagram.com/api/v1/tags/web_info/?tag_name=HASHTAG
    • [ ] get post info from post id: https://www.instagram.com/api/v1/media/POST_ID/info/
    • [ ] get post likers from post id (it seems that we can only have access to a limited number of them): https://www.instagram.com/api/v1/media/POST_ID/likers/

    Need 'cookie' and 'x-ig-app-id'

    enhancement 
    opened by MiguelLaura 0
Releases(0.66.1)
Owner
médialab Sciences Po
SciencesPo's médialab is an interdisciplinary research laboratory gathering engineers, designers & social science researchers.
médialab Sciences Po
A command line utility to export Google Keep notes to markdown.

Keep-Exporter A command line utility to export Google Keep notes to markdown files with metadata stored as a frontmatter header. Supports exporting: S

Nathan Beals 85 Dec 17, 2022
Redial is a simple shell application that manages your SSH sessions on Unix terminal.

redial redial is a simple shell application that manages your SSH sessions on Unix terminal. What's New 0.7 (19.12.2019) Basic support for adding ssh

Bahadır Yağan 186 Oct 28, 2022
Custom function scheduler TUI (text-based user interface) in the console

Custom function scheduler TUI (text-based user interface) in the console

Luke 1 Oct 26, 2022
[WIP]An ani-cli like cli tool for movies and webseries

mov-cli A cli to browse and watch movies. Installation This project is a work in progress. However, you can try it out python git clone https://github

166 Dec 30, 2022
A CLI application that downloads your AC submissions from OJ's like Atcoder,Codeforces,CodeChef and distil it into beautiful Submission HeatMap.

Yoda A CLI that takes away the hassle of managing your submission files on different online-judges by automating the entire process of collecting and organizing your code submissions in one single Di

Nikhar Manchanda 1 Jul 28, 2022
3DigitDev 29 Jan 17, 2022
Python3 command-line tool for the inference of Boolean rules and pathway analysis on omics data

BONITA-Python3 BONITA was originally written in Python 2 and tested with Python 2-compatible packages. This version of the packages ports BONITA to Py

1 Dec 22, 2021
A CLI password generator

passgen - A CLI password generator Usage python3 main.py arguments Arguments Argument Short Description --length -l The length of the password to ge

1 Nov 13, 2021
jenkins-tui is a terminal based user interface for Jenkins.

jenkins-tui 📦 jenkins-tui is a terminal based user interface for Jenkins. 🚧 ⚠️ This app is a prototype and in very early stages of development. Ther

Craig Gumbley 22 Oct 24, 2022
Standalone script written in Python 3 for generating Reverse Shell one liner snippets and handles the communication between target and client using custom Netcat binaries

Standalone script written in Python 3 for generating Reverse Shell one liner snippets and handles the communication between target and client using custom Netcat binaries. It automates the boring stu

Yash Bhardwaj 3 Sep 27, 2022
Shazam is a Command Line Application that checks the integrity of the file by comparing it with a given hash.

SHAZAM - Check the file's integrity Shazam is a Command Line Application that checks the integrity of the file by comparing it with a given hash. Crea

Anaxímeno Brito 1 Aug 21, 2022
Terminal epub reader with inline images

nuber Inspired by epy, nuber is an Epub terminal reader with inline images written with Rust and Python using Überzug. Features Display images in term

Moshe Sherman 73 Oct 12, 2022
Modern line-oriented terminal emulator without support for TUIs.

Modern line-oriented terminal emulator without support for TUIs.

10 Jun 12, 2022
A CLI tool that scans through a directory and organizes all loose files into folders by file type.

Organizer CLI Organizer CLI is a python command line tool that goes through a given directory and organizes all un-folder bound files into folders by

Mulaza Jacinto 6 Dec 14, 2022
CLI Web-CAT interface for people who use VIM.

CLI Web-CAT CLI Web-CAT interface. Installation git clone https://github.com/phuang1024/cliwebcat cd cliwebcat python setup.py bdist_wheel sdist cd di

Patrick 4 Apr 11, 2022
Splitgraph command line client and python library

Splitgraph Overview Splitgraph is a tool for building, versioning and querying reproducible datasets. It's inspired by Docker and Git, so it feels fam

Splitgraph 313 Dec 24, 2022
Doro is a CLI based pomodoro app and countdown timer application built using python.

Doro - CLI based pomodoro app Doro is a CLI based pomodoro app and countdown timer application built using python. Install $ pip install doro Usage Po

Suresh Kumar 14 May 23, 2022
Custom 64 bit shellcode encoder that evades detection and removes some common badchars (\x00\x0a\x0d\x20)

x64-shellcode-encoder Custom 64 bit shellcode encoder that evades detection and removes some common badchars (\x00\x0a\x0d\x20) Usage Using a generato

Cole Houston 2 Jan 26, 2022
This is a CLI utility that allows you to view RedFlagDeals.com on the command line.

RFD Description Motivation Installation Usage View Hot Deals View and Sort Hot Deals Search Advanced View Posts Shell Completion bash zsh Description

Dave G 8 Nov 29, 2022
👻 Ghoul is an easy to use information service, allowing you to get/add information on someone or something directly from your terminal.

👻 Ghoul is an easy to use information service, allowing you to get/add information on someone or something directly from your terminal. It c

Billy 11 Nov 10, 2021