A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
🌍 Harness the power of whatsmydns from the command-line.

chkdns Harness the power of whatsmydns from the command-line. Installing with pip pip install chkdns Run chkdns --host github.com Alternatively you ca

Craig Gumbley 3 Oct 29, 2022
kitty - the fast, feature-rich, cross-platform, GPU based terminal

kitty - the fast, feature-rich, cross-platform, GPU based terminal

Kovid Goyal 17.3k Jan 04, 2023
Todo list console based application. Todo's save to a seperate file.

Todo list console based application. Todo's save to a seperate file.

1 Dec 24, 2021
Sink is a CLI tool that allows users to synchronize their local folders to their Google Drives. It is similar to the Git CLI and allows fast and reliable syncs with the drive.

Sink is a CLI synchronisation tool that enables a user to synchronise local system files and folders with their Google Drives. It follows a git C

Yash Thakre 16 May 29, 2022
'rl_UK' is an open-source command-line tool in Python for calculating the shortest path between BUS stop sequences in the UK

'rl_UK' is an open-source command-line tool in Python for calculating the shortest path between BUS stop sequences in the UK. As input files, it uses an ATCO-CIF file and 'OS Open Roads' dataset from

Nesh P. 0 Feb 16, 2022
A powerful Minecraft command library.

Mecha A powerful Minecraft command library. from mecha import Mecha

32 Dec 10, 2022
A mini command line tool to spellcheck text files using tadqeek.alsharekh.org

tadqeek_sakhr A mini command line tool to spellcheck text files using tadqeek.alsharekh.org Usage usage: python tadqeek_sakhr.py [-h] -i INPUT [-o OUT

Youssif Shaaban Alsager 5 Dec 11, 2022
An awesome Python wrapper for an awesome Docker CLI!

An awesome Python wrapper for an awesome Docker CLI!

Gabriel de Marmiesse 303 Jan 03, 2023
A python CLI app that converts a mp4 file into a gif with ASCII effect added.

Video2ASCIIgif This CLI app takes in a mp4 format video, converts it to a gif with ASCII effect applied. This also includes full control over: backgro

Sriram R 6 Dec 31, 2021
bsp_tool provides a Command Line Interface for analysing .bsp files

bsp_tool Python library for analysing .bsp files bsp_tool provides a Command Line Interface for analysing .bsp files Current development is focused on

Jared Ketterer 64 Dec 28, 2022
πŸ’₯ Share files easily over your local network from the terminal!

Fileshare πŸ“¨ Share files easily over your local network from the terminal! πŸ“¨ Installation # clone the repo $ git clone https://github.com/dopevog/fil

Dopevog 11 Sep 10, 2021
Professor Wordlist is a free open source command line tool written in python

Professor Wordlist is a free open source command line tool written in python, With the aim of generating custom wordlists with a variety of unique parameters and functions providing many possibilitie

γ‚ͺγƒΌγ‚―O A K Z E H γ‚ͺγƒΌγ‚― 1 Oct 28, 2021
A simple python script to execute a command when a YubiKey is disconnected

YubiKeyExecute A python script to execute a command when a YubiKey / YubiKeys are disconnected. β€β€β€Ž β€Ž How to use: 1. Download the latest release and d

6 Mar 12, 2022
A begginer reverse shell tool python.

A begginer tools for hacking. The theme of this repository is to bring some ready-made open-source tools for anyone new to the world of hacking. This

Dio brando 2 Jan 05, 2022
A dashboard for your Terminal written in the Python 3 language,

termDash is a handy little program, written in the Python 3 language, and is a small little dashboard for your terminal, designed to be a utility to help people, as well as helping new users get used

Rebecca White 2 Dec 03, 2021
A command line utility to export Google Keep notes to markdown.

Keep-Exporter A command line utility to export Google Keep notes to markdown files with metadata stored as a frontmatter header. Supports exporting: S

Nathan Beals 85 Dec 17, 2022
🌈 Generate color palettes based on Neovim colorschemes.

Iris Iris is a Neovim plugin that generates a normalized color palette based on your colorscheme. It is named for the goddess Iris of Greek mythology,

N. G. Scheurich 45 Jul 28, 2022
Python-Stock-Info-CLI: Get stock info through CLI by passing stock ticker.

Python-Stock-Info-CLI Get stock info through CLI by passing stock ticker. Installation Use the following command to install the required modules at on

Ayush Soni 1 Nov 05, 2021
Collection of useful command line utilities and snippets to help you organise your workspace and improve workflow.

Collection of useful command line utilities and snippets to help you organise your workspace and improve workflow.

Dominik Tarnowski 3 Dec 26, 2021
Tarstats - A simple Python commandline application that collects statistics about tarfiles

A simple Python commandline application that collects statistics about tarfiles.

Kristian Koehntopp 13 Feb 20, 2022