A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
alternative cli util for update-alternatives

altb altb is a cli utility influenced by update-alternatives of ubuntu. Linked paths are added to $HOME/.local/bin according to XDG Base Directory Spe

Elran Shefer 8 Dec 07, 2022
dbt-subdocs is a python CLI you can used to generate a dbt-docs for a subset of your dbt project

dbt-subdocs dbt-subdocs is a python CLI you can used to generate a dbt-docs for a subset of your dbt project 🤔 Description This project is useful if

Jambe 6 Jan 03, 2023
🌈 Generate color palettes based on Neovim colorschemes.

Iris Iris is a Neovim plugin that generates a normalized color palette based on your colorscheme. It is named for the goddess Iris of Greek mythology,

N. G. Scheurich 45 Jul 28, 2022
CmdTube is a Python CLI library for searching, downloading, and watching YouTube tutorials

CmdTube is a Python CLI library for searching, downloading, and watching YouTube tutorials. This library was made with programmers in mind and it's dedicated to every programmer who watches YouTube v

Samuel Ayomide Ogunleke 2 Aug 22, 2022
A useful and easy to use Terminal Timer made with Python.

Terminal SpeedCubeTimer Installation ¡No requirements! Just Download and play Usage Starts timer.py and you will see this. python timer.py Scramble

Achalogy 5 Dec 22, 2022
A simple discord slash command handler for for discord.py.

A simple discord slash command handler for discord.py About ⦿ Installation ⦿ Disclaimer ⦿ Examples ⦿ Documentation ⦿ Discussions Note that master bran

641 Jan 03, 2023
Create animated ASCII-art for the command line almost instantly!

clippy Create and play colored 🟥 🟩 🟦 or colorless ⬛️ ⬜️ animated, or static, ASCII-art in the command line! clippy can help if you are wanting to;

Connor 10 Jun 26, 2022
Python package with library and CLI tool for analyzing SeaFlow data

Seaflowpy A Python package for SeaFlow flow cytometer data. Table of Contents Install Read EVT/OPP/VCT Files Command-line Interface Configuration Inte

<a href=[email protected]"> 3 Nov 03, 2021
Tablicate - Python library for easy table creation and output to terminal

Tablicate Tablicate - Python library for easy table creation and output to terminal Features Column-wise justification alignment (left, right, center)

3 Dec 14, 2022
This CLI give the possibility to do a queries in Star Wars API and returns a JSON in a terminal.

Star Wars CLI (swcli) This CLI give the possibility to do a queries in Star Wars API and returns a JSON in a terminal. Install $ pip install swcli Qu

Pery Lemke 5 Nov 05, 2021
A command line tool to hide and reveal information inside images (works for both PNGs and JPGs)

Imgrerite A command line tool to hide and reveal information inside images (works for both PNGs and JPGs) Dependencies Python 3 Git Most of the Linux

Jigyasu 10 Jul 27, 2022
A simple command line tool written in python to manage a to-do list

A simple command line tool written in python to manage a to-do list Dependencies: python Commands: todolist (-a | --add) [(-p | --priority)] [(-l | --

edwloef 0 Nov 02, 2021
This is a CLI program which can help you generate your own QR Code.

Python-QR-code-generator This is a CLI program which can help you generate your own QR Code. Single.py This will allow you only to input a single mess

1 Dec 24, 2021
Objexplore is an interactive Python object explorer for the terminal.

Objexplore is an interactive Python object explorer for the terminal. Use it while debugging, or exploring a new library, or whatever! 9D1FAC73-B2A5-4

kylepollina 249 Dec 23, 2022
Ralph is a command-line tool to fetch, extract, convert and push your tracking logs from various storage backends to your LRS or any other compatible storage or database backend.

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs (aka learning events) from various storage backends to your

France Université Numérique 18 Jan 05, 2023
Command-line script to upload videos to Youtube using theYoutube APIv3.

Introduction Command-line script to upload videos to Youtube using theYoutube APIv3. It should work on any platform (GNU/Linux, BSD, OS X, Windows, ..

Arnau Sanchez 1.9k Jan 09, 2023
A selfbot made with DPY, doesn't have much commands but there's some useful commands to use.

Phantom Selfbot A selfbot made in DPY, made by Zenith. How to use Add your token in token = 'YOUR-MOMS-TOKEN-HERE' Change the prefix in prefix = If

[Ͼ⁴] Ƶephyr 2 Dec 02, 2021
stonky is a simple command line dashboard for monitoring stocks.

stonky is a simple command line dashboard for monitoring stocks.

Jessy Williams 228 Dec 14, 2022
A CLI tools to get you started on any project in any language

Any Template A faster easier to Quick start any programming project. Installation pip3 install any-template Features No third party dependencies. Tem

Adwaith Rajesh 2 Jan 11, 2022
AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.

Introduction AthenaCLI is a command line interface (CLI) for the Athena service that can do auto-completion and syntax highlighting, and is a proud me

dbcli 192 Jan 07, 2023