Tools for working with MARC data in Catalogue Bridge.

Overview

catbridge_tools

Tools for working with MARC data in Catalogue Bridge.

Borrows heavily from PyMarc (https://pypi.org/project/pymarc/).

Requirements

Requires the regex module from https://bitbucket.org/mrabarnett/mrab-regex. The built-in re module is not sufficient.

Also requires py2exe.

Installation

From GitHub:

git clone https://github.com/victoriamorris/catbridge_tools
cd catbridge_tools

To install as a Python package:

python setup.py install

To create stand-alone executable (.exe) files for individual scripts:

python setup.py py2exe 

Executable files will be created in the folder \dist, and should be copied to an executable path.

Both of the above commands can be carried out by running the shell script:

compile_catbridge_tools.sh

Scripts

The scripts listed below can be run from anywhere, once the package is installed and the .exe files have been copied to an executable path.

Correspondence with original Catalogue Bridge tools

Original Catalogue Bridge tool New tool Original syntax Corresponding new syntax
cn-find cn-find CN-FIND cn_find -i -o -c
cn-tidy cn-find CN-FIND cn_find -i -o -c --tidy

Features common to all scripts

File formats

Unless otherwise specified, MARC files are in MARC 21 format, with .lex file extensions. Unless otherwise specified, text files are UTF-8-encoded, with .txt, .csv or .tsv file extensions. Config files are also text files, but may have the file extension .cfg for convenience.

Help

For any script, use the option --help to display help text.

Logs and debugging

Logs will be written to catbridge.log within the working directory. This is a UTF-8 encoded text field and can be read in any text editor. The default logging level is INFO; if option --debug is set, the logging level is changed to DEBUG. See https://docs.python.org/3/library/logging.html#levels for information about logging levels.

cn_find

cn_find is a utility which extracts extract control numbers from specified fields and subfields within a file of MARC records.

The fields and subfields to be extracted are specified in a config file.

Usage: cn_find -i 
   
     -o 
    
      -c 
     
       [options]

Options:
    --conv  Convert 10-digit ISBNs to 13-digit form where possible
    --rid   Include record ID as the first column of the output file
    --tidy  Sort and de-duplicate list

    --debug	Debug mode.
    --help	Show help message and exit.

     
    
   

Files

is the name of the input file, which must be a file of MARC 21 records.

is the name of the file to which the control numbers will be written. This should be a text file.

is the name of the file containing the configuration directives.

The config file

The format of the configuration file is as follows, with one entry per line

FIELD TAG $ subfield character [tab] control number specification

Each line must match the regular expression

^([0-9A-Z]{3})\s*\$?\s*([a-z0-9]?)\s*\t(.*?)\s*$

The field tag is specified using three numbers or UPPERCASE letters.

The subfield code are specified using a single number or lowercase letter. If '$' appears without any following subfield characters, all subfields will be searched for control numbers.

The control number specification tells the script what kind of control number to search for within the subfield. This can either take a value from a pre-defined list, or a regular expression can be used to search for control numbers with any other structure. Regular expressions are case-sensitive.

Control number specification Description Regular expression
ISBN Any structurally plausible ISBN* \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b|\b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b
ISBN10 Any structurally plausible 10-digit ISBN* \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b
ISBN13 Any structurally plausible 13-digit ISBN* \b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b
ISSN 8 digits with a hyphen in the middle, where the last digit may be an X \b[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b
BL001 9 digits \b[0-9]{9}\b
BNB See https://www.bl.uk/collection-metadata/metadata-services/structure-of-the-bnb-number \bGB([0-9]{7}|[A-Z][0-9][A-Z0-9][0-9]{4})\b
LCCN See https://www.loc.gov/marc/bibliographic/bd010.html \b[a-z][a-z ][a-z ]?[0-9]{2}[0-9]{6} ?\b
OCLC "(OCoLC)" followed by digits (OCoLC)[0-9]+\b
ISNI 16 digits separated into groups of 4 with spaces or hyphens \b[0]{4}[ -]?[0-9]{4}[ -]?[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b
FAST "fst" followed by digits \bfst[0-9]{8}\b

*Note: The ISBN check digit is not validated.

Multiple fields and subfields may be specified. Fields may be repeated with different subfields.

Example:

001 BL001
015$a	BNB
020	ISBN
020$z	ISBN10
500$a	\b[a-z]{7}\b
035$a	OCLC

In the example above, field 500 subfield $a is being searched for 7-character words.

Options

--conv

If option --conv is used, 10-digit ISBNs will be converted to 13-digit form whenever possible (i.e. whenever they are valid ISBNs).

--rid

By default, the output file consists of a single column of strings. If option --rid is used, the output file will consist of two columns: the first column will be the record control number from field 001 and the second column will be as per the default output.

--tidy

If option --tidy is used, the list of control numbers in the output file will be sorted and de-duplicated. Any duplicate control numbers will be written to an additional output file named with the prefix "dp-".

Note: option --tidy cannot be used at the same time as option --rid

follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
CS50 pset9: Using flask API to create a web application to exchange stocks' shares.

C$50 Finance In this guide we want to implement a website via which users can “register”, “login” “buy” and “sell” stocks, like below: Background If y

1 Jan 24, 2022
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021
A set of procedures that can realize covid19 virus detection based on blood.

A set of procedures that can realize covid19 virus detection based on blood.

Nuyoah-xlh 3 Mar 07, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
Data analysis and visualisation projects from a range of individual projects and applications

Python-Data-Analysis-and-Visualisation-Projects Data analysis and visualisation projects from a range of individual projects and applications. Python

Tom Ritman-Meer 1 Jan 25, 2022
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 19 Nov 24, 2022
Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: https://developer.riotgames.com/ Average time per m

11 Jan 07, 2022
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 150 Jan 06, 2023
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

Arun Singh Babal 1 Jan 17, 2022