These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
A program to convert celcius to faranheit. made with python

Temp-Converter What is Temp-Converter Temp-Converter is little program made with pyhton to convert celcius to faranheit. Needed A python interpreter P

Chandula Janith 0 Nov 27, 2021
Script for generating Hearthstone card spoilers & checklists

This is a script for generating text spoilers and set checklists for Hearthstone. Installation & Running Python 3.6 or higher is required. Copy/clone

John T. Wodder II 1 Oct 11, 2022
Every 2 minutes, check for visa slots at VFS website

vfs-visa-slot-germany Every 2 minutes, check for visa slots at VFS website. If there are any, send a call and a message of the format: Sent from your

12 Dec 15, 2022
Macro recording and metaprogramming in Python

macro-kit is a package for efficient macro recording and metaprogramming in Python using abstract syntax tree (AST).

8 Aug 31, 2022
This code renames subtitle file names to your video files names, so you don't need to rename them manually.

Rename Subtitle This code renames your subtitle file names to your video file names so you don't need to do it manually Note: It only works for series

Mostafa Kazemi 4 Sep 12, 2021
Go through a random file in your favourite open source projects!

Random Source Codes Never be bored again! Staring at your screen and just scrolling the great world wide web? Would you rather read through some code

Mridul Seth 1 Nov 03, 2022
Dependency injection lib for Python 3.8+

PyDI Dependency injection lib for python How to use To define the classes that should be injected and stored as bean use decorator @component @compone

Nikita Antropov 2 Nov 09, 2021
A set of Python scripts to surpass human limits in accomplishing simple tasks.

Human benchmark fooler Summary A set of Python scripts with Selenium designed to surpass human limits in accomplishing simple tasks available on https

Bohdan Dudchenko 3 Feb 10, 2022
A python script to generate wallpaper

wallpaper eits Warning You need to set the path to Robot Mono font in the source code. (Settings are in the main function) Usage A script that given a

Henrique Tsuyoshi Yara 5 Dec 02, 2021
Michael Vinyard's utilities

Install vintools To download this package from pypi: pip install vintools Install the development package To download and install the developmen

Michael Vinyard 2 May 22, 2022
A collection of resources/tools and analyses for the angr binary analysis framework.

Awesome angr A collection of resources/tools and analyses for the angr binary analysis framework. This page does not only collect links and external r

105 Jan 02, 2023
🦩 A Python tool to create comment-free Jupyter notebooks.

Pelikan Pelikan lets you convert notebooks to comment-free notebooks. In other words, It removes Python block and inline comments from source cells in

Hakan Özler 7 Nov 20, 2021
A simple language and reference decompiler/compiler for MHW THK Files

Leviathon A simple language and reference decompiler/compiler for MHW THK Files. Project Goals The project aims to define a language specification for

11 Jan 07, 2023
Display your calendar on the wallpaper.

wallCal Have your calendar appear as the wallpaper. disclaimer Use at your own risk. Don't blame me if you miss a meeting :-) Some parts of the script

7 Jun 14, 2022
Python Random Number Genrator

This Genrates Random Numbers. This Random Number Generator was made using python. This software uses Time and Random extension. Download the EXE file and run it to get your answer.

Krish Sethi 2 Feb 03, 2022
Python 3 script unpacking statically x86 CryptOne packer.

Python 3 script unpacking statically x86 CryptOne packer. CryptOne versions: 2021/08 until now (2021/12)

5 Feb 23, 2022
Set of scripts for some automation during Magic Lantern development

~kitor Magic Lantern scripts A few automation scripts I wrote to automate some things in my ML development efforts. Used only on Debian running over W

Kajetan Krykwiński 1 Jan 03, 2022
Tool for generating Memory.scan() compatible instruction search patterns

scanpat Tool for generating Frida Memory.scan() compatible instruction search patterns. Powered by r2. Examples $ ./scanpat.py arm.ks:64 'sub sp, sp,

Ole André Vadla Ravnås 13 Sep 19, 2022
Python tool to check a web applications compliance with OWASP HTTP response headers best practices

Check Your Head A quick and easy way to check a web applications response headers!

Zak 6 Nov 09, 2021
Simple tool for creating changelogs

Description Simple utility for quickly generating changelogs, assuming your commits are ordered as they should be. This tool will simply log all lates

2 Jan 05, 2022