CDLA: A Chinese document layout analysis (CDLA) dataset

Last update: Dec 28, 2022

Related tags

Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

CDLA是一个中文文档版面分析数据集，面向中文文献类（论文）场景。包含以下10个label：

正文	标题	图片	图片标题	表格	表格标题	页眉	页脚	注释	公式
Text	Title	Figure	Figure caption	Table	Table caption	Header	Footer	Reference	Equation

共包含5000张训练集和1000张验证集，分别在train和val目录下。每张图片对应一个同名的标注文件(.json)。

样例展示：

下载链接

百度云下载：https://pan.baidu.com/s/1449mhds2ze5JLk-88yKVAA, 提取码: tp0d
Google Drive Download：https://drive.google.com/file/d/14SUsp_TG8OPdK0VthRXBcAbYzIBjSNLm/view?usp=sharing

标注格式

我们的标注工具是labelme，所以标注格式和labelme格式一致。这里说明一下比较重要的字段。

"shapes": shapes字段是一个list，里面有多个dict，每个dict代表一个标注实例。

"labels": 类别。

"points": 实例标注。因为我们的标注是Polygon形式，所以points里的坐标数量可能大于4。

"shape_type": "polygon"

"imagePath": 图片路径/名

"imageHeight": 高

"imageWidth": 宽

展示一个完整的标注样例:

{
  "version":"4.5.6",
  "flags":{},
  "shapes":[
    {
      "label":"Title",
      "points":[
        [
          553.1111111111111,
          166.59259259259258
        ],
        [
          553.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          166.59259259259258
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Text",
      "points":[
        [
          250.5925925925925,
          298.0740740740741
        ],
        [
          250.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          410.0740740740741
        ],
        [
          188.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          298.0740740740741
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Footer",
      "points":[
        [
          1033.7407407407406,
          1634.5185185185185
        ],
        [
          1033.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1634.5185185185185
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    }
  ],
  "imagePath":"val_0031.jpg",
  "imageData":null,
  "imageHeight":1754,
  "imageWidth":1240
}

转coco格式

执行命令:

# train
python3 labelme2coco.py CDLA_dir/train train_save_path  --labels labels.txt

# val
python3 labelme2coco.py CDLA_dir/val val_save_path  --labels labels.txt

转换结果保存在train_save_path/val_save_path目录下。

labelme2coco.py取自labelme，更多信息请参考labelme官方项目

CDLA: A Chinese document layout analysis (CDLA) dataset

Related tags

Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

下载链接

标注格式

转coco格式

Owner

buptlihang

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

🤕 spelling exceptions builder for lazy people

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yet Another Compiler Visualizer

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

An open source framework for seq2seq models in PyTorch.

A Streamlit web app that generates Rick and Morty stories using GPT2.

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

Python functions for summarizing and improving voice dictation input.

Spert NLP Relation Extraction API deployed with torchserve for inference

Code voor mijn Master project omtrent VideoBERT

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.