BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Related tags

Text Data & NLPbros
Overview

BROS

Introduction

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which are text and bounding box pairs, it can perform various key information extraction tasks, such as extracting an ordered item list from receipts. For more details, please refer to our paper:

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents
Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park
AAAI 2022 (to appear)

Pre-trained models

name # params Hugging Face - Models
bros-base-uncased < 110M naver-clova-ocr/bros-base-uncased
bros-large-uncased < 340M naver-clova-ocr/bros-large-uncased

Model usage

The example code below is written with reference to LayoutLM.

import torch
from bros import BrosTokenizer, BrosModel


tokenizer = BrosTokenizer.from_pretrained("naver-clova-ocr/bros-base-uncased")
model = BrosModel.from_pretrained("naver-clova-ocr/bros-base-uncased")


width, height = 1280, 720

words = ["to", "the", "moon!"]
quads = [
    [638, 451, 863, 451, 863, 569, 638, 569],
    [877, 453, 1190, 455, 1190, 568, 876, 567],
    [632, 566, 1107, 566, 1107, 691, 632, 691],
]

bbox = []
for word, quad in zip(words, quads):
    n_word_tokens = len(tokenizer.tokenize(word))
    bbox.extend([quad] * n_word_tokens)

cls_quad = [0.0] * 8
sep_quad = [width, height] * 4
bbox = [cls_quad] + bbox + [sep_quad]

encoding = tokenizer(" ".join(words), return_tensors="pt")
input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]

bbox = torch.tensor([bbox])
bbox[:, :, [0, 2, 4, 6]] = bbox[:, :, [0, 2, 4, 6]] / width
bbox[:, :, [1, 3, 5, 7]] = bbox[:, :, [1, 3, 5, 7]] / height

outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask)
last_hidden_state = outputs.last_hidden_state

print("- last_hidden_state")
print(last_hidden_state)
print()
print("- last_hidden_state.shape")
print(last_hidden_state.shape)

Result

- last_hidden_state
tensor([[[-0.0342,  0.2487, -0.2819,  ...,  0.1495,  0.0218,  0.0484],
         [ 0.0792, -0.0040, -0.0127,  ..., -0.0918,  0.0810,  0.0419],
         [ 0.0808, -0.0918,  0.0199,  ..., -0.0566,  0.0869, -0.1859],
         [ 0.0862,  0.0901,  0.0473,  ..., -0.1328,  0.0300, -0.1613],
         [-0.2925,  0.2539,  0.1348,  ...,  0.1988, -0.0148, -0.0982],
         [-0.4160,  0.2135, -0.0390,  ...,  0.6908, -0.2985,  0.1847]]],
       grad_fn=
   
    )

- last_hidden_state.shape
torch.Size([1, 6, 768])

   

Fine-tuning examples

Please refer to docs/finetuning_examples.md.

Acknowledgements

We referenced the code of LayoutLM when implementing BROS in the form of Hugging Face - transformers.
In this repository, we used two public benchmark datasets, FUNSD and SROIE.

License

Copyright 2022-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Owner
Clova AI Research
Open source repository of Clova AI Research, NAVER & LINE
Clova AI Research
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 01, 2022
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Kensho 315 Dec 21, 2022
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

CLUST-consortium 43 Nov 28, 2022
CPC-big and k-means clustering for zero-resource speech processing

The CPC-big model and k-means checkpoints used in Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Benjamin van Niekerk 5 Nov 23, 2022
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022