Code and data for "TURL: Table Understanding through Representation Learning"

Related tags

Deep LearningTURL
Overview

TURL

This Repo contains code and data for "TURL: Table Understanding through Representation Learning".

overview_0

Environment and Setup

The model is mainly developped using PyTorch and Transformers. You can access the docker image we used here docker pull xdeng/transformers:latest

Data

Link for processed pretraining and evaluation data, as well as the model checkpoints can be accessed here. This is created based on the original WikiTables corpus (http://websail-fe.cs.northwestern.edu/TabEL/)

TODO: Instruction for preparing code from original WikiTable Corpus

Pretraining

Data

The [split]_tables.jsonl files are used for pretraining and creation of all test datasets, with 570171 / 5036 / 4964 tables for training/validation/testing.

'_id': '27289759-6', # table id
'pgTitle': '2010 Santos FC season', # page title
'sectionTitle': 'Out', # section title
'tableCaption': '', # table caption
'pgId': 27289759, # wikipedia page id
'tableId': 6, # index of the table in the wikipedia page
'tableData': [[{'text': 'DF', # cell value
    'surfaceLinks': [{'surface': 'DF',
      'locType': 'MAIN_TABLE',
      'target': {'id': 649702,
       'language': 'en',
       'title': 'Defender_(association_football)'},
      'linkType': 'INTERNAL'}] # urls in the cell
      } # one for each cell,...]
      ...]
'tableHeaders': [['Pos.', 'Name', 'Moving to', 'Type', 'Source']], # row headers
'processed_tableHeaders': ['pos.', 'name', 'moving to', 'type', 'source'], # processed headers that will be used
'merged_row': [], # merged rows, we identify them by comparing the cell values
'entityCell': [[1, 1, 1, 0, 0],...], # whether the cell is an entity cell, get by checking the urls inside
'entityColumn': [0, 1, 2], # whether the column is an entity column
'column_type': [0, 0, 0, 4, 2], # more finegrained column type for debug, here we only use 0: entity columns
'unique': [0.16, 1.0, 0.75, 0, 0], # the ratio of unique entities in that column
'entity_count': 72, # total number of entities in the table
'subject_column': 1 # the column index of the subject column

Each line represents a Wikipedia table. Table content is stored in the field tableData, where the target is the actual entity links to the cell, and is also the entity to retrieve. The id and title are the Wikipedia_id and Wikipedia_title of the entity. entityCell and entityColumn shows the cells and columns that pass our filtering and are identified to contain entity information.

There is also an entity_vocab.txt file contains all the entities we used in all experiments (these are the entities shown in pretraining). Each line contains vocab_id, Wikipedia_id, Wikipedia_title, freebase_mid, count of an entity.

Get representation for a given table To use the pretrained model as a table encoder, use the HybridTableMaskedLM model class. There is a example in evaluate_task.ipynb for cell filling task, which also shows how to get representation for arbitrary table.

Finetuning & Evaluation

To systematically evaluate our pre-trained framework as well as facilitate research, we compile a table understanding benchmark consisting of 6 widely studied tasks covering table interpretation (e.g., entity linking, column type annotation, relation extraction) and table augmentation (e.g., row population, cell filling, schema augmentation).

Please see evaluate_task.ipynb for running evaluation for different tasks.

Entity Linking

We use two datasets for evaluation in entity linking. One is based on our train/dev/test split, the linked entity to each cell is the target for entity linking. For the WikiGS corpus, please find the original release here http://www.cs.toronto.edu/~oktie/webtables/ .

We use entity name, together with entity description and entity type to get KB entity representation for entity linking. There are three variants for the entity linking: 0: name + description + type, 1: name + type, 2: name + description.

Evaluation

Please see EL in evaluate_task.ipynb

Data

Data are stored in [split].table_entity_linking.json

'23235546-1', # table id
'Ivan Lendl career statistics', # page title
'Singles: 19 finals (8 titles, 11 runner-ups)', # section title
'', # caption
['outcome', 'year', ...], # headers
[[[0, 4], 'Björn Borg'], [[9, 2], 'Wimbledon'], ...], # cells, [index, entity mention (cell text)]
[['Björn Borg', 'Swedish tennis player', []], ['Björn Borg', 'Swedish swimmer', ['Swimmer']], ...], # candidate entities, this the merged set for all cells. [entity name, entity description, entity types]
[0, 12, ...] # labels, this is the index of the gold entity in the candidate entities
[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

Column Type Annotation

We divide the information available in the table for column type annotation as: entity mention, table metadata and entity embedding. We experiment under 6 settings: 0: all information, 1: only entity related, 2: only table metadata, 3: no entity embedding, 4: only entity mention, 5: only entity embedding.

Data

Data are stored in [split].table_col_type.json. There is a type_vocab.txt store the target types.

'27295818-29', # table id
 '2010–11 rangers f.c. season', # page title
 27295818, # Wikipedia page id
 'overall', # section title
 '', # caption
 ['competition', 'started round', 'final position / round'], # headers
 [[[[0, 0], [26980923, 'Scottish Premier League']],
   [[1, 0], [18255941, 'UEFA Champions League']],
   ...],
  ...,
  [[[1, 2], [18255941, 'Group stage']],
   [[2, 2], [20795986, 'Round of 16']],
   ...]], # cells, [index, [entity id, entity mention (cell text)]]
 [['time.event'], ..., ['time.event']] # column type annotations, a column may have multiple types.

Relation Extraction

There is a relation_vocab.txt store the target relations. In the [split].table_rel_extraction.json file, each example contains table_id, pgTitle, pgId, secTitle, caption, valid_headers, entities, relations similar to column type classification. Note here the relation is between the subject column (leftmost) and each of the object columns (the rest). We do this to avoid checking all column pairs in the table.

Row Population

For row population, the task is to predict the entities linked to the entity cells in the leftmost entity column. A small amount of tables is further filtered out from test_tables.jsonl which results in the final 4132 tables for testing.

Cell Filling

Please see Pretrained and CF in evaluate_task.ipynb. You can directly load the checkpoint under pretrained, as we do not finetune the model for cell filling.

We have three baselines for cell filling: Exact, H2H, H2V. The header vectors and co-occurrence statistics are pre-computed, please see baselines/cell_filling/cell_filling.py for details.

Schema Augmentation

TODO: Refactoring the evaluation scripts and add instruction.

Acknowledgement

We use the WikiTable corpus for developing the dataset for pretraining and most of the evaluation. We also adopt the WikiGS for evaluation of entity linking.

We use multiple existing systems as baseline for evaluation. We took the code released by the author and made minor changes to fit our setting, please refer to the paper for more details.

Owner
SunLab-OSU
SunLab-OSU
Interpretation of T cell states using reference single-cell atlases

Interpretation of T cell states using reference single-cell atlases ProjecTILs is a computational method to project scRNA-seq data into reference sing

Cancer Systems Immunology Lab 139 Jan 03, 2023
On the Limits of Pseudo Ground Truth in Visual Camera Re-Localization

On the Limits of Pseudo Ground Truth in Visual Camera Re-Localization This repository contains the evaluation code and alternative pseudo ground truth

Torsten Sattler 36 Dec 22, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

117 Jan 07, 2023
VGG16 model-based classification project about brain tumor detection.

Brain-Tumor-Classification-with-MRI VGG16 model-based classification project about brain tumor detection. First, you can check what people are doing o

Atakan Erdoğan 2 Mar 21, 2022
Codes to calculate solar-sensor zenith and azimuth angles directly from hyperspectral images collected by UAV. Works only for UAVs that have high resolution GNSS/IMU unit.

UAV Solar-Sensor Angle Calculation Table of Contents About The Project Built With Getting Started Prerequisites Installation Datasets Contributing Lic

Sourav Bhadra 1 Jan 15, 2022
A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

jaxdf - JAX-based Discretization Framework Overview | Example | Installation | Documentation ⚠️ This library is still in development. Breaking changes

UCL Biomedical Ultrasound Group 65 Dec 23, 2022
Pacman-AI - AI project designed by UC Berkeley. Designed reflex and minimax agents for the game Pacman.

Pacman AI Jussi Doherty CAP 4601 - Introduction to Artificial Intelligence - Fall 2020 Python version 3.0+ Source of this project This repo contains a

Jussi Doherty 1 Jan 03, 2022
Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

zshicode 1 Nov 18, 2021
Unofficial implementation (replicates paper results!) of MINER: Multiscale Implicit Neural Representations in pytorch-lightning

MINER_pl Unofficial implementation of MINER: Multiscale Implicit Neural Representations in pytorch-lightning. 📖 Ref readings Laplacian pyramid explan

AI葵 51 Nov 28, 2022
Official Implementation for Fast Training of Neural Lumigraph Representations using Meta Learning.

Fast Training of Neural Lumigraph Representations using Meta Learning Project Page | Paper | Data Alexander W. Bergman, Petr Kellnhofer, Gordon Wetzst

Alex 39 Oct 08, 2022
Deploying PyTorch Model to Production with FastAPI in CUDA-supported Docker

Deploying PyTorch Model to Production with FastAPI in CUDA-supported Docker A example FastAPI PyTorch Model deploy with nvidia/cuda base docker. Model

Ming 68 Jan 04, 2023
Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model The task of age transformation illustrates the change of an individual

444 Dec 30, 2022
RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching This repository contains the source code for our paper: RAFT-Stereo: Multilevel

Princeton Vision & Learning Lab 328 Jan 09, 2023
A simple version for graphfpn

GraphFPN: Graph Feature Pyramid Network for Object Detection Download graph-FPN-main.zip For training , run: python train.py For test with Graph_fpn

WorldGame 67 Dec 25, 2022
Set of models for classifcation of 3D volumes

Classification models 3D Zoo - Keras and TF.Keras This repository contains 3D variants of popular CNN models for classification like ResNets, DenseNet

69 Dec 28, 2022
Generative Flow Networks

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation Implementation for our paper, submitted to NeurIPS 2021 (also chec

Emmanuel Bengio 381 Jan 04, 2023
Art Project "Schrödinger's Game of Life"

Repo of the project "Team Creative Quantum AI: Schrödinger's Game of Life" Installation new conda env: conda create --name qcml python=3.8 conda activ

ℍ◮ℕℕ◭ℍ ℝ∈ᛔ∈ℝ 2 Sep 15, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
Myia prototyping

Myia Myia is a new differentiable programming language. It aims to support large scale high performance computations (e.g. linear algebra) and their g

Mila 456 Nov 07, 2022
Dynamic View Synthesis from Dynamic Monocular Video

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer This repository contains code to compute depth from a

Intelligent Systems Lab Org 2.3k Jan 01, 2023