Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Last update: Dec 23, 2022

Overview

FENSE

The metric, Fluency ENhanced Sentence-bert Evaluation (FENSE), for audio caption evaluation, proposed in the paper "Can Audio Captions Be Evaluated with Image Caption Metrics?"

The main branch contains an easy-to-use interface for fast evaluation of an audio captioning system.

Online demo avaliable at https://share.streamlit.io/blmoistawinde/fense/main/streamlit_demo/app.py .

To get the dataset (AudioCaps-Eval and Clotho-Eval) and the code to reproduce, please refer to the experiment-code branch.

Installation

Clone the repository and pip install it.

git clone https://github.com/blmoistawinde/fense.git
cd fense
pip install -e .

Usage

Single Sentence

To get the detailed scores of each component for a single sentence.

from fense.evaluator import Evaluator

print("----Using tiny models----")
evaluator = Evaluator(device='cpu', sbert_model='paraphrase-MiniLM-L6-v2', echecker_model='echecker_clotho_audiocaps_tiny')

eval_cap = "An engine in idling and a man is speaking and then"
ref_cap = "A machine makes stitching sounds while people are talking in the background"

score, error_prob, penalized_score = evaluator.sentence_score(eval_cap, [ref_cap], return_error_prob=True)

print("Cand:", eval_cap)
print("Ref:", ref_cap)
print(f"SBERT sim: {score:.4f}, Error Prob: {error_prob:.4f}, Penalized score: {penalized_score:.4f}")

System Score

To get a system's overall score on a dataset by averaging sentence-level FENSE, you can use eval_system.py, with your system outputs prepared in the format like test_data/audiocaps_cands.csv or test_data/clotho_cands.csv .

For AudioCaps test set:

python eval_system.py --device cuda --dataset audiocaps --cands_dir ./test_data/audiocaps_cands.csv

For Clotho Eval set:

python eval_system.py --device cuda --dataset clotho --cands_dir ./test_data/clotho_cands.csv

Performance Benchmark

We benchmark the performance of FENSE with different choices of SBERT model and Error Detector on the two benchmark dataset AudioCaps-Eval and Clotho-Eval. (*) is the combination reported in paper.

AudioCaps-Eval

SBERT	echecker	HC	HI	HM	MM	total
paraphrase-MiniLM-L6-v2	none	62.1	98.8	93.7	75.4	80.4
paraphrase-MiniLM-L6-v2	tiny	57.6	94.7	89.5	82.6	82.3
paraphrase-MiniLM-L6-v2	base	62.6	98	82.5	85.4	85.5
paraphrase-TinyBERT-L6-v2	none	64	99.2	92.5	73.6	79.6
paraphrase-TinyBERT-L6-v2	tiny	58.6	95.1	88.3	82.2	82.1
paraphrase-TinyBERT-L6-v2	base	64.5	98.4	91.6	84.6	85.3(*)
paraphrase-mpnet-base-v2	none	63.1	98.8	94.1	74.1	80.1
paraphrase-mpnet-base-v2	tiny	58.1	94.3	90	83.2	82.7
paraphrase-mpnet-base-v2	base	63.5	98	92.5	85.9	85.9

Clotho-Eval

SBERT	echecker	HC	HI	HM	MM	total
paraphrase-MiniLM-L6-v2	none	59.5	95.1	76.3	66.2	71.3
paraphrase-MiniLM-L6-v2	tiny	56.7	90.6	79.3	70.9	73.3
paraphrase-MiniLM-L6-v2	base	60	94.3	80.6	72.3	75.3
paraphrase-TinyBERT-L6-v2	none	60	95.5	75.9	66.9	71.8
paraphrase-TinyBERT-L6-v2	tiny	59	93	79.7	71.5	74.4
paraphrase-TinyBERT-L6-v2	base	60.5	94.7	80.2	72.8	75.7(*)
paraphrase-mpnet-base-v2	none	56.2	96.3	77.6	65.2	70.7
paraphrase-mpnet-base-v2	tiny	54.8	91.8	80.6	70.1	73
paraphrase-mpnet-base-v2	base	57.1	95.5	81.9	71.6	74.9

Reference

If you use FENSE in your research, please cite:

@misc{zhou2021audio,
      title={Can Audio Captions Be Evaluated with Image Caption Metrics?}, 
      author={Zelin Zhou and Zhiling Zhang and Xuenan Xu and Zeyu Xie and Mengyue Wu and Kenny Q. Zhu},
      year={2021},
      eprint={2110.04684},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

139 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Related tags

Overview

FENSE

Installation

Usage

Single Sentence

System Score

Performance Benchmark

Reference

You might also like...

I-BERT: Integer-only BERT Quantization

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Pure python PEMDAS expression solver without using built-in eval function

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Yet another video caption

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Releases(V0.1)

V0.1(Oct 2, 2021)

Owner

Zhiling Zhang

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Fully Connected DenseNet for Image Segmentation

TeST: Temporal-Stable Thresholding for Semi-supervised Learning

(CVPR 2022) Energy-based Latent Aligner for Incremental Learning

Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

Justmagic - Use a function as a method with this mystic script, like in Nim

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

PyTorch implementation of the Value Iteration Networks (VIN) (NIPS '16 best paper)

Code for the paper "Attention Approximates Sparse Distributed Memory"

Machine learning algorithms for many-body quantum systems

Benchmarking the robustness of Spatial-Temporal Models

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Distance-Ratio-Based Formulation for Metric Learning

TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral

PyTorch implementations of the NeRF model described in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

Implementation of the master's thesis "Temporal copying and local hallucination for video inpainting".

A new benchmark for Icon Question Answering (IconQA) and a large-scale icon dataset Icon645.