Text Normalization(文本正则化)

Overview

Text Normalization(文本正则化)

任务描述:通过机器学习算法将英文文本的“手写”形式转换成“口语“形式,例如“6ft”转换成“six feet”等

实验结果

  1. XGBoost + bag-of-words: 0.99159
  2. XGBoost+Weights+rules:0.99002
  3. 进阶solve函数(使用6个output文件):0.98939
  4. 基本solve函数:0.98277
  5. RandomTree + Rules:0.95304
  6. XGboost:0.92605

参考github网址:

数据来源网址:

数据分析EDA网址(帮助快速理解数据特征):

提升点

1. 数据的不平衡性

对于平衡的数据,我们一般使用准确率作为一般的评估标准(accuracy),当类别不平衡时,准确率就具有迷惑性,而且意义不大。因此有以下几种主流评测标准

  • Receiver operating curve,计算ROC曲线面积(二分类,从PLAIN和非PLAIN)
  • Precision-recall curve,计算此曲线下的面积
  • Precision

- 简单通用的算法

阈值调整(threshold moving):将原本默认为0.5的阈值调整到 较少类别/(较少类别+较多类别)即可。使用现有的集成学习分类器,如随机森林或者xgboost,并调整分类阈值

- 对XGBoost模型数据的不平衡处理方法

通过正负样本的权重解决样本不均衡(一般分类中小样本量类别权重高,大样本类别权重低,再进行计算和建模

- 简单有效的方案

  1. 不对数据进行过采样和欠采样,但使用现有的集成学习模型,如随机森林,XGBoost(lGBM)
  2. 输出模型的预测概率,调整阈值得到最终结果
  3. 选择合适的评估标准,如precision,Recall
  4. 文本正则化中的任务是对测试集中的16个目标进行预测,训练集中的最大类别是PLAIN,为7353693,最小的类别为ADDRESS,为522。因此暂定PLAIN的权重为0.01,其余为1.(除去PLAIN,其余15个再做一次分类)

2. 超参数优化(时间复杂度,空间复杂度)

如何选择合适的超参数?不同模型会有不同的最优超参数组合,找到这组最优超参数大家是根据经验或者随机的方法,来尝试。但是其是有可能用数学或者机器学习的模型来解决模型本身超参数的选择问题

背景

  • 机器学习模型超参数调优一般被认为是一个黑盒优化问题,在调优过程中我们只能看到模型的输入与输出,不能获取模型训练过程中的梯度信息,也不能假设模型超参数和最终指标符合凸优化条件
  • 模型训练代价大,时间,金钱成本

自动调参方法

Grid search(网格搜索),Random search(随机搜索),Genetic algorithm(遗传算法),Paticle Swarm Optimization(粒子群优化),Bayesian Optimization(贝叶斯优化),TPE,SMAC等

  • Genetic algorithm和PSO是经典黑盒优化算法,归类为群体优化算法,不是特别适合模型超参数调优场景,因为其需要有足够多的初始样本点,并且优化效率不高**
  • Grid search很容易理解与实现,但是遍历所有的超参数组合来找到其中最优化的方案,对于连续值还需要等间距采样。实际上这30种组合不一定取得全局最优解,而且计算量很大很容易组合爆炸,并不是一种高效的参数调优方法。
  • Random search普遍被认为比Grid search效果好,虽然组合的超参数具有随机性,但是其出现效果可能特别差也可能特别好,在尝试次数和Grid search相同的情况下一般最值会更大,当然variance也更大但这不影响最终结果。

但是在计算机资源有限的情况下,Grid search与Random search不一定比建模工程师的经验要好

  • Bayesian Optimization
    适用场景:
    (1)需要优化的function计算起来非常费时费力,比如上面提到的神经网络的超参问题,每一次训练神经网络都是燃烧好多GPU的
    (2)你要优化的function没有导数信息

3. 可解释性工具(https://www.kaggle.com/learn/machine-learning-explainability)

Xgboost相对于线性模型在进行预测时往往有更好的精度,但是同时也失去了线性模型的可解释性。所以Xgboost通常被认为是黑箱模型。 经典方法是使用全局特征重要性评估

2017年,Lundberg和Lee的[论文]( [A Unified Approach to Interpreting Model Predictions.pdf](../文献阅读/A Unified Approach to Interpreting Model Predictions.pdf) )提出了SHAP值这一广泛适用的方法用来解释各种模型(分类以及回归),其中最大的受益者莫过于之前难以被理解的黑箱模型,如boosting和神经网络模型。

  1. 二分类,看下准确率,高的话
  2. 集成XGBoost,LGB,随机森林
  3. 可解释性,SHAP(SHAP值只能对特征进行分析)
  4. 去掉PLAIN看下效果,ROC

后续改进

  1. 将PLAIN的权值设置为0,训练结果:分数为0.98991 将PLAIN去除不进行预测,实验结果无法得到官方分数,并且实验是通过上下文单词(context)来作为单位进行训练,若去除PLAIN无法训练。因此只能通过将权值设置为0,查看各个种类的预测准确率是否高,但是可以查看对训练集的效果
Owner
Jason_Zhang
Jason_Zhang
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need

[UPDATED] A TensorFlow Implementation of Attention Is All You Need When I opened this repository in 2017, there was no official code yet. I tried to i

Kyubyong Park 3.8k Dec 26, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Ian 1 Jan 15, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
A flask application to predict the speech emotion of any .wav file.

This is a speech emotion recognition app. It will allow you to train a modular MLP model with the RAVDESS dataset, and then use that model with a flask application to predict the speech emotion of an

Aryan Vijaywargia 2 Dec 15, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Eliyar Eziz 2.3k Dec 29, 2022
NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

Harshith VH 1 Oct 29, 2021