NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 09.05.21
Omega
Hey Welcome Back! A flood of EMNLP 2021 papers came in this week so today’s newsletter should be loads of fun! 😋
But first, a meme search engine:
The Missing Text Phenomenon
An article on The Gradient had an interesting take on NLU. It describes how a NNs’ capacity for NLU inference is inherently bounded to the background knowledge it knows (which is usually highly limited relative to a human). Although I would add a bit more nuance to this by sharing that this is only a problem for a model that is not localized for its user, meaning a model that wasn’t fine-tuned/prompted (localized) for a specific user. For information that is general and with ground truth i.e. (rain is wet or rain falls down to the ground), the MTP isn’t a big issue with large enough data/model.
I think a bigger issue in NLU (using text only) is when data doesn’t match the complexity of real-world. Meaning there isn’t enough information in the text only modality. Humans by default use a multi-modal approach (text, audio, visual etc.) when interpreting the world around us which helps us with inference. Multi-modal learning can be a viable approach to the MTP problem examples discussed in the article.
Document Parsing Goes Multi-Lingual
For those into document (PDF) parsing 👇. Includes the 2nd version of LayoutLM and also its multi-lingual cousin LayoutXLM.
…And there’s already a repo built on top of these models! 👌
Papers to Read 📚
StackOverflow Survey Full Dataset Released
Had previously mentioned the highlights/shorter version on a previous newsletter, now you can get the full dataset:
GNN Intro
A long and awesome introduction to graph neural networks.
The Machine & Deep Learning Compendium
Holy Moly 🤯
The Compendium contains over 500-topics in ML, and has been written for over 4 years. It’s now offered in an interactive web-based format.
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
AEDA: An Easier Data Augmentation Technique for Text Classification
AEDA includes only random insertion of punctuation marks into the original text.
Causal Inference Papers and Language
A collection of papers and codebases about influence, causality, and language.
FinQA | Financial Dataset
Dataset contains 8,281 financial QA pairs, along with their numerical reasoning processes. Eleven finance professionals collectively constructed FINQA based on the earnings reports of S&P 500 companies.
Thermostat
Thermostat is a large collection of NLP model explanations and accompanying analysis tools.
- Combines explainability methods from the
captum
library with Hugging Face'sdatasets
andtransformers
. - Mitigates repetitive execution of common experiments in Explainable NLP and thus reduces the environmental impact and financial roadblocks.
- Increases comparability and replicability of research.
- Reduces the implementational burden.
G2R: Distilling the Knowledge of Large-Scale Generative Models into Retrieval Models for Efficient Open-domain Conversation
A new training method that preserves the efficiency of a retrieval model while leveraging the conversational ability of a large-scale generative model by infusing the knowledge of the generative model into the retrieval model.
Text-AutoAugment (TAA)
Automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
MWPToolkit
MWPToolkit is a PyTorch-based toolkit for Math Word Problem(MWP) solving.
Emotion Recognition in Conversation (ERC)
EmoBERTa can learn intra- and inter- speaker states and context to predict the emotion of a current speaker, in an end-to-end manner.
SummerTime — Text Summarization Toolkit for Non-experts
A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.
WebQA: Multihop and Multimodal QA
WebQA, is a new benchmark for multi-modal multi-hop reasoning in which systems are presented with the same style of data as humans when searching the web: Snippets and Images.
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat