NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 09.05.21

Omega

Ricky Costa

5 min readSep 5, 2021

Hey Welcome Back! A flood of EMNLP 2021 papers came in this week so today’s newsletter should be loads of fun! 😋

But first, a meme search engine:

Memegine - The Internet Meme Search Engine

Search through millions of meme images from all places of the internet.

memegine.com

The Missing Text Phenomenon

An article on The Gradient had an interesting take on NLU. It describes how a NNs’ capacity for NLU inference is inherently bounded to the background knowledge it knows (which is usually highly limited relative to a human). Although I would add a bit more nuance to this by sharing that this is only a problem for a model that is not localized for its user, meaning a model that wasn’t fine-tuned/prompted (localized) for a specific user. For information that is general and with ground truth i.e. (rain is wet or rain falls down to the ground), the MTP isn’t a big issue with large enough data/model.

I think a bigger issue in NLU (using text only) is when data doesn’t match the complexity of real-world. Meaning there isn’t enough information in the text only modality. Humans by default use a multi-modal approach (text, audio, visual etc.) when interpreting the world around us which helps us with inference. Multi-modal learning can be a viable approach to the MTP problem examples discussed in the article.

Machine Learning Won't Solve Natural Language Understanding

The Empirical and Data-Driven Revolution In the early 1990s a statistical revolution overtook artificial intelligence…

thegradient.pub

Document Parsing Goes Multi-Lingual

For those into document (PDF) parsing 👇. Includes the 2nd version of LayoutLM and also its multi-lingual cousin LayoutXLM.

…And there’s already a repo built on top of these models! 👌

GitHub - Vishnunkumar/doc_transformers: Document processing using transformers

Document processing using transformers.

github.com

Papers to Read 📚

https://arxiv.org/pdf/2108.13048.pdf

https://arxiv.org/pdf/2108.13300.pdf

https://arxiv.org/pdf/2108.08877.pdf

https://arxiv.org/pdf/2108.10197.pdf

StackOverflow Survey Full Dataset Released

Had previously mentioned the highlights/shorter version on a previous newsletter, now you can get the full dataset:

The full data set for the 2021 Developer Survey now available!

Every year, we ask developers what the state of software engineering looks like for them, and tens of thousands of you…

stackoverflow.blog

GNN Intro

A long and awesome introduction to graph neural networks.

A Gentle Introduction to Graph Neural Networks

This article is one of two Distill publications about graph neural networks. Take a look at Understanding Convolutions…

distill.pub

The Machine & Deep Learning Compendium

Holy Moly 🤯

The Compendium contains over 500-topics in ML, and has been written for over 4 years. It’s now offered in an interactive web-based format.

The Machine & Deep Learning Compendium

Hi! Nearly a year ago I announced the Machine & Deep Learning Compendium that I have been curating for the last 4…

book.mlcompendium.com

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

AEDA: An Easier Data Augmentation Technique for Text Classification

AEDA includes only random insertion of punctuation marks into the original text.

GitHub - akkarimi/aeda_nlp: Data augmentation for NLP, accepted at EMNLP 2021 Findings

Data augmentation for NLP, accepted at EMNLP 2021 Findings - GitHub - akkarimi/aeda_nlp: Data augmentation for NLP…

github.com

Connected Papers 📈

Causal Inference Papers and Language

A collection of papers and codebases about influence, causality, and language.

GitHub - causaltext/causal-text-papers: Curated research at the intersection of causal inference…

A collection of papers and codebases about influence, causality, and language. Pull requests welcome! Semi-simulated…

github.com

Connected Papers 📈

FinQA | Financial Dataset

Dataset contains 8,281 financial QA pairs, along with their numerical reasoning processes. Eleven finance professionals collectively constructed FINQA based on the earnings reports of S&P 500 companies.

GitHub - czyssrs/FinQA: Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning…

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data" - GitHub …

github.com

Connected Papers 📈

Thermostat

Thermostat is a large collection of NLP model explanations and accompanying analysis tools.

Combines explainability methods from the captum library with Hugging Face's datasets and transformers.
Mitigates repetitive execution of common experiments in Explainable NLP and thus reduces the environmental impact and financial roadblocks.
Increases comparability and replicability of research.
Reduces the implementational burden.

GitHub - DFKI-NLP/thermostat: Collection of NLP model explanations and accompanying analysis tools

Combines explainability methods from the captum library with Hugging Face's datasets and transformers. Mitigates…

github.com

Connected Papers 📈

G2R: Distilling the Knowledge of Large-Scale Generative Models into Retrieval Models for Efficient Open-domain Conversation

A new training method that preserves the efficiency of a retrieval model while leveraging the conversational ability of a large-scale generative model by infusing the knowledge of the generative model into the retrieval model.

GitHub - hyperconnect/g2r: Codebase for the EMNLP 2021 Paper "Distilling the Knowledge of…

Codebase for the EMNLP 2021 Paper "Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for…

github.com

Connected Papers 📈

Text-AutoAugment (TAA)

Automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.

GitHub - lancopku/text-autoaugment: Code for EMNLP 2021 main conference paper "Text AutoAugment…

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT]…

github.com

Connected Papers 📈

MWPToolkit

MWPToolkit is a PyTorch-based toolkit for Math Word Problem(MWP) solving.

GitHub - LYH-YF/MWPToolkit

MWPToolkit is a PyTorch-based toolkit for Math Word Problem(MWP) solving. It is a comprehensive framework for research…

github.com

Connected Papers 📈

Emotion Recognition in Conversation (ERC)

EmoBERTa can learn intra- and inter- speaker states and context to predict the emotion of a current speaker, in an end-to-end manner.

GitHub - tae898/erc

At the moment, we only use the text modality to correctly classify the emotion of the utterances.The experiments were…

github.com

Connected Papers 📈

SummerTime — Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

GitHub - Yale-LILY/SummerTime: An open-source text summarization toolkit for non-experts.

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models…

github.com

Connected Papers 📈

WebQA: Multihop and Multimodal QA

WebQA, is a new benchmark for multi-modal multi-hop reasoning in which systems are presented with the same style of data as humans when searching the web: Snippets and Images.

WebQA

@inproceedings{WebQA21, title ={{WebQA: Multihop and Multimodal QA}}, author={Yinghsan Chang and Mridu Narang and…

webqna.github.io

Connected Papers 📈

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 09.05.21

Omega

Memegine - The Internet Meme Search Engine

Search through millions of meme images from all places of the internet.

The Missing Text Phenomenon

Machine Learning Won't Solve Natural Language Understanding

The Empirical and Data-Driven Revolution In the early 1990s a statistical revolution overtook artificial intelligence…

Document Parsing Goes Multi-Lingual

GitHub - Vishnunkumar/doc_transformers: Document processing using transformers

Document processing using transformers.

Papers to Read 📚

StackOverflow Survey Full Dataset Released

The full data set for the 2021 Developer Survey now available!

Every year, we ask developers what the state of software engineering looks like for them, and tens of thousands of you…

GNN Intro

A Gentle Introduction to Graph Neural Networks

This article is one of two Distill publications about graph neural networks. Take a look at Understanding Convolutions…

The Machine & Deep Learning Compendium

The Machine & Deep Learning Compendium

Hi! Nearly a year ago I announced the Machine & Deep Learning Compendium that I have been curating for the last 4…

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

GitHub - akkarimi/aeda_nlp: Data augmentation for NLP, accepted at EMNLP 2021 Findings

Data augmentation for NLP, accepted at EMNLP 2021 Findings - GitHub - akkarimi/aeda_nlp: Data augmentation for NLP…

GitHub - causaltext/causal-text-papers: Curated research at the intersection of causal inference…

A collection of papers and codebases about influence, causality, and language. Pull requests welcome! Semi-simulated…

GitHub - czyssrs/FinQA: Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning…

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data" - GitHub …

GitHub - DFKI-NLP/thermostat: Collection of NLP model explanations and accompanying analysis tools

Combines explainability methods from the captum library with Hugging Face's datasets and transformers. Mitigates…

GitHub - hyperconnect/g2r: Codebase for the EMNLP 2021 Paper "Distilling the Knowledge of…

Codebase for the EMNLP 2021 Paper "Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for…

GitHub - lancopku/text-autoaugment: Code for EMNLP 2021 main conference paper "Text AutoAugment…

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT]…

GitHub - LYH-YF/MWPToolkit

MWPToolkit is a PyTorch-based toolkit for Math Word Problem(MWP) solving. It is a comprehensive framework for research…

GitHub - tae898/erc

At the moment, we only use the text modality to correctly classify the emotion of the utterances.The experiments were…

GitHub - Yale-LILY/SummerTime: An open-source text summarization toolkit for non-experts.

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models…

WebQA

@inproceedings{WebQA21, title ={{WebQA: Multihop and Multimodal QA}}, author={Yinghsan Chang and Mridu Narang and…

Written by Ricky Costa