NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 10.31.21

The Localization Problem, and the Pale Blue Dot

8 min readOct 31, 2021

The Localization Problem (LP) is a glaring dark cloud hanging over the state of affairs in applied deep learning. And acknowledging this problem, I believe, will enable us to make better use of applied AI and expand our knowledge in how the business market will form.

Defining LP: There is a limit to how much large centralized language models can generalize at scale given: 1) that different users inherently have varying definitions of ground-truths due to inter-dependencies to their unique real-world environment and 2) depending whether or not model performance is mission-critical. In other words, in certain conditions, in order for a model to be optimized for accuracy for a given user, the model needs to be “localized” to its user’s ground truth in their data assuming that a model can’t afford to be wrong too many times.

Example: Imagine there is a kazillion parameter encoder transformer called Hal9000. This AGI model knows everything there is to know in the world when it comes to knowledge. Now Hal9000 has 2 big customers, John that works for Meta and Jane that works for CyberDyne Systems. John and Jane, don’t know each other, but are active commodity traders in their spare time who depend on Hal9000 for classifying finance-related tweets for the sentiment analysis (positive, negative, neutral) task. John and Jane are trading in real-time and a tweet is published on the wire: “gold is up 150% in after-hours trading.”

Some background: John is bullish on gold (owns gold call options and wants the gold price to go up in order to make money) and Jane is bearish on gold (owns gold put options and wants the price to go down in order to make money).

It’s time for Hal9000 to do its magic and classify this tweet so John and Jane can execute a trade. But Hal has a big problem. It can’t generalize to both John and Jane’s definition of ‘positive’ and ‘negative’. The model needs to classify this tweet as ‘positive’ for John and ‘negative’ for Jane given the same input text.

This is the LP manifesting itself in the real-world. Hal needs to localize itself to John’s ground truth and Jane’s ground truth of sentiment. Currently the way we localize models is by fine-tuning them. And fine-tuning isn’t a hinderance (as some may suggest who are obsessed with zero-shot) on AI performance but in actuality, it’s a prerequisite. All the software and hardware improvements in the world can’t make the model improve its accuracy if it is not localized to its user. However, not all use-cases encounters LP.

There is a market for non-local language models to thrive: and it’s a market where users can leverage a community accepted NLP task for which the error of the model is not mission-critical. This type of task and non-mission-critical environment isn’t concerned with LP.

(Think of a screenwriter using a GPT-3 for writing a screen play. The model can generate 20 screenplays, 19/20 inferences suck, but the writer likes 1 of the 20 scripts generated. A high error rate of 95%. However, the error rate is not mission-critical, and the user still finds value in the model’s output.)

Inversely, avoiding LP will be more prominent and required when involving mission-critical use-cases for AI models like our John and Jane example. Both John and Jane can’t tolerate a 95% error rate as they would go bust over time (in addition to the model not being local to their ground truth).

In the end, some in deep learning believe that simply scaling large language models will be able to achieve AGI. However, in mission-critical situations, generalization without localization will actually hinder performance for some users due to the localization problem, especially when there is no window for the model to be wrong…

like Tesla’s Autopilot.

In the end, we are only a pale blue dot…

If enjoy this read, don’t forget to give it a 👏👏 … 🎃

Google’s Pathway to More Intelligent Models

Microsoft’s 2 New Repos for Speech and Seq2Seq

WavLM: Large-scale self-supervised pre-training for full stack speech processing.

unilm/wavlm at master · microsoft/unilm

WavLM : WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing Official PyTorch…

github.com

S2S-FT A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation.

unilm/s2s-ft at master · microsoft/unilm

The recommended way to run the code is using docker: docker run -it — rm — runtime=nvidia — ipc=host — privileged…

github.com

DockerFiles Homies

A long list of dockerfiles for assorted sorties in the digital space.

GitHub - jessfraz/dockerfiles: Various Dockerfiles I use on the desktop and on servers.

This is a repo to hold various Dockerfiles for images I create. Table of Contents Almost all of these live on dockerhub…

github.com

Kerla | OS Written in Rust

A new OS with compatibility with Linux binaries.

GitHub - nuta/kerla: A new operating system kernel with Linux binary compatibility written in Rust.

Kerla is a monolithic operating system kernel written from scratch in Rust which aims to be compatible with the Linux…

github.com

Taichi | Parallel Programming

A Python compiler that can parallelize tasks to multi-core CPUs and parallel GPUs.

GitHub - taichi-dev/taichi: Parallel programming for everyone.

Taichi (太极) is a parallel programming language for high-performance numerical computations. It is embedded in Python…

github.com

Oldie But Goodie

Real-time inference TensorRT.

Real-Time Natural Language Processing with BERT Using NVIDIA TensorRT (Updated) | NVIDIA Developer…

Large-scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought exciting leaps in accuracy for many…

developer.nvidia.com

SpikeX | Knowledge Extraction

Insanely useful library for NLP developer. Built on top of spaCy.

Pipelines:

WikiPageX links Wikipedia pages to chunks in text
ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy’s one with improvements
LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
PhraseX creates a Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
SentX detects sentences in a text, based on Splitta with refinements

GitHub - erre-quadro/spikex: SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge…

github.com

The Read 📚

https://arxiv.org/pdf/2110.13041.pdf

Sentence Splitting and NLP… Comparing Libraries w/ Colab

Colab of the Week 🔥

Google Colaboratory

Edit description

colab.research.google.com

Model Convergence in BigScience

Better Way for Evaluating QA Systems

In other words, F1 sucks bruh.

Semantic Answer Similarity: Smarter Metric to Evaluate Question Answering Systems

In our recent post on evaluating a question answering model, we discussed the most commonly used metrics for evaluating…

www.deepset.ai

🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻🐱‍💻

How are hackers scanning the whole Internet in just a few minutes?

Masscan is known for super performance in scanning the Internet - let's look at how easy and feasible it is Image…

cooltechzone.com

GoEmotions Dataset

58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories .

GoEmotions: A Dataset for Fine-Grained Emotion Classification

Emotions are a key aspect of social interactions, influencing the way people behave and shaping relationships. This is…

ai.googleblog.com

Software Updates

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

Open Rule Induction

Aims to induce open rules utilizing the knowledge in language models.

GitHub - chenxran/Orion: This repository is the official implementation of Open Rule Induction…

This repository is the official implementation of Open Rule Induction. This paper has been accepted to NeurIPS 2021. …

github.com

HourglassLM

Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently.

trax/hourglass.py at master · google/trax

Trax - Deep Learning with Clear Code and Speed. Contribute to google/trax development by creating an account on GitHub.

github.com

SCENIC: JAX Library for Computer Vision Research and Beyond

A codebase with a focus on research around attention-based models for computer vision. Scenic has been successfully used to develop classification, segmentation, and detection models for multiple modalities including images, video, audio, and multimodal combinations of them.

GitHub - google-research/scenic: Scenic: A Jax Library for Computer Vision Research and Beyond

More precisely, Scenic is a (i) set of shared light-weight libraries solving tasks commonly encountered tasks when…

github.com

Persona Authentication through Generative Dialogue

An open-domain conversational agent capable of decoding personalized and controlled responses based on user input. It is built on the pretrained DialoGPT-medium model, following the GPT-2 architecture. PersonaGPT is fine-tuned on the Persona-Chat dataset.

GitHub - illidanlab/personaGPT: Implementation of PersonaGPT Dialog Model

PersonaGPT is an open-domain conversational agent cpable of decoding personalized and controlled responses based on…

github.com

IconQA

An abstract visual question answering dataset that highlights the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world problems.
There are three different sub-tasks in IconQA:

57,672 image choice MC questions
31,578 text choice MC questions
18,189 fill-in-the-blank questions

GitHub - lupantech/IconQA: A new benchmark for Icon Question Answering (IconQA) and a large-scale…

IconQA is a new diverse abstract visual question answering dataset that highlights the importance of abstract diagram…

github.com

Grade School Math Dataset

Dataset consists of 8.5K high quality grade school math problems created by human problem writers. Authors segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ — / *) to reach the final answer.

GitHub - openai/grade-school-math

Status: Archive (code is provided as-is, no updates expected) State-of-the-art language models can match human…

github.com

SCICAP: Scientific Figures Dataset

A large-scale figure caption dataset based on Computer Science arXiv papers published between 2010-2020. SCICAP contains 410k figures that focused on one of the dominent figure type — graphplot, extracted from over 290,000 papers.

GitHub - tingyaohsu/SciCap: SciCap Dataset

SCICAP a large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020…

github.com

… … … and …🎃 Happy Halloween! 🎃

Quantum Stat

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 10.31.21

The Localization Problem, and the Pale Blue Dot

Google’s Pathway to More Intelligent Models

Microsoft’s 2 New Repos for Speech and Seq2Seq

unilm/wavlm at master · microsoft/unilm

WavLM : WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing Official PyTorch…

unilm/s2s-ft at master · microsoft/unilm

The recommended way to run the code is using docker: docker run -it — rm — runtime=nvidia — ipc=host — privileged…

DockerFiles Homies

GitHub - jessfraz/dockerfiles: Various Dockerfiles I use on the desktop and on servers.

This is a repo to hold various Dockerfiles for images I create. Table of Contents Almost all of these live on dockerhub…

Kerla | OS Written in Rust

GitHub - nuta/kerla: A new operating system kernel with Linux binary compatibility written in Rust.

Kerla is a monolithic operating system kernel written from scratch in Rust which aims to be compatible with the Linux…

Taichi | Parallel Programming

GitHub - taichi-dev/taichi: Parallel programming for everyone.

Taichi (太极) is a parallel programming language for high-performance numerical computations. It is embedded in Python…

Oldie But Goodie

Real-Time Natural Language Processing with BERT Using NVIDIA TensorRT (Updated) | NVIDIA Developer…

Large-scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought exciting leaps in accuracy for many…

SpikeX | Knowledge Extraction

GitHub - erre-quadro/spikex: SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge…

The Read 📚

Sentence Splitting and NLP… Comparing Libraries w/ Colab

Colab of the Week 🔥

Google Colaboratory

Edit description

Model Convergence in BigScience

Better Way for Evaluating QA Systems

Semantic Answer Similarity: Smarter Metric to Evaluate Question Answering Systems

In our recent post on evaluating a question answering model, we discussed the most commonly used metrics for evaluating…

How are hackers scanning the whole Internet in just a few minutes?

Masscan is known for super performance in scanning the Internet - let's look at how easy and feasible it is Image…

GoEmotions Dataset

GoEmotions: A Dataset for Fine-Grained Emotion Classification

Emotions are a key aspect of social interactions, influencing the way people behave and shaping relationships. This is…

Software Updates

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

Open Rule Induction

GitHub - chenxran/Orion: This repository is the official implementation of Open Rule Induction…

This repository is the official implementation of Open Rule Induction. This paper has been accepted to NeurIPS 2021. …

HourglassLM

trax/hourglass.py at master · google/trax

Trax - Deep Learning with Clear Code and Speed. Contribute to google/trax development by creating an account on GitHub.

SCENIC: JAX Library for Computer Vision Research and Beyond

GitHub - google-research/scenic: Scenic: A Jax Library for Computer Vision Research and Beyond

More precisely, Scenic is a (i) set of shared light-weight libraries solving tasks commonly encountered tasks when…

Persona Authentication through Generative Dialogue

GitHub - illidanlab/personaGPT: Implementation of PersonaGPT Dialog Model

PersonaGPT is an open-domain conversational agent cpable of decoding personalized and controlled responses based on…

IconQA

GitHub - lupantech/IconQA: A new benchmark for Icon Question Answering (IconQA) and a large-scale…

IconQA is a new diverse abstract visual question answering dataset that highlights the importance of abstract diagram…

Grade School Math Dataset

GitHub - openai/grade-school-math

Status: Archive (code is provided as-is, no updates expected) State-of-the-art language models can match human…

SCICAP: Scientific Figures Dataset

GitHub - tingyaohsu/SciCap: SciCap Dataset

SCICAP a large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020…

… … … and …🎃 Happy Halloween! 🎃

Written by Ricky Costa