NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 10.31.21
The Localization Problem, and the Pale Blue Dot
The Localization Problem (LP) is a glaring dark cloud hanging over the state of affairs in applied deep learning. And acknowledging this problem, I believe, will enable us to make better use of applied AI and expand our knowledge in how the business market will form.
Defining LP: There is a limit to how much large centralized language models can generalize at scale given: 1) that different users inherently have varying definitions of ground-truths due to inter-dependencies to their unique real-world environment and 2) depending whether or not model performance is mission-critical. In other words, in certain conditions, in order for a model to be optimized for accuracy for a given user, the model needs to be “localized” to its user’s ground truth in their data assuming that a model can’t afford to be wrong too many times.
Example: Imagine there is a kazillion parameter encoder transformer called Hal9000. This AGI model knows everything there is to know in the world when it comes to knowledge. Now Hal9000 has 2 big customers, John that works for Meta and Jane that works for CyberDyne Systems. John and Jane, don’t know each other, but are active commodity traders in their spare time who depend on Hal9000 for classifying finance-related tweets for the sentiment analysis (positive, negative, neutral) task. John and Jane are trading in real-time and a tweet is published on the wire: “gold is up 150% in after-hours trading.”
Some background: John is bullish on gold (owns gold call options and wants the gold price to go up in order to make money) and Jane is bearish on gold (owns gold put options and wants the price to go down in order to make money).
It’s time for Hal9000 to do its magic and classify this tweet so John and Jane can execute a trade. But Hal has a big problem. It can’t generalize to both John and Jane’s definition of ‘positive’ and ‘negative’. The model needs to classify this tweet as ‘positive’ for John and ‘negative’ for Jane given the same input text.
This is the LP manifesting itself in the real-world. Hal needs to localize itself to John’s ground truth and Jane’s ground truth of sentiment. Currently the way we localize models is by fine-tuning them. And fine-tuning isn’t a hinderance (as some may suggest who are obsessed with zero-shot) on AI performance but in actuality, it’s a prerequisite. All the software and hardware improvements in the world can’t make the model improve its accuracy if it is not localized to its user. However, not all use-cases encounters LP.
There is a market for non-local language models to thrive: and it’s a market where users can leverage a community accepted NLP task for which the error of the model is not mission-critical. This type of task and non-mission-critical environment isn’t concerned with LP.
(Think of a screenwriter using a GPT-3 for writing a screen play. The model can generate 20 screenplays, 19/20 inferences suck, but the writer likes 1 of the 20 scripts generated. A high error rate of 95%. However, the error rate is not mission-critical, and the user still finds value in the model’s output.)
Inversely, avoiding LP will be more prominent and required when involving mission-critical use-cases for AI models like our John and Jane example. Both John and Jane can’t tolerate a 95% error rate as they would go bust over time (in addition to the model not being local to their ground truth).
In the end, some in deep learning believe that simply scaling large language models will be able to achieve AGI. However, in mission-critical situations, generalization without localization will actually hinder performance for some users due to the localization problem, especially when there is no window for the model to be wrong…
like Tesla’s Autopilot.
In the end, we are only a pale blue dot…
If enjoy this read, don’t forget to give it a 👏👏 … 🎃
Google’s Pathway to More Intelligent Models
Microsoft’s 2 New Repos for Speech and Seq2Seq
WavLM: Large-scale self-supervised pre-training for full stack speech processing.
unilm/wavlm at master · microsoft/unilm
WavLM : WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing Official PyTorch…
S2S-FT A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation.
unilm/s2s-ft at master · microsoft/unilm
The recommended way to run the code is using docker: docker run -it — rm — runtime=nvidia — ipc=host — privileged…
A long list of dockerfiles for assorted sorties in the digital space.
GitHub - jessfraz/dockerfiles: Various Dockerfiles I use on the desktop and on servers.
This is a repo to hold various Dockerfiles for images I create. Table of Contents Almost all of these live on dockerhub…
Kerla | OS Written in Rust
A new OS with compatibility with Linux binaries.
GitHub - nuta/kerla: A new operating system kernel with Linux binary compatibility written in Rust.
Kerla is a monolithic operating system kernel written from scratch in Rust which aims to be compatible with the Linux…
Taichi | Parallel Programming
A Python compiler that can parallelize tasks to multi-core CPUs and parallel GPUs.
GitHub - taichi-dev/taichi: Parallel programming for everyone.
Taichi (太极) is a parallel programming language for high-performance numerical computations. It is embedded in Python…
Oldie But Goodie
Real-time inference TensorRT.
Real-Time Natural Language Processing with BERT Using NVIDIA TensorRT (Updated) | NVIDIA Developer…
Large-scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought exciting leaps in accuracy for many…
SpikeX | Knowledge Extraction
Insanely useful library for NLP developer. Built on top of spaCy.
- WikiPageX links Wikipedia pages to chunks in text
- ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
- AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy’s one with improvements
- LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
- PhraseX creates a
Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
- SentX detects sentences in a text, based on Splitta with refinements
GitHub - erre-quadro/spikex: SpikeX - SpaCy Pipes for Knowledge Extraction
SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge…
The Read 📚
Sentence Splitting and NLP… Comparing Libraries w/ Colab
Colab of the Week 🔥
Model Convergence in BigScience
Better Way for Evaluating QA Systems
In other words, F1 sucks bruh.
Semantic Answer Similarity: Smarter Metric to Evaluate Question Answering Systems
In our recent post on evaluating a question answering model, we discussed the most commonly used metrics for evaluating…
How are hackers scanning the whole Internet in just a few minutes?
Masscan is known for super performance in scanning the Internet - let's look at how easy and feasible it is Image…
58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories .
GoEmotions: A Dataset for Fine-Grained Emotion Classification
Emotions are a key aspect of social interactions, influencing the way people behave and shaping relationships. This is…
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
Open Rule Induction
Aims to induce open rules utilizing the knowledge in language models.
Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently.
SCENIC: JAX Library for Computer Vision Research and Beyond
A codebase with a focus on research around attention-based models for computer vision. Scenic has been successfully used to develop classification, segmentation, and detection models for multiple modalities including images, video, audio, and multimodal combinations of them.
Persona Authentication through Generative Dialogue
An open-domain conversational agent capable of decoding personalized and controlled responses based on user input. It is built on the pretrained DialoGPT-medium model, following the GPT-2 architecture. PersonaGPT is fine-tuned on the Persona-Chat dataset.
An abstract visual question answering dataset that highlights the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world problems.
There are three different sub-tasks in IconQA:
- 57,672 image choice MC questions
- 31,578 text choice MC questions
- 18,189 fill-in-the-blank questions
Grade School Math Dataset
Dataset consists of 8.5K high quality grade school math problems created by human problem writers. Authors segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ — / *) to reach the final answer.
SCICAP: Scientific Figures Dataset
A large-scale figure caption dataset based on Computer Science arXiv papers published between 2010-2020. SCICAP contains 410k figures that focused on one of the dominent figure type — graphplot, extracted from over 290,000 papers.