Going West | Pollock


The NLP Cypher | 10.31.21

The Localization Problem, and the Pale Blue Dot

Ricky Costa
8 min readOct 31, 2021


The Localization Problem (LP) is a glaring dark cloud hanging over the state of affairs in applied deep learning. And acknowledging this problem, I believe, will enable us to make better use of applied AI and expand our knowledge in how the business market will form.

Defining LP: There is a limit to how much large centralized language models can generalize at scale given: 1) that different users inherently have varying definitions of ground-truths due to inter-dependencies to their unique real-world environment and 2) depending whether or not model performance is mission-critical. In other words, in certain conditions, in order for a model to be optimized for accuracy for a given user, the model needs to be “localized” to its user’s ground truth in their data assuming that a model can’t afford to be wrong too many times.

Example: Imagine there is a kazillion parameter encoder transformer called Hal9000. This AGI model knows everything there is to know in the world when it comes to knowledge. Now Hal9000 has 2 big customers, John that works for Meta and Jane that works for CyberDyne Systems. John and Jane, don’t know each other, but are active commodity traders in their spare time who depend on Hal9000 for classifying finance-related tweets for the sentiment analysis (positive, negative, neutral) task. John and Jane are trading in real-time and a tweet is published on the wire: “gold is up 150% in after-hours trading.”

Some background: John is bullish on gold (owns gold call options and wants the gold price to go up in order to make money) and Jane is bearish on gold (owns gold put options and wants the price to go down in order to make money).

It’s time for Hal9000 to do its magic and classify this tweet so John and Jane can execute a trade. But Hal has a big problem. It can’t generalize to both John and Jane’s definition of ‘positive’ and ‘negative’. The model needs to classify this tweet as ‘positive’ for John and ‘negative’ for Jane given the same input text.

This is the LP manifesting itself in the real-world. Hal needs to localize itself to John’s ground truth and Jane’s ground truth of sentiment. Currently the way we localize models is by fine-tuning them. And fine-tuning isn’t a hinderance (as some may suggest who are obsessed with zero-shot) on AI performance but in actuality, it’s a prerequisite. All the software and hardware improvements in the world can’t make the model improve its accuracy if it is not localized to its user. However, not all use-cases encounters LP.

There is a market for non-local language models to thrive: and it’s a market where users can leverage a community accepted NLP task for which the error of the model is not mission-critical. This type of task and non-mission-critical environment isn’t concerned with LP.

(Think of a screenwriter using a GPT-3 for writing a screen play. The model can generate 20 screenplays, 19/20 inferences suck, but the writer likes 1 of the 20 scripts generated. A high error rate of 95%. However, the error rate is not mission-critical, and the user still finds value in the model’s output.)

Inversely, avoiding LP will be more prominent and required when involving mission-critical use-cases for AI models like our John and Jane example. Both John and Jane can’t tolerate a 95% error rate as they would go bust over time (in addition to the model not being local to their ground truth).

In the end, some in deep learning believe that simply scaling large language models will be able to achieve AGI. However, in mission-critical situations, generalization without localization will actually hinder performance for some users due to the localization problem, especially when there is no window for the model to be wrong…

like Tesla’s Autopilot.

In the end, we are only a pale blue dot…

If enjoy this read, don’t forget to give it a 👏👏🎃

Google’s Pathway to More Intelligent Models

Microsoft’s 2 New Repos for Speech and Seq2Seq

WavLM: Large-scale self-supervised pre-training for full stack speech processing.

S2S-FT A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation.

DockerFiles Homies

A long list of dockerfiles for assorted sorties in the digital space.

Kerla | OS Written in Rust

A new OS with compatibility with Linux binaries.

Taichi | Parallel Programming

A Python compiler that can parallelize tasks to multi-core CPUs and parallel GPUs.

Oldie But Goodie

Real-time inference TensorRT.

SpikeX | Knowledge Extraction

Insanely useful library for NLP developer. Built on top of spaCy.


  • WikiPageX links Wikipedia pages to chunks in text
  • ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
  • AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy’s one with improvements
  • LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
  • PhraseX creates a Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
  • SentX detects sentences in a text, based on Splitta with refinements

The Read 📚


Sentence Splitting and NLP… Comparing Libraries w/ Colab

Colab of the Week 🔥

Model Convergence in BigScience

Better Way for Evaluating QA Systems

In other words, F1 sucks bruh.


GoEmotions Dataset

58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories .

Software Updates

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁


Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently.

SCENIC: JAX Library for Computer Vision Research and Beyond

A codebase with a focus on research around attention-based models for computer vision. Scenic has been successfully used to develop classification, segmentation, and detection models for multiple modalities including images, video, audio, and multimodal combinations of them.

Persona Authentication through Generative Dialogue

An open-domain conversational agent capable of decoding personalized and controlled responses based on user input. It is built on the pretrained DialoGPT-medium model, following the GPT-2 architecture. PersonaGPT is fine-tuned on the Persona-Chat dataset.


An abstract visual question answering dataset that highlights the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world problems.

There are three different sub-tasks in IconQA:

  • 57,672 image choice MC questions
  • 31,578 text choice MC questions
  • 18,189 fill-in-the-blank questions

Grade School Math Dataset

Dataset consists of 8.5K high quality grade school math problems created by human problem writers. Authors segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ — / *) to reach the final answer.

SCICAP: Scientific Figures Dataset

A large-scale figure caption dataset based on Computer Science arXiv papers published between 2010-2020. SCICAP contains 410k figures that focused on one of the dominent figure type — graphplot, extracted from over 290,000 papers.

… … … and …🎃 Happy Halloween! 🎃



Ricky Costa

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟