NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 09.19.21

Vintage Vectors

7 min readSep 19, 2021

Welcome back! We have a long newsletter this week as many new NLP repos were published as tech nerds return from their Summer vacation. 😁

This week I’ll add close to 150 new NLP repos to the NLP Index. So stay tuned for this update, it will drop this week.

The NLP Index

Top NLP Code Repositories - Quantum Stat

index.quantumstat.com

Welcome to the Matrix

LSXS | Violenza

Infinite WebGL tubes made with Three.js

lsxs.me

Six Degrees of Wikipedia

just explore…

Six Degrees of Wikipedia

Find the shortest hyperlinked paths between any two pages on Wikipedia.

www.sixdegreesofwikipedia.com

EmbeddingHub

Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind.

Store embeddings durably and with high availability
Allow for approximate nearest neighbor operations
Enable other operations like partitioning, sub-indices, and averaging
Manage versioning, access control, and rollbacks painlessly

GitHub - featureform/embeddinghub: A storage engine for vector machine learning embeddings.

Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind. Store embeddings…

github.com

Rubrix | Open Sourced NLP Data Explorer/Annotator

This library is compatible with the usual suspects in NLP: Hugging Face Transformers, spaCy, Stanford Stanza, Flair etc.

Rubrix can:

Monitor the predictions of deployed models.
Collect ground-truth data for starting up a project or evolving an existing one.
Iterate on ground-truth data and predictions to debug, track and improve your models over time.
Build custom applications and dashboards on top of your model predictions and ground-truth data.

GitHub - recognai/rubrix: ✨A Python framework to explore, label, and monitor data for NLP projects

Python framework to explore, label, and monitor data for NLP Example: Named Entity Recognition data exploration and…

github.com

AI100 Survey

After 5 years, the survey is back.

Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100)…

"The One Hundred Year Study on Artificial Intelligence (AI100), launched in the fall of 2014, is a long-term…

ai100.stanford.edu

PDF

https://ai100.stanford.edu/sites/g/files/sbiybj18871/files/media/file/AI100Report_MT_10.pdf

Beyond “Vanilla” Question Answering

Deepset blog on how to enhance a QA model by adding more features such as classification, summarization, and generative QA.

Beyond ‘Vanilla’ Question Answering: Start Using Classification, Summarization, and Generative QA

Sentiment classification, summarization and even natural language generation can all be part of your question answering…

medium.com

Papers to Read 📚

https://arxiv.org/pdf/2102.01192.pdf

https://arxiv.org/pdf/2109.04422.pdf

https://assets.amazon.science/46/ea/020baefd4019bd7095417e02e350/voiser-a-new-benchmark-for-voice-based-search-refinement.pdf

Mistakes Made in AWS

Learning from failure is more informative vs. learning from success.

Mistakes I've Made in AWS

I've been using AWS "professionally" since about 2015. In that time, I've made lots of mistakes. Other than…

laravel-news.com

New Models for Sentence Transformers

Nils Reimers on LinkedIn: 🚨Model Alert🚨 🏋️‍♂️ State-of-the-art sentence & paragraph embedding

🚨Model Alert🚨 🏋️‍♂️ State-of-the-art sentence & paragraph embedding models 🍻State-of-the-art semantic search models…

www.linkedin.com

Comparing Language Identification Libraries

Get a major download of the leading text detection libraries. You get a comparison of accuracy, language coverage, speed and memory consumption.

Comparison of language identification models

Detecting the text language (often called language identification) is a common task when building machine learning…

modelpredict.com

AWESOME NOTEBOOKS

Very handy collection of notebooks for every day data engineering tasks.

GitHub - jupyter-naas/awesome-notebooks: +100 awesome Jupyter Notebooks templates, organized by…

100 awesome Jupyter Notebooks templates, organized by tools, published by the Naas community, to kickstart your data…

github.com

CodeT5 from Salesforce on Hugging Face Model Hub

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

Macaw | Multi-Angle C(Q)uestion Answering

A model capable of general question answering, showing robustness outside the domains it was trained on. It has been trained in “multi-angle” fashion, which means it can handle a flexible set of input and output “slots” (like question, answer, explanation) . Built on top of T5.

GitHub - allenai/macaw: Multi-angle c(q)uestion answering

Macaw ( Multi- angle c(q)uestion ans w ering) is a ready-to-use model capable of general question answering, showing…

github.com

Connected Papers 📈

Generating Out-of-scope Labels with Data augmentation (GOLD)

A technique that augments existing data to train better out-of-scope detectors operating in low-data regimes. GOLD generates pseudo-labeled candidates using samples from an auxiliary dataset and keeps only the most beneficial candidates for training through a novel filtering mechanism.

GitHub - asappresearch/gold: Official repository for "GOLD: Improving Out-of-Scope Detection in…

This respository contains the code and data for GOLD: Improving Out-of-Scope Detection in Dialogues using Data…

github.com

Connected Papers 📈

STaCK: Sentence Ordering with Temporal Commonsense Knowledge

A framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences.

GitHub - declare-lab/sentence-ordering: This repository contains the PyTorch implementation of the…

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense…

github.com

Connected Papers 📈

The Emory Language and Information Toolkit (ELIT)

The Emory Language and Information Toolkit (ELIT) provides the state-of-the-art NLP models for the following tasks:

GitHub - emorynlp/elit: Emory Langauge and Information Toolkit

The Emory Language and Information Toolkit (ELIT) provides the state-of-the-art NLP models for the following tasks…

github.com

Connected Papers 📈

Finetuned Language Models are Zero-Shot Learners

A method for improving the zero-shot learning abilities of language models via instruction tuning.

GitHub - google-research/FLAN

Contribute to google-research/FLAN development by creating an account on GitHub.

github.com

Connected Papers 📈

Data Efficient Masked Language Modeling for Vision and Language

GitHub - yonatanbitton/data_efficient_masked_language_modeling_for_vision_and_language: Repository…

Repository for the paper "Data Efficient Masked Language Modeling for Vision and Language", accepted to Findings of…

github.com

Connected Papers 📈

xGQA Dataset

Extending the English GQA dataset to 7 typologically diverse languages for cross-lingual visual question answering.

GitHub - Adapter-Hub/xGQA

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering". xGQA builds…

github.com

Connected Papers 📈

AliceMind: ALIbaba’s Collection of Encoder-Decoders from MinD Lab

Repo contains:

The family of AliceMind:

Language understanding model: StructBERT (ICLR 2020)
Generative language model: PALM (EMNLP 2020)
Cross-lingual language model: VECO (ACL 2021)
Cross-modal language model: StructVBERT (CVPR 2020 VQA Challenge Runner-up)
Structural language model: StructuralLM (ACL 2021)
Chinese language understanding model with multi-granularity inputs: LatticeBERT (NAACL 2021)

GitHub - alibaba/AliceMind

AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository…

github.com

Connected Papers 📈

SEW (Squeezed and Efficient Wav2vec)

Repo focusing on the wav2vec 2.0 model that formalizes several architecture designs that influence both the model performance and its efficiency.

GitHub - asappresearch/sew

The repo contains the code of the paper " Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech…

github.com

Connected Papers 📈

BioLAMA Benchmark

BIOLAMA benchmark is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs.

GitHub - dmis-lab/BioLAMA: EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed…

github.com

Connected Papers 📈

Zero-Shot Dialogue State Tracking via Cross-Task Transfer

TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in dialogue-state tracking.

GitHub - facebookresearch/Zero-Shot-DST: Zero-shot dialogue state tracking (DST)

This repository includes the implementation of the paper: Leveraging Slot Descriptions for Zero-Shot Cross-Domain…

github.com

Connected Papers 📈

BenchIE: Benchmark for Open Information Extraction

BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations and a set of OIE extractions from different OIE systems, BenchIE measures precision, recall and F1 score based on our fact-based approach for evaluating OIE systems.

GitHub - gkiril/benchie: Comprehensive evaluation framework for Open Information Extraction.

BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations…

github.com

Connected Papers 📈

Box Embeddings

Open-source library for Box Embeddings and Box Representations, built on PyTorch & TensorFlow.

GitHub - iesl/box-embeddings: Box Embeddings as Modules

Open-source library for Box Embeddings and Box Representations, built on PyTorch & TensorFlow. The preferred way to…

github.com

Connected Papers 📈

Art Description Generation for Paintings

A repo with a model for generating descriptions of fine-art paintings.

GitHub - noagarcia/explain-paintings: Repository for the data in the paper "Explain Me the…

This repository is for the annotated data in the paper Explain Me the Painting: Multi-TopicKnowledgeable Art…

github.com

Connected Papers 📈

Quantum Stat

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 09.19.21

Vintage Vectors

The NLP Index

Top NLP Code Repositories - Quantum Stat

Welcome to the Matrix

LSXS | Violenza

Infinite WebGL tubes made with Three.js

Six Degrees of Wikipedia

Six Degrees of Wikipedia

Find the shortest hyperlinked paths between any two pages on Wikipedia.

EmbeddingHub

GitHub - featureform/embeddinghub: A storage engine for vector machine learning embeddings.

Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind. Store embeddings…

Rubrix | Open Sourced NLP Data Explorer/Annotator

GitHub - recognai/rubrix: ✨A Python framework to explore, label, and monitor data for NLP projects

Python framework to explore, label, and monitor data for NLP Example: Named Entity Recognition data exploration and…

AI100 Survey

Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100)…

"The One Hundred Year Study on Artificial Intelligence (AI100), launched in the fall of 2014, is a long-term…

Beyond “Vanilla” Question Answering

Beyond ‘Vanilla’ Question Answering: Start Using Classification, Summarization, and Generative QA

Sentiment classification, summarization and even natural language generation can all be part of your question answering…

Papers to Read 📚

Mistakes Made in AWS

Mistakes I've Made in AWS

I've been using AWS "professionally" since about 2015. In that time, I've made lots of mistakes. Other than…

New Models for Sentence Transformers

Nils Reimers on LinkedIn: 🚨Model Alert🚨 🏋️‍♂️ State-of-the-art sentence & paragraph embedding

🚨Model Alert🚨 🏋️‍♂️ State-of-the-art sentence & paragraph embedding models 🍻State-of-the-art semantic search models…

Comparing Language Identification Libraries

Comparison of language identification models

Detecting the text language (often called language identification) is a common task when building machine learning…

AWESOME NOTEBOOKS

GitHub - jupyter-naas/awesome-notebooks: +100 awesome Jupyter Notebooks templates, organized by…

100 awesome Jupyter Notebooks templates, organized by tools, published by the Naas community, to kickstart your data…

CodeT5 from Salesforce on Hugging Face Model Hub

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

GitHub - allenai/macaw: Multi-angle c(q)uestion answering

Macaw ( Multi- angle c(q)uestion ans w ering) is a ready-to-use model capable of general question answering, showing…

GitHub - asappresearch/gold: Official repository for "GOLD: Improving Out-of-Scope Detection in…

This respository contains the code and data for GOLD: Improving Out-of-Scope Detection in Dialogues using Data…

GitHub - declare-lab/sentence-ordering: This repository contains the PyTorch implementation of the…

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense…

GitHub - emorynlp/elit: Emory Langauge and Information Toolkit

The Emory Language and Information Toolkit (ELIT) provides the state-of-the-art NLP models for the following tasks…

GitHub - google-research/FLAN

Contribute to google-research/FLAN development by creating an account on GitHub.

GitHub - yonatanbitton/data_efficient_masked_language_modeling_for_vision_and_language: Repository…

Repository for the paper "Data Efficient Masked Language Modeling for Vision and Language", accepted to Findings of…

GitHub - Adapter-Hub/xGQA

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering". xGQA builds…

AliceMind: ALIbaba’s Collection of Encoder-Decoders from MinD Lab

GitHub - alibaba/AliceMind

AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository…

GitHub - asappresearch/sew

The repo contains the code of the paper " Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech…

GitHub - dmis-lab/BioLAMA: EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed…

GitHub - facebookresearch/Zero-Shot-DST: Zero-shot dialogue state tracking (DST)

This repository includes the implementation of the paper: Leveraging Slot Descriptions for Zero-Shot Cross-Domain…

GitHub - gkiril/benchie: Comprehensive evaluation framework for Open Information Extraction.

BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations…

GitHub - iesl/box-embeddings: Box Embeddings as Modules

Open-source library for Box Embeddings and Box Representations, built on PyTorch & TensorFlow. The preferred way to…

GitHub - noagarcia/explain-paintings: Repository for the data in the paper "Explain Me the…

This repository is for the annotated data in the paper Explain Me the Painting: Multi-TopicKnowledgeable Art…

Written by Ricky Costa

No responses yet