NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 09.19.21
Vintage Vectors
Welcome back! We have a long newsletter this week as many new NLP repos were published as tech nerds return from their Summer vacation. 😁
This week I’ll add close to 150 new NLP repos to the NLP Index. So stay tuned for this update, it will drop this week.
Welcome to the Matrix
Six Degrees of Wikipedia
just explore…
EmbeddingHub
Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind.
- Store embeddings durably and with high availability
- Allow for approximate nearest neighbor operations
- Enable other operations like partitioning, sub-indices, and averaging
- Manage versioning, access control, and rollbacks painlessly
Rubrix | Open Sourced NLP Data Explorer/Annotator
This library is compatible with the usual suspects in NLP: Hugging Face Transformers, spaCy, Stanford Stanza, Flair etc.
Rubrix can:
- Monitor the predictions of deployed models.
- Collect ground-truth data for starting up a project or evolving an existing one.
- Iterate on ground-truth data and predictions to debug, track and improve your models over time.
- Build custom applications and dashboards on top of your model predictions and ground-truth data.
AI100 Survey
After 5 years, the survey is back.
Beyond “Vanilla” Question Answering
Deepset blog on how to enhance a QA model by adding more features such as classification, summarization, and generative QA.
Papers to Read 📚
Mistakes Made in AWS
Learning from failure is more informative vs. learning from success.
New Models for Sentence Transformers
Comparing Language Identification Libraries
Get a major download of the leading text detection libraries. You get a comparison of accuracy, language coverage, speed and memory consumption.
AWESOME NOTEBOOKS
Very handy collection of notebooks for every day data engineering tasks.
CodeT5 from Salesforce on Hugging Face Model Hub
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
Macaw | Multi-Angle C(Q)uestion Answering
A model capable of general question answering, showing robustness outside the domains it was trained on. It has been trained in “multi-angle” fashion, which means it can handle a flexible set of input and output “slots” (like question, answer, explanation) . Built on top of T5.
Generating Out-of-scope Labels with Data augmentation (GOLD)
A technique that augments existing data to train better out-of-scope detectors operating in low-data regimes. GOLD generates pseudo-labeled candidates using samples from an auxiliary dataset and keeps only the most beneficial candidates for training through a novel filtering mechanism.
STaCK: Sentence Ordering with Temporal Commonsense Knowledge
A framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences.
The Emory Language and Information Toolkit (ELIT)
The Emory Language and Information Toolkit (ELIT) provides the state-of-the-art NLP models for the following tasks:
- Tokenization
- Part-of-Speech Tagging
- Named Entity Recognition
- Constituency Parsing
- Dependency Parsing
- Semantic Role Labeling
- AMR Parsing
- Coreference Resolution
- Emotion Detection
Finetuned Language Models are Zero-Shot Learners
A method for improving the zero-shot learning abilities of language models via instruction tuning.
xGQA Dataset
Extending the English GQA dataset to 7 typologically diverse languages for cross-lingual visual question answering.
AliceMind: ALIbaba’s Collection of Encoder-Decoders from MinD Lab
Repo contains:
The family of AliceMind:
- Language understanding model: StructBERT (
ICLR 2020
) - Generative language model: PALM (
EMNLP 2020
) - Cross-lingual language model: VECO (
ACL 2021
) - Cross-modal language model: StructVBERT (
CVPR 2020 VQA Challenge Runner-up
) - Structural language model: StructuralLM (
ACL 2021
) - Chinese language understanding model with multi-granularity inputs: LatticeBERT (
NAACL 2021
)
SEW (Squeezed and Efficient Wav2vec)
Repo focusing on the wav2vec 2.0 model that formalizes several architecture designs that influence both the model performance and its efficiency.
BioLAMA Benchmark
BIOLAMA benchmark is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs.
Zero-Shot Dialogue State Tracking via Cross-Task Transfer
TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in dialogue-state tracking.
BenchIE: Benchmark for Open Information Extraction
BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations and a set of OIE extractions from different OIE systems, BenchIE measures precision, recall and F1 score based on our fact-based approach for evaluating OIE systems.
Box Embeddings
Open-source library for Box Embeddings and Box Representations, built on PyTorch & TensorFlow.
Art Description Generation for Paintings
A repo with a model for generating descriptions of fine-art paintings.