NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 01.23.22
Desiderata
🕵️♂️Has AI interest peaked?
https://trends.google.com/trends/explore?date=all&q=deep%20learning,Artificial%20Intelligence
If you’re bummed, you can always… 👇
Graph ML in 2022: Where Are We Now?
The State of Web-Scraping 2022
DARPA and OSS 🕵️♀️
Press Release: https://www.darpa.mil/news-events/2021-12-21
The DARPA GARD program seeks to establish theoretical ML system foundations to identify system vulnerabilities, characterize properties that will enhance system robustness, and encourage the creation of effective defenses. Currently, ML defenses tend to be highly specific and are effective only against particular attacks. GARD seeks to develop defenses capable of defending against broad categories of attacks. Furthermore, current evaluation paradigms of AI robustness often focus on simplistic measures that may not be relevant to security. To verify relevance to security and wide applicability, defenses generated under GARD will be measured in a novel testbed employing scenario-based evaluations.
Repos mentioned in the press release:
State of Machine Learning in Julia
For Those interested in Semantic Similarity
Free CS Classes
Google Style Guide for Python
From the Creator of FastAPI 👉 Asyncer
“The main goal of Asyncer is to improve developer experience by providing better support for autocompletion and inline errors in the editor, and more certainty that the code is bug-free by providing better support for type checking tools like mypy.”
Real-Time Machine Learning
Handling Large Messages with Kafka
Sentence Segmentation
Kaggle Solutions Repo
Happy Transformer
OSLO: Extending the Training Capability for Transformers
SeaTunnel
Problems it attempts to solve:
- Data loss and duplication
- Task accumulation and delay
- Low throughput
- Long cycle to be applied in the production environment
- Lack of application running status monitoring
Cresset — A PyTorch Universal Docker Template
Papers to Read📚
From the Lex Fridman podcast featuring Yann LeCun as guest:
It’s cued up to the moment Yann mentions the paper above.
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
COPA-SSE
COPA-SSE contains crowdsourced explanations for the Balanced COPA dataset, a variant of the Choice of Plausible Alternatives (COPA) benchmark. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts.
SQUIRE: A Sequence-to-sequence Framework for Multi-hop Knowledge Graph Reasoning
The first sequence-to-sequence based multi-hop reasoning framework, which utilizes an encoder-decoder structure to translate the triple query to a multi-hop path.
CVSS A Massively Multilingual Speech-to-Speech Translation Corpus
CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English.
Datasheet for the Pile
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.
UnifiedSKG📚: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
The UnifiedSKG framework, which unifies 21 SKG tasks into the text-to-text format, aiming to promote systematic SKG research - instead of being exclusive to a single task, domain, or dataset. It shows that large language models like T5, with simple modification when necessary, achieve state-of-the-art performance on nearly all 21 tasks.
TweebankNLP
Tweebank-NER, an NER corpus of tweets based on Tweebank V2 (TB2) dataset.