Play Me

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 01.23.22

Desiderata

Ricky Costa

6 min readJan 23, 2022

🕵️‍♂️Has AI interest peaked?

https://trends.google.com/trends/explore?date=all&q=deep%20learning,Artificial%20Intelligence

If you’re bummed, you can always… 👇

Make Frontend Shit Again

Make Frontend Shit Againmakefrontendshitagain.party

Graph ML in 2022: Where Are We Now?

Hot trends and major advancements

towardsdatascience.com

The State of Web-Scraping 2022

The State of Web Scraping 2022 | ScrapeOps

With 2021 having come to an end, now is the time to look back at the big events & trends in the world of web scraping…

scrapeops.io

DARPA and OSS 🕵️‍♀️

Press Release: https://www.darpa.mil/news-events/2021-12-21

The DARPA GARD program seeks to establish theoretical ML system foundations to identify system vulnerabilities, characterize properties that will enhance system robustness, and encourage the creation of effective defenses. Currently, ML defenses tend to be highly specific and are effective only against particular attacks. GARD seeks to develop defenses capable of defending against broad categories of attacks. Furthermore, current evaluation paradigms of AI robustness often focus on simplistic measures that may not be relevant to security. To verify relevance to security and wide applicability, defenses generated under GARD will be measured in a novel testbed employing scenario-based evaluations.

Repos mentioned in the press release:

GitHub - twosixlabs/armory: ARMORY Adversarial Robustness Evaluation Test Bed

ARMORY is a test bed for running scalable evaluations of adversarial defenses. Configuration files are used to launch…

github.com

GitHub - Trusted-AI/adversarial-robustness-toolbox: Adversarial Robustness Toolbox (ART) - Python…

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction…

github.com

State of Machine Learning in Julia

State of machine learning in Julia

After the Twitter space Q&A @logankilpatrick hosted yesterday on "The future of machine learning and why it looks a lot…

discourse.julialang.org

For Those interested in Semantic Similarity

Euclidean vs. Cosine Distance

When to use the cosine similarity? Let's compare two different measures of distance in a vector space, and why either…

cmry.github.io

Free CS Classes

Google Style Guide for Python

styleguide

Python is the main dynamic language used at Google. This style guide is a list of dos and don'ts for Python programs…

google.github.io

From the Creator of FastAPI 👉 Asyncer

“The main goal of Asyncer is to improve developer experience by providing better support for autocompletion and inline errors in the editor, and more certainty that the code is bug-free by providing better support for type checking tools like mypy.”

Asyncer

Asyncer, async and await, focused on developer experience. Documentation: https://asyncer.tiangolo.com Source Code…

asyncer.tiangolo.com

Real-Time Machine Learning

Real-time machine learning: challenges and solutions

In the last year, I've talked to ~30 companies in different industries about their challenges with real-time machine…

huyenchip.com

Handling Large Messages with Kafka

Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files) - Kai Waehner

Kafka was not built for large messages. Period. Nevertheless, more and more projects send and process 1Mb, 10Mb, and…

www.kai-waehner.de

Sentence Segmentation

GitHub - nipunsadvilkar/pySBD: 🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based…

pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works…

github.com

GitHub - SHI-Labs/Rethinking-Text-Segmentation: [CVPR 2021] Rethinking Text Segmentation: A Novel…

This is the repo to host the dataset TextSeg and code for TexRNet from the following paper: Xingqian Xu, Zhifei Zhang…

github.com

Kaggle Solutions Repo

GitHub - faridrashidi/kaggle-solutions: 🏅 Collection of Kaggle Solutions and Ideas 🏅

This repo consists of almost all available solutions and ideas shared by top performers in the past Kaggle…

github.com

Happy Transformer

GitHub - EricFillion/happy-transformer: A package built on top of Hugging Face's transformers…

A package built on top of Hugging Face's transformers library that makes it easy to utilize state-of-the-art NLP models…

github.com

OSLO: Extending the Training Capability for Transformers

GitHub - tunib-ai/oslo: OSLO: Open Source framework for Large-scale transformer Optimization

OSLO is a framework that provides various GPU based optimization technologies for large-scale modeling. 3D Parallelism…

github.com

SeaTunnel

Problems it attempts to solve:

Data loss and duplication
Task accumulation and delay
Low throughput
Long cycle to be applied in the production environment
Lack of application running status monitoring

GitHub - apache/incubator-seatunnel: SeaTunnel is a distributed, high-performance data integration…

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of…

github.com

Cresset — A PyTorch Universal Docker Template

GitHub - veritas9872/PyTorch-Universal-Docker-Template: Template repository to build PyTorch…

Translations: 한국어 Notice The project will soon be renamed Cresset. Please be aware that the project URL will also…

github.com

Papers to Read📚

LaMDA: Language Models for Dialog Applications

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language…

arxiv.org

Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However…

arxiv.org

https://arxiv.org/pdf/2110.03742.pdf

From the Lex Fridman podcast featuring Yann LeCun as guest:

It’s cued up to the moment Yann mentions the paper above.

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

COPA-SSE

COPA-SSE contains crowdsourced explanations for the Balanced COPA dataset, a variant of the Choice of Plausible Alternatives (COPA) benchmark. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts.

GitHub - a-brassard/copa-sse: Repository for Semi-Structured Explanations for COPA (COPA-SSE)

Repository for COPA-SSE: Semi-Structured Explanations for Commonsense Reasoning . COPA-SSE contains crowdsourced…

github.com

Connected Papers 📈

SQUIRE: A Sequence-to-sequence Framework for Multi-hop Knowledge Graph Reasoning

The first sequence-to-sequence based multi-hop reasoning framework, which utilizes an encoder-decoder structure to translate the triple query to a multi-hop path.

GitHub - bys0318/SQUIRE

This is the official codebase of the SQUIRE framework for multi-hop reasoning, proposed in SQUIRE: A…

github.com

Connected Papers 📈

CVSS A Massively Multilingual Speech-to-Speech Translation Corpus

CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English.

GitHub - google-research-datasets/cvss: CVSS: A Massively Multilingual Speech-to-Speech Translation…

CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel…

github.com

Connected Papers 📈

Datasheet for the Pile

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.

GitHub - EleutherAI/the-pile

The Pile is a large, diverse, open source language modelling data set that consists of many smaller datasets combined…

github.com

Connected Papers 📈

UnifiedSKG📚: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models

The UnifiedSKG framework, which unifies 21 SKG tasks into the text-to-text format, aiming to promote systematic SKG research - instead of being exclusive to a single task, domain, or dataset. It shows that large language models like T5, with simple modification when necessary, achieve state-of-the-art performance on nearly all 21 tasks.