NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 02.06.21

NeoX

Ricky Costa

5 min readFeb 6, 2022

A Vendetta and 404s

North Korea Hacked Him. So He Took Down Its Internet

Disappointed with the lack of US response to the Hermit Kingdom's attacks against US security researchers, one hacker…

www.wired.com

Meanwhile Everything dropped this week…

DeepMind’s AlphaCode

AlphaCode

Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect…

alphacode.deepmind.com

Meanwhile back at the ranch…

Math Olympiad solver from OpenAI:

Blog:

Solving (Some) Formal Math Olympiad Problems

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems…

openai.com

Meanwhile back at the ranch… again…

New GPT-NeoX 20B params dropped:

Announcing GPT-NeoX-20B | Hacker News

That said, besides being overall "dumber" than 175B GPT-3, the 6B model was missing a critical feature: prompting. 175B…

news.ycombinator.com

New Transformer Book Repo with Colab Notebooks!

GitHub - nlp-with-transformers/notebooks: Jupyter notebooks for the Natural Language Processing…

This repository contains the example code from our O'Reilly book Natural Language Processing with Transformers: You can…

github.com

Parsr: A PDF Parser that doesn’t suck 😎

GitHub - axa-group/Parsr: Transforms PDF, Documents and Images into Enriched Structured Data

Parsr, is a minimal-footprint document ( image, pdf, docx, eml) cleaning, parsing and extraction toolchain which…

github.com

SBERT Author Shreds the New GPT-3 Embeddings Offering 🥶🥶

“The biggest downside for the OpenAI embeddings endpoint is the high costs (about 8,000–600,000 times more expensive than open models on your infrastructure), the high dimensionality of up to 12288 dimensions (making downstream applications slow), and the extreme latency when computing embeddings. This hinders the actual usage of the embeddings for any search applications.”

OpenAI GPT-3 Text Embeddings — Really a new state-of-the-art in dense text embeddings?

This week, OpenAI announced an embeddings endpoint (paper) for GPT-3 that allows users to derive dense text embeddings…

medium.com

🥶 Oops: Exposed databases on AWS

FYI: I had previously written about this issue over a year ago and even provided a search engine, it seems now more peeps are on top of this issue.

How I Discovered Thousands of Open Databases on AWS

My journey on finding and reporting databases with sensitive data about Fortune-500 companies, Hospitals, Crypto…

infosecwriteups.com

Scan the entire internet under 5 minutes:

GitHub - robertdavidgraham/masscan: TCP port scanner, spews SYN packets asynchronously, scanning…

This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million…

github.com

or…

Recovering redacted information from pixelated videos | Positive Security

Information that has been redacted is often the most interesting. It's therefore no wonder that some people might have…

positive.security

How I got an FBI record at age 11 from dabbling in cryptography then got into more trouble 😭😭

I Spy

Les Earnest Growing up in , my first encounter with advanced technology was the gift of a one-speed fat tired bicycle…

web.stanford.edu

ViLT Notebook for Visual Question Answering

author: Niels Rogge @ Hugging Face

Colab:

Google Colaboratory

Edit description

colab.research.google.com

Space:

Vilt Vqa - a Hugging Face Space by nielsr

Discover amazing ML apps made by the community

huggingface.co

How to Improve User Experience (and Behavior): Three Papers from Stanford’s Alexa Prize Team

How to Improve User Experience (and Behavior): Three Papers from Stanford's Alexa Prize Team

In 2019, Stanford entered the Alexa Prize Socialbot Grand Challenge 3 for the first time, with its bot Chirpy Cardinal…

ai.stanford.edu

For Practioners: How GPUs Work | A Thread

Twitter conversation started by @marktenenholtz

Read the Twitter conversation started by @marktenenholtz and the replies to it on ThreadReaderApp

threadreaderapp.com

Data Engineering in Julia

Work with massive datasets to design data models and automate data pipelines using Julia

towardsdatascience.com

DeepChecks

A repo for validating models and data.

GitHub - deepchecks/deepchecks: Test Suites for Validating ML Models & Data. Deepchecks is a Python…

Test Suites for Validating ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine…

github.com

Task-Specific Knowledge Distillation for BERT using Transformers & Amazon SageMaker

Task-specific knowledge distillation for BERT using Transformers & Amazon SageMaker

Photo by Paul Byrne on Unsplash Welcome to this end-to-end task-specific knowledge distillation Text-Classification…

www.philschmid.de

Papers to Read 📚

https://arxiv.org/pdf/2201.05596.pdf

https://arxiv.org/pdf/2201.11990.pdf

https://arxiv.org/pdf/2202.01110.pdf

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

TableQuery: Querying tabular data with natural language

AI Tool for querying natural language on tabular data.
Tabular data can be:
::Dataframes
::CSV files

GitHub - abhijithneilabraham/tableQA: AI Tool for querying natural language on tabular data.

AI Tool for querying natural language on tabular data.Built using QA models from transformers. This work is described…

github.com

Connected Papers 📈

POTATO: exPlainable infOrmation exTrAcTion framewOrk

POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any classification problem in NLP.

GitHub - adaamko/POTATO: XAI based human-in-the-loop framework for automatic rule-learning.

POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any…

github.com

Connected Papers 📈

PrompSource

PromptSource is a toolkit for creating, sharing and using natural language prompts. Work from the BigScience initiative.

GitHub - bigscience-workshop/promptsource: Toolkit for creating, sharing and using natural language…

Recent work has shown that large language models exhibit the ability to perform reasonable zero-shot generalization to…

github.com

Connected Papers 📈

Text Anonymization Benchmark (TAB)

The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) manually annotated with:

semantic categories for personal identifiers,
masking decisions (in regard to the re-identification risk for the person to protect),
confidential attributes,
co-reference relations.

GitHub - NorskRegnesentral/text-anonymisation-benchmark: Annotated corpus + evaluation metrics for…

The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268…

github.com

Connected Papers 📈

FiNCAT: Financial Numeral Claim Analysis Tool

A tool to detect whether numerals present in Financial Texts are in-claim or out-of-claim.

GitHub - sohomghosh/FiNCAT_Financial_Numeral_Claim_Analysis_Tool: A tool to detect whether numerals…

A tool to detect whether numerals present in Financial Texts are in-claim or out-of-claim Please refer to…

github.com

Connected Papers 📈

Quantum Stat

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 02.06.21

NeoX

A Vendetta and 404s

North Korea Hacked Him. So He Took Down Its Internet

Disappointed with the lack of US response to the Hermit Kingdom's attacks against US security researchers, one hacker…

Meanwhile Everything dropped this week…

AlphaCode

Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect…

Meanwhile back at the ranch…

Solving (Some) Formal Math Olympiad Problems

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems…

Meanwhile back at the ranch… again…

Announcing GPT-NeoX-20B | Hacker News

That said, besides being overall "dumber" than 175B GPT-3, the 6B model was missing a critical feature: prompting. 175B…

New Transformer Book Repo with Colab Notebooks!

GitHub - nlp-with-transformers/notebooks: Jupyter notebooks for the Natural Language Processing…

This repository contains the example code from our O'Reilly book Natural Language Processing with Transformers: You can…

Parsr: A PDF Parser that doesn’t suck 😎

GitHub - axa-group/Parsr: Transforms PDF, Documents and Images into Enriched Structured Data

Parsr, is a minimal-footprint document ( image, pdf, docx, eml) cleaning, parsing and extraction toolchain which…

SBERT Author Shreds the New GPT-3 Embeddings Offering 🥶🥶

OpenAI GPT-3 Text Embeddings — Really a new state-of-the-art in dense text embeddings?

This week, OpenAI announced an embeddings endpoint (paper) for GPT-3 that allows users to derive dense text embeddings…

🥶 Oops: Exposed databases on AWS

How I Discovered Thousands of Open Databases on AWS

My journey on finding and reporting databases with sensitive data about Fortune-500 companies, Hospitals, Crypto…

Scan the entire internet under 5 minutes:

GitHub - robertdavidgraham/masscan: TCP port scanner, spews SYN packets asynchronously, scanning…

This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million…

Recovering redacted information from pixelated videos | Positive Security

Information that has been redacted is often the most interesting. It's therefore no wonder that some people might have…

How I got an FBI record at age 11 from dabbling in cryptography then got into more trouble 😭😭

I Spy

Les Earnest Growing up in , my first encounter with advanced technology was the gift of a one-speed fat tired bicycle…

ViLT Notebook for Visual Question Answering

Google Colaboratory

Edit description

Vilt Vqa - a Hugging Face Space by nielsr

Discover amazing ML apps made by the community

How to Improve User Experience (and Behavior): Three Papers from Stanford’s Alexa Prize Team

How to Improve User Experience (and Behavior): Three Papers from Stanford's Alexa Prize Team

In 2019, Stanford entered the Alexa Prize Socialbot Grand Challenge 3 for the first time, with its bot Chirpy Cardinal…

For Practioners: How GPUs Work | A Thread

Twitter conversation started by @marktenenholtz

Read the Twitter conversation started by @marktenenholtz and the replies to it on ThreadReaderApp

Data Engineering in Julia

Data Engineering in Julia

Work with massive datasets to design data models and automate data pipelines using Julia

DeepChecks

GitHub - deepchecks/deepchecks: Test Suites for Validating ML Models & Data. Deepchecks is a Python…

Test Suites for Validating ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine…

Task-Specific Knowledge Distillation for BERT using Transformers & Amazon SageMaker

Task-specific knowledge distillation for BERT using Transformers & Amazon SageMaker

Photo by Paul Byrne on Unsplash Welcome to this end-to-end task-specific knowledge distillation Text-Classification…

Papers to Read 📚

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

TableQuery: Querying tabular data with natural language

GitHub - abhijithneilabraham/tableQA: AI Tool for querying natural language on tabular data.

AI Tool for querying natural language on tabular data.Built using QA models from transformers. This work is described…

POTATO: exPlainable infOrmation exTrAcTion framewOrk

GitHub - adaamko/POTATO: XAI based human-in-the-loop framework for automatic rule-learning.

POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any…

PrompSource

GitHub - bigscience-workshop/promptsource: Toolkit for creating, sharing and using natural language…

Recent work has shown that large language models exhibit the ability to perform reasonable zero-shot generalization to…

Text Anonymization Benchmark (TAB)

GitHub - NorskRegnesentral/text-anonymisation-benchmark: Annotated corpus + evaluation metrics for…

The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268…

FiNCAT: Financial Numeral Claim Analysis Tool

GitHub - sohomghosh/FiNCAT_Financial_Numeral_Claim_Analysis_Tool: A tool to detect whether numerals…

A tool to detect whether numerals present in Financial Texts are in-claim or out-of-claim Please refer to…

Written by Ricky Costa