NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 02.06.21
A Vendetta and 404s
North Korea Hacked Him. So He Took Down Its Internet
Disappointed with the lack of US response to the Hermit Kingdom's attacks against US security researchers, one hacker…
Meanwhile Everything dropped this week…
Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect…
Meanwhile back at the ranch…
Math Olympiad solver from OpenAI:
Solving (Some) Formal Math Olympiad Problems
We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems…
Meanwhile back at the ranch… again…
New GPT-NeoX 20B params dropped:
Announcing GPT-NeoX-20B | Hacker News
That said, besides being overall "dumber" than 175B GPT-3, the 6B model was missing a critical feature: prompting. 175B…
New Transformer Book Repo with Colab Notebooks!
GitHub - nlp-with-transformers/notebooks: Jupyter notebooks for the Natural Language Processing…
This repository contains the example code from our O'Reilly book Natural Language Processing with Transformers: You can…
Parsr: A PDF Parser that doesn’t suck 😎
GitHub - axa-group/Parsr: Transforms PDF, Documents and Images into Enriched Structured Data
Parsr, is a minimal-footprint document ( image, pdf, docx, eml) cleaning, parsing and extraction toolchain which…
SBERT Author Shreds the New GPT-3 Embeddings Offering 🥶🥶
“The biggest downside for the OpenAI embeddings endpoint is the high costs (about 8,000–600,000 times more expensive than open models on your infrastructure), the high dimensionality of up to 12288 dimensions (making downstream applications slow), and the extreme latency when computing embeddings. This hinders the actual usage of the embeddings for any search applications.”
OpenAI GPT-3 Text Embeddings — Really a new state-of-the-art in dense text embeddings?
This week, OpenAI announced an embeddings endpoint (paper) for GPT-3 that allows users to derive dense text embeddings…
🥶 Oops: Exposed databases on AWS
FYI: I had previously written about this issue over a year ago and even provided a search engine, it seems now more peeps are on top of this issue.
How I Discovered Thousands of Open Databases on AWS
My journey on finding and reporting databases with sensitive data about Fortune-500 companies, Hospitals, Crypto…
Scan the entire internet under 5 minutes:
GitHub - robertdavidgraham/masscan: TCP port scanner, spews SYN packets asynchronously, scanning…
This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million…
Recovering redacted information from pixelated videos | Positive Security
Information that has been redacted is often the most interesting. It's therefore no wonder that some people might have…
How I got an FBI record at age 11 from dabbling in cryptography then got into more trouble 😭😭
Les Earnest Growing up in , my first encounter with advanced technology was the gift of a one-speed fat tired bicycle…
ViLT Notebook for Visual Question Answering
author: Niels Rogge @ Hugging Face
Vilt Vqa - a Hugging Face Space by nielsr
Discover amazing ML apps made by the community
How to Improve User Experience (and Behavior): Three Papers from Stanford’s Alexa Prize Team
How to Improve User Experience (and Behavior): Three Papers from Stanford's Alexa Prize Team
In 2019, Stanford entered the Alexa Prize Socialbot Grand Challenge 3 for the first time, with its bot Chirpy Cardinal…
For Practioners: How GPUs Work | A Thread
Twitter conversation started by @marktenenholtz
Read the Twitter conversation started by @marktenenholtz and the replies to it on ThreadReaderApp
Data Engineering in Julia
Data Engineering in Julia
Work with massive datasets to design data models and automate data pipelines using Julia
A repo for validating models and data.
GitHub - deepchecks/deepchecks: Test Suites for Validating ML Models & Data. Deepchecks is a Python…
Test Suites for Validating ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine…
Task-Specific Knowledge Distillation for BERT using Transformers & Amazon SageMaker
Task-specific knowledge distillation for BERT using Transformers & Amazon SageMaker
Photo by Paul Byrne on Unsplash Welcome to this end-to-end task-specific knowledge distillation Text-Classification…
Papers to Read 📚
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
TableQuery: Querying tabular data with natural language
AI Tool for querying natural language on tabular data.
Tabular data can be:
GitHub - abhijithneilabraham/tableQA: AI Tool for querying natural language on tabular data.
AI Tool for querying natural language on tabular data.Built using QA models from transformers. This work is described…
POTATO: exPlainable infOrmation exTrAcTion framewOrk
POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any classification problem in NLP.
GitHub - adaamko/POTATO: XAI based human-in-the-loop framework for automatic rule-learning.
POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any…
PromptSource is a toolkit for creating, sharing and using natural language prompts. Work from the BigScience initiative.
GitHub - bigscience-workshop/promptsource: Toolkit for creating, sharing and using natural language…
Recent work has shown that large language models exhibit the ability to perform reasonable zero-shot generalization to…
Text Anonymization Benchmark (TAB)
The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) manually annotated with:
- semantic categories for personal identifiers,
- masking decisions (in regard to the re-identification risk for the person to protect),
- confidential attributes,
- co-reference relations.
GitHub - NorskRegnesentral/text-anonymisation-benchmark: Annotated corpus + evaluation metrics for…
The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268…
FiNCAT: Financial Numeral Claim Analysis Tool
A tool to detect whether numerals present in Financial Texts are in-claim or out-of-claim.
GitHub - sohomghosh/FiNCAT_Financial_Numeral_Claim_Analysis_Tool: A tool to detect whether numerals…
A tool to detect whether numerals present in Financial Texts are in-claim or out-of-claim Please refer to…