NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 02.06.21
NeoX
A Vendetta and 404s
Meanwhile Everything dropped this week…
DeepMind’s AlphaCode
Meanwhile back at the ranch…
Math Olympiad solver from OpenAI:
Blog:
Meanwhile back at the ranch… again…
New GPT-NeoX 20B params dropped:
New Transformer Book Repo with Colab Notebooks!
Parsr: A PDF Parser that doesn’t suck 😎
SBERT Author Shreds the New GPT-3 Embeddings Offering 🥶🥶
“The biggest downside for the OpenAI embeddings endpoint is the high costs (about 8,000–600,000 times more expensive than open models on your infrastructure), the high dimensionality of up to 12288 dimensions (making downstream applications slow), and the extreme latency when computing embeddings. This hinders the actual usage of the embeddings for any search applications.”
🥶 Oops: Exposed databases on AWS
FYI: I had previously written about this issue over a year ago and even provided a search engine, it seems now more peeps are on top of this issue.
Scan the entire internet under 5 minutes:
or…
How I got an FBI record at age 11 from dabbling in cryptography then got into more trouble 😭😭
ViLT Notebook for Visual Question Answering
author: Niels Rogge @ Hugging Face
Colab:
Space:
How to Improve User Experience (and Behavior): Three Papers from Stanford’s Alexa Prize Team
For Practioners: How GPUs Work | A Thread
Data Engineering in Julia
DeepChecks
A repo for validating models and data.
Task-Specific Knowledge Distillation for BERT using Transformers & Amazon SageMaker
Papers to Read 📚
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
TableQuery: Querying tabular data with natural language
AI Tool for querying natural language on tabular data.
Tabular data can be:
::Dataframes
::CSV files
POTATO: exPlainable infOrmation exTrAcTion framewOrk
POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any classification problem in NLP.
PrompSource
PromptSource is a toolkit for creating, sharing and using natural language prompts. Work from the BigScience initiative.
Text Anonymization Benchmark (TAB)
The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) manually annotated with:
- semantic categories for personal identifiers,
- masking decisions (in regard to the re-identification risk for the person to protect),
- confidential attributes,
- co-reference relations.
FiNCAT: Financial Numeral Claim Analysis Tool
A tool to detect whether numerals present in Financial Texts are in-claim or out-of-claim.