🕵️♂️Has AI interest peaked?
If you’re bummed, you can always… 👇
Graph ML in 2022: Where Are We Now?
The State of Web-Scraping 2022
The State of Web Scraping 2022 | ScrapeOps
With 2021 having come to an end, now is the time to look back at the big events & trends in the world of web scraping…
DARPA and OSS 🕵️♀️
Press Release: https://www.darpa.mil/news-events/2021-12-21
The DARPA GARD program seeks to establish theoretical ML system foundations to identify system vulnerabilities, characterize properties that will enhance system robustness, and encourage the creation of effective defenses. Currently, ML defenses tend to be highly specific and are effective only against particular attacks. GARD seeks to develop defenses capable of defending against broad categories of attacks. Furthermore, current evaluation paradigms of AI robustness often focus on simplistic measures that may not be relevant to security. To verify relevance to security and wide applicability, defenses generated under GARD will be measured in a novel testbed employing scenario-based evaluations.
Repos mentioned in the press release:
GitHub - twosixlabs/armory: ARMORY Adversarial Robustness Evaluation Test Bed
ARMORY is a test bed for running scalable evaluations of adversarial defenses. Configuration files are used to launch…
GitHub - Trusted-AI/adversarial-robustness-toolbox: Adversarial Robustness Toolbox (ART) - Python…
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction…
State of Machine Learning in Julia
State of machine learning in Julia
After the Twitter space Q&A @logankilpatrick hosted yesterday on "The future of machine learning and why it looks a lot…
For Those interested in Semantic Similarity
Euclidean vs. Cosine Distance
When to use the cosine similarity? Let's compare two different measures of distance in a vector space, and why either…
Free CS Classes
Google Style Guide for Python
Python is the main dynamic language used at Google. This style guide is a list of dos and don'ts for Python programs…
From the Creator of FastAPI 👉 Asyncer
“The main goal of Asyncer is to improve developer experience by providing better support for autocompletion and inline errors in the editor, and more certainty that the code is bug-free by providing better support for type checking tools like mypy.”
Asyncer, async and await, focused on developer experience. Documentation: https://asyncer.tiangolo.com Source Code…
Real-Time Machine Learning
Real-time machine learning: challenges and solutions
In the last year, I've talked to ~30 companies in different industries about their challenges with real-time machine…
Handling Large Messages with Kafka
Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files) - Kai Waehner
Kafka was not built for large messages. Period. Nevertheless, more and more projects send and process 1Mb, 10Mb, and…
GitHub - nipunsadvilkar/pySBD: 🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based…
pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works…
GitHub - SHI-Labs/Rethinking-Text-Segmentation: [CVPR 2021] Rethinking Text Segmentation: A Novel…
This is the repo to host the dataset TextSeg and code for TexRNet from the following paper: Xingqian Xu, Zhifei Zhang…
Kaggle Solutions Repo
GitHub - faridrashidi/kaggle-solutions: 🏅 Collection of Kaggle Solutions and Ideas 🏅
This repo consists of almost all available solutions and ideas shared by top performers in the past Kaggle…
GitHub - EricFillion/happy-transformer: A package built on top of Hugging Face's transformers…
A package built on top of Hugging Face's transformers library that makes it easy to utilize state-of-the-art NLP models…
OSLO: Extending the Training Capability for Transformers
GitHub - tunib-ai/oslo: OSLO: Open Source framework for Large-scale transformer Optimization
OSLO is a framework that provides various GPU based optimization technologies for large-scale modeling. 3D Parallelism…
Problems it attempts to solve:
- Data loss and duplication
- Task accumulation and delay
- Low throughput
- Long cycle to be applied in the production environment
- Lack of application running status monitoring
GitHub - apache/incubator-seatunnel: SeaTunnel is a distributed, high-performance data integration…
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of…
Cresset — A PyTorch Universal Docker Template
GitHub - veritas9872/PyTorch-Universal-Docker-Template: Template repository to build PyTorch…
Translations: 한국어 Notice The project will soon be renamed Cresset. Please be aware that the project URL will also…
Papers to Read📚
LaMDA: Language Models for Dialog Applications
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language…
Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning
Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However…
From the Lex Fridman podcast featuring Yann LeCun as guest:
It’s cued up to the moment Yann mentions the paper above.
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
COPA-SSE contains crowdsourced explanations for the Balanced COPA dataset, a variant of the Choice of Plausible Alternatives (COPA) benchmark. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts.
GitHub - a-brassard/copa-sse: Repository for Semi-Structured Explanations for COPA (COPA-SSE)
Repository for COPA-SSE: Semi-Structured Explanations for Commonsense Reasoning . COPA-SSE contains crowdsourced…
The first sequence-to-sequence based multi-hop reasoning framework, which utilizes an encoder-decoder structure to translate the triple query to a multi-hop path.
GitHub - bys0318/SQUIRE
This is the official codebase of the SQUIRE framework for multi-hop reasoning, proposed in SQUIRE: A…
CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English.
GitHub - google-research-datasets/cvss: CVSS: A Massively Multilingual Speech-to-Speech Translation…
CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel…
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.
GitHub - EleutherAI/the-pile
The Pile is a large, diverse, open source language modelling data set that consists of many smaller datasets combined…
UnifiedSKG📚: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
The UnifiedSKG framework, which unifies 21 SKG tasks into the text-to-text format, aiming to promote systematic SKG research - instead of being exclusive to a single task, domain, or dataset. It shows that large language models like T5, with simple modification when necessary, achieve state-of-the-art performance on nearly all 21 tasks.
GitHub - HKUNLP/UnifiedSKG: A Unified Framework and Analysis for Structured Knowledge Grounding
Code for paper UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models…
Tweebank-NER, an NER corpus of tweets based on Tweebank V2 (TB2) dataset.
GitHub - social-machines/TweebankNLP: An off-the-shelf pre-trained Tweet NLP pipeline (NER…
This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet…