The NLP Cypher | 11.21.21

Inference Prime

Ricky Costa
5 min readNov 21, 2021


Hey … so have you ever deployed a state-of-the-art production level inference server? Don’t know how to do it?

Well… last week, Michael Benesty dropped a bomb when he published one of the first ever detailed blogs on how to not only deploy a production level inference API but benchmarking some of the most widely used frameworks such as FastAPI and Triton servers and runtime engines such as ONNX runtime (ORT) and TensorRT (TRT). Eventually, Michael recreated Hugging Face’s ability to reach a 1–2ms inference with miniLM & a T4 GPU. 👀



Another Tutorial for Triton and Hugging Face Inference

NVIDIA’s Triton Server Update

PyTorch LIT (talkin’ bout Inference)

PyTorch Lite Inference Toolkit: works with Hugging Face pipeline.

Here’s an example for text generation with GPT-J (6 Billi param model)

Model Size x 18 = Model Memory Required


A Convenient Collection of Simple Python Code Snippets

Some Examples:

1 Hello World


3 Random Password Generator

4 Instagram Profile Info

6 Fetch links from Webpage

7 Todo App With Flask

8 Add Watermark on Images

9 WishList App Using Django

10 Split Folders into Subfolders

OpenAI’s API Goes Open Range

G5 Instances at AWS w/ A10G GPUs

Hop: Reading Files without Extracting Archive

“25x faster than unzip and 10x faster than tar at reading individual files (uncompressed)”

InfraNodus | Text Analysis Software

Create graphs with your text data.

Free Version:

Distributed Training w/ PyTorch Lightning and Ray

pip install ray-lightning

Papers to Read 📚

Stanford’s Papers at EMNLP/CoNLL

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁


Improving DeBERTa using ELECTRA Style Pre-Training with Gradient-Disentangled Embedding Sharing. On GLUE it achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure.

Includes a multi-lingual variant. :)

🤗 Model Pages

Connected Papers 📈

DataCLUE: A Benchmark Suite for Data-centric NLP

A benchmark for Data Centric AI. It benchmarks how data modification can impact model’s performance. You can modify the training set and validation set, re-split the training set and validation set, or add data by non-crawler methods. The modification can be done by algorithms or programs or in combination with manual methods.

Connected Papers 📈

Dynamic-TinyBERT: Boost TinyBERT’s Inference Efficiency by Dynamic Sequence Length

Dynamic-TinyBERT, a TinyBERT model that utilizes sequence-length reduction and Hyperparameter Optimization for enhanced inference efficiency per any computational budget. Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches (up to 3.3x with <1% loss drop).

Connected Papers 📈



Ricky Costa

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟