DeepAuto.ai Chat

DeepAuto Lightweight LLM

Our efficient LLM serving framework drastically accelerate long-context Transformers inference in a plug-and-play manner using our novel sub-quadratic complexity attention mechanism, Hierarchical Pruned Attention (HiP).

Examples

Source Model

hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4

DeepAuto.ai Optimized Model

DeepAuto/Meta-Llama-3.1-8B-Instruct-AWQ-INT4-HiP