DeepAuto.ai Chat logo
DeepAuto Lightweight LLM
Our efficient LLM serving framework drastically accelerate long-context Transformers inference in a plug-and-play manner using our novel sub-quadratic complexity attention mechanism, Hierarchical Pruned Attention (HiP).
Examples
Source Model
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
DeepAuto.ai Optimized Model
DeepAuto/Meta-Llama-3.1-8B-Instruct-AWQ-INT4-HiP