原贴链接

Intro (on cheese)

Is vllm delivering the same inference quality as mistral.rs? How does in-situ-quantization stacks against bpw in EXL2? Is running q8 in Ollama is the same as fp8 in aphrodite? Which model suggests the classic mornay sauce for a lasagna?

Sadly there weren’t enough answers in the community to questions like these. Most of the cross-backend benchmarks are (reasonably) focused on the speed as the main metric. But for a local setup… sometimes you would just run the model that knows its cheese better even if it means that you’ll have to make pauses reading its responses. Often you would trade off some TPS for a better quant that knows the difference between a bechamel and a mornay sauce better than you do.

The test

Based on a selection of 256 MMLU Pro questions from the other category:

  • Running the whole MMLU suite would take too much time, so running a selection of questions was the only option
  • Selection isn’t scientific in terms of the distribution, so results are only representative in relation to each other
  • The questions were chosen for leaving enough headroom for the models to show their differences
  • Question categories are outlined by what got into the selection, not by any specific benchmark goals

Here’re a couple of questions that made it into the test:

- How many water molecules are in a human head?
  A: 8*10^25

- Which of the following words cannot be decoded through knowledge of letter-sound relationships?
  F: Said

- Walt Disney, Sony and Time Warner are examples of:
  F: transnational corporations

Initially, I tried to base the benchmark on Misguided Attention prompts (shout out to Tim!), but those are simply too hard. None of the existing LLMs are able to consistently solve these, the results are too noisy.

Engines

LLM and quants

There’s one model that is a golden standard in terms of engine support. It’s of course Meta’s Llama 3.1. We’re using 8B for the benchmark as most of the tests are done on a 16GB VRAM GPU.

We’ll run quants below 8bit precision, with an exception of fp16 in Ollama.

Here’s a full list of the quants used in the test:

  • Ollama: q2_K, q4_0, q6_K, q8_0, fp16
  • llama.cpp: Q8_0, Q4_K_M
  • Mistral.rs (ISQ): Q8_0, Q6K, Q4K
  • TabbyAPI: 8bpw, 6bpw, 4bpw
  • Aphrodite: fp8
  • vLLM: fp8, bitsandbytes (default), awq (results added after the post)

Results

Let’s start with our baseline, Llama 3.1 8B, 70B and Claude 3.5 Sonnet served via OpenRouter’s API. This should give us a sense of where we are “globally” on the next charts.

image

Unsurprisingly, Sonnet is completely dominating here.

Before we begin, here’s a boxplot showing distributions of the scores per engine and per tested temperature settings, to give you an idea of the spread in the numbers.

Left: distribution in scores by category per engine, Right: distribution in scores by category per temperature setting (across all engines)

Let’s take a look at our engines, starting with Ollama

https://preview.redd.it/ykbgk6e3ieod1.png?width=773&format=png&auto=webp&s=82025d34657be1cc7a35d8760d82fa82939fc932

Note that the axis is truncated, compared to the reference chat, this is applicable to the following charts as well. One surprising result is that fp16 quant isn’t doing particularly well in some areas, which of course can be attributed to the tasks specific to the benchmark.

Moving on, Llama.cpp

https://preview.redd.it/qrvy9iul8god1.png?width=776&format=png&auto=webp&s=930bdecb3803864cf72dcc00dfca064e8fb8c92c

Here, we see also a somewhat surprising picture. I promise we’ll talk about it in more detail later. Note how enabling kv cache drastically impacts the performance.

Next, Mistral.rs and its interesting In-Situ-Quantization approach

https://preview.redd.it/71xue0xwieod1.png?width=773&format=png&auto=webp&s=39738e38192db4e934924beadd9b11a994473fc8

Tabby API

https://preview.redd.it/0r4n7ck1jeod1.png?width=773&format=png&auto=webp&s=af7ae74af88c71ee49d28f20861708a0aa69fe40

Here, results are more aligned with what we’d expect - lower quants are loosing to the higher ones.

And finally, vLLM

https://preview.redd.it/kl8rrszwxeod1.png?width=783&format=png&auto=webp&s=c594e27e449d5ee87b5a7114cca99f874dbfa4c7

It’d be safe to say, that these results do not fit well into the mental model of lower quants always loosing to the higher ones in terms of quality.

And, in fact, that’s true. LLMs are very susceptible to even the tiniest changes in weights that can nudge the outputs slightly. We’re not talking about catastrophical forgetting, rather something along the lines of fine-tuning.

For most of the tasks - you’ll never know what specific version works best for you, until you test that with your data and in conditions you’re going to run. We’re not talking about the difference of orders of magnitudes, of course, but still measureable and sometimes meaningful differential in quality.

Here’s the chart that you should be very wary about.

https://preview.redd.it/fnd1pptq8god1.png?width=971&format=png&auto=webp&s=8d92c729322c44ccf4633c3b2daa099336e27468

https://preview.redd.it/1i0hj9o89god1.png?width=2378&format=png&auto=webp&s=02ef4ef0fe5c45f06082e68b1b58fe6613fd3b19

Does it mean that vllm awq is the best local llama you can get? Most definitely not, however it’s the model that performed the best for the 256 questions specific to this test. It’s very likely there’s also a “sweet spot” for your specific data and workflows out there.

Materials

P.S. Cheese bench

I wasn’t kidding that I need an LLM that knows its cheese. So I’m also introducing a CheeseBench - first (and only?) LLM benchmark measuring the knowledge about cheese. It’s very small at just four questions, but I already can feel my sauce getting thicker with recipes from the winning LLMs.

Can you guess with LLM knows the cheese best? Why, Mixtral, of course!

https://preview.redd.it/nbicd3uzqeod1.png?width=441&format=png&auto=webp&s=c7eb3c0c3a5e229aba1561be9ba35441f8bce81f

Edit 1: fixed a few typos

Edit 2: updated vllm chart with results for AWQ quants

Edit 3: added Q6_K_L quant for llama.cpp

Edit 4: added kv cache measurements for Q4_K_M llama.cpp quant

Edit 5: added all measurements as a table

讨论总结

本次讨论主要围绕主流LLM推理引擎(如llama.cpp、Ollama、vLLM、mistral.rs、TabbyAPI、Aphrodite Engine)的性能比较展开。讨论内容涵盖了不同量化方法(如Q6_K_L、Q8_0、fp16等)对推理引擎性能的影响,并通过雷达图展示了这些引擎在不同任务上的表现。参与者们还讨论了量化缓存(k/v cache)对性能的影响,并提出了进一步测试的建议。总体而言,讨论氛围偏向技术性和数据驱动,参与者们对不同引擎和量化方法的性能表现进行了深入分析。

主要观点

  1. 👍 量化方法对性能有显著影响

    • 支持理由:不同量化方法在比特数相同的情况下,实际效果和精度可能存在显著差异。例如,EXL2的4.0bpw与Q4_K_M的4.84bpw在实际应用中表现不同。
    • 反对声音:量化方法的比特数并不能直接反映其性能,实际应用中应根据具体任务和模型表现来选择合适的量化方法。
  2. 🔥 量化缓存(k/v cache)对性能的影响

    • 正方观点:量化缓存对性能的影响值得进一步研究,某些量化方法(如fp16)在某些任务上的表现不如预期。
    • 反方观点:量化缓存的影响可能因任务而异,需要更多的测试来确定其具体影响。
  3. 💡 不同推理引擎在不同任务上的表现存在差异

    • 解释:通过雷达图展示了不同推理引擎在不同任务上的表现,某些引擎在特定任务上的表现优于其他引擎。
  4. 👍 高量化并不总是意味着更好的性能

    • 支持理由:高量化方法在某些任务上的表现可能不如预期,需要根据具体任务选择合适的量化方法。
    • 反对声音:高量化方法在某些任务上的表现可能优于低量化方法,具体取决于任务需求。
  5. 💡 需要更多的测试来确定最佳量化方法

    • 解释:不同量化方法在不同任务上的表现存在差异,需要更多的测试来确定哪种量化方法最适合特定的数据和任务。

金句与有趣评论

  1. “😂 Before anyone else steals it - I know this post is cheesy”

    • 亮点:作者以幽默的方式表达对帖子主题的看法,暗示帖子内容可能有些“老套”或“俗气”。
  2. “🤔 量化方法的比特数并不能直接反映其性能,实际应用中应根据具体任务和模型表现来选择合适的量化方法。”

    • 亮点:强调了量化方法的选择应基于实际任务和模型表现,而非仅仅依赖比特数。
  3. “👀 Ollama和llama.cpp在测试中表现不同,高量化并不总是意味着更好的性能。”

    • 亮点:指出了不同推理引擎在测试中的表现差异,并强调了高量化方法的局限性。

情感分析

讨论的总体情感倾向偏向技术性和数据驱动,参与者们对不同引擎和量化方法的性能表现进行了深入分析。主要分歧点在于不同量化方法的实际效果和精度,以及量化缓存对性能的具体影响。可能的原因包括不同任务的需求差异、量化方法的实现细节以及测试环境的差异。

趋势与预测

  • 新兴话题:进一步研究量化缓存对性能的影响,以及不同量化方法在实际应用中的表现。
  • 潜在影响:对LLM推理引擎的优化和选择提供更科学的依据,推动量化方法的发展和应用。

详细内容:

《主流 LLM 推理引擎的精彩对决》

近日,Reddit 上一篇关于主流 LLM 推理引擎的帖子引发了广泛关注。该帖由用户 [u/Everlier] 发布,内容围绕着对 6 种主流 LLM 推理引擎的测试与分析,收获了众多点赞和大量评论。

帖子主要探讨了不同推理引擎在特定任务中的表现差异,包括 Ollama、llama.cpp、vLLM、mistral.rs、TabbyAPI 和 Aphrodite Engine 等。测试基于 256 个 MMLU Pro 问题,并使用了多种量化方式和模型参数。

讨论焦点与观点分析: 有人认为 Sonnet 在测试中表现出色,完全占据优势。也有人指出,在某些情况下,fp16量化的表现不如预期,可能与任务特点有关。 有人提到,启用 kv 缓存对性能有显著影响。还有人对不同引擎的结果差异感到意外,比如 Ollama 和 llama.cpp 的结果不同。 有人好奇能否将相同的 bpw/quant 在不同引擎上以蜘蛛图形式可视化,以更清晰地对比类似模型的测试情况。 有人提出 Triton TensorRT-LLM 后端的使用问题,引发了关于其设置难度和付费许可的讨论。 有人认为 vLLM 的测试结果打破了低量化总是输给高量化的思维定式。

总的来说,这次关于 LLM 推理引擎的讨论展示了其复杂性和多样性,为相关研究和应用提供了有价值的参考。不同观点的碰撞也让人们对 LLM 推理引擎的性能有了更深入的理解。但仍需更多的测试和研究,以适应各种具体的数据和工作流程需求。