带有负载均衡器的Llama.cpp比阿芙罗狄蒂更快？？

大家好！

我相对较新地在本地运行大型语言模型（LLMs），但我全力以赴，在家中搭建了一个小型GPU集群，专门用于在必须保持本地化的数据上运行RAG管道。在研究不同的模型服务方法时，vLLM和Aphrodite不断出现，似乎是并行处理请求的最快候选者。然而，在我的测试中，情况并非如此，我感到困惑，不知道我可以调整什么，或者我是否做错了什么…

对于完全适合单个卡内存的模型，似乎在负载均衡器（Paddler）后面运行多个llama.cpp实例要快得多，根据我的观察，速度几乎快了100%！

我在一台联想P520工作站上进行这些测试，配备Xeon W2133 CPU，64Gb RAM和2x 3060 12Gb GPU。两个GPU都在x16插槽中。

这个结果是预期的吗？令人惊讶吗？大家能看看我下面列出的设置，告诉我是否有我遗漏的、配置错误的地方，或者应该调整的地方吗？

注意：我现在正在下载完整的、未量化的llama3.1-8b-Instruct版本，以重复这些测试，看看量化是否在使用张量并行（即设置#1和#2）的设置中造成了瓶颈…

测试案例：

处理25个Ruby源代码文件，检查每个文件与给定提示的相关性。
所有测试案例使用完全相同的文件和提示。
提示类似于：返回[FILE]与[PROMPT]相关性的相关性分数
完整提示在这里：https://gist.github.com/aarongough/301d459537e7d42b6c44f2011030fb4d
每个测试案例列出的时间是5次运行中的最佳时间，经过1次预热运行。

设置1 - Aphrodite：

aphrodite run Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --port 8080 --host 0.0.0.0 --tokenizer meta-llama/Meta-Llama-3-8B-Instruct --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager true --tensor-parallel-size 2
内存使用：11.30Gb每个GPU
完成所有提示所需时间：24.23秒
注意：正在使用FlashAttention后端：INFO: 使用FlashAttention后端。

设置2 - Llama.cpp：

./llama.cpp/llama-server --model Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --prompt "You are a helpful assistant" --n-gpu-layers 2000000 --threads 6 --flash-attn --ctx-size 8192 --port 8080 --host 0.0.0.0 --cont-batching --parallel 10 --threads-http 10 --tensor-split 1,1
内存使用：4.23Gb每个GPU
完成所有提示所需时间：18.59秒

设置3 - 2x Llama.cpp负载均衡 w/ Paddler：

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-server --model Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --prompt "You are a helpful assistant" --n-gpu-layers 2000000 --threads 6 --flash-attn --ctx-size 8192 --port 8088 --host 0.0.0.0 --cont-batching --parallel 10 --threads-http 10

CUDA_VISIBLE_DEVICES=1 ./llama.cpp/llama-server --model Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --prompt "You are a helpful assistant" --n-gpu-layers 2000000 --threads 6 --flash-attn --ctx-size 8192 --port 8089 --host 0.0.0.0 --cont-batching --parallel 10 --threads-http 10


./paddler-bin-linux-amd64 agent \
    --external-llamacpp-host 127.0.0.1 \
    --external-llamacpp-port 8088 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8088 \
    --management-host 127.0.0.1 \
    --management-port 8085

./paddler-bin-linux-amd64 agent \
    --external-llamacpp-host 127.0.0.1 \
    --external-llamacpp-port 8089 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8089 \
    --management-host 127.0.0.1 \
    --management-port 8085

./paddler-bin-linux-amd64 balancer \
    --management-host 127.0.0.1 \
    --management-port 8085 \
    --reverseproxy-host 0.0.0.0 \
    --reverseproxy-port 8080

内存使用：7.5Gb每个GPU
完成所有提示所需时间：11.26秒

注意：

在查看我在测试期间拍摄的屏幕截图时，我发现aphrodite在PCIe Gen1上运行卡，而llama.cpp在gen3上运行它们：

Nvtop Aphrodite: https://i.imgur.com/tMF8ZJv.png

Nvtop Llama.cpp: https://i.imgur.com/IMHqKyo.png

这似乎会导致Aphrodite变慢，我该如何解决这个问题？

链接：

Aphrodite: https://github.com/PygmalionAI/aphrodite-engine
Aphrodite服务器参数文档：https://github.com/PygmalionAI/aphrodite-engine/wiki/3.-Engine-Options
Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file
Llama.cpp服务器参数文档：https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
Paddler: https://github.com/distantmagic/paddler

讨论总结

本次讨论主要集中在用户在本地运行大型语言模型（LLMs）时，发现使用llama.cpp结合负载均衡器Paddler比使用Aphrodite更快的问题。用户在Lenovo P520工作站上进行了测试，使用了两张3060 12Gb GPU，并比较了三种不同的设置。结果显示，使用llama.cpp的设置在处理25个Ruby源代码文件时，速度几乎是Aphrodite的两倍。讨论中还涉及了量化技术、PCIe配置、批量大小等技术细节，以及社区成员的建议和维护者的直接回应。

主要观点

👍 使用llama.cpp结合负载均衡器Paddler比Aphrodite更快
- 支持理由：测试结果显示速度提升近100%。
- 反对声音：部分用户建议尝试其他量化方法或调整配置。
🔥 Aphrodite在运行时将GPU降级到PCIe Gen1，而llama.cpp保持在Gen3
- 正方观点：这可能是性能差异的原因。
- 反方观点：需要进一步测试和验证。
💡 未量化的模型可能会影响性能
- 解释：用户正在考虑使用未量化的模型进行进一步测试。
👀 社区成员建议尝试其他量化方法，如GPTQ或AWQ
- 解释：这些方法可能更快，且有社区成员的实际经验支持。
🚀 Aphrodite的维护者回应并建议尝试即将优化的版本
- 解释：下一个版本（0.5.4）将显著提升性能。

金句与有趣评论

“😂 FullOf_Bad_Ideas：Seems like you’re running GGUF models in Aphrodite. I’ve didn’t try that so I don’t know about speeds, but my best results with vLLM/Aphrodite were with models that fit into VRAM unquantized, in .safetensors format.”
- 亮点：强调了未量化模型在某些情况下的优势。
“🤔 Ragecommie：You can’t change PCIe configuration settings from your OS… Is Aphrodite communicating with the BIOS / UEFI by any chance?”
- 亮点：提出了关于PCIe配置和系统通信的深入问题。
“👀 Pedalnomica：Aphrodite is based on VLLM. They’ve added support for more quantization options than VLLM (e.g. exl2, gguf), but I don’t think those are as optimized.”
- 亮点：指出了Aphrodite在量化选项上的扩展及其潜在的优化问题。

情感分析

讨论的总体情感倾向是积极的，用户们积极参与并提供有价值的建议和反馈。主要分歧点在于不同工具和配置的性能比较，以及量化技术的选择。这些分歧可能源于不同的测试环境和配置，以及对新技术的探索和适应。

趋势与预测

新兴话题：尝试使用GPTQ、AWQ等新的量化方法，以及优化Aphrodite的配置和版本。
潜在影响：这些优化和建议可能会显著提升本地运行大型语言模型的性能，推动相关技术的发展和应用。

测试案例：#

设置1 - Aphrodite：#

设置2 - Llama.cpp：#

设置3 - 2x Llama.cpp负载均衡 w/ Paddler：#

注意：#

链接：#

讨论总结#

主要观点#

金句与有趣评论#

情感分析#

趋势与预测#