TL;DR
It works. It’s good, despite low quant. Example attached below. Runs at 8tok/s. Based on my short tests, it’s the best model (for roleplay) on 48 gb. You don’t have to switch to dev branches.
How to run
- Update your ooba
- UPDATE: 2.75bpw exl2 is available now! <3 Download them here. Use a 22.5,24 split, 4bit cache, 8k context fits without issues (I got 4gb space left, so 16k should also fit).
Download the Q_2K here(~45 GB)Load the model in ooba with the following parameters:Select: n_ctx: 8192 (more should be possible, didn’t try yet) tensor_split: 24,24 flash_attn: on tensorcores: on cache_4bit: onAlready selected: Model loader: llama.cpp n-gpu-layers: 89 If you don’t activate flash-attention the model will not load with a rather obscure error. That was mostly the tricky part.
Stats
Hardware: 2x RTX 3090, power limited to 70%
VRAM usage: 22.4 GB, 23.8 GB (without model it is: 0.0 GB, 0.8 GB)
Generation: 12 tok/s (9 tok/s with gguf)
Prompt eval: 100 tok/s (on gguf, exl2 probably faster)
SillyTavern Settings
If you use Sillytavern for RP, you can use the Mistral Context/Instruct presets. Might not be perfect (?) but works. Otherwise I use the following settings:
Temperature: 0.9
Min-p: 0.1
Smoothing factor: 0.2
DRY Repetition Penalty / Multiplier: 0.6
Dynamic Temperature: min 0.4, max 2.0
Example Chat
[Characters are standing in front of a house that they should investigate]
Me: *I raise an eyebrow as I approach you, trying to avoid the puddles* What key?
AI output:
*She rolls her eyes, a faint smirk playing on her lips as she tilts her head slightly to the side* The key to the house, of course. *She gestures towards the dilapidated building behind her, its once grand facade now marred by time and neglect. The moon casts eerie shadows through the broken windows, giving the impression of movement within the dark interior* Or did you expect the front door to be open for us? *Her violet eyes narrow slightly, a hint of impatience in her voice*
*As you scan the surroundings, you notice the abandoned house is isolated, set apart from the main street by a overgrown garden. The air is thick with the scent of damp earth and the faint undercurrent of something else—a subtle, acrid smell that seems to emanate from the house itself.*
*Enira shifts her stance, the silver swords at her sides catching the moonlight briefly. Her intricate armor seems to absorb the shadows, giving her an almost ethereal appearance against the night.* Did you at least bring a lockpick set? Because I’m not breaking a window to get in. *She crosses her arms, the clink of her gauntlets echoing in the quiet night.*
FAQ
- Can I use exl2? UPDATE: 2.75bpw quants are available. Those should fit. Some mentioned that 3.0bpw barely fits with 6k context on linux, others that 3.0bpw didn’t fit at all.
Not yet, I think. There are 3.0bpw quants, but I don’t think they fit. They are 3gb bigger than my current biggest exl2 model (goliath-120b), which has only 2.5gb VRAM to spare. We might need 2.8bpw quants. But I’ll test 3.0bpw as soon as I got the space. - What about iMat quants? Didn’t try yet. IQ2_M is 41.6 gb, so 3gb smaller. Should fit, not sure if significantly better.
If you got any questions or issues, just post them. :)
Otherwise: Have fun!
讨论总结
本次讨论主要围绕如何在具有48 GB VRAM的硬件上成功运行Mistral Large (123b)模型展开。参与者分享了详细的运行步骤、硬件配置、模型下载链接以及在SillyTavern中的角色扮演设置。讨论中还涉及了模型的性能统计、示例对话展示以及对不同量化版本模型的适应性分析。此外,部分评论探讨了模型在不同文化背景下的表现差异,以及如何通过优化提高模型在特定上下文长度下的处理速度。
主要观点
- 👍 在48 GB VRAM上成功运行Mistral Large (123b)模型
- 支持理由:提供了详细的运行步骤和设置,包括更新ooba和下载特定版本的模型。
- 反对声音:无明显反对声音,多数用户对此表示认可。
- 🔥 模型在角色扮演中的应用
- 正方观点:分享了在SillyTavern中的设置,并提供了一个示例对话,展示了模型的应用场景。
- 反方观点:无明显反方观点,多数用户对此表示兴趣。
- 💡 不同量化版本模型的适应性
- 解释:讨论了不同量化版本的模型是否适合运行,并提供了相关数据和链接。
- 🚀 模型性能优化
- 解释:部分用户分享了他们在优化模型运行速度方面的经验,特别是在提高上下文长度下的处理速度。
- 🌍 文化对AI模型训练的影响
- 解释:探讨了不同国家文化对AI模型训练的影响,特别是法国、美国和中国。
金句与有趣评论
- “😂 The slop is strong in this one.”
- 亮点:评论者对AI角色扮演对话中的描述性内容进行了点评,认为虽然详细但可能过于繁琐。
- “🤔 I’ve got IQ quants of some small sizes here if anyone wants to go even lower.”
- 亮点:评论者分享了他们拥有的较小尺寸的量化模型,鼓励其他用户尝试。
- “👀 Exl2 3.0bpw 版本的 Mistral Large 123b 可以在两块24GB GPU上运行,但上下文限制为6k。”
- 亮点:评论者分享了他们在两块24GB GPU上运行模型的经验,提供了具体的上下文限制数据。
情感分析
讨论的总体情感倾向积极,多数用户对在48 GB VRAM上成功运行Mistral Large (123b)模型表示认可,并分享了详细的运行步骤和设置。部分用户对模型的性能优化和文化影响进行了深入探讨,整体氛围较为技术性和建设性。
趋势与预测
- 新兴话题:模型性能优化和不同文化背景下的模型表现差异可能会引发后续讨论。
- 潜在影响:随着技术的进步和用户需求的多样化,模型在不同硬件配置和文化背景下的适应性和优化将成为未来讨论的重点。
感谢您的耐心阅读!来选个表情,或者留个评论吧!