创意写作LLM超级比较

原贴链接

I’m a writer, and I use AI to help brainstorm by generating alternative versions of passages I’m struggling with. With all due respect to the hoarde of AI girlfriends out there, I feel like a lot of models aren’t analyzed enough in non-RP creative writing use cases. This post is my attempt to share my experiences using various models, usually larger ones, in my writing workflow.

Without my typical settings, using fully neutral settings, I tested a simple two-sentence example of my own “rough-draft” writing style: “Her gaze snapped leftwards, in the direction of an assault to her senses. Beyond the dimly lit parking lot, a rusty pickup truck had crashed into a lamppost, with sparks erupting in its wake.”

Llama 3.1 70b Test: When asked to re-imagine the passage, this is what L3.1 70b responded with: “As a jarring assault on her senses unfolded, her head whipped sharply to the left. In the distance, beyond the faintly lit parking lot, a decrepit pickup truck had smashed into a lamppost, its impact triggering a cascade of sparks that illuminated the darkness.”

I still prefer my writing, because I find Llama’s style to be overly verbose when it’s not necessary for that scene. It also took my implied meaning, and stated it outright. Instead of the audience knowing the experience was jarring due to the “assault” on her senses, Llama-3.1 retained my description but added “jarring” to it. Also, it took my description of a “rusty pickup truck” and changed it to “decrepit,” which sorta works, but is more ambiguous. The only part I feel improves the writing is “sparks that illuminated the darkness,” though it’s a bit unnecessary to re-iterate the “faintly lit parking lot” as being shrouded in “darkness.” Nonetheless, this is passable, and far better than most AI models. It’s not objectively worse than my writing in any major way, but I’m just not a fan of it’s style.

Mistral Large 123B Test: This has become a new daily-driver for me, potentially dethroning New Dawn Llama. Only time will tell, but it’s been solid thus far. This was its response: “Her eyes darted to the left, drawn by the sudden sensory assault. In the faintly lit parking lot, an old pickup truck had collided with a lamppost, sending a cascade of sparks into the air.”

I think this mostly improved upon my original writing, and I would use pieces of this in my own work. I particularly like the “drawn by the sudden sensory assault” line, as it captured what I was going for, but with cleaner verbiage than my “in the direction of an assault to her senses.” If I were to use this, I’d rephrase it to “Her gaze snapped leftwards, drawn by the abrupt sensory assault.” My writing tends to get wordy, so I highly appreciate it when a model can shorten a passage while retaining its full meaning like Mistral Large 2 seems to be capable of. However, I do think it misunderstood the 2nd sentence here. The truck isn’t inside the parking lot, but beyond it. I’m 90% sure this misunderstanding is due to the low quant I’m using, but I haven’t tested any higher quants of this model.

Command R Plus 103B Test: Command R Plus has always been a mixed bag, but good for ideas. I find it struggles to maintain clarity similarly to Llama-3, but while using better diction. Here’s what it gave me, (3.5bpw, 16384ctx): “Startled, she whipped her head to the left, attracted by the assault on her senses. In the gloom of the parking lot, she spotted a rusty pickup truck that had collided with a lamppost, sending sparks flying through the air.”

This is a great re-imagination for the sake of new ideas, but nothing else. This text isn’t as clear to the audience. Like, it starts the 2nd sentence with “In the gloom of the parking lot,” when the subject of the sentence is the crashed truck. Starting the sentence like that implies the entire sentence will take place inside the parking lot, instead of beyond it, like my writing clearly stated. Either way, it’s unclear if it misunderstood the truck’s location from my prompt, because the writing is simply too vague. This issue persisted in my test of the Q5 version, which runs agonizingly slow on my PC. Also, describing the witness of a car accident as “attracted by it” seems bizarre. Llama-3.1 will probably dethrone Command R Plus for me a backup option to my daily driver(s).

Command R 35B Test: I don’t use this model anymore, but it used to be a daily contender for me. This was its response (6bpw, 32768ctx): “Her eyes swiftly shifted towards the chaotic scene unfolding in the parking lot. A rusty pickup truck had collided with a lamppost, causing a shower of sparks to light up the otherwise dimly lit area.”

The first sentence is okay, but the second is kinda rough. It didn’t understand the truck’s location like it’s bigger sibling, but the quality is noticeably worse than the 103B output.

New Dawn Llama 70B Test: New-Dawn-Llama 70B is a creative writing finetune of Llama-3 by the incredibly talented Sophosympatheia, creator of Midnight Miqu 1.5. This is the only L3 70b finetune I truly believe is better than the original. I use this model every day, and I may go back to it depending on how I feel about Mistral Large 2 long term. This was its attempt at my prompt (4.5bpw, 32768ctx - lower quant only because I couldn’t find a 5bpw or 5.5bpw version at the time I downloaded it): “Her eyes darted to the left, where her senses were under siege. Beyond the dimly lit parking lot, a rusted pickup truck had careened into a lamppost, showering the area in a cascade of sparks.”

This easily trades blows with Mistral Large 2, but in my experience Mistral Large has better logical consistency and general knowledge of concepts. It’s not quite as “smooth” or condensed as Mistral Large, but I love how it interpreted my “assault to her senses” as “her senses were under siege.” I also love how it clarifies the physical orientation of subjects in the writing by saying “showering the area in a cascade of sparks.” This is easily the best version of the “sparks” line from any of these tests, and I’d definitely use it in my writing. New-Dawn-Llama can really paint a mental picture, and considering how little I enjoy using L3 (3.1 being a huge improvement), it’s beyond me how Sophosympatheia managed to optimize it so well. If Mistral Large didn’t have such a strong sense of logic, this would remain the best open model I know of for creative writing.

WizardLM 2 8x22b Test: I opt for this model whenever I need fresh ideas and totally new writing. I’ll ask it to “expand upon” things, but seldom to “reimagine” something. Perhaps this is due to my computer being no match for a model of this size. I can barely run Mistral Large as it is, and this model is even bigger. Here was its attempt (3bpw, 16384ctx): “Her eyes darted abruptly to the left, drawn by a sudden commotion. In the shadowy expanse of the parking area, a weathered pickup truck had violently collided with a lamppost, sending a cascade of sparks shooting into the air from the impact zone.”

I really like the “drawn by a sudden commotion” phrasing here. Like Mistral and CommandR, it mistakenly thought the truck was inside the parking lot. Also, this is too long, but I can appreciate the overall writing quality.

Final Ratings:

Mistral Large 2 123B EXL2 3.5bpw: 8.5/10, better than New Dawn only because of my broader experiences with it, where I’ve found superior logical reasoning and awareness. I wish it understood prompts and user intent better, but I think that might be due to the low quant. Unfortunately, it can be a bit plain, similar to Mistral 8x7b - but I wouldn’t say it’s an assistant-coded model like prior Mistral releases. The biggest drawback I’m noticing in Mistral Large so far is it’s tendency to degrade past ~600 tokens when responding.

New-Dawn-Llama 70B EXL2 4.5bpw: 8.5/10, the best L3 model for creative writing and it’s not even close, might be the best writing model entirely. Unfortunately, without existing writing to work from, it can struggle. This is model is more of a “mirror” than other models I’ve seen, very much at it’s best when it can improve upon something instead of generating new content. Nonetheless, it’s strengths, while limited in scope, are very solid.

Llama 3.1 70b EXL2 5bpw: 7.5/10, a huge improvement over L3 but still a bit “spacey” in my view. Like a very intelligent person with brain damage, and you never know when it’s subtle traces will appear.

WizardLM 2 8x22B EXL2 3bpw: 7.5/10. I feel like if I could run this model at a reasonable quant (3.5 at least) it might outperform everything else. If only I had a spare kidney to sell.

Command R Plus 103B EXL2 3.5bpw: 6/10, really great for it’s time, but new models have been a lot better for me.

Honorable Mentions:

Llama 3.1 8b EXL2 6bpw: 9/10. Rated so high because of it’s size. As much as I dislike L3, I don’t think you can do better than this if you can’t run anything larger. Gemma 9b would probably be a better option? But I haven’t tested that one yet. The writing quality of L3.1 8B is actually kinda close to the 70b version, at least in my limited testing.

Gemma 2 27b EXL2 6bpw: 7.5/10. I hope someone can obliterate this or something. 8k context is rough, but I love how it’s retained a Gemini Advanced writing style. This could revolutionize the “small” model space. Google has been leading in writing quality for almost a year now, and I’m not surprised, considering how competent this model is.

Midnight-Miqu 1.5 103B EXL2 3.5bpw: 7.25/10, I liked this better than the 70b version, but with it being a self-merge I don’t really trust that it truly is better. I ran it at 3.5bpw, and so it wasn’t as logical as the 70b version at 6bpw.

Command R 35B EXL2 6bpw: 6.5/10, This was one of the models that REALLY got me involved with LLMs. Cohere really popped off with this one, but I hope Gemma 2 27B can overthrow it. Perhaps quantization really messes with CmdR, because I actually like this one more than it’s 103b counterpart.

Midnight-Miqu 1.5 70B EXL2 6bpw: 7/10, this has to be the most inconsistent model I’ve ever used. It will spit out something beautiful, then follow up with 5+ attempts of pure garbage. This model had me hitting “retry” like there was no tomorrow, but when it worked, it all felt worth it. This model also needs HELP. Like custom prompting and everything.

Llama 3 70b EXL2 5bpw: 6.75/10. It’s not a bad model, but I’m unsure why people think it’s great.

Fimbulvetr Kuro Lotus 10.7B EXL2 6bpw: 5.5/10. This is one of my favorite tiny models despite being like 10,000 years old. It’s very RP centric but in general writing tasks it’s actually okay. I don’t really care about models this small, so maybe I’m not giving it a fair rating. I used this a lot 6 months ago when temporarily without my workstation.

RPStew 2 34B EXL2 6bpw: 4.5/10. Just use Command R I guess? It’s pretty bad at writing, but maybe that’s the RP alignment.

Magnum 72b EXL2 4.5bpw: 4/10. I tried this out of curiosity, and I don’t get the appeal. It’s not unusable, but it’s very sloppy at writing. To prove my point, I loaded it up to test it with the prompt I used for this post. It said, “Her eyes darted to her left, drawn by an onslaught on her senses. In the poorly illuminated parking lot, a dilapidated pickup truck had careened into a streetlight, igniting a shower of sparks in its aftermath.” It’s not horrible, but it’s definitely not good. Trying it again, I got “Across the shadowy parking lot, a dilapidated pickup truck had slammed into a streetlight, trailing a cascade of sparking electricity in its destructive path.” Go girl, give us nothing! It needing to specify “destructive path” makes this feel like a story written for a 5 year old, where authors have to spell out every implication. Qwen & its finetunes always give me cocomellon-ass responses, and I don’t get why people like these models?

For the sake of disclosure, here’s my workstation specs. It took me over a year of saving to build it, hence why I didn’t even bother testing anything larger like 200-400B. If my PC can’t run something at realtime speeds, I’d assume almost nobody normal could afford a system that can. When I built it, I was still working a job in video editing, hence why I didn’t go for something more cost-effective and AI-friendly: RTX 3090 FE (PCIE X16), RTX 3090 (PCIE X4), RTX 3060 (PCIE X1), Ryzen 7950X, 128GB DDR5. All models were run within Ooobabooga WebUI in EXL2 format, under a Windows 11 host.

TLDR: New-Dawn-Llama 70B & Mistral Large 2 123B are my favorite writing models, even compared to Llama 3.1 70b. I don’t like Llama 3 70b at all though, and Command R Plus is showing it’s age (or maybe my quant is too low). WizardLM 2 8x22b is great, but runs a bit verbose and can struggle with logical consistency, but also my PC can barely run it so maybe that’s why.

EDIT #1: There’s a very important consideration I forgot to mention with New Dawn Llama and Mistral Large 2. One of the things that makes New Dawn Llama so writer-friendly is it’s ability to output extremely long responses without going off the rails in them. I usually have it output between 768-1024 tokens at a time, and it has zero issues doing that. Dry sampler + repetition penalty goes a long way here too. I can’t say the same for Mistral Large 2, which needs to be coaxed beyond 500 tokens regardless of settings. Thankfully, ML2 is very receptive to being asked to slow down and lengthen its responses, but sometimes, this causes it to over-correct. When this happens, it will over-explain and over-fill a response until it feels like a patience exercise. Most of the time, it doesn’t do this, but I’ve noticed it’s more common when the prompt is 16k tokens long. Perhaps it wasn’t optimized to receive or generate long strings of text. ML2 also goes totally off the rails with repetition penalty, but benefits from a touch of dry repetition sampling. Nonetheless, I still think these models are on a similarly top-tier level. Both have major weaknesses and strengths.

EDIT #2: After reading through the comments, I decided to add a few new test runs. I won’t add these models to my rankings if they aren’t there already, because I don’t have a lot of experience with them.

Lumimaid v0.2 (Mistral Large 2) 123B: I hope someone makes a 3.5bpw EXL2 quant soon, because what I’m seeing is quite promising. Here’s its attempt (3bpw, 32768ctx): “Her eyes darted to the left, drawn by a sudden disturbance. Beyond the poorly-lit parking area, an old, rusted pickup truck had collided with a lamppost, sending a shower of sparks flying in its wake.”

I didn’t expect the first finetune of this model I’ve heard of to perform this well? It understood the truck’s location, unlike Large 2 at 3.5bpw, and most other models regardless of quant. Additionally, it didn’t add any slop, and retained the exact meaning of the original passage while substantially modifying it. To ensure this wasn’t a fluke, I ran it 5 more times, and it remained perfectly consistent in meaning between iterations. Also, it’s worth noting how it actually improved the flow of the second sentence. Genuinely knocked it out of the park here, and I’m very impressed for a 3bpw quant especially.

Big Tiger Gemma (Gemma 2) 27B: As a big Gemini Advanced fan, I have not given the local models enough attention. This is mostly due to the 8k context window. Nonetheless, I applaud this finetune for retaining Gemma’s phenomenal writing instincts while stripping it’s corporate alignment. This is its attempt (8bpw, 8192ctx): “Her eyes darted left, drawn by a chaotic scene that assaulted her senses. In a dimly lit parking lot, a rusty pickup truck had careened into a lamppost, sending sparks flying in all directions.”

This messed up the truck location, but as we’ve seen, few models don’t do that with my test prompt. Looking past that, though, this is quite good. I didn’t expect a 27B model to be capable of enhancing the existing tone in non-obvious ways. Instead of other models, which like to do something along the lines of “it was big scary and bad!!” this model took the frenzied tone of the scene and applied it to the sparks, describing them as “flying in all directions.” That change added much needed movement and chaos to the scene - elements which my prompt downplayed. It didn’t merely SAY the scene was “chaotic,” though it did, it also adjusted how elements were described to reflect this. Very impressive considering Magnum 72B couldn’t practice restraint like this, instead opting to explain to the reader that a car crash would leave a “destructive path.” Even WizardLM2 8x22b, which is a very well respected model, decided to explain how the sparks indeed emanated “from the impact zone,” as if the reader would otherwise assume the sparks flew from her ass or something. Very impressive google! I hope they take a hint from Facebook and give us something with usable context, and larger models.

讨论总结

作者分享了使用不同AI模型进行创意写作的经验，特别是对多个大型模型的测试和比较。他提供了具体的写作示例，并详细描述了每个模型的输出结果，包括他对每个输出的偏好和评价。作者还讨论了模型的逻辑一致性、词汇选择和理解用户意图的能力，以及他对未来模型的期望。

主要观点

👍 Mistral Large 2 123B
- 支持理由：在逻辑推理和概念理解上表现较好，但有时过于平淡。
- 反对声音：有时会误解提示，需要更好的用户意图理解。
🔥 New-Dawn-Llama 70B
- 正方观点：在创意写作方面表现出色，尤其是在改进现有文本而非生成新内容时。
- 反方观点：在没有现有写作基础上，可能会有所挣扎。
💡 Llama 3.1 70b
- 虽然有所改进，但仍显得有些“太空洞”，且有时过于冗长。
🚀 WizardLM 2 8x22B
- 在提供新想法方面表现不错，但逻辑一致性有待提高。
🌟 Command R Plus 103B
- 过去表现良好，但新模型已经超越了它，且在保持文本清晰度方面有待提高。

金句与有趣评论

“😂 I still prefer my writing, because I find Llama’s style to be overly verbose when it’s not necessary for that scene.”
- 亮点：作者对Llama模型的直接批评，强调了简洁性的重要性。
“🤔 Mistral Large 2 123B EXL2 3.5bpw: 8.5/10, better than New Dawn only because of my broader experiences with it, where I’ve found superior logical reasoning and awareness.”
- 亮点：作者对Mistral Large 2的评价，强调了其在逻辑推理和意识方面的优势。
“👀 New-Dawn-Llama 70B EXL2 4.5bpw: 8.5/10, the best L3 model for creative writing and it’s not even close, might be the best writing model entirely.”
- 亮点：作者对New-Dawn-Llama的高度评价，认为它是最佳的创意写作模型。

情感分析

讨论的总体情感倾向是积极的，作者对多个AI模型在创意写作中的表现给予了高度评价。主要分歧点在于不同模型在逻辑一致性、用户意图理解和文本生成质量方面的表现。可能的原因包括模型的训练数据、量化设置和用户提示的影响。

趋势与预测

新兴话题：未来可能会有更多针对创意写作优化的AI模型出现，特别是在理解用户意图和生成高质量内容方面。
潜在影响：这些模型的改进将极大地提高创意写作的效率和质量，对作家和内容创作者产生积极影响。

讨论总结#

主要观点#

金句与有趣评论#

情感分析#

趋势与预测#

讨论总结

主要观点

金句与有趣评论

情感分析

趋势与预测