这种语音合成(TTS)方法是使用Qwen 2.5创建的。我认为它与Llasa相似。不确定之前是否有人发布过。Hugging Face空间:https://huggingface.co/spaces/Mobvoi/Offical - Spark - TTS 论文:https://arxiv.org/pdf/2503.01710 GitHub仓库:https://github.com/SparkAudio/Spark - TTS 权重:https://huggingface.co/SparkAudio/Spark - TTS - 0.5B 演示:https://sparkaudio.github.io/spark - tts/
这是一个关于Spark - TTS模型的讨论。参与者们从多个方面对该模型进行了讨论,包括与其他模型的比较、语音克隆效果、生成速度、音频质量以及一些有趣的语言发音现象等,整体氛围积极正面,大家对这个模型的兴趣浓厚,也对开发团队的工作表示赞赏。
- 新兴话题:模型的流式传输能力是否可行可能会引发后续讨论。
- 潜在影响:如果模型在速度、音频质量等方面不断优化,可能会对语音合成相关领域产生积极影响,如应用于更多的语音交互设备等。
标题:Spark-TTS:高效的基于 LLM 的文本转语音模型引发热议
近日,Reddit 上一则关于“Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens”的帖子引发了众多关注。该帖子介绍了这款 TTS 方法是基于 Qwen 2.5 制作而成,还提供了相关的链接,如 Hugging Face Space、Paper、GitHub Repository、Weights 以及 Demos 等。截至目前,帖子已收获了众多点赞和大量评论。
讨论的焦点主要集中在 Spark-TTS 的性能和特点上。有人称赞道:“Holy shit this is as good as llasa using half the size (of their smallest llm model) and has better license. Like why does it feel like it’s christmas every week in this space?” 还有用户表示:“The voice cloning is so fantastic that it is shocking how I would mistaken it for the real person. English is good with only a little AI quality to it. But the Chinese voices are so realistic.” 有人提到:“Demos sounds very strong. Thank you. 👍 ” 有人觉得有趣:“Interesting that a company known for smartwatches is releasing TTS models.” 也有人分享个人感受:“It’s funny to listen to the American voices speak Chinese. They sound like fluent non - native speakers. It’s hard to put your finger on why. You get a similar effect with the native Chinese input samples being used to produce English.”
关于生成速度方面,有用户分享:“Took about 35 seconds to generate 46 seconds of speech on a 4090 with a 27 second long cloning sample.” 也有人说:“Without a cloning sample, it takes 46 seconds to generate 56 seconds of speech.” 还有用户指出:“When cloning, the output appears to match the cadence of the input sample, while non - clone generation takes speed and pitch as tunable inputs. This could be a factor.” 另外,有人反馈:“On my rtx 3060, it took 48s to make 23s audio. The quality is really good, the only issue for me is it created pauses at odd positions in the audio file. A normal person would never use pauses like that.”
有人因为不在办公室无法运行而好奇:“I can’t run this right now away from oc. I’m wondering is it faster than realtime? The demos sound incredible. Would it work for streaming to have a seamless convo? None the less amazing qork to the team!”
在这场讨论中,大家对于 Spark-TTS 的性能和特点各抒己见,既有对其出色表现的称赞,也有对一些细节问题的关注和探讨。而这种开放和多元的讨论,无疑为我们更全面地了解 Spark-TTS 提供了丰富的视角。