ChatTTS-Forge

Running on Zero

App Files Files Community

zhzluke96 commited on Jun 26

Commit

bed01bd

•

1 Parent(s): 37195a7

update

Browse files

Files changed (42) hide show

CHANGELOG.md +150 -7
launch.py +49 -11
modules/ChatTTS/ChatTTS/core.py +124 -93
modules/ChatTTS/ChatTTS/infer/api.py +4 -0
modules/ChatTTS/ChatTTS/model/gpt.py +123 -74
modules/ChatTTS/ChatTTS/utils/infer_utils.py +31 -6
modules/Enhancer/ResembleEnhance.py +1 -1
modules/SentenceSplitter.py +32 -1
modules/SynthesizeSegments.py +30 -17
modules/api/api_setup.py +5 -98
modules/api/impl/handler/AudioHandler.py +19 -1
modules/api/impl/handler/SSMLHandler.py +5 -0
modules/api/impl/handler/TTSHandler.py +60 -1
modules/api/impl/model/audio_model.py +4 -0
modules/api/impl/tts_api.py +18 -4
modules/api/impl/xtts_v2_api.py +97 -37
modules/api/worker.py +3 -7
modules/devices/devices.py +7 -1
modules/finetune/train_speaker.py +2 -2
modules/generate_audio.py +84 -8
modules/models.py +11 -2
modules/models_setup.py +74 -0
modules/normalization.py +35 -24
modules/refiner.py +8 -0
modules/repos_static/resemble_enhance/inference.py +5 -5
modules/speaker.py +6 -0
modules/synthesize_audio.py +0 -2
modules/synthesize_stream.py +42 -0
modules/utils/HomophonesReplacer.py +39 -0
modules/utils/audio.py +64 -58
modules/utils/detect_lang.py +27 -0
modules/utils/html.py +26 -0
modules/utils/ignore_warn.py +9 -0
modules/utils/markdown.py +1 -0
modules/webui/localization_runtime.py +22 -0
modules/webui/speaker/speaker_creator.py +6 -6
modules/webui/ssml/podcast_tab.py +1 -1
modules/webui/ssml/ssml_tab.py +61 -14
modules/webui/tts_tab.py +49 -2
modules/webui/webui_utils.py +83 -31
requirements.txt +11 -9
webui.py +5 -6

CHANGELOG.md CHANGED Viewed

@@ -1,22 +1,150 @@
 # Changelog
-<a name="0.5.6-rc"></a>
-## 0.5.6-rc (2024-06-09)
 ### Added
 - ✨ add localization [[c05035d](https://github.com/lenML/ChatTTS-Forge/commit/c05035d5cdcc5aa7efd995fe42f6a2541abe718b)]
 - ✨ SSML 支持 enhancer [[5c2788e](https://github.com/lenML/ChatTTS-Forge/commit/5c2788e04f3debfa8bafd8a2e2371dde30f38d4d)]
 - ✨ webui 增加 podcast 工具 tab [[b0b169d](https://github.com/lenML/ChatTTS-Forge/commit/b0b169d8b49c8e013209e59d1f8b637382d8b997)]
-- ✨ 完善  enhancer [[205ebeb](https://github.com/lenML/ChatTTS-Forge/commit/205ebebeb7530c81fde7ea96c7e4c6a888a29835)]
 ### Changed
 - 🍱 update banner [[dbc293e](https://github.com/lenML/ChatTTS-Forge/commit/dbc293e1a7dec35f60020dcaf783ba3b7c734bfa)]
 - ⚡ 增强 TN [[092c1b9](https://github.com/lenML/ChatTTS-Forge/commit/092c1b94147249880198fe2ad3dfe3b209099e19)]
 - ⚡ enhancer 支持 off_tqdm [[94d34d6](https://github.com/lenML/ChatTTS-Forge/commit/94d34d657fa3433dae9ff61775e0c364a6f77aff)]
 - ⚡ 增加 git env [[43d9c65](https://github.com/lenML/ChatTTS-Forge/commit/43d9c65877ff68ad94716bc2e505ccc7ae8869a8)]
-- ⚡ 修改webui保存文件格式 [[2da41c9](https://github.com/lenML/ChatTTS-Forge/commit/2da41c90aa81bf87403598aefaea3e0ae2e83d79)]
 ### Removed
@@ -24,6 +152,17 @@
 ### Fixed
 - 🐛 fix hparams config [#22](https://github.com/lenML/ChatTTS-Forge/issues/22) [[61d9809](https://github.com/lenML/ChatTTS-Forge/commit/61d9809804ad8c141d36afde51a608734a105662)]
 - 🐛 fix enhance 下载脚本 [[d2e14b0](https://github.com/lenML/ChatTTS-Forge/commit/d2e14b0a4905724a55b03493fa4b94b5c4383c95)]
 - 🐛 fix &#x27;trange&#x27; referenced [[d1a8dae](https://github.com/lenML/ChatTTS-Forge/commit/d1a8daee61e62d14cf5fd7a17fab4424e24b1c41)]
@@ -33,10 +172,14 @@
 ### Miscellaneous
 - 🌐 更新翻译文案 [[f56caa7](https://github.com/lenML/ChatTTS-Forge/commit/f56caa71e9186680b93c487d9645186ae18c1dc6)]
-- 📝 update [[7cacf91](https://github.com/lenML/ChatTTS-Forge/commit/7cacf913541ee5f86eaa80d8b193b94b3db2b67c)]
-- 📝 update webui document [[7f2bb22](https://github.com/lenML/ChatTTS-Forge/commit/7f2bb227027cc0eff312c37758a20916c1ebade6)]
 <a name="0.5.5"></a>

 # Changelog
+<a name="0.6.2-rc"></a>
+## 0.6.2-rc (2024-06-23)
 ### Added
+- ✨ add adjuster to webui [[01f09b4](https://github.com/lenML/ChatTTS-Forge/commit/01f09b4fad2eb8b24a16b7768403de4975d51774)]
+- ✨ stream mode support adjuster [[585d2dd](https://github.com/lenML/ChatTTS-Forge/commit/585d2dd488d8f8387e0d9435fb399f090a41b9cc)]
+- ✨ improve xtts_v2 api [[fec66c7](https://github.com/lenML/ChatTTS-Forge/commit/fec66c7c00939a3c7c15e007536e037ac01153fa)]
+- ✨ improve normalize [[d0da37e](https://github.com/lenML/ChatTTS-Forge/commit/d0da37e43f1de4088ef638edd90723f93894b1d2)]
+- ✨ improve normalize/spliter [[163b649](https://github.com/lenML/ChatTTS-Forge/commit/163b6490e4d453c37cc259ce27208f55d10a9084)]
+- ✨ add loudness equalization [[bc8bda7](https://github.com/lenML/ChatTTS-Forge/commit/bc8bda74825c31985d3cc1a44366ad92af1b623a)]
+- ✨ support &#x60;--use_cpu&#x3D;chattts,enhancer,trainer,all&#x60; [[23023bc](https://github.com/lenML/ChatTTS-Forge/commit/23023bc610f6f74a157faa8a6c6aacf64d91d870)]
+- ✨ improve normalizetion.py [[1a7c0ed](https://github.com/lenML/ChatTTS-Forge/commit/1a7c0ed3923234ceadb79f397fa7577f9e682f2d)]
+- ✨ ignore_useless_warnings [[4b9a32e](https://github.com/lenML/ChatTTS-Forge/commit/4b9a32ef821d85ceaf3d62af8f871aeb5088e084)]
+- ✨ enhance logger, info &#x3D;&gt; debug [[73bc8e7](https://github.com/lenML/ChatTTS-Forge/commit/73bc8e72b40146debd0a59100b1cca4cc42f5029)]
+- ✨ add playground.stream page [[31377b0](https://github.com/lenML/ChatTTS-Forge/commit/31377b060c182519d74a12d81e66c8e73686bcd8)]
+- ✨ tts api support stream [#5](https://github.com/lenML/ChatTTS-Forge/issues/5) [[15e0b2c](https://github.com/lenML/ChatTTS-Forge/commit/15e0b2cb051ba39dcf99f60f1faa11941f6dc656)]
+### Changed
+- 🍱 add _p_en [[56f1fbf](https://github.com/lenML/ChatTTS-Forge/commit/56f1fbf1f3fff6f76ca8c29aa12a6ddef665cf9f)]
+- 🍱 update prompt [[4f95b31](https://github.com/lenML/ChatTTS-Forge/commit/4f95b31679225e1ee144a411a9cfa9b30c598450)]
+- ⚡ Reduce popping sounds [[2d0fd68](https://github.com/lenML/ChatTTS-Forge/commit/2d0fd688ad1a5cff1e6aafc0502aee26de3f1d75)]
+- ⚡ improve &#x60;apply_character_map&#x60; [[ea7399f](https://github.com/lenML/ChatTTS-Forge/commit/ea7399facc5c29327a7870bd66ad6222f5731ce3)]
+### Fixed
+- 🐛 fix &#x60;apply_normalize&#x60; missing &#x60;sr&#x60; [[2db6d65](https://github.com/lenML/ChatTTS-Forge/commit/2db6d65ef8fbf8a3a213cbdc3d4b1143396cc165)]
+- 🐛 fix sentence spliter [[5d8937c](https://github.com/lenML/ChatTTS-Forge/commit/5d8937c169d5f7784920a93834df0480dd3a67b3)]
+- 🐛 fix playground url_join [[53e7cbc](https://github.com/lenML/ChatTTS-Forge/commit/53e7cbc6103bc0e3bb83767a9233c45285b77e75)]
+- 🐛 fix generate_audio args [[a7a698c](https://github.com/lenML/ChatTTS-Forge/commit/a7a698c760b5bc97c90a144a4a7afb5e17414995)]
+- 🐛 fix infer func [[b0de527](https://github.com/lenML/ChatTTS-Forge/commit/b0de5275342c02d332a50d0ab5ac171a7007b300)]
+- 🐛 fix webui logging format [[4adc29e](https://github.com/lenML/ChatTTS-Forge/commit/4adc29e6c06fa806a8178f445399bbac8ed57911)]
+- 🐛 fix webui speaker_tab missing progress [[fafe242](https://github.com/lenML/ChatTTS-Forge/commit/fafe242e69ea8019729a62e52f6c0b3c0d6a63ad)]
+### Miscellaneous
+- 📝 添加整合包地址 [[26122d4](https://github.com/lenML/ChatTTS-Forge/commit/26122d4cfd975206211fc37491348cf40aa39561)]
+- 📝 details &#x60;.env&#x60; file and cli usage docs [[ec3d36f](https://github.com/lenML/ChatTTS-Forge/commit/ec3d36f8a67215e243e6b8225aa9144ac888313a)]
+- 📝 update changelog [[22996e9](https://github.com/lenML/ChatTTS-Forge/commit/22996e9f0c42d9cad59950aecfe6b16413f2ab40)]
+-  Windows not yet supported for torch.compile fix [[74ac27d](https://github.com/lenML/ChatTTS-Forge/commit/74ac27d56a370f87560329043c42be27022ca0f5)]
+-  fix: replace mispronounced words in TTS [[de66e6b](https://github.com/lenML/ChatTTS-Forge/commit/de66e6b8f7f8b5c10e7ac54f7b2488c798e5ef81)]
+-  feat: support stream mode [[3da0f0c](https://github.com/lenML/ChatTTS-Forge/commit/3da0f0cb7f213dee40d00a89093166ad9e1d17a0)]
+-  optimize: mps audio quality by contiguous scores [[1e4d79f](https://github.com/lenML/ChatTTS-Forge/commit/1e4d79f1a81a3ac8697afff0e44f0cfd2608599a)]
+- 📝 update changelog [[ab55c14](https://github.com/lenML/ChatTTS-Forge/commit/ab55c149d48edc52f1de9c6d4fe6e6ed78b3b134)]
+<a name="0.6.1"></a>
+## 0.6.1 (2024-06-18)
+### Added
+- ✨ add &#x60;--preload_models&#x60; [[73a41e0](https://github.com/lenML/ChatTTS-Forge/commit/73a41e009cd4426dfe4b0a35325da68189966390)]
+- ✨ add webui progress [[778802d](https://github.com/lenML/ChatTTS-Forge/commit/778802ded12de340520f41a3e1bdb852f00bd637)]
+- ✨ add merger error [[51060bc](https://github.com/lenML/ChatTTS-Forge/commit/51060bc343a6308493b7d582e21dca62eacaa7cb)]
+- ✨ tts prompt &#x3D;&gt; experimental [[d3e6315](https://github.com/lenML/ChatTTS-Forge/commit/d3e6315a3cb8b1fa254cefb2efe2bae7c74a50f8)]
+- ✨ add 基本的 speaker finetune ui [[5f68f19](https://github.com/lenML/ChatTTS-Forge/commit/5f68f193e78f470bd2c3ca4b9fa1008cf809e753)]
+- ✨ add speaker finetune [[5ce27ed](https://github.com/lenML/ChatTTS-Forge/commit/5ce27ed7e4da6c96bb3fd016b8b491768faf319d)]
+- ✨ add &#x60;--ino_half&#x60; remove &#x60;--half&#x60; [[5820e57](https://github.com/lenML/ChatTTS-Forge/commit/5820e576b288df50b929fbdfd9d0d6b6f548b54e)]
+- ✨ add webui podcast 默认值 [[dd786a8](https://github.com/lenML/ChatTTS-Forge/commit/dd786a83733a71d005ff7efe6312e35d652b2525)]
+- ✨ add webui 分割器配置 [[589327b](https://github.com/lenML/ChatTTS-Forge/commit/589327b729188d1385838816b9807e894eb128b0)]
+- ✨ add &#x60;eos&#x60; params to all api [[79c994f](https://github.com/lenML/ChatTTS-Forge/commit/79c994fadf7d60ea432b62f4000b62b67efe7259)]
+### Changed
+- ⬆️ Bump urllib3 from 2.2.1 to 2.2.2 [[097c15b](https://github.com/lenML/ChatTTS-Forge/commit/097c15ba56f8197a4f26adcfb77336a70e5ed806)]
+- 🎨 run formatter [[8c267e1](https://github.com/lenML/ChatTTS-Forge/commit/8c267e151152fe2090528104627ec031453d4ed5)]
+- ⚡ Optimize &#x60;audio_data_to_segment&#x60; [#57](https://github.com/lenML/ChatTTS-Forge/issues/57) [[d33809c](https://github.com/lenML/ChatTTS-Forge/commit/d33809c60a3ac76a01f71de4fd26b315d066c8d3)]
+- ⚡ map_location&#x3D;&quot;cpu&quot; [[0f58c10](https://github.com/lenML/ChatTTS-Forge/commit/0f58c10a445efaa9829f862acb4fb94bc07f07bf)]
+- ⚡ colab use default GPU [[c7938ad](https://github.com/lenML/ChatTTS-Forge/commit/c7938adb6d3615f37210b1f3cbe4671f93d58285)]
+- ⚡ improve hf calling [[2dde612](https://github.com/lenML/ChatTTS-Forge/commit/2dde6127906ce6e77a970b4cd96e68f7a5417c6a)]
+- 🍱 add &#x60;bob_ft10.pt&#x60; [[9eee965](https://github.com/lenML/ChatTTS-Forge/commit/9eee965425a7d6640eba22d843db4975dd3e355a)]
+- ⚡ enhance SynthesizeSegments [[0bb4dd7](https://github.com/lenML/ChatTTS-Forge/commit/0bb4dd7676c38249f10bf0326174ff8b74b2abae)]
+- 🍱 add &#x60;bob_ft10.pt&#x60; [[bef1b02](https://github.com/lenML/ChatTTS-Forge/commit/bef1b02435c39830612b18738bb31ac48e340fc6)]
+- ♻️ refactor api [[671fcc3](https://github.com/lenML/ChatTTS-Forge/commit/671fcc38a570d0cb7de0a214d318281084c9608c)]
+- ⚡ improve xtts_v2 api [[206fabc](https://github.com/lenML/ChatTTS-Forge/commit/206fabc76f1dbad261c857cb02f8c99c21e99eef)]
+- ⚡ train text &#x3D;&gt; just text [[e2037e0](https://github.com/lenML/ChatTTS-Forge/commit/e2037e0f97f15ff560fce14bbdc3926e3261bff9)]
+- ⚡ improve TN [[a0069ed](https://github.com/lenML/ChatTTS-Forge/commit/a0069ed2d0c3122444e873fb13b9922f9ab88a79)]
+### Fixed
+- 🐛 fix webui speaker_editor missing &#x60;describe&#x60; [[2a2a36d](https://github.com/lenML/ChatTTS-Forge/commit/2a2a36d62d8f253fc2e17ccc558038dbcc99d1ee)]
+- 💚 Dependabot alerts [[f501860](https://github.com/lenML/ChatTTS-Forge/commit/f5018607f602769d4dda7aa00573b9a06e659d91)]
+- 🐛 fix &#x60;numpy&lt;2&#x60; [#50](https://github.com/lenML/ChatTTS-Forge/issues/50) [[e4fea4f](https://github.com/lenML/ChatTTS-Forge/commit/e4fea4f80b31d962f02cd1146ce8c73bf75b6a39)]
+- 🐛 fix Box() index [#49](https://github.com/lenML/ChatTTS-Forge/issues/49) add testcase [[d982e33](https://github.com/lenML/ChatTTS-Forge/commit/d982e33ed30749d7ae6570ade5ec7b560a3d1f06)]
+- 🐛 fix Box() index [#49](https://github.com/lenML/ChatTTS-Forge/issues/49) [[1788318](https://github.com/lenML/ChatTTS-Forge/commit/1788318a96c014a53ee41c4db7d60fdd4b15cfca)]
+- 🐛 fix &#x60;--use_cpu&#x60; [#47](https://github.com/lenML/ChatTTS-Forge/issues/47) update conftest [[4095b08](https://github.com/lenML/ChatTTS-Forge/commit/4095b085c4c6523f2579e00edfb1569d65608ca2)]
+- 🐛 fix &#x60;--use_cpu&#x60; [#47](https://github.com/lenML/ChatTTS-Forge/issues/47) [[221962f](https://github.com/lenML/ChatTTS-Forge/commit/221962fd0f61d3f269918b26a814cbcd5aabd1f0)]
+- 🐛 fix webui speaker args [[3b3c331](https://github.com/lenML/ChatTTS-Forge/commit/3b3c3311dd0add0e567179fc38223a3cc5e56f6e)]
+- 🐛 fix speaker trainer [[52d473f](https://github.com/lenML/ChatTTS-Forge/commit/52d473f37f6a3950d4c8738c294f048f11198776)]
+- 🐛 兼容 win32 [[7ffa37f](https://github.com/lenML/ChatTTS-Forge/commit/7ffa37f3d36fb9ba53ab051b2fce6229920b1208)]
+- 🐛 fix google api ssml synthesize [#43](https://github.com/lenML/ChatTTS-Forge/issues/43) [[1566f88](https://github.com/lenML/ChatTTS-Forge/commit/1566f8891c22d63681d756deba70374e2b75d078)]
+### Miscellaneous
+- Merge pull request [#58](https://github.com/lenML/ChatTTS-Forge/issues/58) from lenML/dependabot/pip/urllib3-2.2.2 [[f259f18](https://github.com/lenML/ChatTTS-Forge/commit/f259f180af57f9a6938b14bf263d0387b6900e57)]
+- 📝 update changelog [[b9da7ec](https://github.com/lenML/ChatTTS-Forge/commit/b9da7ec1afed416a825e9e4a507b8263f69bf47e)]
+- 📝 update [[8439437](https://github.com/lenML/ChatTTS-Forge/commit/84394373de66b81a9f7f70ef8484254190e292ab)]
+- 📝 update [[ef97206](https://github.com/lenML/ChatTTS-Forge/commit/ef972066558d0b229d6d0b3d83bb4f8e8517558f)]
+- 📝 improve readme.md [[7bf3de2](https://github.com/lenML/ChatTTS-Forge/commit/7bf3de2afb41b9a29071bec18ee6306ce8e70183)]
+- 📝 add bug report forms [[091cf09](https://github.com/lenML/ChatTTS-Forge/commit/091cf0958a719236c77107acf4cfb8c0ba090946)]
+- 📝 update changelog [[3d519ec](https://github.com/lenML/ChatTTS-Forge/commit/3d519ec8a20098c2de62631ae586f39053dd89a5)]
+- 📝 update [[66963f8](https://github.com/lenML/ChatTTS-Forge/commit/66963f8ff8f29c298de64cd4a54913b1d3e29a6a)]
+- 📝 update [[b7a63b5](https://github.com/lenML/ChatTTS-Forge/commit/b7a63b59132d2c8dbb4ad2e15bd23713f00f0084)]
+<a name="0.6.0"></a>
+## 0.6.0 (2024-06-12)
+### Added
+- ✨ add XTTSv2 api [#42](https://github.com/lenML/ChatTTS-Forge/issues/42) [[d1fc63c](https://github.com/lenML/ChatTTS-Forge/commit/d1fc63cd1e847d622135c96371bbfe2868a80c19)]
+- ✨ google api 支持 enhancer [[14fecdb](https://github.com/lenML/ChatTTS-Forge/commit/14fecdb8ea0f9a5d872a4c7ca862e901990076c0)]
+- ✨ 修改 podcast 脚本默认 style [[98186c2](https://github.com/lenML/ChatTTS-Forge/commit/98186c25743cbfa24ca7d41336d4ec84aa34aacf)]
+- ✨ playground google api [[4109adb](https://github.com/lenML/ChatTTS-Forge/commit/4109adb317be215970d756b4ba7064c9dc4d6fdc)]
+- ✨ 添加 unload api [[ed9d61a](https://github.com/lenML/ChatTTS-Forge/commit/ed9d61a2fe4ba1d902d91517148f8f7dea47b51b)]
+- ✨ support api workers [[babdada](https://github.com/lenML/ChatTTS-Forge/commit/babdada50e79e425bac4d3074f8e42dfb4c4c33a)]
+- ✨ add ffmpeg version to webui footer [[e9241a1](https://github.com/lenML/ChatTTS-Forge/commit/e9241a1a8d1f5840ae6259e46020684ba70a0efb)]
+- ✨ support use internal ffmpeg [[0e02ab0](https://github.com/lenML/ChatTTS-Forge/commit/0e02ab0f5d81fbfb6166793cb4f6d58c5f17f34c)]
+- ✨ 增加参数 debug_generate [[94e876a](https://github.com/lenML/ChatTTS-Forge/commit/94e876ae3819c3efbde4a239085f91342874bd5a)]
+- ✨ 支持 api 服务与 webui 并存 [[4901491](https://github.com/lenML/ChatTTS-Forge/commit/4901491eced3955c51030388d1dcebf049cd790e)]
+- ✨ refiner api support normalize [[ef665da](https://github.com/lenML/ChatTTS-Forge/commit/ef665dad5a5517c610f0b430bc52a5b0ba3c2d96)]
+- ✨ add webui 音色编辑器 [[fb4c7b3](https://github.com/lenML/ChatTTS-Forge/commit/fb4c7b3b0949ac669da0d069c739934f116b83e2)]
 - ✨ add localization [[c05035d](https://github.com/lenML/ChatTTS-Forge/commit/c05035d5cdcc5aa7efd995fe42f6a2541abe718b)]
 - ✨ SSML 支持 enhancer [[5c2788e](https://github.com/lenML/ChatTTS-Forge/commit/5c2788e04f3debfa8bafd8a2e2371dde30f38d4d)]
 - ✨ webui 增加 podcast 工具 tab [[b0b169d](https://github.com/lenML/ChatTTS-Forge/commit/b0b169d8b49c8e013209e59d1f8b637382d8b997)]
+- ✨ 完善 enhancer [[205ebeb](https://github.com/lenML/ChatTTS-Forge/commit/205ebebeb7530c81fde7ea96c7e4c6a888a29835)]
 ### Changed
+- ⚡ improve synthesize_audio [[759adc2](https://github.com/lenML/ChatTTS-Forge/commit/759adc2ead1da8395df62ea1724456dad6894eb1)]
+- ⚡ reduce enhancer chunk vram usage [[3464b42](https://github.com/lenML/ChatTTS-Forge/commit/3464b427b14878ee11e03ebdfb91efee1550de59)]
+- ⚡ 增加默认说话人 [[d702ad5](https://github.com/lenML/ChatTTS-Forge/commit/d702ad5ad585978f8650284ab99238571dbd163b)]
+- 🍱 add &#x60;podcast&#x60; &#x60;podcast_p&#x60; style [[2b9e5bf](https://github.com/lenML/ChatTTS-Forge/commit/2b9e5bfd8fe4700f802097b995f5b68bf1097087)]
+- 🎨 improve code [[317951e](https://github.com/lenML/ChatTTS-Forge/commit/317951e431b16c735df31187b1af7230a1608c41)]
 - 🍱 update banner [[dbc293e](https://github.com/lenML/ChatTTS-Forge/commit/dbc293e1a7dec35f60020dcaf783ba3b7c734bfa)]
 - ⚡ 增强 TN [[092c1b9](https://github.com/lenML/ChatTTS-Forge/commit/092c1b94147249880198fe2ad3dfe3b209099e19)]
 - ⚡ enhancer 支持 off_tqdm [[94d34d6](https://github.com/lenML/ChatTTS-Forge/commit/94d34d657fa3433dae9ff61775e0c364a6f77aff)]
 - ⚡ 增加 git env [[43d9c65](https://github.com/lenML/ChatTTS-Forge/commit/43d9c65877ff68ad94716bc2e505ccc7ae8869a8)]
+- ⚡ 修改 webui 保存文件格式 [[2da41c9](https://github.com/lenML/ChatTTS-Forge/commit/2da41c90aa81bf87403598aefaea3e0ae2e83d79)]
+### Breaking changes
+- 💥 enhancer support --half [[fef2ed6](https://github.com/lenML/ChatTTS-Forge/commit/fef2ed659fd7fe5a14807d286c209904875ce594)]
 ### Removed
 ### Fixed
+- 🐛 fix worker env loader [[5b0bf4e](https://github.com/lenML/ChatTTS-Forge/commit/5b0bf4e93738bcd115f006376691c4eaa89b66de)]
+- 🐛 fix colab default lang missing [[d4e5190](https://github.com/lenML/ChatTTS-Forge/commit/d4e51901856305fc039d886a92e38eea2a2cd24d)]
+- 🐛 fix &quot;reflection_pad1d&quot; not implemented for &#x27;Half&#x27; [[536c19b](https://github.com/lenML/ChatTTS-Forge/commit/536c19b7f6dc3f1702fcc2a90daa3277040e70f0)]
+- 🐛 fix [#33](https://github.com/lenML/ChatTTS-Forge/issues/33) [[76e0b58](https://github.com/lenML/ChatTTS-Forge/commit/76e0b5808ede71ebb28edbf0ce0af7d9da9bcb27)]
+- 🐛 fix localization error [[507dbe7](https://github.com/lenML/ChatTTS-Forge/commit/507dbe7a3b92d1419164d24f7804295f6686b439)]
+- 🐛 block main thread [#30](https://github.com/lenML/ChatTTS-Forge/issues/30) [[3a7cbde](https://github.com/lenML/ChatTTS-Forge/commit/3a7cbde6ccdfd20a6c53d7625d4e652007367fbf)]
+- 🐛 fix webui skip no-translate [[a8d595e](https://github.com/lenML/ChatTTS-Forge/commit/a8d595eb490f23c943d6efc35b65b33266c033b7)]
+- 🐛 fix hf.space force abort [[f564536](https://github.com/lenML/ChatTTS-Forge/commit/f5645360dd1f45a7bf112f01c85fb862ee57df3c)]
+- 🐛 fix missing device [#25](https://github.com/lenML/ChatTTS-Forge/issues/25) [[07cf6c1](https://github.com/lenML/ChatTTS-Forge/commit/07cf6c1386900999b6c9436debbfcbe59f6b692a)]
+- 🐛 fix Chat.refiner_prompt() [[0839863](https://github.com/lenML/ChatTTS-Forge/commit/083986369d0e67fcb4bd71930ad3d2bc3fc038fb)]
+- 🐛 fix --language type check [[50d354c](https://github.com/lenML/ChatTTS-Forge/commit/50d354c91c659d9ae16c8eaa0218d9e08275fbb2)]
 - 🐛 fix hparams config [#22](https://github.com/lenML/ChatTTS-Forge/issues/22) [[61d9809](https://github.com/lenML/ChatTTS-Forge/commit/61d9809804ad8c141d36afde51a608734a105662)]
 - 🐛 fix enhance 下载脚本 [[d2e14b0](https://github.com/lenML/ChatTTS-Forge/commit/d2e14b0a4905724a55b03493fa4b94b5c4383c95)]
 - 🐛 fix &#x27;trange&#x27; referenced [[d1a8dae](https://github.com/lenML/ChatTTS-Forge/commit/d1a8daee61e62d14cf5fd7a17fab4424e24b1c41)]
 ### Miscellaneous
+- 🐳 fix docker / 兼容 py 3.9 [[ebb096f](https://github.com/lenML/ChatTTS-Forge/commit/ebb096f9b1b843b65d150fb34da7d3b5acb13011)]
+- 🐳 add .dockerignore [[57262b8](https://github.com/lenML/ChatTTS-Forge/commit/57262b81a8df3ed26ca5da5e264d5dca7b022471)]
+- 🧪 add tests [[a807640](https://github.com/lenML/ChatTTS-Forge/commit/a80764030b790baee45a10cbe2d4edd7f183ef3c)]
+- 🌐 fix [[b34a0f8](https://github.com/lenML/ChatTTS-Forge/commit/b34a0f8654467f3068e43056708742ab69e3665b)]
+- 🌐 remove chat limit desc [[3f81eca](https://github.com/lenML/ChatTTS-Forge/commit/3f81ecae6e4521eeb4e867534defc36be741e1e2)]
+- 🧪 add tests [[7a54225](https://github.com/lenML/ChatTTS-Forge/commit/7a542256a157a281a15312bbf987bc9fb16876ee)]
+- 🔨 improve model downloader [[79a0c59](https://github.com/lenML/ChatTTS-Forge/commit/79a0c599f03b4e47346315a03f1df3d92578fe5d)]
 - 🌐 更新翻译文案 [[f56caa7](https://github.com/lenML/ChatTTS-Forge/commit/f56caa71e9186680b93c487d9645186ae18c1dc6)]
 <a name="0.5.5"></a>

launch.py CHANGED Viewed

@@ -5,6 +5,7 @@ from modules.ffmpeg_env import setup_ffmpeg_path
 try:
     setup_ffmpeg_path()
     logging.basicConfig(
         level=os.getenv("LOG_LEVEL", "INFO"),
         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
@@ -16,26 +17,44 @@ import argparse
 import uvicorn
-from modules.api.api_setup import setup_api_args, setup_model_args, setup_uvicon_args
 from modules.utils import env
 logger = logging.getLogger(__name__)
-if __name__ == "__main__":
-    import dotenv
-    dotenv.load_dotenv(
-        dotenv_path=os.getenv("ENV_FILE", ".env.api"),
     )
-    parser = argparse.ArgumentParser(
-        description="Start the FastAPI server with command line arguments"
     )
-    setup_api_args(parser)
-    setup_model_args(parser)
-    setup_uvicon_args(parser=parser)
-    args = parser.parse_args()
     host = env.get_and_update_env(args, "host", "0.0.0.0", str)
     port = env.get_and_update_env(args, "port", 7870, int)
     reload = env.get_and_update_env(args, "reload", False, bool)
@@ -68,3 +87,22 @@ if __name__ == "__main__":
         ssl_certfile=ssl_certfile,
         ssl_keyfile_password=ssl_keyfile_password,
     )

 try:
     setup_ffmpeg_path()
+    # NOTE: 因为 logger 都是在模块中初始化，所以这个 config 必须在最前面
     logging.basicConfig(
         level=os.getenv("LOG_LEVEL", "INFO"),
         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
 import uvicorn
+from modules.api.api_setup import setup_api_args
+from modules.models_setup import setup_model_args
 from modules.utils import env
+from modules.utils.ignore_warn import ignore_useless_warnings
+ignore_useless_warnings()
 logger = logging.getLogger(__name__)
+def setup_uvicon_args(parser: argparse.ArgumentParser):
+    parser.add_argument("--host", type=str, help="Host to run the server on")
+    parser.add_argument("--port", type=int, help="Port to run the server on")
+    parser.add_argument(
+        "--reload", action="store_true", help="Enable auto-reload for development"
     )
+    parser.add_argument("--workers", type=int, help="Number of worker processes")
+    parser.add_argument("--log_level", type=str, help="Log level")
+    parser.add_argument("--access_log", action="store_true", help="Enable access log")
+    parser.add_argument(
+        "--proxy_headers", action="store_true", help="Enable proxy headers"
+    )
+    parser.add_argument(
+        "--timeout_keep_alive", type=int, help="Keep-alive timeout duration"
+    )
+    parser.add_argument(
+        "--timeout_graceful_shutdown",
+        type=int,
+        help="Graceful shutdown timeout duration",
+    )
+    parser.add_argument("--ssl_keyfile", type=str, help="SSL key file path")
+    parser.add_argument("--ssl_certfile", type=str, help="SSL certificate file path")
+    parser.add_argument(
+        "--ssl_keyfile_password", type=str, help="SSL key file password"
     )
+def process_uvicon_args(args):
     host = env.get_and_update_env(args, "host", "0.0.0.0", str)
     port = env.get_and_update_env(args, "port", 7870, int)
     reload = env.get_and_update_env(args, "reload", False, bool)
         ssl_certfile=ssl_certfile,
         ssl_keyfile_password=ssl_keyfile_password,
     )
+if __name__ == "__main__":
+    import dotenv
+    dotenv.load_dotenv(
+        dotenv_path=os.getenv("ENV_FILE", ".env.api"),
+    )
+    parser = argparse.ArgumentParser(
+        description="Start the FastAPI server with command line arguments"
+    )
+    # NOTE: 主进程中不需要处理 model args / api args，但是要接收这些参数, 具体处理在 worker.py 中
+    setup_api_args(parser=parser)
+    setup_model_args(parser=parser)
+    setup_uvicon_args(parser=parser)
+    args = parser.parse_args()
+    process_uvicon_args(args)

modules/ChatTTS/ChatTTS/core.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import logging
 import os
 import torch
 from huggingface_hub import snapshot_download
 from omegaconf import OmegaConf
@@ -142,9 +143,12 @@ class Chat:
             gpt.load_state_dict(torch.load(gpt_ckpt_path, map_location=map_location))
             if compile and "cuda" in str(device):
                 self.logger.info("compile gpt model")
-                gpt.gpt.forward = torch.compile(
-                    gpt.gpt.forward, backend="inductor", dynamic=True
-                )
             self.pretrain_models["gpt"] = gpt
             spk_stat_path = os.path.join(os.path.dirname(gpt_ckpt_path), "spk_stat.pt")
             assert os.path.exists(
@@ -173,7 +177,7 @@ class Chat:
         self.check_model()
-    def infer(
         self,
         text,
         skip_refine_text=False,
@@ -181,9 +185,11 @@ class Chat:
         params_refine_text={},
         params_infer_code={"prompt": "[speed_5]"},
         use_decoder=True,
     ):
-        assert self.check_model(use_decoder=use_decoder)
         if not isinstance(text, list):
             text = [text]
@@ -192,122 +198,147 @@ class Chat:
             reserved_tokens = self.pretrain_models[
                 "tokenizer"
             ].additional_special_tokens
-            invalid_characters = count_invalid_characters(t, reserved_tokens)
             if len(invalid_characters):
                 self.logger.log(
                     logging.WARNING, f"Invalid characters found! : {invalid_characters}"
                 )
-                text[i] = apply_character_map(t)
         if not skip_refine_text:
-            text_tokens = refine_text(self.pretrain_models, text, **params_refine_text)[
-                "ids"
-            ]
-            text_tokens = [
-                i[
-                    i
-                    < self.pretrain_models["tokenizer"].convert_tokens_to_ids(
-                        "[break_0]"
-                    )
                 ]
-                for i in text_tokens
-            ]
-            text = self.pretrain_models["tokenizer"].batch_decode(text_tokens)
             if refine_text_only:
-                return text
         text = [params_infer_code.get("prompt", "") + i for i in text]
         params_infer_code.pop("prompt", "")
-        result = infer_code(
-            self.pretrain_models, text, **params_infer_code, return_hidden=use_decoder
         )
         if use_decoder:
-            mel_spec = [
-                self.pretrain_models["decoder"](i[None].permute(0, 2, 1))
-                for i in result["hiddens"]
-            ]
         else:
-            mel_spec = [
-                self.pretrain_models["dvae"](i[None].permute(0, 2, 1))
-                for i in result["ids"]
-            ]
-        wav = [self.pretrain_models["vocos"].decode(i).cpu().numpy() for i in mel_spec]
-        return wav
-    def refiner_prompt(
         self,
         text,
         params_refine_text={},
-    ) -> str:
-        # assert self.check_model(use_decoder=False)
-        if not isinstance(text, list):
-            text = [text]
-        for i, t in enumerate(text):
-            reserved_tokens = self.pretrain_models[
-                "tokenizer"
-            ].additional_special_tokens
-            invalid_characters = count_invalid_characters(t, reserved_tokens)
-            if len(invalid_characters):
-                self.logger.log(
-                    logging.WARNING, f"Invalid characters found! : {invalid_characters}"
-                )
-                text[i] = apply_character_map(t)
-        text_tokens = refine_text(self.pretrain_models, text, **params_refine_text)[
-            "ids"
-        ]
-        text_tokens = [
-            i[i < self.pretrain_models["tokenizer"].convert_tokens_to_ids("[break_0]")]
-            for i in text_tokens
-        ]
-        text = self.pretrain_models["tokenizer"].batch_decode(text_tokens)
-        return text[0]
     def generate_audio(
         self,
         prompt,
         params_infer_code={"prompt": "[speed_5]"},
         use_decoder=True,
-    ) -> list:
-        # assert self.check_model(use_decoder=use_decoder)
-        if not isinstance(prompt, list):
-            prompt = [prompt]
-        prompt = [params_infer_code.get("prompt", "") + i for i in prompt]
-        params_infer_code.pop("prompt", "")
-        result = infer_code(
-            self.pretrain_models,
             prompt,
-            return_hidden=use_decoder,
-            **params_infer_code,
         )
-        if use_decoder:
-            mel_spec = [
-                self.pretrain_models["decoder"](i[None].permute(0, 2, 1))
-                for i in result["hiddens"]
-            ]
-        else:
-            mel_spec = [
-                self.pretrain_models["dvae"](i[None].permute(0, 2, 1))
-                for i in result["ids"]
-            ]
-        wav = [self.pretrain_models["vocos"].decode(i).cpu().numpy() for i in mel_spec]
-        return wav
     def sample_random_speaker(
         self,
     ) -> torch.Tensor:

 import logging
 import os
+import numpy as np
 import torch
 from huggingface_hub import snapshot_download
 from omegaconf import OmegaConf
             gpt.load_state_dict(torch.load(gpt_ckpt_path, map_location=map_location))
             if compile and "cuda" in str(device):
                 self.logger.info("compile gpt model")
+                try:
+                    gpt.gpt.forward = torch.compile(
+                        gpt.gpt.forward, backend="inductor", dynamic=True
+                    )
+                except RuntimeError as e:
+                    logging.warning(f"Compile failed,{e}. fallback to normal mode.")
             self.pretrain_models["gpt"] = gpt
             spk_stat_path = os.path.join(os.path.dirname(gpt_ckpt_path), "spk_stat.pt")
             assert os.path.exists(
         self.check_model()
+    def _infer(
         self,
         text,
         skip_refine_text=False,
         params_refine_text={},
         params_infer_code={"prompt": "[speed_5]"},
         use_decoder=True,
+        stream=False,
+        stream_text=False,
     ):
+        # assert self.check_model(use_decoder=use_decoder)
         if not isinstance(text, list):
             text = [text]
             reserved_tokens = self.pretrain_models[
                 "tokenizer"
             ].additional_special_tokens
+            invalid_characters = count_invalid_characters(
+                t, reserved_tokens=reserved_tokens
+            )
             if len(invalid_characters):
                 self.logger.log(
                     logging.WARNING, f"Invalid characters found! : {invalid_characters}"
                 )
+                text[i] = apply_character_map(t, reserved_tokens=reserved_tokens)
         if not skip_refine_text:
+            text_tokens_gen = refine_text(
+                self.pretrain_models, text, stream=stream, **params_refine_text
+            )
+            def decode_text(text_tokens):
+                text_tokens = [
+                    i[
+                        i
+                        < self.pretrain_models["tokenizer"].convert_tokens_to_ids(
+                            "[break_0]"
+                        )
+                    ]
+                    for i in text_tokens
                 ]
+                text = self.pretrain_models["tokenizer"].batch_decode(text_tokens)
+                return text
+            if stream_text:
+                for result in text_tokens_gen:
+                    text_incomplete = decode_text(result["ids"])
+                    if refine_text_only and stream:
+                        yield text_incomplete
+                    if refine_text_only:
+                        return
+            else:
+                result = next(text_tokens_gen)
+                text = decode_text(result["ids"])
+                if refine_text_only:
+                    yield text
             if refine_text_only:
+                return
         text = [params_infer_code.get("prompt", "") + i for i in text]
         params_infer_code.pop("prompt", "")
+        result_gen = infer_code(
+            self.pretrain_models,
+            text,
+            **params_infer_code,
+            return_hidden=use_decoder,
+            stream=stream,
         )
         if use_decoder:
+            field = "hiddens"
+            docoder_name = "decoder"
         else:
+            field = "ids"
+            docoder_name = "dvae"
+        vocos_decode = lambda spec: [
+            self.pretrain_models["vocos"]
+            .decode(i.cpu() if torch.backends.mps.is_available() else i)
+            .cpu()
+            .numpy()
+            for i in spec
+        ]
+        if stream:
+            length = 0
+            for result in result_gen:
+                chunk_data = result[field][0]
+                assert len(result[field]) == 1
+                start_seek = length
+                length = len(chunk_data)
+                self.logger.debug(
+                    f"{start_seek=} total len: {length}, new len: {length - start_seek = }"
+                )
+                chunk_data = chunk_data[start_seek:]
+                if not len(chunk_data):
+                    continue
+                self.logger.debug(f"new hidden {len(chunk_data)=}")
+                mel_spec = [
+                    self.pretrain_models[docoder_name](i[None].permute(0, 2, 1))
+                    for i in [chunk_data]
+                ]
+                wav = vocos_decode(mel_spec)
+                self.logger.debug(f"yield wav chunk {len(wav[0])=} {len(wav[0][0])=}")
+                yield wav
+            return
+        mel_spec = [
+            self.pretrain_models[docoder_name](i[None].permute(0, 2, 1))
+            for i in next(result_gen)[field]
+        ]
+        yield vocos_decode(mel_spec)
+    def infer(
         self,
         text,
+        skip_refine_text=False,
+        refine_text_only=False,
         params_refine_text={},
+        params_infer_code={"prompt": "[speed_5]"},
+        use_decoder=True,
+        stream=False,
+    ):
+        res_gen = self._infer(
+            text=text,
+            skip_refine_text=skip_refine_text,
+            refine_text_only=refine_text_only,
+            params_refine_text=params_refine_text,
+            params_infer_code=params_infer_code,
+            use_decoder=use_decoder,
+            stream=stream,
+        )
+        if stream:
+            return res_gen
+        else:
+            return next(res_gen)
+    def refiner_prompt(self, text, params_refine_text={}, stream=False):
+        return self.infer(
+            text=text,
+            skip_refine_text=False,
+            refine_text_only=True,
+            params_refine_text=params_refine_text,
+            stream=stream,
+        )
     def generate_audio(
         self,
         prompt,
         params_infer_code={"prompt": "[speed_5]"},
         use_decoder=True,
+        stream=False,
+    ):
+        return self.infer(
             prompt,
+            skip_refine_text=True,
+            params_infer_code=params_infer_code,
+            use_decoder=use_decoder,
+            stream=stream,
         )
     def sample_random_speaker(
         self,
     ) -> torch.Tensor:

modules/ChatTTS/ChatTTS/infer/api.py CHANGED Viewed

@@ -17,6 +17,7 @@ def infer_code(
     prompt1="",
     prompt2="",
     prefix="",
     **kwargs,
 ):
@@ -83,6 +84,7 @@ def infer_code(
         eos_token=num_code,
         max_new_token=max_new_token,
         infer_text=False,
         **kwargs,
     )
@@ -98,6 +100,7 @@ def refine_text(
     repetition_penalty=1.0,
     max_new_token=384,
     prompt="",
     **kwargs,
 ):
     device = next(models["gpt"].parameters()).device
@@ -152,6 +155,7 @@ def refine_text(
         )[None],
         max_new_token=max_new_token,
         infer_text=True,
         **kwargs,
     )
     return result

     prompt1="",
     prompt2="",
     prefix="",
+    stream=False,
     **kwargs,
 ):
         eos_token=num_code,
         max_new_token=max_new_token,
         infer_text=False,
+        stream=stream,
         **kwargs,
     )
     repetition_penalty=1.0,
     max_new_token=384,
     prompt="",
+    stream=False,
     **kwargs,
 ):
     device = next(models["gpt"].parameters()).device
         )[None],
         max_new_token=max_new_token,
         infer_text=True,
+        stream=stream,
         **kwargs,
     )
     return result

modules/ChatTTS/ChatTTS/model/gpt.py CHANGED Viewed

@@ -3,6 +3,7 @@ import os
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 import logging
 import torch
 import torch.nn as nn
@@ -37,7 +38,6 @@ class GPT_warpper(nn.Module):
         num_audio_tokens,
         num_text_tokens,
         num_vq=4,
-        **kwargs,
     ):
         super().__init__()
@@ -211,12 +211,13 @@ class GPT_warpper(nn.Module):
         infer_text=False,
         return_attn=False,
         return_hidden=False,
         disable_tqdm=False,
     ):
         if disable_tqdm:
-            tqdm = lambda x: x
-        else:
-            from tqdm import tqdm
         with torch.no_grad():
@@ -242,90 +243,136 @@ class GPT_warpper(nn.Module):
             if attention_mask is not None:
                 attention_mask_cache[:, : attention_mask.shape[1]] = attention_mask
-            for i in tqdm(range(max_new_token)):
-                if finish.all():
-                    continue
-                model_input = self.prepare_inputs_for_generation(
-                    inputs_ids,
-                    outputs.past_key_values if i != 0 else None,
-                    attention_mask_cache[:, : inputs_ids.shape[1]],
-                    use_cache=True,
-                )
-                if i == 0:
-                    model_input["inputs_embeds"] = emb
-                else:
-                    if infer_text:
-                        model_input["inputs_embeds"] = self.emb_text(
-                            model_input["input_ids"][:, :, 0]
-                        )
-                    else:
-                        code_emb = [
-                            self.emb_code[i](model_input["input_ids"][:, :, i])
-                            for i in range(self.num_vq)
-                        ]
-                        model_input["inputs_embeds"] = torch.stack(code_emb, 3).sum(3)
-                model_input["input_ids"] = None
-                outputs = self.gpt.forward(**model_input, output_attentions=return_attn)
-                attentions.append(outputs.attentions)
-                hidden_states = outputs[0]  # 🐻
-                if return_hidden:
-                    hiddens.append(hidden_states[:, -1])
-                with P.cached():
-                    if infer_text:
-                        logits = self.head_text(hidden_states)
                     else:
-                        logits = torch.stack(
-                            [
-                                self.head_code[i](hidden_states)
                                 for i in range(self.num_vq)
-                            ],
-                            3,
                         )
-                logits = logits[:, -1].float()
-                if not infer_text:
-                    logits = rearrange(logits, "b c n -> (b n) c")
-                    logits_token = rearrange(
-                        inputs_ids[:, start_idx:], "b c n -> (b n) c"
-                    )
-                else:
-                    logits_token = inputs_ids[:, start_idx:, 0]
-                logits = logits / temperature
-                for logitsProcessors in LogitsProcessors:
-                    logits = logitsProcessors(logits_token, logits)
-                for logitsWarpers in LogitsWarpers:
-                    logits = logitsWarpers(logits_token, logits)
-                if i < min_new_token:
-                    logits[:, eos_token] = -torch.inf
-                scores = F.softmax(logits, dim=-1)
-                idx_next = torch.multinomial(scores, num_samples=1)
-                if not infer_text:
-                    idx_next = rearrange(idx_next, "(b n) 1 -> b n", n=self.num_vq)
-                    finish = finish | (idx_next == eos_token).any(1)
-                    inputs_ids = torch.cat([inputs_ids, idx_next.unsqueeze(1)], 1)
-                else:
-                    finish = finish | (idx_next == eos_token).any(1)
-                    inputs_ids = torch.cat(
-                        [
-                            inputs_ids,
-                            idx_next.unsqueeze(-1).expand(-1, -1, self.num_vq),
-                        ],
-                        1,
-                    )
-                end_idx = end_idx + (~finish).int()
             inputs_ids = [
                 inputs_ids[idx, start_idx : start_idx + i]
@@ -342,7 +389,9 @@ class GPT_warpper(nn.Module):
                     f"Incomplete result. hit max_new_token: {max_new_token}"
                 )
-            return {
                 "ids": inputs_ids,
                 "attentions": attentions,
                 "hiddens": hiddens,

 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 import logging
+from functools import partial
 import torch
 import torch.nn as nn
         num_audio_tokens,
         num_text_tokens,
         num_vq=4,
     ):
         super().__init__()
         infer_text=False,
         return_attn=False,
         return_hidden=False,
+        stream=False,
         disable_tqdm=False,
     ):
+        from tqdm import tqdm
         if disable_tqdm:
+            tqdm = partial(tqdm, disable=True)
         with torch.no_grad():
             if attention_mask is not None:
                 attention_mask_cache[:, : attention_mask.shape[1]] = attention_mask
+            with tqdm(total=max_new_token) as pbar:
+                past_key_values = None
+                for i in range(max_new_token):
+                    pbar.update(1)
+                    model_input = self.prepare_inputs_for_generation(
+                        inputs_ids,
+                        past_key_values,
+                        attention_mask_cache[:, : inputs_ids.shape[1]],
+                        use_cache=True,
+                    )
+                    if i == 0:
+                        model_input["inputs_embeds"] = emb
                     else:
+                        if infer_text:
+                            model_input["inputs_embeds"] = self.emb_text(
+                                model_input["input_ids"][:, :, 0]
+                            )
+                        else:
+                            code_emb = [
+                                self.emb_code[i](model_input["input_ids"][:, :, i])
                                 for i in range(self.num_vq)
+                            ]
+                            model_input["inputs_embeds"] = torch.stack(code_emb, 3).sum(
+                                3
+                            )
+                    model_input["input_ids"] = None
+                    outputs = self.gpt.forward(
+                        **model_input, output_attentions=return_attn
+                    )
+                    del model_input
+                    attentions.append(outputs.attentions)
+                    hidden_states = outputs[0]  # 🐻
+                    past_key_values = outputs.past_key_values
+                    del outputs
+                    if return_hidden:
+                        hiddens.append(hidden_states[:, -1])
+                    with P.cached():
+                        if infer_text:
+                            logits = self.head_text(hidden_states)
+                        else:
+                            logits = torch.stack(
+                                [
+                                    self.head_code[i](hidden_states)
+                                    for i in range(self.num_vq)
+                                ],
+                                3,
+                            )
+                    logits = logits[:, -1].float()
+                    if not infer_text:
+                        logits = rearrange(logits, "b c n -> (b n) c")
+                        logits_token = rearrange(
+                            inputs_ids[:, start_idx:], "b c n -> (b n) c"
                         )
+                    else:
+                        logits_token = inputs_ids[:, start_idx:, 0]
+                    logits = logits / temperature
+                    for logitsProcessors in LogitsProcessors:
+                        logits = logitsProcessors(logits_token, logits)
+                    for logitsWarpers in LogitsWarpers:
+                        logits = logitsWarpers(logits_token, logits)
+                    del logits_token
+                    if i < min_new_token:
+                        logits[:, eos_token] = -torch.inf
+                    scores = F.softmax(logits, dim=-1)
+                    del logits
+                    idx_next = torch.multinomial(scores, num_samples=1)
+                    if not infer_text:
+                        idx_next = rearrange(idx_next, "(b n) 1 -> b n", n=self.num_vq)
+                        finish_or = (idx_next == eos_token).any(1)
+                        finish |= finish_or
+                        del finish_or
+                        inputs_ids = torch.cat([inputs_ids, idx_next.unsqueeze(1)], 1)
+                    else:
+                        finish_or = (idx_next == eos_token).any(1)
+                        finish |= finish_or
+                        del finish_or
+                        inputs_ids = torch.cat(
+                            [
+                                inputs_ids,
+                                idx_next.unsqueeze(-1).expand(-1, -1, self.num_vq),
+                            ],
+                            1,
+                        )
+                    del idx_next
+                    end_idx += (~finish).int().to(end_idx.device)
+                    if stream:
+                        if end_idx % 24 and not finish.all():
+                            continue
+                        y_inputs_ids = [
+                            inputs_ids[idx, start_idx : start_idx + i]
+                            for idx, i in enumerate(end_idx.int())
+                        ]
+                        y_inputs_ids = (
+                            [i[:, 0] for i in y_inputs_ids]
+                            if infer_text
+                            else y_inputs_ids
+                        )
+                        y_hiddens = [[]]
+                        if return_hidden:
+                            y_hiddens = torch.stack(hiddens, 1)
+                            y_hiddens = [
+                                y_hiddens[idx, :i]
+                                for idx, i in enumerate(end_idx.int())
+                            ]
+                        yield {
+                            "ids": y_inputs_ids,
+                            "attentions": attentions,
+                            "hiddens": y_hiddens,
+                        }
+                    if finish.all():
+                        pbar.update(max_new_token - i - 1)
+                        break
             inputs_ids = [
                 inputs_ids[idx, start_idx : start_idx + i]
                     f"Incomplete result. hit max_new_token: {max_new_token}"
                 )
+            del finish
+            yield {
                 "ids": inputs_ids,
                 "attentions": attentions,
                 "hiddens": hiddens,

modules/ChatTTS/ChatTTS/utils/infer_utils.py CHANGED Viewed

@@ -24,6 +24,7 @@ class CustomRepetitionPenaltyLogitsProcessorRepeat:
         freq = F.one_hot(input_ids, scores.size(1)).sum(1)
         freq[self.max_input_ids :] = 0
         alpha = self.penalty**freq
         scores = torch.where(scores < 0, scores * alpha, scores / alpha)
         return scores
@@ -145,11 +146,35 @@ halfwidth_2_fullwidth_map = {
 }
-def apply_half2full_map(text):
-    translation_table = str.maketrans(halfwidth_2_fullwidth_map)
-    return text.translate(translation_table)
-def apply_character_map(text):
-    translation_table = str.maketrans(character_map)
-    return text.translate(translation_table)

         freq = F.one_hot(input_ids, scores.size(1)).sum(1)
         freq[self.max_input_ids :] = 0
         alpha = self.penalty**freq
+        scores = scores.contiguous()
         scores = torch.where(scores < 0, scores * alpha, scores / alpha)
         return scores
 }
+def replace_unsupported_chars(text, replace_dict, reserved_tokens: list = []):
+    escaped_tokens = [re.escape(token) for token in reserved_tokens]
+    special_tokens_pattern = "|".join(escaped_tokens)
+    tokens = re.split(f"({special_tokens_pattern})", text)
+    def replace_chars(segment):
+        for old_char, new_char in replace_dict.items():
+            segment = segment.replace(old_char, new_char)
+        return segment
+    result = "".join(
+        (
+            replace_chars(segment)
+            if not re.match(special_tokens_pattern, segment)
+            else segment
+        )
+        for segment in tokens
+    )
+    return result
+def apply_half2full_map(text, reserved_tokens: list = []):
+    return replace_unsupported_chars(
+        text, halfwidth_2_fullwidth_map, reserved_tokens=reserved_tokens
+    )
+def apply_character_map(text, reserved_tokens: list = []):
+    return replace_unsupported_chars(
+        text, character_map, reserved_tokens=reserved_tokens
+    )

modules/Enhancer/ResembleEnhance.py CHANGED Viewed

@@ -85,7 +85,7 @@ def load_enhancer() -> ResembleEnhance:
         if resemble_enhance is None:
             logger.info("Loading ResembleEnhance model")
             resemble_enhance = ResembleEnhance(
-                device=devices.device, dtype=devices.dtype
             )
             resemble_enhance.load_model()
             logger.info("ResembleEnhance model loaded")

         if resemble_enhance is None:
             logger.info("Loading ResembleEnhance model")
             resemble_enhance = ResembleEnhance(
+                device=devices.get_device_for("enhancer"), dtype=devices.dtype
             )
             resemble_enhance.load_model()
             logger.info("ResembleEnhance model loaded")

modules/SentenceSplitter.py CHANGED Viewed

@@ -2,6 +2,8 @@ import re
 import zhon
 def split_zhon_sentence(text):
     result = []
@@ -21,6 +23,35 @@ def split_zhon_sentence(text):
     return result
 # 解析文本 并根据停止符号分割成句子
 # 可以设置最大阈值，即如果分割片段小于这个阈值会与下一段合并
 class SentenceSplitter:
@@ -28,7 +59,7 @@ class SentenceSplitter:
         self.sentence_threshold = threshold
     def parse(self, text):
-        sentences = split_zhon_sentence(text)
         # 合并小于最大阈值的片段
         merged_sentences = []

 import zhon
+from modules.utils.detect_lang import guess_lang
 def split_zhon_sentence(text):
     result = []
     return result
+def split_en_sentence(text):
+    """
+    Split English text into sentences.
+    """
+    # Define a regex pattern for English sentence splitting
+    pattern = re.compile(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s")
+    result = pattern.split(text)
+    # Filter out any empty strings or strings that are just whitespace
+    result = [sentence.strip() for sentence in result if sentence.strip()]
+    return result
+def is_eng_sentence(text):
+    return guess_lang(text) == "en"
+def split_zhon_paragraph(text):
+    lines = text.split("\n")
+    result = []
+    for line in lines:
+        if is_eng_sentence(line):
+            result.extend(split_en_sentence(line))
+        else:
+            result.extend(split_zhon_sentence(line))
+    return result
 # 解析文本 并根据停止符号分割成句子
 # 可以设置最大阈值，即如果分割片段小于这个阈值会与下一段合并
 class SentenceSplitter:
         self.sentence_threshold = threshold
     def parse(self, text):
+        sentences = split_zhon_paragraph(text)
         # 合并小于最大阈值的片段
         merged_sentences = []

modules/SynthesizeSegments.py CHANGED Viewed

@@ -1,4 +1,5 @@
 import copy
 import json
 import logging
 import re
@@ -7,6 +8,7 @@ from typing import List, Union
 import numpy as np
 from box import Box
 from pydub import AudioSegment
 from modules import generate_audio
 from modules.api.utils import calc_spk_style
@@ -15,15 +17,39 @@ from modules.SentenceSplitter import SentenceSplitter
 from modules.speaker import Speaker
 from modules.ssml_parser.SSMLParser import SSMLBreak, SSMLContext, SSMLSegment
 from modules.utils import rng
-from modules.utils.audio import pitch_shift, time_stretch
 logger = logging.getLogger(__name__)
 def audio_data_to_segment(audio_data: np.ndarray, sr: int):
     """
     optimize: https://github.com/lenML/ChatTTS-Forge/issues/57
     """
     audio_data = (audio_data * 32767).astype(np.int16)
     audio_segment = AudioSegment(
         audio_data.tobytes(),
@@ -41,21 +67,6 @@ def combine_audio_segments(audio_segments: list[AudioSegment]) -> AudioSegment:
     return combined_audio
-def apply_prosody(
-    audio_segment: AudioSegment, rate: float, volume: float, pitch: float
-) -> AudioSegment:
-    if rate != 1:
-        audio_segment = time_stretch(audio_segment, rate)
-    if volume != 0:
-        audio_segment += volume
-    if pitch != 0:
-        audio_segment = pitch_shift(audio_segment, pitch)
-    return audio_segment
 def to_number(value, t, default=0):
     try:
         number = t(value)
@@ -202,7 +213,9 @@ class SynthesizeSegments:
                 pitch = float(segment.get("pitch", "0"))
                 audio_segment = audio_data_to_segment(audio_data, sr)
-                audio_segment = apply_prosody(audio_segment, rate, volume, pitch)
                 # compare by Box object
                 original_index = src_segments.index(segment)
                 audio_segments[original_index] = audio_segment

 import copy
+import io
 import json
 import logging
 import re
 import numpy as np
 from box import Box
 from pydub import AudioSegment
+from scipy.io import wavfile
 from modules import generate_audio
 from modules.api.utils import calc_spk_style
 from modules.speaker import Speaker
 from modules.ssml_parser.SSMLParser import SSMLBreak, SSMLContext, SSMLSegment
 from modules.utils import rng
+from modules.utils.audio import apply_prosody_to_audio_segment
 logger = logging.getLogger(__name__)
+def audio_data_to_segment_slow(audio_data, sr):
+    byte_io = io.BytesIO()
+    wavfile.write(byte_io, rate=sr, data=audio_data)
+    byte_io.seek(0)
+    return AudioSegment.from_file(byte_io, format="wav")
+def clip_audio(audio_data: np.ndarray, threshold: float = 0.99):
+    audio_data = np.clip(audio_data, -threshold, threshold)
+    return audio_data
+def normalize_audio(audio_data: np.ndarray, norm_factor: float = 0.8):
+    max_amplitude = np.max(np.abs(audio_data))
+    if max_amplitude > 0:
+        audio_data = audio_data / max_amplitude * norm_factor
+    return audio_data
 def audio_data_to_segment(audio_data: np.ndarray, sr: int):
     """
     optimize: https://github.com/lenML/ChatTTS-Forge/issues/57
     """
+    audio_data = normalize_audio(audio_data)
+    audio_data = clip_audio(audio_data)
     audio_data = (audio_data * 32767).astype(np.int16)
     audio_segment = AudioSegment(
         audio_data.tobytes(),
     return combined_audio
 def to_number(value, t, default=0):
     try:
         number = t(value)
                 pitch = float(segment.get("pitch", "0"))
                 audio_segment = audio_data_to_segment(audio_data, sr)
+                audio_segment = apply_prosody_to_audio_segment(
+                    audio_segment, rate=rate, volume=volume, pitch=pitch
+                )
                 # compare by Box object
                 original_index = src_segments.index(segment)
                 audio_segments[original_index] = audio_segment

modules/api/api_setup.py CHANGED Viewed

@@ -1,7 +1,9 @@
 import argparse
 import logging
-from modules import config, generate_audio
 from modules.api.Api import APIManager
 from modules.api.impl import (
     google_api,
@@ -15,15 +17,12 @@ from modules.api.impl import (
     tts_api,
     xtts_v2_api,
 )
-from modules.devices import devices
-from modules.Enhancer.ResembleEnhance import load_enhancer
-from modules.models import load_chat_tts
 from modules.utils import env
 logger = logging.getLogger(__name__)
-def create_api(app, exclude=[]):
     app_mgr = APIManager(app=app, exclude_patterns=exclude)
     ping_api.setup(app_mgr)
@@ -40,98 +39,6 @@ def create_api(app, exclude=[]):
     return app_mgr
-def setup_model_args(parser: argparse.ArgumentParser):
-    parser.add_argument("--compile", action="store_true", help="Enable model compile")
-    parser.add_argument(
-        "--no_half",
-        action="store_true",
-        help="Disalbe half precision for model inference",
-    )
-    parser.add_argument(
-        "--off_tqdm",
-        action="store_true",
-        help="Disable tqdm progress bar",
-    )
-    parser.add_argument(
-        "--device_id",
-        type=str,
-        help="Select the default CUDA device to use (export CUDA_VISIBLE_DEVICES=0,1,etc might be needed before)",
-        default=None,
-    )
-    parser.add_argument(
-        "--use_cpu",
-        nargs="+",
-        help="use CPU as torch device for specified modules",
-        default=[],
-        type=str.lower,
-    )
-    parser.add_argument(
-        "--lru_size",
-        type=int,
-        default=64,
-        help="Set the size of the request cache pool, set it to 0 will disable lru_cache",
-    )
-    parser.add_argument(
-        "--debug_generate",
-        action="store_true",
-        help="Enable debug mode for audio generation",
-    )
-    parser.add_argument(
-        "--preload_models",
-        action="store_true",
-        help="Preload all models at startup",
-    )
-def process_model_args(args):
-    lru_size = env.get_and_update_env(args, "lru_size", 64, int)
-    compile = env.get_and_update_env(args, "compile", False, bool)
-    device_id = env.get_and_update_env(args, "device_id", None, str)
-    use_cpu = env.get_and_update_env(args, "use_cpu", [], list)
-    no_half = env.get_and_update_env(args, "no_half", False, bool)
-    off_tqdm = env.get_and_update_env(args, "off_tqdm", False, bool)
-    debug_generate = env.get_and_update_env(args, "debug_generate", False, bool)
-    preload_models = env.get_and_update_env(args, "preload_models", False, bool)
-    generate_audio.setup_lru_cache()
-    devices.reset_device()
-    devices.first_time_calculation()
-    if debug_generate:
-        generate_audio.logger.setLevel(logging.DEBUG)
-    if preload_models:
-        load_chat_tts()
-        load_enhancer()
-def setup_uvicon_args(parser: argparse.ArgumentParser):
-    parser.add_argument("--host", type=str, help="Host to run the server on")
-    parser.add_argument("--port", type=int, help="Port to run the server on")
-    parser.add_argument(
-        "--reload", action="store_true", help="Enable auto-reload for development"
-    )
-    parser.add_argument("--workers", type=int, help="Number of worker processes")
-    parser.add_argument("--log_level", type=str, help="Log level")
-    parser.add_argument("--access_log", action="store_true", help="Enable access log")
-    parser.add_argument(
-        "--proxy_headers", action="store_true", help="Enable proxy headers"
-    )
-    parser.add_argument(
-        "--timeout_keep_alive", type=int, help="Keep-alive timeout duration"
-    )
-    parser.add_argument(
-        "--timeout_graceful_shutdown",
-        type=int,
-        help="Graceful shutdown timeout duration",
-    )
-    parser.add_argument("--ssl_keyfile", type=str, help="SSL key file path")
-    parser.add_argument("--ssl_certfile", type=str, help="SSL certificate file path")
-    parser.add_argument(
-        "--ssl_keyfile_password", type=str, help="SSL key file password"
-    )
 def setup_api_args(parser: argparse.ArgumentParser):
     parser.add_argument(
         "--cors_origin",
@@ -156,7 +63,7 @@ def setup_api_args(parser: argparse.ArgumentParser):
     )
-def process_api_args(args, app):
     cors_origin = env.get_and_update_env(args, "cors_origin", "*", str)
     no_playground = env.get_and_update_env(args, "no_playground", False, bool)
     no_docs = env.get_and_update_env(args, "no_docs", False, bool)

 import argparse
 import logging
+from fastapi import FastAPI
+from modules import config
 from modules.api.Api import APIManager
 from modules.api.impl import (
     google_api,
     tts_api,
     xtts_v2_api,
 )
 from modules.utils import env
 logger = logging.getLogger(__name__)
+def create_api(app: FastAPI, exclude=[]):
     app_mgr = APIManager(app=app, exclude_patterns=exclude)
     ping_api.setup(app_mgr)
     return app_mgr
 def setup_api_args(parser: argparse.ArgumentParser):
     parser.add_argument(
         "--cors_origin",
     )
+def process_api_args(args: argparse.Namespace, app: FastAPI):
     cors_origin = env.get_and_update_env(args, "cors_origin", "*", str)
     no_playground = env.get_and_update_env(args, "no_playground", False, bool)
     no_docs = env.get_and_update_env(args, "no_docs", False, bool)

modules/api/impl/handler/AudioHandler.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import base64
 import io
 import numpy as np
 import soundfile as sf
@@ -10,7 +11,24 @@ from modules.api.impl.model.audio_model import AudioFormat
 class AudioHandler:
     def enqueue(self) -> tuple[np.ndarray, int]:
-        raise NotImplementedError
     def enqueue_to_buffer(self, format: AudioFormat) -> io.BytesIO:
         audio_data, sample_rate = self.enqueue()

 import base64
 import io
+from typing import Generator
 import numpy as np
 import soundfile as sf
 class AudioHandler:
     def enqueue(self) -> tuple[np.ndarray, int]:
+        raise NotImplementedError("Method 'enqueue' must be implemented by subclass")
+    def enqueue_stream(self) -> Generator[tuple[np.ndarray, int], None, None]:
+        raise NotImplementedError(
+            "Method 'enqueue_stream' must be implemented by subclass"
+        )
+    def enqueue_to_stream(self, format: AudioFormat) -> Generator[bytes, None, None]:
+        for audio_data, sample_rate in self.enqueue_stream():
+            buffer = io.BytesIO()
+            sf.write(buffer, audio_data, sample_rate, format="wav")
+            buffer.seek(0)
+            if format == AudioFormat.mp3:
+                buffer = api_utils.wav_to_mp3(buffer)
+            binary = buffer.read()
+            yield binary
     def enqueue_to_buffer(self, format: AudioFormat) -> io.BytesIO:
         audio_data, sample_rate = self.enqueue()

modules/api/impl/handler/SSMLHandler.py CHANGED Viewed

@@ -91,4 +91,9 @@ class SSMLHandler(AudioHandler):
             sr=sample_rate,
         )
         return audio_data, sample_rate

             sr=sample_rate,
         )
+        if adjust_config.normalize:
+            sample_rate, audio_data = audio.apply_normalize(
+                audio_data=audio_data, headroom=adjust_config.headroom, sr=sample_rate
+            )
         return audio_data, sample_rate

modules/api/impl/handler/TTSHandler.py CHANGED Viewed

@@ -1,3 +1,6 @@
 import numpy as np
 from modules.api.impl.handler.AudioHandler import AudioHandler
@@ -8,7 +11,10 @@ from modules.Enhancer.ResembleEnhance import apply_audio_enhance_full
 from modules.normalization import text_normalize
 from modules.speaker import Speaker
 from modules.synthesize_audio import synthesize_audio
-from modules.utils.audio import apply_prosody_to_audio_data
 class TTSHandler(AudioHandler):
@@ -94,4 +100,57 @@ class TTSHandler(AudioHandler):
             sr=sample_rate,
         )
         return audio_data, sample_rate

+import logging
+from typing import Generator
 import numpy as np
 from modules.api.impl.handler.AudioHandler import AudioHandler
 from modules.normalization import text_normalize
 from modules.speaker import Speaker
 from modules.synthesize_audio import synthesize_audio
+from modules.synthesize_stream import synthesize_stream
+from modules.utils.audio import apply_normalize, apply_prosody_to_audio_data
+logger = logging.getLogger(__name__)
 class TTSHandler(AudioHandler):
             sr=sample_rate,
         )
+        if adjust_config.normalize:
+            sample_rate, audio_data = apply_normalize(
+                audio_data=audio_data,
+                headroom=adjust_config.headroom,
+                sr=sample_rate,
+            )
         return audio_data, sample_rate
+    def enqueue_stream(self) -> Generator[tuple[np.ndarray, int], None, None]:
+        text = text_normalize(self.text_content)
+        tts_config = self.tts_config
+        infer_config = self.infer_config
+        adjust_config = self.adjest_config
+        enhancer_config = self.enhancer_config
+        if enhancer_config.enabled:
+            logger.warning(
+                "enhancer_config is enabled, but it is not supported in stream mode"
+            )
+        gen = synthesize_stream(
+            text,
+            spk=self.spk,
+            temperature=tts_config.temperature,
+            top_P=tts_config.top_p,
+            top_K=tts_config.top_k,
+            prompt1=tts_config.prompt1,
+            prompt2=tts_config.prompt2,
+            prefix=tts_config.prefix,
+            infer_seed=infer_config.seed,
+            spliter_threshold=infer_config.spliter_threshold,
+            end_of_sentence=infer_config.eos,
+        )
+        # FIXME: 很奇怪，合并出来的音频每个 chunk 之前会有一段异常，暂时没有查出来是哪里的问题，可能是解码时候切割漏了？或者多了？
+        for sr, wav in gen:
+            wav = apply_prosody_to_audio_data(
+                audio_data=wav,
+                rate=adjust_config.speed_rate,
+                pitch=adjust_config.pitch,
+                volume=adjust_config.volume_gain_db,
+                sr=sr,
+            )
+            if adjust_config.normalize:
+                sr, wav = apply_normalize(
+                    audio_data=wav,
+                    headroom=adjust_config.headroom,
+                    sr=sr,
+                )
+            yield wav, sr

modules/api/impl/model/audio_model.py CHANGED Viewed

@@ -12,3 +12,7 @@ class AdjustConfig(BaseModel):
     pitch: float = 0
     speed_rate: float = 1
     volume_gain_db: float = 0

     pitch: float = 0
     speed_rate: float = 1
     volume_gain_db: float = 0
+    # 响度均衡
+    normalize: bool = True
+    headroom: float = 1

modules/api/impl/tts_api.py CHANGED Viewed

@@ -1,3 +1,5 @@
 from fastapi import Depends, HTTPException, Query
 from fastapi.responses import FileResponse, StreamingResponse
 from pydantic import BaseModel
@@ -10,6 +12,8 @@ from modules.api.impl.model.chattts_model import ChatTTSConfig, InferConfig
 from modules.api.impl.model.enhancer_model import EnhancerConfig
 from modules.speaker import Speaker
 class TTSParams(BaseModel):
     text: str = Query(..., description="Text to synthesize")
@@ -44,6 +48,8 @@ class TTSParams(BaseModel):
     pitch: float = Query(0, description="Pitch of the audio")
     volume_gain: float = Query(0, description="Volume gain of the audio")
 async def synthesize_tts(params: TTSParams = Depends()):
     try:
@@ -132,14 +138,22 @@ async def synthesize_tts(params: TTSParams = Depends()):
             adjust_config=adjust_config,
             enhancer_config=enhancer_config,
         )
-        buffer = handler.enqueue_to_buffer(format=AudioFormat(params.format))
         media_type = f"audio/{params.format}"
         if params.format == "mp3":
             media_type = "audio/mpeg"
-        return StreamingResponse(buffer, media_type=media_type)
     except Exception as e:
         import logging

+import logging
 from fastapi import Depends, HTTPException, Query
 from fastapi.responses import FileResponse, StreamingResponse
 from pydantic import BaseModel
 from modules.api.impl.model.enhancer_model import EnhancerConfig
 from modules.speaker import Speaker
+logger = logging.getLogger(__name__)
 class TTSParams(BaseModel):
     text: str = Query(..., description="Text to synthesize")
     pitch: float = Query(0, description="Pitch of the audio")
     volume_gain: float = Query(0, description="Volume gain of the audio")
+    stream: bool = Query(False, description="Stream the audio")
 async def synthesize_tts(params: TTSParams = Depends()):
     try:
             adjust_config=adjust_config,
             enhancer_config=enhancer_config,
         )
         media_type = f"audio/{params.format}"
         if params.format == "mp3":
             media_type = "audio/mpeg"
+        if params.stream:
+            if infer_config.batch_size != 1:
+                # 流式生成下仅支持 batch size 为 1，当前请求参数将被忽略
+                logger.warning(
+                    f"Batch size {infer_config.batch_size} is not supported in streaming mode, will set to 1"
+                )
+            buffer_gen = handler.enqueue_to_stream(format=AudioFormat(params.format))
+            return StreamingResponse(buffer_gen, media_type=media_type)
+        else:
+            buffer = handler.enqueue_to_buffer(format=AudioFormat(params.format))
+            return StreamingResponse(buffer, media_type=media_type)
     except Exception as e:
         import logging

modules/api/impl/xtts_v2_api.py CHANGED Viewed

@@ -1,18 +1,15 @@
-import io
 import logging
-import soundfile as sf
-from fastapi import HTTPException
 from fastapi.responses import StreamingResponse
 from pydantic import BaseModel
-from modules import config
-from modules.api import utils as api_utils
 from modules.api.Api import APIManager
-from modules.normalization import text_normalize
 from modules.speaker import speaker_mgr
-from modules.synthesize_audio import synthesize_audio
-from modules.utils.audio import apply_prosody_to_audio_data
 logger = logging.getLogger(__name__)
@@ -22,8 +19,11 @@ class XTTS_V2_Settings:
         self.stream_chunk_size = 100
         self.temperature = 0.3
         self.speed = 1
         self.length_penalty = 0.5
         self.repetition_penalty = 1.0
         self.top_p = 0.7
         self.top_k = 20
         self.enable_text_splitting = True
@@ -37,6 +37,7 @@ class XTTS_V2_Settings:
         self.prompt2 = ""
         self.prefix = ""
         self.spliter_threshold = 100
 class TTSSettingsRequest(BaseModel):
@@ -58,6 +59,7 @@ class TTSSettingsRequest(BaseModel):
     prompt2: str = None
     prefix: str = None
     spliter_threshold: int = None
 class SynthesisRequest(BaseModel):
@@ -95,45 +97,101 @@ def setup(app: APIManager):
         if spk is None:
             raise HTTPException(status_code=400, detail="Invalid speaker id")
-        text = text_normalize(text, is_end=True)
-        sample_rate, audio_data = synthesize_audio(
-            # TODO: 这两个参数现在用不着...但是其实gpt是可以用的
-            # length_penalty=XTTSV2.length_penalty,
-            # repetition_penalty=XTTSV2.repetition_penalty,
-            text=text,
             temperature=XTTSV2.temperature,
-            top_P=XTTSV2.top_p,
-            top_K=XTTSV2.top_k,
-            spk=spk,
-            spliter_threshold=XTTSV2.spliter_threshold,
-            batch_size=XTTSV2.batch_size,
-            end_of_sentence=XTTSV2.eos,
-            infer_seed=XTTSV2.infer_seed,
-            use_decoder=XTTSV2.use_decoder,
             prompt1=XTTSV2.prompt1,
             prompt2=XTTSV2.prompt2,
-            prefix=XTTSV2.prefix,
         )
-        if XTTSV2.speed:
-            audio_data = apply_prosody_to_audio_data(
-                audio_data,
-                rate=XTTSV2.speed,
-                sr=sample_rate,
-            )
-        # to mp3
-        buffer = io.BytesIO()
-        sf.write(buffer, audio_data, sample_rate, format="wav")
-        buffer.seek(0)
-        buffer = api_utils.wav_to_mp3(buffer)
         return StreamingResponse(buffer, media_type="audio/mpeg")
     @app.get("/v1/xtts_v2/tts_stream")
-    async def tts_stream():
-        raise HTTPException(status_code=501, detail="Not implemented")
     @app.post("/v1/xtts_v2/set_tts_settings")
     async def set_tts_settings(request: TTSSettingsRequest):
@@ -195,6 +253,8 @@ def setup(app: APIManager):
                 XTTSV2.prefix = request.prefix
             if request.spliter_threshold:
                 XTTSV2.spliter_threshold = request.spliter_threshold
             return {"message": "Settings successfully applied"}
         except Exception as e:

 import logging
+from fastapi import HTTPException, Query, Request
 from fastapi.responses import StreamingResponse
 from pydantic import BaseModel
 from modules.api.Api import APIManager
+from modules.api.impl.handler.TTSHandler import TTSHandler
+from modules.api.impl.model.audio_model import AdjustConfig, AudioFormat
+from modules.api.impl.model.chattts_model import ChatTTSConfig, InferConfig
+from modules.api.impl.model.enhancer_model import EnhancerConfig
 from modules.speaker import speaker_mgr
 logger = logging.getLogger(__name__)
         self.stream_chunk_size = 100
         self.temperature = 0.3
         self.speed = 1
+        # TODO: 这两个参数现在用不着...但是其实gpt是可以用的可以考虑增加
         self.length_penalty = 0.5
         self.repetition_penalty = 1.0
         self.top_p = 0.7
         self.top_k = 20
         self.enable_text_splitting = True
         self.prompt2 = ""
         self.prefix = ""
         self.spliter_threshold = 100
+        self.style = ""
 class TTSSettingsRequest(BaseModel):
     prompt2: str = None
     prefix: str = None
     spliter_threshold: int = None
+    style: str = None
 class SynthesisRequest(BaseModel):
         if spk is None:
             raise HTTPException(status_code=400, detail="Invalid speaker id")
+        tts_config = ChatTTSConfig(
+            style=XTTSV2.style,
             temperature=XTTSV2.temperature,
+            top_k=XTTSV2.top_k,
+            top_p=XTTSV2.top_p,
+            prefix=XTTSV2.prefix,
             prompt1=XTTSV2.prompt1,
             prompt2=XTTSV2.prompt2,
+        )
+        infer_config = InferConfig(
+            batch_size=XTTSV2.batch_size,
+            spliter_threshold=XTTSV2.spliter_threshold,
+            eos=XTTSV2.eos,
+            seed=XTTSV2.infer_seed,
+        )
+        adjust_config = AdjustConfig(
+            speed_rate=XTTSV2.speed,
+        )
+        # TODO: support enhancer
+        enhancer_config = EnhancerConfig(
+            # enabled=params.enhance or params.denoise or False,
+            # lambd=0.9 if params.denoise else 0.1,
         )
+        handler = TTSHandler(
+            text_content=text,
+            spk=spk,
+            tts_config=tts_config,
+            infer_config=infer_config,
+            adjust_config=adjust_config,
+            enhancer_config=enhancer_config,
+        )
+        buffer = handler.enqueue_to_buffer(AudioFormat.mp3)
         return StreamingResponse(buffer, media_type="audio/mpeg")
     @app.get("/v1/xtts_v2/tts_stream")
+    async def tts_stream(
+        request: Request,
+        text: str = Query(),
+        speaker_wav: str = Query(),
+        language: str = Query(),
+    ):
+        # speaker_wav 就是 speaker id 。。。
+        voice_id = speaker_wav
+        spk = speaker_mgr.get_speaker_by_id(voice_id) or speaker_mgr.get_speaker(
+            voice_id
+        )
+        if spk is None:
+            raise HTTPException(status_code=400, detail="Invalid speaker id")
+        tts_config = ChatTTSConfig(
+            style=XTTSV2.style,
+            temperature=XTTSV2.temperature,
+            top_k=XTTSV2.top_k,
+            top_p=XTTSV2.top_p,
+            prefix=XTTSV2.prefix,
+            prompt1=XTTSV2.prompt1,
+            prompt2=XTTSV2.prompt2,
+        )
+        infer_config = InferConfig(
+            batch_size=XTTSV2.batch_size,
+            spliter_threshold=XTTSV2.spliter_threshold,
+            eos=XTTSV2.eos,
+            seed=XTTSV2.infer_seed,
+        )
+        adjust_config = AdjustConfig(
+            speed_rate=XTTSV2.speed,
+        )
+        # TODO: support enhancer
+        enhancer_config = EnhancerConfig(
+            # enabled=params.enhance or params.denoise or False,
+            # lambd=0.9 if params.denoise else 0.1,
+        )
+        handler = TTSHandler(
+            text_content=text,
+            spk=spk,
+            tts_config=tts_config,
+            infer_config=infer_config,
+            adjust_config=adjust_config,
+            enhancer_config=enhancer_config,
+        )
+        async def generator():
+            for chunk in handler.enqueue_to_stream(AudioFormat.mp3):
+                disconnected = await request.is_disconnected()
+                if disconnected:
+                    break
+                yield chunk
+        return StreamingResponse(generator(), media_type="audio/mpeg")
     @app.post("/v1/xtts_v2/set_tts_settings")
     async def set_tts_settings(request: TTSSettingsRequest):
                 XTTSV2.prefix = request.prefix
             if request.spliter_threshold:
                 XTTSV2.spliter_threshold = request.spliter_threshold
+            if request.style:
+                XTTSV2.style = request.style
             return {"message": "Settings successfully applied"}
         except Exception as e:

modules/api/worker.py CHANGED Viewed

@@ -5,7 +5,9 @@ import os
 import dotenv
 from fastapi import FastAPI
 from modules.ffmpeg_env import setup_ffmpeg_path
 setup_ffmpeg_path()
 logging.basicConfig(
@@ -14,13 +16,7 @@ logging.basicConfig(
 )
 from modules import config
-from modules.api.api_setup import (
-    process_api_args,
-    process_model_args,
-    setup_api_args,
-    setup_model_args,
-    setup_uvicon_args,
-)
 from modules.api.app_config import app_description, app_title, app_version
 from modules.utils.torch_opt import configure_torch_optimizations

 import dotenv
 from fastapi import FastAPI
+from launch import setup_uvicon_args
 from modules.ffmpeg_env import setup_ffmpeg_path
+from modules.models_setup import process_model_args, setup_model_args
 setup_ffmpeg_path()
 logging.basicConfig(
 )
 from modules import config
+from modules.api.api_setup import process_api_args, setup_api_args
 from modules.api.app_config import app_description, app_title, app_version
 from modules.utils.torch_opt import configure_torch_optimizations

modules/devices/devices.py CHANGED Viewed

@@ -92,7 +92,10 @@ def get_optimal_device():
 def get_device_for(task):
-    if task in config.cmd_opts.use_cpu or "all" in config.cmd_opts.use_cpu:
         return cpu
     return get_optimal_device()
@@ -128,6 +131,9 @@ def reset_device():
     global dtype_gpt
     global dtype_decoder
     if "all" in config.runtime_env_vars.use_cpu and not config.runtime_env_vars.no_half:
         logger.warning(
             "Cannot use half precision with CPU, using full precision instead"

 def get_device_for(task):
+    if (
+        task in config.runtime_env_vars.use_cpu
+        or "all" in config.runtime_env_vars.use_cpu
+    ):
         return cpu
     return get_optimal_device()
     global dtype_gpt
     global dtype_decoder
+    if config.runtime_env_vars.use_cpu is None:
+        config.runtime_env_vars.use_cpu = []
     if "all" in config.runtime_env_vars.use_cpu and not config.runtime_env_vars.no_half:
         logger.warning(
             "Cannot use half precision with CPU, using full precision instead"

modules/finetune/train_speaker.py CHANGED Viewed

@@ -255,7 +255,7 @@ if __name__ == "__main__":
         vocos_model=chat.pretrain_models["vocos"],
         tar_path=tar_path,
         tar_in_memory=tar_in_memory,
-        device=devices.device,
         # speakers=None,  # set(['speaker_A', 'speaker_B'])
     )
@@ -267,7 +267,7 @@ if __name__ == "__main__":
         speaker_embeds = {
             speaker: torch.tensor(
                 spk.emb,
-                device=devices.device,
                 requires_grad=True,
             )
             for speaker in dataset.speakers

         vocos_model=chat.pretrain_models["vocos"],
         tar_path=tar_path,
         tar_in_memory=tar_in_memory,
+        device=devices.get_device_for("trainer"),
         # speakers=None,  # set(['speaker_A', 'speaker_B'])
     )
         speaker_embeds = {
             speaker: torch.tensor(
                 spk.emb,
+                device=devices.get_device_for("trainer"),
                 requires_grad=True,
             )
             for speaker in dataset.speakers

modules/generate_audio.py CHANGED Viewed

@@ -1,11 +1,12 @@
 import gc
 import logging
-from typing import Union
 import numpy as np
 import torch
 from modules import config, models
 from modules.devices import devices
 from modules.speaker import Speaker
 from modules.utils.cache import conditional_cache
@@ -13,6 +14,8 @@ from modules.utils.SeedContext import SeedContext
 logger = logging.getLogger(__name__)
 def generate_audio(
     text: str,
@@ -42,20 +45,18 @@ def generate_audio(
     return (sample_rate, wav)
-@torch.inference_mode()
-def generate_audio_batch(
     texts: list[str],
     temperature: float = 0.3,
     top_P: float = 0.7,
     top_K: float = 20,
     spk: Union[int, Speaker] = -1,
     infer_seed: int = -1,
-    use_decoder: bool = True,
     prompt1: str = "",
     prompt2: str = "",
     prefix: str = "",
 ):
-    chat_tts = models.load_chat_tts()
     params_infer_code = {
         "spk_emb": None,
         "temperature": temperature,
@@ -97,18 +98,93 @@ def generate_audio_batch(
         }
     )
     with SeedContext(infer_seed, True):
         wavs = chat_tts.generate_audio(
-            texts, params_infer_code, use_decoder=use_decoder
         )
-    sample_rate = 24000
     if config.auto_gc:
         devices.torch_gc()
         gc.collect()
-    return [(sample_rate, np.array(wav).flatten().astype(np.float32)) for wav in wavs]
 lru_cache_enabled = False

 import gc
 import logging
+from typing import Generator, Union
 import numpy as np
 import torch
 from modules import config, models
+from modules.ChatTTS import ChatTTS
 from modules.devices import devices
 from modules.speaker import Speaker
 from modules.utils.cache import conditional_cache
 logger = logging.getLogger(__name__)
+SAMPLE_RATE = 24000
 def generate_audio(
     text: str,
     return (sample_rate, wav)
+def parse_infer_params(
     texts: list[str],
+    chat_tts: ChatTTS.Chat,
     temperature: float = 0.3,
     top_P: float = 0.7,
     top_K: float = 20,
     spk: Union[int, Speaker] = -1,
     infer_seed: int = -1,
     prompt1: str = "",
     prompt2: str = "",
     prefix: str = "",
 ):
     params_infer_code = {
         "spk_emb": None,
         "temperature": temperature,
         }
     )
+    return params_infer_code
+@torch.inference_mode()
+def generate_audio_batch(
+    texts: list[str],
+    temperature: float = 0.3,
+    top_P: float = 0.7,
+    top_K: float = 20,
+    spk: Union[int, Speaker] = -1,
+    infer_seed: int = -1,
+    use_decoder: bool = True,
+    prompt1: str = "",
+    prompt2: str = "",
+    prefix: str = "",
+):
+    chat_tts = models.load_chat_tts()
+    params_infer_code = parse_infer_params(
+        texts=texts,
+        chat_tts=chat_tts,
+        temperature=temperature,
+        top_P=top_P,
+        top_K=top_K,
+        spk=spk,
+        infer_seed=infer_seed,
+        prompt1=prompt1,
+        prompt2=prompt2,
+        prefix=prefix,
+    )
     with SeedContext(infer_seed, True):
         wavs = chat_tts.generate_audio(
+            prompt=texts, params_infer_code=params_infer_code, use_decoder=use_decoder
+        )
+    if config.auto_gc:
+        devices.torch_gc()
+        gc.collect()
+    return [(SAMPLE_RATE, np.array(wav).flatten().astype(np.float32)) for wav in wavs]
+# TODO: generate_audio_stream 也应该支持 lru cache
+@torch.inference_mode()
+def generate_audio_stream(
+    text: str,
+    temperature: float = 0.3,
+    top_P: float = 0.7,
+    top_K: float = 20,
+    spk: Union[int, Speaker] = -1,
+    infer_seed: int = -1,
+    use_decoder: bool = True,
+    prompt1: str = "",
+    prompt2: str = "",
+    prefix: str = "",
+) -> Generator[tuple[int, np.ndarray], None, None]:
+    chat_tts = models.load_chat_tts()
+    texts = [text]
+    params_infer_code = parse_infer_params(
+        texts=texts,
+        chat_tts=chat_tts,
+        temperature=temperature,
+        top_P=top_P,
+        top_K=top_K,
+        spk=spk,
+        infer_seed=infer_seed,
+        prompt1=prompt1,
+        prompt2=prompt2,
+        prefix=prefix,
+    )
+    with SeedContext(infer_seed, True):
+        wavs_gen = chat_tts.generate_audio(
+            prompt=texts,
+            params_infer_code=params_infer_code,
+            use_decoder=use_decoder,
+            stream=True,
         )
+        for wav in wavs_gen:
+            yield [SAMPLE_RATE, np.array(wav).flatten().astype(np.float32)]
     if config.auto_gc:
         devices.torch_gc()
         gc.collect()
+    return
 lru_cache_enabled = False

modules/models.py CHANGED Viewed

@@ -21,18 +21,27 @@ def load_chat_tts_in_thread():
     logger.info("Loading ChatTTS models")
     chat_tts = ChatTTS.Chat()
     chat_tts.load_models(
         compile=config.runtime_env_vars.compile,
         source="local",
         local_path="./models/ChatTTS",
-        device=devices.device,
-        dtype=devices.dtype,
         dtype_vocos=devices.dtype_vocos,
         dtype_dvae=devices.dtype_dvae,
         dtype_gpt=devices.dtype_gpt,
         dtype_decoder=devices.dtype_decoder,
     )
     devices.torch_gc()
     logger.info("ChatTTS models loaded")

     logger.info("Loading ChatTTS models")
     chat_tts = ChatTTS.Chat()
+    device = devices.get_device_for("chattts")
+    dtype = devices.dtype
     chat_tts.load_models(
         compile=config.runtime_env_vars.compile,
         source="local",
         local_path="./models/ChatTTS",
+        device=device,
+        dtype=dtype,
         dtype_vocos=devices.dtype_vocos,
         dtype_dvae=devices.dtype_dvae,
         dtype_gpt=devices.dtype_gpt,
         dtype_decoder=devices.dtype_decoder,
     )
+    # 如果 device 为 cpu 同时，又是 dtype == float16 那么报 warn
+    # 提示可能无法正常运行，建议使用 float32 即开启 `--no_half` 参数
+    if device == devices.cpu and dtype == torch.float16:
+        logger.warning(
+            "The device is CPU and dtype is float16, which may not work properly. It is recommended to use float32 by enabling the `--no_half` parameter."
+        )
     devices.torch_gc()
     logger.info("ChatTTS models loaded")

modules/models_setup.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import argparse
+import logging
+from modules import generate_audio
+from modules.devices import devices
+from modules.Enhancer.ResembleEnhance import load_enhancer
+from modules.models import load_chat_tts
+from modules.utils import env
+def setup_model_args(parser: argparse.ArgumentParser):
+    parser.add_argument("--compile", action="store_true", help="Enable model compile")
+    parser.add_argument(
+        "--no_half",
+        action="store_true",
+        help="Disalbe half precision for model inference",
+    )
+    parser.add_argument(
+        "--off_tqdm",
+        action="store_true",
+        help="Disable tqdm progress bar",
+    )
+    parser.add_argument(
+        "--device_id",
+        type=str,
+        help="Select the default CUDA device to use (export CUDA_VISIBLE_DEVICES=0,1,etc might be needed before)",
+        default=None,
+    )
+    parser.add_argument(
+        "--use_cpu",
+        nargs="+",
+        help="use CPU as torch device for specified modules",
+        default=[],
+        type=str.lower,
+        choices=["all", "chattts", "enhancer", "trainer"],
+    )
+    parser.add_argument(
+        "--lru_size",
+        type=int,
+        default=64,
+        help="Set the size of the request cache pool, set it to 0 will disable lru_cache",
+    )
+    parser.add_argument(
+        "--debug_generate",
+        action="store_true",
+        help="Enable debug mode for audio generation",
+    )
+    parser.add_argument(
+        "--preload_models",
+        action="store_true",
+        help="Preload all models at startup",
+    )
+def process_model_args(args: argparse.Namespace):
+    lru_size = env.get_and_update_env(args, "lru_size", 64, int)
+    compile = env.get_and_update_env(args, "compile", False, bool)
+    device_id = env.get_and_update_env(args, "device_id", None, str)
+    use_cpu = env.get_and_update_env(args, "use_cpu", [], list)
+    no_half = env.get_and_update_env(args, "no_half", False, bool)
+    off_tqdm = env.get_and_update_env(args, "off_tqdm", False, bool)
+    debug_generate = env.get_and_update_env(args, "debug_generate", False, bool)
+    preload_models = env.get_and_update_env(args, "preload_models", False, bool)
+    generate_audio.setup_lru_cache()
+    devices.reset_device()
+    devices.first_time_calculation()
+    if debug_generate:
+        generate_audio.logger.setLevel(logging.DEBUG)
+    if preload_models:
+        load_chat_tts()
+        load_enhancer()

modules/normalization.py CHANGED Viewed

@@ -1,39 +1,21 @@
 import re
-from functools import lru_cache
 import emojiswitch
 from modules import models
 from modules.utils.markdown import markdown_to_text
-from modules.utils.zh_normalization.text_normlization import *
 # 是否关闭 unk token 检查
 # NOTE: 单测的时候用于跳过模型加载
 DISABLE_UNK_TOKEN_CHECK = False
-@lru_cache(maxsize=64)
-def is_chinese(text):
-    # 中文字符的 Unicode 范围是 \u4e00-\u9fff
-    chinese_pattern = re.compile(r"[\u4e00-\u9fff]")
-    return bool(chinese_pattern.search(text))
-@lru_cache(maxsize=64)
-def is_eng(text):
-    eng_pattern = re.compile(r"[a-zA-Z]")
-    return bool(eng_pattern.search(text))
-@lru_cache(maxsize=64)
-def guess_lang(text):
-    if is_chinese(text):
-        return "zh"
-    if is_eng(text):
-        return "en"
-    return "zh"
 post_normalize_pipeline = []
 pre_normalize_pipeline = []
@@ -184,9 +166,32 @@ def replace_unk_tokens(text):
     return output_text
 ## ---------- pre normalize ----------
 @pre_normalize()
 def apply_markdown_to_text(text):
     if is_markdown(text):
@@ -194,6 +199,11 @@ def apply_markdown_to_text(text):
     return text
 # 将 "xxx" => \nxxx\n
 # 将 'xxx' => \nxxx\n
 @pre_normalize()
@@ -293,6 +303,7 @@ if __name__ == "__main__":
         " [oral_9] [laugh_0] [break_0] 电 [speed_0] 影 [speed_0] 中 梁朝伟 [speed_9] 扮演的陈永仁的编号27149",
         " 明天有62％的概率降雨",
         "大🍌，一条大🍌，嘿，你的感觉真的很奇妙  [lbreak]",
         """
 # 你好，世界
 ```js

+import html
 import re
 import emojiswitch
+import ftfy
 from modules import models
+from modules.utils.detect_lang import guess_lang
+from modules.utils.HomophonesReplacer import HomophonesReplacer
+from modules.utils.html import remove_html_tags as _remove_html_tags
 from modules.utils.markdown import markdown_to_text
+from modules.utils.zh_normalization.text_normlization import TextNormalizer
 # 是否关闭 unk token 检查
 # NOTE: 单测的时候用于跳过模型加载
 DISABLE_UNK_TOKEN_CHECK = False
 post_normalize_pipeline = []
 pre_normalize_pipeline = []
     return output_text
+homo_replacer = HomophonesReplacer(map_file_path="./data/homophones_map.json")
+@post_normalize()
+def replace_homophones(text):
+    lang = guess_lang(text)
+    if lang == "zh":
+        text = homo_replacer.replace(text)
+    return text
 ## ---------- pre normalize ----------
+@pre_normalize()
+def html_unescape(text):
+    text = html.unescape(text)
+    text = html.unescape(text)
+    return text
+@pre_normalize()
+def fix_text(text):
+    return ftfy.fix_text(text=text)
 @pre_normalize()
 def apply_markdown_to_text(text):
     if is_markdown(text):
     return text
+@pre_normalize()
+def remove_html_tags(text):
+    return _remove_html_tags(text)
 # 将 "xxx" => \nxxx\n
 # 将 'xxx' => \nxxx\n
 @pre_normalize()
         " [oral_9] [laugh_0] [break_0] 电 [speed_0] 影 [speed_0] 中 梁朝伟 [speed_9] 扮演的陈永仁的编号27149",
         " 明天有62％的概率降雨",
         "大🍌，一条大🍌，嘿，你的感觉真的很奇妙  [lbreak]",
+        "I like eating 🍏",
         """
 # 你好，世界
 ```js

modules/refiner.py CHANGED Viewed

@@ -1,3 +1,5 @@
 import numpy as np
 import torch
@@ -31,4 +33,10 @@ def refine_text(
                 "disable_tqdm": config.runtime_env_vars.off_tqdm,
             },
         )
         return refined_text

+from typing import Generator
 import numpy as np
 import torch
                 "disable_tqdm": config.runtime_env_vars.off_tqdm,
             },
         )
+        if isinstance(refined_text, Generator):
+            raise NotImplementedError(
+                "Refiner is not yet implemented for generator output"
+            )
+        if isinstance(refined_text, list):
+            refined_text = "\n".join(refined_text)
         return refined_text

modules/repos_static/resemble_enhance/inference.py CHANGED Viewed

@@ -1,12 +1,12 @@
 import logging
 import time
 import torch
 import torch.nn.functional as F
 from torch.nn.utils.parametrize import remove_parametrizations
 from torchaudio.functional import resample
 from torchaudio.transforms import MelSpectrogram
-from tqdm import trange
 from modules import config
 from modules.devices import devices
@@ -142,10 +142,10 @@ def inference(
     chunk_seconds: float = 30.0,
     overlap_seconds: float = 1.0,
 ):
     if config.runtime_env_vars.off_tqdm:
-        trange = range
-    else:
-        from tqdm import trange
     remove_weight_norm_recursively(model)
@@ -188,7 +188,7 @@ def inference(
         torch.cuda.synchronize()
     elapsed_time = time.perf_counter() - start_time
-    logger.info(
         f"Elapsed time: {elapsed_time:.3f} s, {hwav.shape[-1] / elapsed_time / 1000:.3f} kHz"
     )
     devices.torch_gc()

 import logging
 import time
+from functools import partial
 import torch
 import torch.nn.functional as F
 from torch.nn.utils.parametrize import remove_parametrizations
 from torchaudio.functional import resample
 from torchaudio.transforms import MelSpectrogram
 from modules import config
 from modules.devices import devices
     chunk_seconds: float = 30.0,
     overlap_seconds: float = 1.0,
 ):
+    from tqdm import trange
     if config.runtime_env_vars.off_tqdm:
+        trange = partial(trange, disable=True)
     remove_weight_norm_recursively(model)
         torch.cuda.synchronize()
     elapsed_time = time.perf_counter() - start_time
+    logger.debug(
         f"Elapsed time: {elapsed_time:.3f} s, {hwav.shape[-1] / elapsed_time / 1000:.3f} kHz"
     )
     devices.torch_gc()

modules/speaker.py CHANGED Viewed

@@ -29,6 +29,12 @@ class Speaker:
         speaker.emb = tensor
         return speaker
     def __init__(
         self, seed_or_tensor: Union[int, torch.Tensor], name="", gender="", describe=""
     ):

         speaker.emb = tensor
         return speaker
+    @staticmethod
+    def from_seed(seed: int):
+        speaker = Speaker(seed_or_tensor=seed)
+        speaker.emb = create_speaker_from_seed(seed)
+        return speaker
     def __init__(
         self, seed_or_tensor: Union[int, torch.Tensor], name="", gender="", describe=""
     ):

modules/synthesize_audio.py CHANGED Viewed

@@ -1,7 +1,5 @@
-import io
 from typing import Union
-from modules import generate_audio as generate
 from modules.SentenceSplitter import SentenceSplitter
 from modules.speaker import Speaker
 from modules.ssml_parser.SSMLParser import SSMLSegment

 from typing import Union
 from modules.SentenceSplitter import SentenceSplitter
 from modules.speaker import Speaker
 from modules.ssml_parser.SSMLParser import SSMLSegment

modules/synthesize_stream.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import io
+from typing import Generator, Union
+import numpy as np
+from modules import generate_audio as generate
+from modules.SentenceSplitter import SentenceSplitter
+from modules.speaker import Speaker
+def synthesize_stream(
+    text: str,
+    temperature: float = 0.3,
+    top_P: float = 0.7,
+    top_K: float = 20,
+    spk: Union[int, Speaker] = -1,
+    infer_seed: int = -1,
+    use_decoder: bool = True,
+    prompt1: str = "",
+    prompt2: str = "",
+    prefix: str = "",
+    spliter_threshold: int = 100,
+    end_of_sentence="",
+) -> Generator[tuple[int, np.ndarray], None, None]:
+    spliter = SentenceSplitter(spliter_threshold)
+    sentences = spliter.parse(text)
+    for sentence in sentences:
+        wav_gen = generate.generate_audio_stream(
+            text=sentence + end_of_sentence,
+            temperature=temperature,
+            top_P=top_P,
+            top_K=top_K,
+            spk=spk,
+            infer_seed=infer_seed,
+            use_decoder=use_decoder,
+            prompt1=prompt1,
+            prompt2=prompt2,
+            prefix=prefix,
+        )
+        for sr, wav in wav_gen:
+            yield sr, wav

modules/utils/HomophonesReplacer.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import json
+# ref: https://github.com/2noise/ChatTTS/commit/ce1c962b6235bd7d0c637fbdcda5e2dccdbac80d
+class HomophonesReplacer:
+    """
+    Homophones Replacer
+    Replace the mispronounced characters with correctly pronounced ones.
+    Creation process of homophones_map.json:
+    1. Establish a word corpus using the [Tencent AI Lab Embedding Corpora v0.2.0 large] with 12 million entries. After cleaning, approximately 1.8 million entries remain. Use ChatTTS to infer the text.
+    2. Record discrepancies between the inferred and input text, identifying about 180,000 misread words.
+    3. Create a pinyin to common characters mapping using correctly read characters by ChatTTS.
+    4. For each discrepancy, extract the correct pinyin using [python-pinyin] and find homophones with the correct pronunciation from the mapping.
+    Thanks to:
+    [Tencent AI Lab Embedding Corpora for Chinese and English Words and Phrases](https://ai.tencent.com/ailab/nlp/en/embedding.html)
+    [python-pinyin](https://github.com/mozillazg/python-pinyin)
+    """
+    def __init__(self, map_file_path):
+        self.homophones_map = self.load_homophones_map(map_file_path)
+    def load_homophones_map(self, map_file_path):
+        with open(map_file_path, "r", encoding="utf-8") as f:
+            homophones_map = json.load(f)
+        return homophones_map
+    def replace(self, text):
+        result = []
+        for char in text:
+            if char in self.homophones_map:
+                result.append(self.homophones_map[char])
+            else:
+                result.append(char)
+        return "".join(result)

modules/utils/audio.py CHANGED Viewed

@@ -2,14 +2,14 @@ import sys
 from io import BytesIO
 import numpy as np
-import pyrubberband as pyrb
 import soundfile as sf
-from pydub import AudioSegment
 INT16_MAX = np.iinfo(np.int16).max
-def audio_to_int16(audio_data):
     if (
         audio_data.dtype == np.float32
         or audio_data.dtype == np.float64
@@ -20,6 +20,23 @@ def audio_to_int16(audio_data):
     return audio_data
 def audiosegment_to_librosawav(audiosegment: AudioSegment) -> np.ndarray:
     """
     Converts pydub audio segment into np.float32 of shape [duration_in_seconds*sample_rate, channels],
@@ -35,64 +52,42 @@ def audiosegment_to_librosawav(audiosegment: AudioSegment) -> np.ndarray:
     return fp_arr
-def pydub_to_np(audio: AudioSegment) -> tuple[int, np.ndarray]:
-    """
-    Converts pydub audio segment into np.float32 of shape [duration_in_seconds*sample_rate, channels],
-    where each value is in range [-1.0, 1.0].
-    Returns tuple (audio_np_array, sample_rate).
-    """
-    return (
-        audio.frame_rate,
-        np.array(audio.get_array_of_samples(), dtype=np.float32).reshape(
-            (-1, audio.channels)
-        )
-        / (1 << (8 * audio.sample_width - 1)),
-    )
-def ndarray_to_segment(ndarray, frame_rate):
     buffer = BytesIO()
-    sf.write(buffer, ndarray, frame_rate, format="wav")
     buffer.seek(0)
-    sound = AudioSegment.from_wav(
-        buffer,
-    )
-    return sound
-def time_stretch(input_segment: AudioSegment, time_factor: float) -> AudioSegment:
-    """
-    factor range -> [0.2,10]
-    """
-    time_factor = np.clip(time_factor, 0.2, 10)
-    sr = input_segment.frame_rate
-    y = audiosegment_to_librosawav(input_segment)
-    y_stretch = pyrb.time_stretch(y, sr, time_factor)
-    sound = ndarray_to_segment(
-        y_stretch,
-        frame_rate=sr,
     )
-    return sound
-def pitch_shift(
-    input_segment: AudioSegment,
-    pitch_shift_factor: float,
 ) -> AudioSegment:
-    """
-    factor range -> [-12,12]
-    """
-    pitch_shift_factor = np.clip(pitch_shift_factor, -12, 12)
-    sr = input_segment.frame_rate
-    y = audiosegment_to_librosawav(input_segment)
-    y_shift = pyrb.pitch_shift(y, sr, pitch_shift_factor)
-    sound = ndarray_to_segment(
-        y_shift,
-        frame_rate=sr,
     )
-    return sound
 def apply_prosody_to_audio_data(
@@ -114,6 +109,17 @@ def apply_prosody_to_audio_data(
     return audio_data
 if __name__ == "__main__":
     input_file = sys.argv[1]
@@ -123,11 +129,11 @@ if __name__ == "__main__":
     input_sound = AudioSegment.from_mp3(input_file)
     for time_factor in time_stretch_factors:
-        output_wav = f"time_stretched_{int(time_factor * 100)}.wav"
-        sound = time_stretch(input_sound, time_factor)
-        sound.export(output_wav, format="wav")
     for pitch_factor in pitch_shift_factors:
-        output_wav = f"pitch_shifted_{int(pitch_factor * 100)}.wav"
-        sound = pitch_shift(input_sound, pitch_factor)
-        sound.export(output_wav, format="wav")

 from io import BytesIO
 import numpy as np
 import soundfile as sf
+from pydub import AudioSegment, effects
+import pyrubberband as pyrb
 INT16_MAX = np.iinfo(np.int16).max
+def audio_to_int16(audio_data: np.ndarray) -> np.ndarray:
     if (
         audio_data.dtype == np.float32
         or audio_data.dtype == np.float64
     return audio_data
+def pydub_to_np(audio: AudioSegment) -> tuple[int, np.ndarray]:
+    """
+    Converts pydub audio segment into np.float32 of shape [duration_in_seconds*sample_rate, channels],
+    where each value is in range [-1.0, 1.0].
+    Returns tuple (audio_np_array, sample_rate).
+    """
+    nd_array = np.array(audio.get_array_of_samples(), dtype=np.float32)
+    if audio.channels != 1:
+        nd_array = nd_array.reshape((-1, audio.channels))
+    nd_array = nd_array / (1 << (8 * audio.sample_width - 1))
+    return (
+        audio.frame_rate,
+        nd_array,
+    )
 def audiosegment_to_librosawav(audiosegment: AudioSegment) -> np.ndarray:
     """
     Converts pydub audio segment into np.float32 of shape [duration_in_seconds*sample_rate, channels],
     return fp_arr
+def ndarray_to_segment(
+    ndarray: np.ndarray, frame_rate: int, sample_width: int = None, channels: int = None
+) -> AudioSegment:
     buffer = BytesIO()
+    sf.write(buffer, ndarray, frame_rate, format="wav", subtype="PCM_16")
     buffer.seek(0)
+    sound: AudioSegment = AudioSegment.from_wav(buffer)
+    if sample_width is None:
+        sample_width = sound.sample_width
+    if channels is None:
+        channels = sound.channels
+    return (
+        sound.set_frame_rate(frame_rate)
+        .set_sample_width(sample_width)
+        .set_channels(channels)
     )
+def apply_prosody_to_audio_segment(
+    audio_segment: AudioSegment,
+    rate: float = 1,
+    volume: float = 0,
+    pitch: int = 0,
+    sr: int = 24000,
 ) -> AudioSegment:
+    audio_data = audiosegment_to_librosawav(audio_segment)
+    audio_data = apply_prosody_to_audio_data(audio_data, rate, volume, pitch, sr)
+    audio_segment = ndarray_to_segment(
+        audio_data, sr, audio_segment.sample_width, audio_segment.channels
     )
+    return audio_segment
 def apply_prosody_to_audio_data(
     return audio_data
+def apply_normalize(
+    audio_data: np.ndarray,
+    headroom: float = 1,
+    sr: int = 24000,
+):
+    segment = ndarray_to_segment(audio_data, sr)
+    segment = effects.normalize(seg=segment, headroom=headroom)
+    return pydub_to_np(segment)
 if __name__ == "__main__":
     input_file = sys.argv[1]
     input_sound = AudioSegment.from_mp3(input_file)
     for time_factor in time_stretch_factors:
+        output_wav = f"{input_file}_time_{time_factor}.wav"
+        output_sound = apply_prosody_to_audio_segment(input_sound, rate=time_factor)
+        output_sound.export(output_wav, format="wav")
     for pitch_factor in pitch_shift_factors:
+        output_wav = f"{input_file}_pitch_{pitch_factor}.wav"
+        output_sound = apply_prosody_to_audio_segment(input_sound, pitch=pitch_factor)
+        output_sound.export(output_wav, format="wav")

modules/utils/detect_lang.py ADDED Viewed

	@@ -0,0 +1,27 @@

+from functools import lru_cache
+from typing import Literal
+@lru_cache(maxsize=64)
+def is_chinese(text):
+    for char in text:
+        if "\u4e00" <= char <= "\u9fff":
+            return True
+    return False
+@lru_cache(maxsize=64)
+def is_eng(text):
+    for char in text:
+        if "a" <= char.lower() <= "z":
+            return True
+    return False
+@lru_cache(maxsize=64)
+def guess_lang(text) -> Literal["zh", "en"]:
+    if is_chinese(text):
+        return "zh"
+    if is_eng(text):
+        return "en"
+    return "zh"

modules/utils/html.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from html.parser import HTMLParser
+class HTMLTagRemover(HTMLParser):
+    def __init__(self):
+        super().__init__()
+        self.reset()
+        self.fed = []
+    def handle_data(self, data):
+        self.fed.append(data)
+    def get_data(self):
+        return "\n".join(self.fed)
+def remove_html_tags(text):
+    parser = HTMLTagRemover()
+    parser.feed(text)
+    return parser.get_data()
+if __name__ == "__main__":
+    input_text = "<h1>一个标题</h1> 这是一段包含<code>标签</code>的文本。"
+    output_text = remove_html_tags(input_text)
+    print(output_text)  # 输出： 一个标题 这是一段包含标签的文本。

modules/utils/ignore_warn.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import warnings
+def ignore_useless_warnings():
+    # NOTE: 因为触发位置在 `vocos/heads.py:60` 改不动...所以忽略
+    warnings.filterwarnings(
+        "ignore", category=UserWarning, message="ComplexHalf support is experimental"
+    )

modules/utils/markdown.py CHANGED Viewed

@@ -36,6 +36,7 @@ class PlainTextRenderer(mistune.HTMLRenderer):
             return html + "\n" + text + "\n"
         return "\n" + text + "\n"
     def list_item(self, text):
         return "" + text + "\n"

             return html + "\n" + text + "\n"
         return "\n" + text + "\n"
+    # FIXME: 现在的 list 转换没法保留序号
     def list_item(self, text):
         return "" + text + "\n"

modules/webui/localization_runtime.py CHANGED Viewed

@@ -90,6 +90,28 @@ class ZHLocalizationVars(LocalizationVars):
         ]
         self.tts_examples = [
             {
                 "text": "大🍌，一条大🍌，嘿，你的感觉真的很奇妙  [lbreak]",
             },

         ]
         self.tts_examples = [
+            {
+                "text": """
+Fear is the path to the dark side. Fear leads to anger. Anger leads to hate. Hate leads to suffering.
+恐惧是通向黑暗之路。恐惧导致愤怒。愤怒引发仇恨。仇恨造成痛苦。 [lbreak]
+Do. Or do not. There is no try.
+要么做，要么不做，没有试试看。[lbreak]
+Peace is a lie, there is only passion.
+安宁即是谎言，激情方为王道。[lbreak]
+Through passion, I gain strength.
+我以激情换取力量。[lbreak]
+Through strength, I gain power.
+以力量赚取权力。[lbreak]
+Through power, I gain victory.
+以权力赢取胜利。[lbreak]
+Through victory, my chains are broken.
+于胜利中超越自我。[lbreak]
+The Force shall free me.
+原力任我逍遥。[lbreak]
+May the force be with you!
+愿原力与你同在！[lbreak]
+              """.strip()
+            },
             {
                 "text": "大🍌，一条大🍌，嘿，你的感觉真的很奇妙  [lbreak]",
             },

modules/webui/speaker/speaker_creator.py CHANGED Viewed

@@ -62,11 +62,10 @@ def create_spk_from_seed(
     gender: str,
     desc: str,
 ):
-    chat_tts = load_chat_tts()
-    with SeedContext(seed, True):
-        emb = chat_tts.sample_random_speaker()
-    spk = Speaker(seed_or_tensor=-2, name=name, gender=gender, describe=desc)
-    spk.emb = emb
     with tempfile.NamedTemporaryFile(delete=False, suffix=".pt") as tmp_file:
         torch.save(spk, tmp_file)
@@ -82,7 +81,8 @@ def test_spk_voice(
     text: str,
     progress=gr.Progress(track_tqdm=True),
 ):
-    return tts_generate(spk=seed, text=text, progress=progress)
 def random_speaker():

     gender: str,
     desc: str,
 ):
+    spk = Speaker.from_seed(seed)
+    spk.name = name
+    spk.gender = gender
+    spk.describe = desc
     with tempfile.NamedTemporaryFile(delete=False, suffix=".pt") as tmp_file:
         torch.save(spk, tmp_file)
     text: str,
     progress=gr.Progress(track_tqdm=True),
 ):
+    spk = Speaker.from_seed(seed)
+    return tts_generate(spk=spk, text=text, progress=progress)
 def random_speaker():

modules/webui/ssml/podcast_tab.py CHANGED Viewed

@@ -124,7 +124,7 @@ def create_ssml_podcast_tab(ssml_input: gr.Textbox, tabs1: gr.Tabs, tabs2: gr.Ta
     def send_to_ssml(msg, spk, style, sheet: pd.DataFrame):
         if sheet.empty:
-            return gr.Error("Please add some text to the script table.")
         msg, spk, style, ssml = merge_dataframe_to_ssml(msg, spk, style, sheet)
         return [
             msg,

     def send_to_ssml(msg, spk, style, sheet: pd.DataFrame):
         if sheet.empty:
+            raise gr.Error("Please add some text to the script table.")
         msg, spk, style, ssml = merge_dataframe_to_ssml(msg, spk, style, sheet)
         return [
             msg,

modules/webui/ssml/ssml_tab.py CHANGED Viewed

@@ -6,19 +6,6 @@ from modules.webui.webui_utils import synthesize_ssml
 def create_ssml_interface():
     with gr.Row():
-        with gr.Column(scale=3):
-            with gr.Group():
-                gr.Markdown("📝SSML Input")
-                gr.Markdown("SSML_TEXT_GUIDE")
-                ssml_input = gr.Textbox(
-                    label="SSML Input",
-                    lines=10,
-                    value=webui_config.localization.DEFAULT_SSML_TEXT,
-                    placeholder="输入 SSML 或选择示例",
-                    elem_id="ssml_input",
-                    show_label=False,
-                )
-                ssml_button = gr.Button("🔊Synthesize SSML", variant="primary")
         with gr.Column(scale=1):
             with gr.Group():
                 gr.Markdown("🎛️Parameters")
@@ -44,11 +31,64 @@ def create_ssml_interface():
                     step=1,
                 )
             with gr.Group():
                 gr.Markdown("💪🏼Enhance")
                 enable_enhance = gr.Checkbox(value=True, label="Enable Enhance")
                 enable_de_noise = gr.Checkbox(value=False, label="Enable De-noise")
             with gr.Group():
                 gr.Markdown("🎄Examples")
                 gr.Examples(
@@ -56,7 +96,9 @@ def create_ssml_interface():
                     inputs=[ssml_input],
                 )
-    ssml_output = gr.Audio(label="Generated Audio", format="mp3")
     ssml_button.click(
         synthesize_ssml,
@@ -67,6 +109,11 @@ def create_ssml_interface():
             enable_de_noise,
             eos_input,
             spliter_thr_input,
         ],
         outputs=ssml_output,
     )

 def create_ssml_interface():
     with gr.Row():
         with gr.Column(scale=1):
             with gr.Group():
                 gr.Markdown("🎛️Parameters")
                     step=1,
                 )
+            with gr.Group():
+                gr.Markdown("🎛️Adjuster")
+                # 调节 speed pitch volume
+                # 可以选择开启 响度均衡
+                speed_input = gr.Slider(
+                    label="Speed",
+                    value=1.0,
+                    minimum=0.5,
+                    maximum=2.0,
+                    step=0.1,
+                )
+                pitch_input = gr.Slider(
+                    label="Pitch",
+                    value=0,
+                    minimum=-12,
+                    maximum=12,
+                    step=0.1,
+                )
+                volume_up_input = gr.Slider(
+                    label="Volume Gain",
+                    value=0,
+                    minimum=-12,
+                    maximum=12,
+                    step=0.1,
+                )
+                enable_loudness_normalization = gr.Checkbox(
+                    value=True,
+                    label="Enable Loudness EQ",
+                )
+                headroom_input = gr.Slider(
+                    label="Headroom",
+                    value=1,
+                    minimum=0,
+                    maximum=12,
+                    step=0.1,
+                )
             with gr.Group():
                 gr.Markdown("💪🏼Enhance")
                 enable_enhance = gr.Checkbox(value=True, label="Enable Enhance")
                 enable_de_noise = gr.Checkbox(value=False, label="Enable De-noise")
+        with gr.Column(scale=3):
+            with gr.Group():
+                gr.Markdown("📝SSML Input")
+                gr.Markdown("SSML_TEXT_GUIDE")
+                ssml_input = gr.Textbox(
+                    label="SSML Input",
+                    lines=10,
+                    value=webui_config.localization.DEFAULT_SSML_TEXT,
+                    placeholder="输入 SSML 或选择示例",
+                    elem_id="ssml_input",
+                    show_label=False,
+                )
+                ssml_button = gr.Button("🔊Synthesize SSML", variant="primary")
             with gr.Group():
                 gr.Markdown("🎄Examples")
                 gr.Examples(
                     inputs=[ssml_input],
                 )
+            with gr.Group():
+                gr.Markdown("🎨Output")
+                ssml_output = gr.Audio(label="Generated Audio", format="mp3")
     ssml_button.click(
         synthesize_ssml,
             enable_de_noise,
             eos_input,
             spliter_thr_input,
+            pitch_input,
+            speed_input,
+            volume_up_input,
+            enable_loudness_normalization,
+            headroom_input,
         ],
         outputs=ssml_output,
     )

modules/webui/tts_tab.py CHANGED Viewed

@@ -228,14 +228,56 @@ def create_tts_interface():
                     label="prompt_audio", visible=webui_config.experimental
                 )
             with gr.Group():
                 gr.Markdown("🔊Generate")
                 disable_normalize_input = gr.Checkbox(
-                    value=False, label="Disable Normalize"
                 )
                 with gr.Group():
-                    gr.Markdown("💪🏼Enhance")
                     enable_enhance = gr.Checkbox(value=True, label="Enable Enhance")
                     enable_de_noise = gr.Checkbox(value=False, label="Enable De-noise")
                 tts_button = gr.Button(
@@ -271,6 +313,11 @@ def create_tts_interface():
             spk_file_upload,
             spliter_thr_input,
             eos_input,
         ],
         outputs=tts_output,
     )

                     label="prompt_audio", visible=webui_config.experimental
                 )
+            with gr.Group():
+                gr.Markdown("🎛️Adjuster")
+                # 调节 speed pitch volume
+                # 可以选择开启 响度均衡
+                speed_input = gr.Slider(
+                    label="Speed",
+                    value=1.0,
+                    minimum=0.5,
+                    maximum=2.0,
+                    step=0.1,
+                )
+                pitch_input = gr.Slider(
+                    label="Pitch",
+                    value=0,
+                    minimum=-12,
+                    maximum=12,
+                    step=0.1,
+                )
+                volume_up_input = gr.Slider(
+                    label="Volume Gain",
+                    value=0,
+                    minimum=-12,
+                    maximum=12,
+                    step=0.1,
+                )
+                enable_loudness_normalization = gr.Checkbox(
+                    value=True,
+                    label="Enable Loudness EQ",
+                )
+                headroom_input = gr.Slider(
+                    label="Headroom",
+                    value=1,
+                    minimum=0,
+                    maximum=12,
+                    step=0.1,
+                )
             with gr.Group():
                 gr.Markdown("🔊Generate")
                 disable_normalize_input = gr.Checkbox(
+                    value=False,
+                    label="Disable Normalize",
+                    # 不需要了
+                    visible=False,
                 )
                 with gr.Group():
+                    # gr.Markdown("💪🏼Enhance")
                     enable_enhance = gr.Checkbox(value=True, label="Enable Enhance")
                     enable_de_noise = gr.Checkbox(value=False, label="Enable De-noise")
                 tts_button = gr.Button(
             spk_file_upload,
             spliter_thr_input,
             eos_input,
+            pitch_input,
+            speed_input,
+            volume_up_input,
+            enable_loudness_normalization,
+            headroom_input,
         ],
         outputs=tts_output,
     )

modules/webui/webui_utils.py CHANGED Viewed

@@ -6,6 +6,11 @@ import torch
 import torch.profiler
 from modules import refiner
 from modules.api.utils import calc_spk_style
 from modules.data import styles_mgr
 from modules.Enhancer.ResembleEnhance import apply_audio_enhance as _apply_audio_enhance
@@ -13,8 +18,6 @@ from modules.normalization import text_normalize
 from modules.SentenceSplitter import SentenceSplitter
 from modules.speaker import Speaker, speaker_mgr
 from modules.ssml_parser.SSMLParser import SSMLBreak, SSMLSegment, create_ssml_parser
-from modules.synthesize_audio import synthesize_audio
-from modules.SynthesizeSegments import SynthesizeSegments, combine_audio_segments
 from modules.utils import audio
 from modules.utils.hf import spaces
 from modules.webui import webui_config
@@ -89,6 +92,11 @@ def synthesize_ssml(
     enable_denoise=False,
     eos: str = "[uv_break]",
     spliter_thr: int = 100,
     progress=gr.Progress(track_tqdm=True),
 ):
     try:
@@ -99,7 +107,7 @@ def synthesize_ssml(
     ssml = ssml.strip()
     if ssml == "":
-        return None
     parser = create_ssml_parser()
     segments = parser.parse(ssml)
@@ -107,22 +115,36 @@ def synthesize_ssml(
     segments = segments_length_limit(segments, max_len)
     if len(segments) == 0:
-        return None
-    synthesize = SynthesizeSegments(
-        batch_size=batch_size, eos=eos, spliter_thr=spliter_thr
     )
-    audio_segments = synthesize.synthesize_segments(segments)
-    combined_audio = combine_audio_segments(audio_segments)
-    sr = combined_audio.frame_rate
-    audio_data, sr = apply_audio_enhance(
-        audio.audiosegment_to_librosawav(combined_audio),
-        sr,
-        enable_denoise,
-        enable_enhance,
     )
     # NOTE: 这里必须要加，不然 gradio 没法解析成 mp3 格式
     audio_data = audio.audio_to_int16(audio_data)
@@ -150,6 +172,11 @@ def tts_generate(
     spk_file=None,
     spliter_thr: int = 100,
     eos: str = "[uv_break]",
     progress=gr.Progress(track_tqdm=True),
 ):
     try:
@@ -161,10 +188,10 @@ def tts_generate(
     text = text.strip()[0:max_len]
     if text == "":
-        return None
     if style == "*auto":
-        style = None
     if isinstance(top_k, float):
         top_k = int(top_k)
@@ -181,31 +208,56 @@ def tts_generate(
     infer_seed = np.clip(infer_seed, -1, 2**32 - 1, out=None, dtype=np.float64)
     infer_seed = int(infer_seed)
-    if not disable_normalize:
-        text = text_normalize(text)
     if spk_file:
-        spk = Speaker.from_file(spk_file)
-    sample_rate, audio_data = synthesize_audio(
-        text=text,
         temperature=temperature,
-        top_P=top_p,
-        top_K=top_k,
-        spk=spk,
-        infer_seed=infer_seed,
-        use_decoder=use_decoder,
         prompt1=prompt1,
         prompt2=prompt2,
-        prefix=prefix,
         batch_size=batch_size,
-        end_of_sentence=eos,
         spliter_threshold=spliter_thr,
     )
-    audio_data, sample_rate = apply_audio_enhance(
-        audio_data, sample_rate, enable_denoise, enable_enhance
     )
     # NOTE: 这里必须要加，不然 gradio 没法解析成 mp3 格式
     audio_data = audio.audio_to_int16(audio_data)
     return sample_rate, audio_data

 import torch.profiler
 from modules import refiner
+from modules.api.impl.handler.SSMLHandler import SSMLHandler
+from modules.api.impl.handler.TTSHandler import TTSHandler
+from modules.api.impl.model.audio_model import AdjustConfig
+from modules.api.impl.model.chattts_model import ChatTTSConfig, InferConfig
+from modules.api.impl.model.enhancer_model import EnhancerConfig
 from modules.api.utils import calc_spk_style
 from modules.data import styles_mgr
 from modules.Enhancer.ResembleEnhance import apply_audio_enhance as _apply_audio_enhance
 from modules.SentenceSplitter import SentenceSplitter
 from modules.speaker import Speaker, speaker_mgr
 from modules.ssml_parser.SSMLParser import SSMLBreak, SSMLSegment, create_ssml_parser
 from modules.utils import audio
 from modules.utils.hf import spaces
 from modules.webui import webui_config
     enable_denoise=False,
     eos: str = "[uv_break]",
     spliter_thr: int = 100,
+    pitch: float = 0,
+    speed_rate: float = 1,
+    volume_gain_db: float = 0,
+    normalize: bool = True,
+    headroom: float = 1,
     progress=gr.Progress(track_tqdm=True),
 ):
     try:
     ssml = ssml.strip()
     if ssml == "":
+        raise gr.Error("SSML is empty, please input some SSML")
     parser = create_ssml_parser()
     segments = parser.parse(ssml)
     segments = segments_length_limit(segments, max_len)
     if len(segments) == 0:
+        raise gr.Error("No valid segments in SSML")
+    infer_config = InferConfig(
+        batch_size=batch_size,
+        spliter_threshold=spliter_thr,
+        eos=eos,
+        # NOTE: SSML not support `infer_seed` contorl
+        # seed=42,
+    )
+    adjust_config = AdjustConfig(
+        pitch=pitch,
+        speed_rate=speed_rate,
+        volume_gain_db=volume_gain_db,
+        normalize=normalize,
+        headroom=headroom,
+    )
+    enhancer_config = EnhancerConfig(
+        enabled=enable_denoise or enable_enhance or False,
+        lambd=0.9 if enable_denoise else 0.1,
     )
+    handler = SSMLHandler(
+        ssml_content=ssml,
+        infer_config=infer_config,
+        adjust_config=adjust_config,
+        enhancer_config=enhancer_config,
     )
+    audio_data, sr = handler.enqueue()
     # NOTE: 这里必须要加，不然 gradio 没法解析成 mp3 格式
     audio_data = audio.audio_to_int16(audio_data)
     spk_file=None,
     spliter_thr: int = 100,
     eos: str = "[uv_break]",
+    pitch: float = 0,
+    speed_rate: float = 1,
+    volume_gain_db: float = 0,
+    normalize: bool = True,
+    headroom: float = 1,
     progress=gr.Progress(track_tqdm=True),
 ):
     try:
     text = text.strip()[0:max_len]
     if text == "":
+        raise gr.Error("Text is empty, please input some text")
     if style == "*auto":
+        style = ""
     if isinstance(top_k, float):
         top_k = int(top_k)
     infer_seed = np.clip(infer_seed, -1, 2**32 - 1, out=None, dtype=np.float64)
     infer_seed = int(infer_seed)
+    if isinstance(spk, int):
+        spk = Speaker.from_seed(spk)
     if spk_file:
+        try:
+            spk: Speaker = Speaker.from_file(spk_file)
+        except Exception:
+            raise gr.Error("Failed to load speaker file")
+        if not isinstance(spk.emb, torch.Tensor):
+            raise gr.Error("Speaker file is not supported")
+    tts_config = ChatTTSConfig(
+        style=style,
         temperature=temperature,
+        top_k=top_k,
+        top_p=top_p,
+        prefix=prefix,
         prompt1=prompt1,
         prompt2=prompt2,
+    )
+    infer_config = InferConfig(
         batch_size=batch_size,
         spliter_threshold=spliter_thr,
+        eos=eos,
+        seed=infer_seed,
+    )
+    adjust_config = AdjustConfig(
+        pitch=pitch,
+        speed_rate=speed_rate,
+        volume_gain_db=volume_gain_db,
+        normalize=normalize,
+        headroom=headroom,
+    )
+    enhancer_config = EnhancerConfig(
+        enabled=enable_denoise or enable_enhance or False,
+        lambd=0.9 if enable_denoise else 0.1,
     )
+    handler = TTSHandler(
+        text_content=text,
+        spk=spk,
+        tts_config=tts_config,
+        infer_config=infer_config,
+        adjust_config=adjust_config,
+        enhancer_config=enhancer_config,
     )
+    audio_data, sample_rate = handler.enqueue()
     # NOTE: 这里必须要加，不然 gradio 没法解析成 mp3 格式
     audio_data = audio.audio_to_int16(audio_data)
     return sample_rate, audio_data

requirements.txt CHANGED Viewed

@@ -1,27 +1,29 @@
-numpy
 scipy
 lxml
 pydub
 fastapi
 soundfile
-pyrubberband
 omegaconf
 pypinyin
 pandas
 vector_quantize_pytorch
 einops
 omegaconf~=2.3.0
 tqdm
-huggingface_hub>=0.22.2,<1.0
-vocos==0.0.1
-transformers==4.41.2
-torch
-torchvision
-torchaudio
 gradio
 emojiswitch
 python-dotenv
 zhon
 mistune==3.0.2
 cn2an
-python-box

+numpy==1.26.4
 scipy
 lxml
 pydub
 fastapi
 soundfile
 omegaconf
 pypinyin
+vocos
 pandas
 vector_quantize_pytorch
 einops
+transformers~=4.41.1
 omegaconf~=2.3.0
 tqdm
+# torch
+# torchvision
+# torchaudio
 gradio
 emojiswitch
 python-dotenv
 zhon
 mistune==3.0.2
 cn2an
+# audio_denoiser
+python-box
+ftfy
+librosa
+pyrubberband

webui.py CHANGED Viewed

@@ -6,6 +6,7 @@ from modules.ffmpeg_env import setup_ffmpeg_path
 try:
     setup_ffmpeg_path()
     logging.basicConfig(
         level=os.getenv("LOG_LEVEL", "INFO"),
         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
@@ -16,20 +17,18 @@ except BaseException:
 import argparse
 from modules import config
-from modules.api.api_setup import (
-    process_api_args,
-    process_model_args,
-    setup_api_args,
-    setup_model_args,
-)
 from modules.api.app_config import app_description, app_title, app_version
 from modules.gradio_dcls_fix import dcls_patch
 from modules.utils.env import get_and_update_env
 from modules.utils.torch_opt import configure_torch_optimizations
 from modules.webui import webui_config
 from modules.webui.app import create_interface, webui_init
 dcls_patch()
 def setup_webui_args(parser: argparse.ArgumentParser):

 try:
     setup_ffmpeg_path()
+    # NOTE: 因为 logger 都是在模块中初始化，所以这个 config 必须在最前面
     logging.basicConfig(
         level=os.getenv("LOG_LEVEL", "INFO"),
         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
 import argparse
 from modules import config
+from modules.api.api_setup import process_api_args, setup_api_args
 from modules.api.app_config import app_description, app_title, app_version
 from modules.gradio_dcls_fix import dcls_patch
+from modules.models_setup import process_model_args, setup_model_args
 from modules.utils.env import get_and_update_env
+from modules.utils.ignore_warn import ignore_useless_warnings
 from modules.utils.torch_opt import configure_torch_optimizations
 from modules.webui import webui_config
 from modules.webui.app import create_interface, webui_init
 dcls_patch()
+ignore_useless_warnings()
 def setup_webui_args(parser: argparse.ArgumentParser):