~/Projects/WhisperSpeech
git clone https://code.lsong.org/WhisperSpeech
Commit
- Commit
- 0b8912c808814f22dc3ffece479de3cb28b7a2f3
- Author
- Jakub Piotr Cłapa <[email protected]>
- Date
- 2023-04-13 14:06:52 +0200 +0200
- Diffstat
README.md | 14 +++++++++++++-
Added a new end-to-end TTS sample Models and code coming later today
diff --git a/README.md b/README.md index 04b5c05c8b9a75feb6ad1673161b204509290943..00f3825d7e4c5a691f30bd7661d1d997a3d2757d 100644 --- a/README.md +++ b/README.md @@ -13,8 +13,20 @@ we want to target multiple languages (Whisper and EnCodec are both multilanguage). ## Progress updates -[![](https://dcbadge.vercel.app/api/server/FANw4rHD5E)](https://discord.gg/FANw4rHD5E) +**UPDATE 2023-04-13**: We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!). + +End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see https://github.com/collabora/spear-tts-pytorch/issues/9 for more details: + +[Whisper](https://github.com/openai/whisper) encoder to generate semantic tokens and [EnCodec](https://github.com/facebookresearch/encodec) for acoustic modeling. [![](https://dcbadge.vercel.app/api/server/FANw4rHD5E)](https://discord.gg/FANw4rHD5E) + +https://user-images.githubusercontent.com/107984/231753132-e87bc3e2-3b22-42c0-a7fc-ef525eff4e06.mp4 + +Ground truth: + +https://user-images.githubusercontent.com/107984/231753161-35179c15-0d63-4149-8e0b-8d3578ad2617.mp4 + +**UPDATE 2023-04-03**: We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps. Validation set ground truth (don't forget to unmute):