Liu Song’s Projects


~/Projects/WhisperSpeech

git clone https://code.lsong.org/WhisperSpeech

Commit

Commit
0b8912c808814f22dc3ffece479de3cb28b7a2f3
Author
Jakub Piotr Cłapa <[email protected]>
Date
2023-04-13 14:06:52 +0200 +0200
Diffstat
 README.md | 14 +++++++++++++-

Added a new end-to-end TTS sample

Models and code coming later today


diff --git a/README.md b/README.md
index 04b5c05c8b9a75feb6ad1673161b204509290943..00f3825d7e4c5a691f30bd7661d1d997a3d2757d 100644
--- a/README.md
+++ b/README.md
@@ -13,8 +13,20 @@ we want to target multiple languages (Whisper and EnCodec are both multilanguage).
 
 ## Progress updates
 
-[![](https://dcbadge.vercel.app/api/server/FANw4rHD5E)](https://discord.gg/FANw4rHD5E)  
+**UPDATE 2023-04-13**: We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
+
+End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see https://github.com/collabora/spear-tts-pytorch/issues/9 for more details:
+
+[Whisper](https://github.com/openai/whisper) encoder to generate semantic tokens and [EnCodec](https://github.com/facebookresearch/encodec) for acoustic modeling.
 [![](https://dcbadge.vercel.app/api/server/FANw4rHD5E)](https://discord.gg/FANw4rHD5E)  
+
+https://user-images.githubusercontent.com/107984/231753132-e87bc3e2-3b22-42c0-a7fc-ef525eff4e06.mp4
+
+Ground truth:
+
+https://user-images.githubusercontent.com/107984/231753161-35179c15-0d63-4149-8e0b-8d3578ad2617.mp4
+
+**UPDATE 2023-04-03**: We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
 
 Validation set ground truth (don't forget to unmute):