Hi,
I've just spent a few weeks taking apart a number of text-to-speech apps. Generally their are the phoneme-based (not great but small) and sample-based (great but not small) options. I've revisited the Speak and Spell, ESS & LPC10 patents and come to the conclusion that compressing to a given quality or given size are likely to be the most usable.
I'm interested to know if anyone would see utility in such a system. I'm currently using 16-bit, 16KHz mono samples as the source data. If I use 3 scaled & offset sine waves I can remove >95% of the energy in voiced speech. The residue can be stored as 2-bit ADPCM (using an escape code for the unused 10b symbol). Sibilants can simply be rendered by stepping through the sine function very quickly so it gives pseudo-random data.
Rather than assigning fixed values to the sine waves, using 5ms blocks that give a vector to the 3 waves with a second-order filter results in a close match of human speech. I don't know how much speech people want and I don't want to waste months to produce an extension to MBed but if people can find a use then I will finish it. If people are interested then please tell me what you want.
With thanks,
Sean
Hi, I've just spent a few weeks taking apart a number of text-to-speech apps. Generally their are the phoneme-based (not great but small) and sample-based (great but not small) options. I've revisited the Speak and Spell, ESS & LPC10 patents and come to the conclusion that compressing to a given quality or given size are likely to be the most usable.
I'm interested to know if anyone would see utility in such a system. I'm currently using 16-bit, 16KHz mono samples as the source data. If I use 3 scaled & offset sine waves I can remove >95% of the energy in voiced speech. The residue can be stored as 2-bit ADPCM (using an escape code for the unused 10b symbol). Sibilants can simply be rendered by stepping through the sine function very quickly so it gives pseudo-random data.
Rather than assigning fixed values to the sine waves, using 5ms blocks that give a vector to the 3 waves with a second-order filter results in a close match of human speech. I don't know how much speech people want and I don't want to waste months to produce an extension to MBed but if people can find a use then I will finish it. If people are interested then please tell me what you want.
With thanks, Sean