Text-to-Speech

# 09 Sep 2017

Hi, I've just spent a few weeks taking apart a number of text-to-speech apps. Generally their are the phoneme-based (not great but small) and sample-based (great but not small) options. I've revisited the Speak and Spell, ESS & LPC10 patents and come to the conclusion that compressing to a given quality or given size are likely to be the most usable.

I'm interested to know if anyone would see utility in such a system. I'm currently using 16-bit, 16KHz mono samples as the source data. If I use 3 scaled & offset sine waves I can remove >95% of the energy in voiced speech. The residue can be stored as 2-bit ADPCM (using an escape code for the unused 10b symbol). Sibilants can simply be rendered by stepping through the sine function very quickly so it gives pseudo-random data.

Rather than assigning fixed values to the sine waves, using 5ms blocks that give a vector to the 3 waves with a second-order filter results in a close match of human speech. I don't know how much speech people want and I don't want to waste months to produce an extension to MBed but if people can find a use then I will finish it. If people are interested then please tell me what you want.

With thanks, Sean

Chris Burrows

# 10 Sep 2017

If the use of smaller samples would help you then 16-bit, 16KHz is extravagant. The range of frequencies used in human speech is much more restricted than in music, say. I would start with 8-bit, 8KHz samples and see how that goes. You can probably reduce the sampling frequency even further without losing necessary information.

Sean Dunlevy

# 10 Sep 2017

Chris Burrows

That is the input format, not how the output is stored. The human voice goes up to 8 KHz and you need to sample at x2 the highest frequency you want to record. The speech is decomposed to 3 sine waves with frequency and amplitude vectors with the residue stored as ADPCM or as pseudo random data with a vector. Residue is only stored when the SNR ratio is <20 dB. I'm using a psychoacoustic model to get the SNR value.

You can get good quality speech that works out as <1-bit per sample and only uses about 1.5 MIPS on a Cortex M0+. It's the 'middle way'. MP3 or general audio is complex as are LPC10 & MELP but since all of the hard work is performed on the encode side, decode is possible with little CPU bandwidth.

I'm obviously going to carefully design an open format. A reference encoder & decoder code in C will be the basic software although I'm actually writing the decoder in assembly language. I've cut out the BBC Micro:Bit's built in sample stuff so I can output directly to a 10-bit DAC because I don't see any cheaper or simpler CPU running MBed.

Chris Burrows

# 10 Sep 2017

8 KHz might be a theoretical maximum for a soprano but you are only concerned with average speech not singing. aren't you? The smaller your input samples the less processing you will have to do.

Important changes to forums and questions

Text-to-Speech

Important Information for this Arm website

Important changes to forums and questions

Text-to-Speech

Important Information for this Arm website

Access Warning