Synthetic Speech - Tools and Processes

Original Author(s): Lynn Leith
The TTS (Text To Speech) tools presented here do not form an all inclusive listing of tools that may be available. Thanks to those members of the DAISY Technical Developments List who contributed to this collection of information. At the time of writing, all links provided are functional.

Information on Loquendo TTS, including technical specifications, is available at: this PDF document on the There is an interactive TTS demo on the Loquendo home page.

Information and specifications are provided on the AT&T Web site for Natural Voices Text-to-Speech Engine.

One comment about Nuance's RealSpeak was that the quality of the speech from Nuance is parallel to or better sound than Loquendo. The site includes an interactive demo. Rhetorical, a high quality TTS program, has "become" Nuance. It has been reported that rVoice (Rhetorical) is being discontinued, and will be replaced by the next version of RealSpeak. The new version of RealSpeak will include conversion protocols to enable it to replace rVoice with no change to application which may have been developed specifically for rVoice.

There are samples including multiple voice options for AT&T Natural Voices, NeoSpeech, Cepstral and RealSpeak.

Acapella group (was "Babel") and SVOX are two others that may be worth reviewing

Free TTS software programs are available, however, some may not compare in terms of speech quality, with the commercial systems (which can be very expensive). NaturalReader is one of the free TTS software programs available on-line. Information and demos are at NaturalReader.

Three of the DAISY Consortium's member organizations have provided a brief outline of how they have implemented TTS into their production: Vision Australia of ANZAIG (Australia New Zealand Accessible Information Group), RNIB and CNIB of the Canadian DAISY Consortium.

Vision Australia: TTS Newsletters

Vision Australia uses rVoice (Rhetorical), and are presently integrating it into an online, daily newspaper service, which will allow clients to choose between downloading a text only DAISY version of daily newspapers, or a synthetic voice version, using rVoice. Concerning TTS quality, they reported: "Our testing of various synthetic voice systems, showed that the Rhetorical product (now Nuance) was indeed the highest quality speech engine for English (many other languages are available, but we did not trial these)."

RNIB: Weekly Television Listings

The current 4 step process involved in the production of the DAISY TTS TV listings is as follows:

  • The XML files are received from the UK Publishers Association
  • These are converted to the RNIB XML format
  • The RNIB XML files are processed through in-house software (called transforms) which create the DAISY structure and deal with acronyms, "videoplus" numbers and other "awkward" content
  • The file is then processed through another piece of in-house software called WinDiss which creates the audio using rVoice TTS (Rhetorical) and deals with the SMIL synchronisation

The DAISY TV Listings are completed with 25 hours of audio on one CD covering UK terrestrial, cable and satellite channels. This process takes approximately three hours of production time. RNIB is currently reviewing many of the tools used within this process.

CNIB: Indices

The CNIB produces indices with synthetic speech. These are then "joined" to the body of the book which has been produced with human narration. An index that is produced with marked up text provides the user with a DAISY book that provides word search options which otherwise would not be possible. Production time required to create the indices with TTS is about one tenth of the time required to narrate them with human voice.

In summary, the process is as follows. EasePublisher is used to create a full text DAISY DTB of the scanned index. Following this, and during "pre-production" an in-house developed XSLT is used to insert the Loquendo codes, add punctuation to improve the prosody, and allow utilization of the custom dictionary which is under on-going development. Following the production of the synthetic speech with Loquendo, a post production process (again, in-house developed tools are used) removes the Loquendo codes. The DTB is encoded, regenerated and validated.

Once the DTB of the index is complete, it is manually "joined" with the completed, human narrated, book body. The completed DAISY DTB is then regenerated and validated. This process is somewhat labour intensive, and it is hoped that one or more of the DAISY Pipeline transformers will streamline it.

DAISYpedia Categories: 

This page was last edited by PVerma on Thursday, August 26, 2010 09:50
Text is available under the terms of the DAISY Consortium Intellectual Property Policy, Licensing, and Working Group Process.