Speechgen2 multi-language TTS configuration

This documentation is about per-language configuration of TTS engines in the se_tpb_speechgen2 transformer. It targets Pipeline users who want to refine the configuration via manual edition of the internal XML configuration files.

Overview

The Pipeline uses a set of declarative rules to associate TTS voices to language codes. The Narrator automatically selects TTS voices depending on the value of the xml:lang attributes found in the DTBook. If there is no specific rule for a given language, the Narrator will fall back to the default system voice.

The TTS-related configuration is actually part of the se_tpb_speechgen2 transformer. It uses a simple factory/builder to get hold of TTS implementations, configured in an XML file named ttsbuilder.xml in the se_tpb_speechgen2/tts/ directory.

Structure of the ttsbuilder.xml configuration file

The configuration consists of operating-system-specific sections ; whithin each of these OS sections are language-specific sub-sections containing the declaration of a single TTS engine to use for this language:

<ttsbuilder>
	
	<os>
		<property name="os.name" match="[Ww]indows.*" />
		<lang lang="__">
			<tts default="true">...</tts>
		</lang>
		<lang lang="en">
			<tts>...</tts>
		</lang>
		<lang lang="fr">
			<tts>...</tts>
		</lang>
	</os>
	
	<os>
		<property name="os.name" match="[Ll]inux.*" />		
		<lang lang="en">
			<tts default="true">...</tts>
		</lang>
		...
	</os>
	
	<os>
		<property name="os.name" match="Mac OS X" />
		<lang lang="en">
			<tts default="true">...</tts>
		</lang>
		...
	</os>
	
</ttsbuilder>

For each OS, there can be one (and only one) descendant TTS with the attribute default="true" to be used as fallback. Note that this deault TTS can be configured in a "dummy" language section (with a fake language code), as it is done for the Windows section in the example above.

Voice selection mechanism

When the Narrator must generate the audio for a DTBook element, it first looks at the value of the xml:lang attribute of the element or its closest ancestor. It then tries to instantiate a TTS engine based on the configuration in the tts element in the language section corresponding to the xml:lang value and in the OS section corresponding to the user's OS. For instance if the document locale is en-US it will pick the best match in that order:

  1. the section with the lang attribute equals to 'en_US'
  2. the section with the lang attribute equals to 'en'
  3. the first section with the lang attribute starting with 'en_'
  4. the section with the default attribute set to 'true'

Note that the configuration uses underscores to separate the language and country codes as done in the java.util.Locale#toString() method

Note that this multi-language support can be disabled with the script parameter named "Multi-language support". If this option is disabled, the TTS engine configured in the default section will always be used.

Configuration on Windows

On Windows, the actual voice selection is by default delegated to the Microsoft Speech API (SAPI5), which means that only SAPI-compliant TTS engines can be used.

The text sent to the default SAPI TTS adapter is wrapped in a voice SAPI XML tag with the selection criteria declared in the sapiVoiceSelection parameter of the tts section in ttsbuilder.xml.

For instance, if the TTS configuration contains the following section:

<lang lang="en">
	<tts>
		<param name="class" value="se_tpb_speechgen2.external.win.DefaultSapiTTS"/>
		<param name="sapiVoiceSelection" value="Language=409"/>
		...
	</tts>
</lang>
The text "This is is a sentence." is transformed into the following SAPI XML tag before being sent to the TTS:
<voice optional="Language=409">This is is a sentence.</voice>
The default ttsbuilder.xml configuration uses Microsoft language codes to select voice matching a language section, but note that it is possible to refine the selection, with queries such as "Gender=Female;Age!=Child;Language=409". It is even possible to explicitly name the voice to use for a particular language section. For more information on the TTS selection attributes, refer to the Microsoft XML TTS tutorial and to the list of language codes.

Configuration on Mac OS X

On Mac OS X, the TTS engine is selected directly by the name of the voice specified in the voice parameter of the TTS section of the language. The parameter accepts a comma-separated list of voice names, and the first name corresponding to a voice existing on the user's system is selected.

For instance, if the TTS configuration contains the following section:

<lang lang="en">
	<tts>
		<param name="class" value="se_tpb_speechgen2.external.MacOS.MacSayTTS"/>
		<param name="voice" value="Alex, Vicky"/>
		...
	</tts>
</lang>

The voice selected to speech English content will be Apple's Alex voice on Mac OS X 10.5 Leopard and later, and Apple's Vicky voice on Mac OS X 10.4 Tiger (where Alex is not available).

Configuration on Linux

On Linux, the default TTS adapter to the ESpeak engine selects the voice with a two-letter language code configured in the eSpeakVoiceFile parameter of the TTS section of the language.

For instance, for English the TTS configuration would be:

<lang lang="en">
	<tts default="true">
		<param name="class" value="se_tpb_speechgen2.external.linux.ESpeakTTS"/>
		<param name="eSpeakVoiceFile" value="en"/>
		...
	</tts>
</lang>