MaryTTS - Short Review

Speech Tools

Product Overview: MaryTTS

Introduction

MaryTTS (Modular Architecture for Research on speech sYnthesis) is an open-source, multilingual Text-to-Speech (TTS) Synthesis platform written in Java. Originally developed as a collaborative project between DFKI’s Language Technology Lab and the Institute of Phonetics at Saarland University, MaryTTS is now maintained by the Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI.

What MaryTTS Does

MaryTTS converts written text into spoken speech, supporting a wide range of languages. It is designed to be highly flexible and extensible, making it a valuable tool for both developers and researchers in the field of speech synthesis.

Key Features and Functionality

Multilingual Support

MaryTTS supports multiple languages, including German, British and American English, French, Italian, Luxembourgish, Russian, Swedish, Telugu, and Turkish. Additional languages are in preparation, ensuring the platform remains versatile and adaptable to various linguistic needs.

Modular Architecture

The platform is built on a modular architecture, which is beneficial for research on speech synthesis. This architecture allows for flexibility and extensibility, enabling easy integration of new components and features.

Voice Building Capabilities

MaryTTS comes with toolkits that facilitate the quick addition of support for new languages and the building of unit selection and Hidden Markov Model (HMM)-based synthesis voices. This feature enables the creation of new voices and language support, making the platform highly customizable.

Client-Server System

MaryTTS operates as a client-server system, written in pure Java. This setup allows for efficient communication between the client and server, enabling the synthesis of text to speech and other related tasks. The system includes a main server or “manager” program, several processing modules, and a client for sending input data and receiving processing results.

Natural Language Processing (NLP)

The system performs comprehensive NLP tasks, including:

Text Normalization: Tokenization, abbreviation expansion, and numeral expansion.
Part of Speech Labelling and Shallow Parsing: Identifying the parts of speech and performing chunking.
Lexicon Lookup and Phonemisation: Using a pronunciation lexicon and grapheme to phoneme rules for unknown tokens.
Postlexical Phonological Rules: Applying rules to modify phone symbols and intonation symbols based on context.

Acoustic Parameter Generation

The output from the NLP component is translated into an acoustic parameter file using models for duration (Klatt Rules adapted for German) and intonation (ToBI based approach). This process assigns durations and frequency targets to phone symbols, which are then used by waveform synthesizers like MBROLA.

Automation and Integration

MaryTTS can be integrated into various automation systems, such as the Freedomotic platform, allowing it to be used in programmable automations (e.g., “IF this THEN say THAT”) and triggered by specific events. It also supports predefined commands for tasks like announcing the current time, temperature, or device status.

Conclusion

MaryTTS is a robust and versatile open-source platform for text-to-speech synthesis, offering a range of features that make it ideal for both research and practical applications. Its multilingual support, modular architecture, and voice building capabilities ensure it remains a valuable tool in the field of speech synthesis.