Lina-Style

Companion webpage of Lina-Style: Word-Level Style Control in TTS via Interleaved Data Distillation.

Abstract

We propose a method for word-level style conditioning in text-to-speech (TTS) based on data distillation, enabling emotion and style control with limited supervision. We first train a TTS model on stylistically unlabeled data. Then, using that base model, we synthesize multiple stylistic renditions of the same sentences by cloning expressive samples from a small labeled corpus. Using cross-attention alignments, we interleave segments from different styles to construct synthetic examples with local style variation. To provide independent control of style intensity, we generate samples with classifier-free guidance at varying strengths and condition the model accordingly. This self-distilled parallel dataset allows the model to learn precise and coherent word-level style control. Despite relying solely on synthetic supervision, our approach performs similarly to fine-tuned baselines while offering greater controllability.

{{ sections[0].label }}

Varying the intensity level for each style tag is achieved using a 5-point scale, where each level is associated with a learned embedding. These correspond to bins of linearly spaced CFG factors ranging from 0.5 (least intense) to 2.0 (most intense). This approach enables independent and simplified conditioning, avoiding the need for double generation as required when directly applying CFG.

{{ tag }}

{{ seg.text }}

{{ sections[1].label }}

The style conditioning is disentangled from semantic content contained in the text prompt.

Style & Intensity

{{ tag }}

{{ seg.text }}

{{ sections[2].label }}

Various examples with mixed styles and intensities.

Style & Intensity

{{ tag }}

{{ seg.text }}