New! Sign up for our free email newsletter.
Science News
from research organizations

New method for high-speed synthesis of natural voices

Neural source-filter model uses neural networks to update classical speech-synthesis methods

Date:
February 5, 2019
Source:
Research Organization of Information and Systems
Summary:
A research team has developed the method of neural source-filter (NSF) models for high-speed, high-quality voice synthesis. This technique, which combines the recent deep-learning algorithms and a classical speech production model dated back to the 1960s, is capable not only of generating high-quality voice waveforms -- closely resembling the human voice -- but also of conducting stable learning via neural networks.
Share:
FULL STORY

The research team in the Digital Content and Media Sciences Research Division, the National Institute of Informatics (NII, Director General: Dr. Masaru Kitsuregawa, Chiyoda-ku, Tokyo, Japan) -- Researcher by Special Appointment Xin Wang, Assistant Professor by Special Appointment Shinji Takaki, and Associate Professor Junichi Yamagishi -- has developed the method of neural source-filter (NSF) models for high-speed, high-quality voice synthesis. This new technique, which combines the recent deep-learning algorithms and a classical speech production model dated back to the 1960s, is capable not only of generating high-quality voice waveforms -- closely resembling the human voice -- but also of conducting stable learning via neural networks.

To date, many speech synthesis systems have adopted the vocoder approach, a method for synthesizing speech waveforms that is widely used in cellular-phone networks and other applications. However, the quality of the speech waveforms synthesized by these methods has remained inferior to that of the human voice. In 2016, an influential overseas technology company proposed WaveNet -- a speech-synthesis method based on deep-learning algorithms -- and demonstrated the ability to synthesize high-quality speech waveforms resembling the human voice. However, one drawback of WaveNet is the extremely complex structure of its neural networks, which demand large quantities of voice data for machine learning and require parameter tuning and various other laborious trial-and-error procedures to be repeated many times before accurate predictions can be obtained.

One of the most well-known vocoders is the source-filter vocoder, which was developed in the 1960s and remains in widespread use today. The NII research team infused the conventional source-filter vocoder method with modern neural-network algorithms to develop a new technique for synthesizing high-quality speech waveforms resembling the human voice. Among the advantages of this neural source-filter (NSF) method is the simple structure of its neural networks, which require only about 1 hour of voice data for machine learning and can obtain correct predictive results without extensive parameter tuning. Moreover, large-scale listening tests have demonstrated that speech waveforms produced by NSF techniques are comparable in quality to those generated by WaveNet.

Because the theoretical basis of NSF differs from the patented technologies used by influential overseas ICT companies, the adoption of NSF techniques is likely to spur new technological advances in speech synthesis. For this reason, the source code implementing the NSF method has been made available to the public at no cost, allowing it to be widely used.

Source code, trained NSF models, and the actual NSF-synthesized speech samples (both Japanese and English) are available at the following sites:

Source code:

https://github.com/nii-yamagishilab/project-CURRENNT-public

Trained models (may be executed to generate English-language voices):

https://github.com/nii-yamagishilab/project-CURRENNT-scripts

Voice samples (Japanese or English):

https://nii-yamagishilab.github.io/samples-nsf/index.html


Story Source:

Materials provided by Research Organization of Information and Systems. Note: Content may be edited for style and length.


Journal Reference:

  1. Xin Wang, Shinji Takaki, Junichi Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. submitted to arXiv, 2019 [abstract]

Cite This Page:

Research Organization of Information and Systems. "New method for high-speed synthesis of natural voices." ScienceDaily. ScienceDaily, 5 February 2019. <www.sciencedaily.com/releases/2019/02/190205102528.htm>.
Research Organization of Information and Systems. (2019, February 5). New method for high-speed synthesis of natural voices. ScienceDaily. Retrieved October 4, 2024 from www.sciencedaily.com/releases/2019/02/190205102528.htm
Research Organization of Information and Systems. "New method for high-speed synthesis of natural voices." ScienceDaily. www.sciencedaily.com/releases/2019/02/190205102528.htm (accessed October 4, 2024).

Explore More

from ScienceDaily

RELATED STORIES