• Inventory
  • Products
  • Technical Information
  • Circuit Diagram
  • Data Sheet
Technical Information
Home > Technical Information > Embedded System/ARM Technology > Design and Implementation of Embedded TTS Chinese Speech System

Design and Implementation of Embedded TTS Chinese Speech System

Source:mgfer
Category:Embedded System/ARM Technology
2023-04-29 11:31:46
20
Language is a means of communicating information with people. It has been the research goal of scientists for many years to make computers, electrical appliances with human-computer interaction, instruments and so on speak like human beings. Text To Speech (TTS) is a technology that automatically converts input text into speech output and tries to make the output speech efficient and natural. There are two main problems to be solved in TTS system: 1. Text analysis, that is, linguistic analysis. The task is to convert strings entered in text width into linguistic representations; 2 Speech synthesis. That is, speech is synthesized from linguistic information expressed internally. Speech synthesis methods in TTS system can be divided into two categories: time domain and frequency domain. Frequency domain methods mainly include LPC parameter synthesis and peak synthesis. The essence of these methods is to implement speech generation model in engineering and then simulate the pronunciation organs on the terminal characteristics. At present, the pronunciation formed by frequency-domain method is not natural and requires a lot of computation. It is not suitable for use on low-end embedded chips. Waveform editing is a method of splicing short digital audio segments (i.e., synthetic primitives) and smoothing them to produce continuous speech flow. This method takes up a large amount of storage space, but has a small amount of computation, a fast calculation speed, and a high degree of naturalness of synthesized speech. It is obviously suitable for the application of embedded systems with weak chip performance.

Embedded TTS system with waveform editing method has attracted more and more attention due to its low cost, perfect performance and high naturalness. With the continuous advancement of waveform modification algorithm and the increasing function of microprocessor and non-volatile storage media. This system uses the time domain waveform editing technology, collects all the characters'pronunciation in GB2312 Chinese character encoding character set as the original material, uses the improved run-length encoding algorithm to compress and generate a speech library suitable for the current Flash memory, and uses the multiple lookup table design and pre-stored command word technology to effectively speed up the address speed of the speech library, and successfully implements a TTS speech system on AT89S52 single-chip computer based on Atmel Company. The test results are satisfactory. The system is simple to use, has a small size, low power consumption and universal serial interface, and can be widely used in related Chinese speech application systems.
1 System Principle

Figure 1 shows the system schematic diagram and the main operation flow. The system uses serial port to interact with the outside world. Any device with standard serial port can be connected with the system. To pronounce Chinese characters, the National Standard Code (GB code) is sent to the MCU through serial port. The MCU maps it to the address of the corresponding item in the Flash memory address table. Then the command word in the corresponding item is obtained according to this address. The MCU reads the corresponding voice data of the Chinese characters according to this command word, reads out the voice data continuously, decodes it with the run-length code decoding algorithm, and plays it through D/A conversion and power amplification at a fixed rate during the speech sampling. The speech sampling rate in this paper is 1025B/s. To meet the application requirements, this paper first builds a speech library which is easy to decode quickly. According to the storage format of specific Flash memory, it is organized and stored in Flash memory by fast multi-lookup table addressing and command word pre-storage to meet the real-time requirements of voice playback. Similarly, MCU code should give priority to speed at the expense of requirements such as modularity and readability. For practical reasons, sufficient input buffer support needs to be added to the system to satisfy the requirements of entering multiple streaming words or whole sentences.

2 Collection and processing of original voice data

This system collects 1335 pronunciations, including 1306 streaming pronunciations, 26 English letters pronunciation and 3 pauses. The AD conversion of voice acquisition card is composed of 11025B/s, 8 bits resolution, 0-255 sample range, and 80H silent value. The original voice is saved in the WAV file format on the PC.

Figure 2 shows the time domain waveform of the Ouch sound sample. All samples except for different waveform envelopes have the same structure, that is, a complete Chinese pronunciation consists of two muted parts, the front part and the back part, and the middle part. Mute acquisition values are mostly 80H (some minor disturbances can be considered noise in the recording process, but the end part has to be processed separately), so they can be unified to 80H to improve the compression ratio. As can be seen from Figure 2, the probability of occurrence of these edge values is very small, such as 00H, 01H, FFH, FEH. This feature can also be used in speech compression algorithms.

Based on the distribution of the above silent and edge values, this paper presents an improved run-length encoding for speech data compression. It uses 00H to represent the run-length compression start code, followed by the encoded character, and the next byte is the duplicate code of the encoded character, such as 80 80 80 80 80 80 80 can be represented as 00 80 05. Obviously, no coding is necessary when run-length is less than or equal to 3, so no duplicate codes with values of 00H, 01H, and 02H will occur. As mentioned above, in the original voice file, these edge values, 00H and 01H, do not appear at all. Because a large number of these edge values means that the dynamic range setting of the voice acquisition system is wrong. Nevertheless, in order to ensure that there is no "redundant" edge value in the original voice file, the voice file needs to be slightly processed, changing the possible 00H and 01H to 02H, obviously such processing will not affect the actual playback effect of the voice. After processing, 00H and 01H can be used as special control characters. Figure 3 is a flowchart for improving process compression encoding proposed in this paper. Before encoding, the size of 1335 original speech samples was 1497622 bytes, and after compression, it was 7767112 bytes, with a compression ratio of more than 50%. The voice library can already be loaded into Flash memory with a capacity of 8M bytes.

3 Storage structure of voice library

This paper takes 8Mbit × As an example, 8-bit NAND Flash memory K9F6408U0B describes the storage structure of the voice library of this system.

The basic content of the voice library is divided into two parts: the front end is the address lookup table, followed by the compressed voice data. In an address table, each 4 bytes represents an address entry. Each Chinese character in the GB2312 Chinese Character Coding Character Set has a corresponding entry in the address table, whose content points to the starting address of the phonetic data for the corresponding pronunciation of the Chinese character. There are 94 sections in the GB code character set, each with 94 characters, totaling 8,836 Chinese characters, English letters and other symbols, of which 7445 are actually used, the rest being reserved. These reserved areas are also reserved for future expansion. Thus, the size of the address table is 94 × 94 × 4 = 35344 bytes. The voice data area stores 1335 years of pronunciation, compressed by flow coding, and adds 01H as the end controller at the end of each piece of voice data.

For different Flash memory, the voice library needs to do some targeted processing. For K9F6408U0B, special treatment is required for its zone C. In this chip, each page (Page) has three sections A, B and C, of which 256 bytes are in A and B, and only 16 bytes are in C. Area C is not used in this design, so attention must be paid to filling the area with blank code (FFH) when making binary intergovernmental library files written to Flash. When zone C filling is considered, the calculation method for the size of the binary speech library file corresponding to the address table is changed to 512. × 69+16=35344, which means 69 pages, 16 bytes more, are needed when 35344 bytes occupy only areas A and B. This means that 69 C zones need to be filled, that is, the actual size of the address table written to Flash should be 35344+69 × 16 = 36448. Accordingly, the voice data area needs to be processed the same way.

When making Flash data files written on PC, first place the address table in front, then write the compressed voice files one by one, and convert the starting address of each file into command word for Flash memory operation to write the corresponding entries in the address table. Each file is written with a 01H end code, and the C section is filled during the writing process. After synthesizing 1335 speech files, address lookup tables, C-block padding codes and end-of-file codes, the binary image file of Flash memory is obtained, which is 8047776 bytes in size. After writing, the remaining 333KB of free space in Flash is left, and the reservation in the Joint Address Table can be used to further expand the system speech library. The storage structure of the above voice library is shown in Figure 4.

4 Code word conversion and implementation of efficient MCU code

There are two types of code-word conversion in this article. A type of conversion from GB codes to the starting bytes of the speech library is used to determine the starting address of the corresponding item in the baseline table for the corresponding pronunciation after the MCU receives the GB codes entered in the serial port. This type of code-to-word conversion is mainly based on the GB2312 standard and the structure of the address table of the speech library. In this paper, the algorithm for the code conversion is: (GB code high byte-161) × 94+ (GB) code low byte-161) × 4. The other is to convert the above addresses into command words for Flash to read data. This type of conversion is related to the voice store storage structure and the read and write operations and timing of the Flash memory used. Since the starting address of the voice data has been converted into operation command words by PC and stored in the corresponding items of address table when the voice library is generated, that is, most of the calculation and sequence control operations have been completed when the PC is used to make the binary image file of Flash, thus avoiding a lot of calculation in the running of the system, thus ensuring the real-time performance of voice playback. The method of calculating the command word is related to the specific Flash memory model and is cumbersome. Due to the limitation of space, this paper will not give a specific algorithm. Interested readers can refer to the data table of K9F6408U0B.

The MCU model in this paper is AT89S52, using 22.1484MHz crystal oscillation. According to the AT89S52 data table, for each Chinese character played, the required number of instruction cycles is (1/11025)/(12/22.1184)=167.2. So set a timer interrupt with an interrupt value of 256-167=89 to do the following between each interrupt:

(1) Get the GB code from the buffer and convert it to the address of the corresponding item in the address table;
(2) Get the corresponding voice data area storage address from the corresponding item in the address table;
(3) Obtain corresponding voice data area data;
(4) Complete the run-length decoding and play.

In addition, the serial port should work in the interrupt mode because it is possible to receive input characters during the voice playback process. The serial port baud rate is 9600bps and its priority is higher than the timer interrupt. In this system, this buffer can be used to input more than 60 Chinese characters. The above operations are completed in approximately 168 instruction cycles, equivalent to approximately 84 double-cycle instructions. Therefore, in code writing, code efficiency must be put in place, and programming skills must be used flexibly to complete.

This paper presents an implementation scheme of embedded TTS Chinese speech system. Because of the pre-storage technology of improved run-length algorithm, multiple lookup table and Flash memory operation commands which are easy to decode, the scheme can be implemented on low-requirement hardware platform. The embedded TTS system based on AT89S52 MCU is different from PC-based TTS voice system. The system is small in size, low in power consumption, low in cost and wide in application. After testing its voice is clear and coherent, it can pronounce bytes covering all Chinese characters of GB code, 26 English bytes, and can input up to 60 complete sentences of Chinese characters, which is sufficient for most applications. If the platform is MCU or ARM processor, more algorithms can be added to further improve system performance.



Source:Xiang Xueqin