Speech

Speech to text

Both the windows and android versions support speech to text conversion. For the windows application, this is done through the managed SAPI libraries provided by Microsoft. The same is valid for android: speech input is provided by the operating system.

In both cases though, the network can be used to augment the input quality. Speech libraries usually provide a list of possible input statements, each with a ‘weight’ to determine the likelihood. When the neural network is properly set up (read: not to many regular variables are used), it should be able to pick out the best match.

At the time of writing, speech input is not yet directly supported for the online version, although some browsers (chrome) already support speech input and there is some effort to make it into a web standard.

Text to speech

All versions are also able to say the output statements that are generated by the neural network. In windows, this is again done through the SAPI libraries. Both the managed and non managed versions are supported so that the system can work with most voices currently available on the market.

The android app and web interface both make use of the Espeak library to generate the speech. This is a more limiting system since it only allows for synth based voices, so things sound more robotic. Still, espeak also allows you to select between different voices and both pitch and speed can be altered.

In the case of the android app, Espeak is used ‘by default’ as this is usually the speech system that ships with the device. However, android is a fairly open platform: it allows users to replace the default espeak app with a different one. This is no problem for the chatbot application: it will automatically switch to the newly installed voice system.

Future

At the time of writing, the neural network is already able to ‘improve’ input quality by checking multiple possible combinations with it’s known state. This system can still be improved by adding an extra verification layer.

For speech input, there are plans to build a custom input system that is able to more tightly integrate into the neural network. The idea being that this should provide more accuracy. However, this is still just a theory so….

The same goes for speech rendering: with a better integration into the neural network, the system should be able to better control things like pitch, accentuation, speed, volume,… which should result into a more human-like (or emotionally loaded) voice.

 

 

Leave a Reply