Digital Dictation

Dictation tests are how humans are taught to convert speech to text. But how do we teach computers to do the same? We use Speech-to-Text technology.

Remember those dreaded dictation tests we did in our early primary school days? Pencils furiously scratching away as we strained to hear our soft-spoken English teacher over the rattling fans, other classes doing PE in the basketball courts, and the distraction of schoolmates passing in the corridors. Did she say ‘leave’ or ‘leaf’?

Dictation tests are how humans are taught to convert speech to text. But how do we teach computers to do the same? We use Speech-to-Text technology.

Speech-to-Text (STT), also known as automatic speech recognition, is how machines use Artificial Intelligence (AI) techniques to transcribe speech that is recorded in audio into written text. This is done via signal processing and machine learning techniques.

Simple as it sounds, STT has several critical applications in the MHA context. For example, STT can be used to transcribe spoken words in video-recorded interviews, radio communications, focus group discussions and emergency calls to 995 and 999.

For these emergency calls, when STT is applied, operators will be able to focus on helping the caller, over trying to manually record important incident information during the call.

There are two general approaches to STT – the statistical approach and the Deep Learning approach.

Statistical Approach

(Image Credit: HTX)

The statistical approach makes use of signal processing techniques to break down the speech into a digital format that can be understood by the machine. These are then sent to three different models – acoustic, pronunciation and language. These three models work to recognise the basic units of sound, match these sounds to words and then predict the sequence of words that are transcribed.

Deep Learning approach

The Deep Learning approach feeds recorded speech data into one end-to-end model in the computer and uses AI techniques to process the spoken words into a written transcript.

One key advantage of using the Deep Learning approach is that it is better at recognising dialects, accents and multiple languages compared to the statistical approach. However, both approaches are not without their challenges.

Challenges of STT

(Image Credit: HTX)

Beyond the Lines

Done well, STT is useful beyond the mere recording of speech.

Other Natural Language Processing techniques can also be applied on STT’s textual data, allowing us to translate different languages, build tools that automatically summarise text, and create question-answering systems like Siri, which provide a reply to spoken queries on the phone.

The many useful applications of STT for the MHA context is the key motivation for its continual development. Thus, as a first giant step, HTX’s DSAI is collecting datasets within the Home Team so that we can develop better STT models.