More than one in four people currently integrate speech recognition into their daily lives. A new algorithm developed by a University of Copenhagen researcher and his international colleagues makes it possible to interact with digital assistants like "Siri" without any internet connection. The innovation allows for speech recognition to be used anywhere, even in situations where security is paramount.
Talking to a computer was once the stuff of science fiction. Nowadays, saying "Hey Siri," or Alexa, Google or other digital assistant on a smartphone or other interactive gizmo has become commonplace. Yet, in the future the role of speech recognition may become even more important.
While studies suggest that these technologies are already used by one in four people on a regular basis, should predictions hold true, by 2025 the number of devices equipped with speech recognition will exceed the planet's population. And the technology is still evolving.
Until now, speech recognition has relied upon a device being connected to the internet. This is because the algorithms typically used for this process require significant amounts of temporary random access memory (RAM) which is usually provided by powerful data center servers. Indeed, try switching your smartphone to airplane mode and see how far your voice commands get you. But change is in the air.
A new algorithm developed by Professor Panagiotis Karras from the University of Copenhagen's Department of Computer Science, together with linguist Nassos Katsamanis of the Athena Research Center in Greece, and researchers from Aalto University in Finland and KTH in Sweden, allows even smaller devices like smartphones to decode speech without needing substantial memory-or internet access.
The code, recently presented in a scientific article, employs a clever strategy: it "forgets" what it doesn't need in real-time.
Phonemes are the smallest units of sound in a language that cannot be replaced without altering the meaning of what is spoken. According to the Danish Language Council, phonemes are "speech sounds with meaning-distinguishing functions."
Speech recognition algorithms use phonemes as data units to recognize and process linguistic expressions by matching spoken sounds with text.
"Speech recognition fundamentally works by matching the small speeech sounds we use to form words and sentences-known as phonemes-with a library of corresponding sounds," explains Panagiotis Karras. "Probabilities are calculated for matches and the subsequent combinations that go on to form our words and sentences. The most likely sequences are calculated and the software translates these sounds into text."
Current algorithms require increased memory the longer one speaks as all alternative combinations must remain open until the final sound is analyzed. The new algorithm does away with this problem.
"The algorithm conceived by Panos and developed further by our team, does something entirely new," says co-developer and co-author Nassos Katsamanis. "Unlike the existing gold standard algorithm used since speech recognition's early days, our algorithm only stores a fraction of the processing data, serving as a set of 'coordinates.' With these, an entire sequence can be reconstructed, which makes speech recognition possible with significantly less RAM."
From Keywords to Entire Sentences
This maneuver may sound simple, but it involves an entirely new and unique code for which the researchers have sought a patent. This algorithm reduces the need for critical memory without sacrificing recognition quality. And though it requires slightly more time and computational power, the researchers assure that the difference is negligible vis-à-vis the muscular capabilities of modern devices.
Moreover, it works without an internet connection, thus enabling speech recognition-and potentially real-time language translation in the future, hope the researchers-anywhere, even in the depths of the Amazon jungle.