2 min

Facebook’s parent company Meta may resolve the greatest hurdle in speech recognition platforms: background noise. The firm integrated visual cues examination to filter out external chatter in its speech recognition platform.

Background noise has been a great problem in contemporary speech recognition platforms, thus making it difficult for AIs to decipher verbal cues in a noisy space. Traditionally, noise-suppression techniques have separated the main sound from the chatter. However, these techniques failed to be as effective as the human sense of amalgamating auditory cues with vision.

Taking that into account, Facebooks’s parent company, Meta AI, has launched its latest conversational AI structure. The Audio-Visual Hidden Unit BERT (AV-HuBERT) is a system directed to train Artificial Intelligent devices by taking in auditory and visual cues. According to Meta AI, AV-HuBERT examines a video’s speech and lip movements without transcriptions.

How is AV-HuBERT different from other speech recognition platforms?

Meta’s AV-HuBERT supposedly is much more technologically advanced than others of its kind. The current market for speech recognition exclusively includes software programs that rely solely on audio input. These platforms struggle with differentiating the different voices of multiple speakers. However, the AV-HuBERT is meticulously engineered to combine visual cues with auditory data. The platform studies lip and teeth movement to understand the distinctions in various input streams. As a result, the program can decipher the speaker’s voice and differentiate it from background noise or chatter.

How effective is AV-HuBERT?

Meta AI perceives AV-HuBERT to be more than 75% accurate in delivering accurate audio-visual speech results. In addition, according to the firm, the model only requires 10% of data that other systems need to acquire the same results.

The system’s efficiency in data collection makes it a viable candidate for an ideal platform for understanding and decoding visual and auditory cues from different languages. Meta’s speech recognition system can be utilized to create advanced systems for more languages with large-scale labeled datasets.

Application of AV-HuBERT

The applications for an auditory-visual speech detection system are endless. For starters, the technology can be used in smartphones and smart-home devices to ensure the accurate understanding and transmission of data in high-noise environments. The system can also identify deepfakes, as it can analyze minute associations between auditory and visual cues. Meta’s AV-HuBERT can also be game-changing for VR avatars, giving them a more realistic touch.

Meta AI also claims to create and distribute a batch of pre-trained systems to other researchers to increase the scope of progress within the industry.