Engineering | Press releases | Search | Technology
May 23, 2024
Noise-canceling headphones have become very effective at creating an auditory blank slate. But allowing certain sounds coming from the wearer’s environment to be erased remains a challenge for researchers. The latest edition of Apple’s AirPods Pro, for example, automatically adjusts sound levels for users — such as detecting when they’re on a conversation — but the user has little control over who to listen to or when that happens.
A team from the University of Washington has developed an artificial intelligence system that allows a user wearing headphones to look at a person speaking for three to five seconds to “register” them. The system, called “Target Speech Hearing,” then cancels out all other sounds in the environment and plays only the recorded speaker’s voice in real time, even if the listener moves to noisy locations and is no longer facing to the speaker.
The team presented its findings May 14 in Honolulu at the ACM CHI Conference on Human Factors in Computing Systems. The code for the proof-of-concept device is available for others to use as inspiration. The system is not commercially available.
“We now tend to think of AI as web-based chatbots that answer questions,” said lead author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. “But in this project, we are developing AI to modify the auditory perception of anyone wearing headphones, based on their preferences. With our devices, you can now hear a single speaker clearly, even if you’re in a noisy environment with many other people talking.
To use the system, a person wearing commercial headphones equipped with microphones presses a button while pointing their head toward someone who is speaking. The sound waves from that speaker’s voice must then reach the microphones on both sides of the headset simultaneously; there is a margin of error of 16 degrees. The headphones send this signal to an onboard computer, where the team’s machine learning software learns the vocal patterns of the desired speaker. The system latches on to that speaker’s voice and continues to play it back to the listener, even when both move. The system’s ability to focus on the recorded voice improves as the speaker continues speaking, giving the system more training data.
The team tested their system on 21 subjects, who on average rated the clarity of the registered speaker’s voice nearly twice as high as that of unfiltered audio.
This work builds on the team’s previous research into “semantic hearing”, which allowed users to select specific classes of sounds – such as birds or voices – that they wanted to hear and cancel other sounds in the environment.
Currently, the TSH system can only register one speaker at a time, and it can only register a speaker when there is no other loud voice coming from the same direction as the speaker’s voice. target. If a user is not satisfied with the sound quality, they can perform another registration on the speaker to improve clarity.
The team is working to expand the system to headphones and hearing aids in the future.
Other co-authors on the paper were Bandhav Veluri, Malek Itani and Tuochao Chen, UW doctoral students in the Allen School, and Takuya Yoshioka, research director at AssemblyAI. This research was funded by a Moore Inventor Fellow Award, a Thomas J. Cabel Endowed Chair, and a UW CoMotion Innovation Gap Fund.
for more information, contact [email protected].
Tag(s): Bandhav Veluri • College of Engineering • Department of Electrical and Computer Engineering • Malek Itani • Paul G. Allen School of Computer Science and Engineering • Shyam Gollakota • Tuochao Chen