NAB 2018: IBM Watson on Refining AI for Closed Captioning

Closed captioning isn’t just for the hard-of-hearing anymore. According to Digiday, 85 percent of Facebook video is viewed without sound. That signals a trend of viewers who prefer to watch closed captioning, putting the heat on solutions providers to come up with compliant systems that are also accurate and speedy. With artificial intelligence, says IBM Watson Media senior offering manager David Kulczar, closed captioning can be enhanced to go beyond transcription, and automatically identify background audio descriptions.

Kulczar defined the terms used in closed captioning. “It all starts with automated speech recognition (ASR), which creates a text output from an audio track,” he said. “This is not to be confused with speech-to-text, which extends ASR but is only focused on the spoken word.”


Transcription is the manual process of creating text output from audio. “And closed captioning is a service that allows audiences to interpret content accurately without audio to assist with full understanding of the presented content,” he said.

In addition to ASR, closed captioning systems need to be able to recognize and identify common sounds and noises, know when speakers shift languages, and identify and represent speaker transitions (called diarization). For the blind, audio description represents visual characteristics of the content. Closed captioning systems also need edit tools to accurately time-stamp caption data, present captions in an intelligent, coherent layout and provide an editing interface to allow for manual changes.

Today’s systems for closed captioning all offer pros and cons. Manual or human captioning is the most accurate for VOD files, but is the most expensive. Crowdsourced captioning is more cost effective and knowledge of most areas is likely, but it doesn’t support live content, is non-predictable and has the potential for malicious intent.

Machine-generated captioning is the most cost effective model and it’s consistent and trainable, but needs further development and includes the up-front costs of training. Hybrid systems offer a “safety net” to cover some of the other systems’ flaws, but don’t get rid of all of them.

When using AI for captioning, said Kulczar, “training makes all the difference.” He broke down such training into three categories: vocabulary, contextual and acoustic.

“We build a corpus of knowledge,” he explained. “In nightly news, for example, the topic changes from weather to politics in an instant, and I have to make sure my models are applied rapidly and consistently and do it in an automated way.”

With these three kinds of training, said Kulczar, IBM Watson has shown five to 10 percent gains in accuracy. “The models give you two scores: relevance and confidence,” he said. “You run through a certain amount of corpus information and it goes to a team that refines it until you get an accuracy model that works. With Live, it took us two weeks to get 95 percent accuracy for the U.S. Open, for example.”

“Manual transcription also has a training requirement,” he concluded. “But we can build these models really, really quickly.”