Apple’s ReALM AI Advances the Science of Digital Assistants

Apple has developed a large language model it says has advanced screen-reading and comprehension capabilities. ReALM (Reference Resolution as Language Modeling) is artificial intelligence that can see and read computer screens in context, according to Apple, which says it advances technology essential for a true AI assistant “that aims to allow a user to naturally communicate their requirements to an agent, or to have a conversation with it.” Apple claims that in a benchmark against GPT-3.5 and GPT-4, the smallest ReALM model performed “comparable” to GPT-4, with its “larger models substantially outperforming it.”

Conceding LLMs are “extremely powerful for a variety of tasks,” Apple researchers say they’re not too good at understanding ambiguous references, or grasping context, which is where ReALM comes in.

Context includes both previous turns of conversation as well as “non-conversational entities, such as entities on the user’s screen or those running in the background,” according to a research paper on ReALM. The breakthrough is using LLMs to convert all this “into a language modeling problem.”

“By publishing the research, Apple is signaling its continuing investments in making Siri and other products more conversant and context-aware,” writes VentureBeat, which notes “the work highlights the potential for focused language models to handle tasks like reference resolution in production systems where using massive end-to-end models is infeasible due to latency or compute constraints.”

Tom’s Guide cuts to the point, predicting ReALM “could make Siri way faster and smarter,” adding that the research paper comes “ahead of the launch of iOS 18 in June at WWDC 2024, where we expect a big push behind a new Siri 2.0, though it’s not clear if this model will be integrated into Siri in time.”

After taking a back seat to full-bore AI players like Microsoft and Google, Apple seems to have been quietly busy behind the scenes. Tom’s says “ReALM is the latest announcement from Apple’s rapidly growing AI research team and the first to focus specifically on improving existing models.”

Approaching reference resolution via language modeling “breaks from traditional methods focused on conversational context,” reports ZDNet, explaining “ReALM can convert conversational, onscreen, and background processes into a text format that can then be processed by large language models, leveraging their semantic understanding capabilities.”

But the researchers caution that at present, automated screen parsing seems to have its limitations, and more complex visual references will probably require incorporating computer vision as well as multi-modal AI functionality.

“The paper lists four sizes of the ReALM model: ReALM-80M, ReALM-250M, ReALM-1B, and ReALM-3B,” summarizes ZDNet, noting “the ‘M’ and ‘B’ indicate the number of parameters in millions and billions, respectively. GPT-3.5 has 175 billion parameters while GPT-4 reportedly boasts about 1.5 trillion parameters.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.