ChatGPT Goes Multimodal: OpenAI Adds Vision, Voice Ability

OpenAI began previewing vision capabilities for GPT-4 in March, and the company is now starting to roll out the image input and output to users of its popular ChatGPT. The multimodal expansion also includes audio functionality, with OpenAI proclaiming late last month that “ChatGPT can now see, hear and speak.” The upgrade vaults GPT-4 into the multimodal category with what OpenAI is apparently calling GPT-4V (for “Vision,” though equally applicable to “Voice”). “We’re rolling out voice and images in ChatGPT to Plus and Enterprise users,” OpenAI announced.

An OpenAI blog post of September 25 indicated voice and images would be made available to Plus and Enterprise users “over the next two weeks.” As of last week there were still inquiries on OpenAI’s forums requesting access or news of anyone actually using the new features.

OpenAI said “voice is coming on iOS and Android (opt-in in your settings) and images will be available on all platforms.” Mobile users will be able to “speak with ChatGPT and have it talk back,” OpenAI explained, suggesting the voice function be used to “request a bedtime story for your family, or settle a dinner table debate.”

Unite.AI reports ChatGPT’s new computer vision skills are the result of “the integration of DALL-E 3 into ChatGPT.” “This blend allows for a smoother interaction where ChatGPT aids in crafting precise prompts for DALL-E 3, turning user ideas into vivid AI-generated art,” Unite.AI says. “While users can directly interact with DALL-E 3, having ChatGPT in the mix makes the process of creating AI art much more user-friendly.”

OpenAI’s DALL-E 3 landing page now says “DALL-E 3 understands significantly more nuance and detail than our previous systems, allowing you to easily translate your ideas into exceptionally accurate images,” and advises visitors to “try ChatGPT” with “DALL-E 3 coming soon!” A GPT-4V research paper is available.

“Multimodal is the next generation of these large models, where it can process not just text, but also images, audio, video, and even other modalities,” Nvidia Senior AI Research Scientist Jim Fan tells Spectrum IEEE.

Calling the ChatGPT upgrade “a noteworthy example of a multimodal AI system,” Spectrum IEEE explains that “instead of using a single AI model designed to work with a single form of input, like a large language model (LLM) or speech-to-voice model, multiple models work together to create a more cohesive AI tool” for multimodal functionality.

In other OpenAI news, Reuters reports the company “is exploring making its own artificial intelligence chips and has gone as far as evaluating a potential acquisition target.” And while TechCrunch observes that demand for ChatGPT has slowed, it says the chatbot’s mobile app notched a “record $4.58 million in revenue last month.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.