9 mins read

ChatGPT 4 And Its Multimodal Capabilities: Seeing, Hearing, and Speaking

OpenAI’s ongoing quest to improve ChatGPT continues to bear fruit with the rollout of its Vision and Speech-to-Text (STT) capabilities. While the Vision component was initially announced at launch but postponed due to privacy concerns, it is now ready to impress its users in the way ChatGPT 4 interacts with content other than text.


These new features allow users to engage with ChatGPT in a more intuitive and immersive manner. In this article, we’ll explore how to leverage these capabilities and the responsible approach OpenAI is taking to ensure both usefulness and safety.

Latest ChatGPT 4 Features

Vision: The Power to See and Understand

The Vision component of ChatGPT marks a significant leap in AI’s ability to understand and respond to images. This feature enables ChatGPT to process text and image prompts in combination, accommodating multiple images and even allowing users to highlight specific portions of images for focused analysis.

A multimodal model leverages its natural language processing capabilities to see and understand images, including photographs, screenshots, and documents that contain both text and images. This integration of language and visual reasoning enhances the AI’s ability to provide context-aware responses, bridging the gap between textual and visual communication.

Access is still rolling out, but to kickstart your visual conversations with ChatGPT 4, simply tap the photo button to capture or select an image. For mobile users on iOS or Android, tap the plus button to begin. What’s more, you’re not limited to a single image; you can discuss multiple images in a single conversation or even use the built-in drawing tool to provide visual guidance to your AI assistant. This makes ChatGPT’s image recognition capabilities versatile and user-friendly.

Voice Capabilities

You can now talk to ChatGPT 4 as if you were talking to a human. You can ask it questions, give it instructions, or just chat about whatever is on your mind.
The introduction of voice technology into ChatGPT enables the creation of lifelike synthetic voices from mere seconds of real speech. While this opens the door to creativity and accessibility-focused applications, it also poses new challenges, such as the potential for malicious use. OpenAI has taken a responsible approach by initially using this technology for voice chat, collaborating closely with voice actors to maintain quality and authenticity. Notably, Spotify is harnessing this technology to pilot a Voice Translation feature, expanding the reach of podcast content by translating it into different languages using the original podcaster’s voice.

chatgpt 4 voice app
Talking with ChatGPT

Will ChatGPT also understand tones and emotions? Imagine ChatGPT 4 detecting the frustration in a user’s voice and responding with empathy or recognizing enthusiasm and responding with enthusiasm in return. Such capabilities could greatly enhance the user experience and make ChatGPT a more emotionally intelligent conversational partner.

Ensuring Vision is Useful and Safe

OpenAI’s approach to vision capabilities has been significantly informed by their collaboration with Be My Eyes, a mobile app aiding blind and low-vision individuals. Users have found value in discussing images containing people in the background, illustrating the potential for everyday applications. OpenAI has implemented technical measures to limit ChatGPT’s analysis of and statements about people to respect privacy, acknowledging that the system isn’t always accurate. Real-world usage and feedback will be instrumental in refining these safeguards while maintaining the tool’s utility.

Gradual Deployment for Safety and Progress

Incorporating vision-based models into ChatGPT 4 introduces complexities, including the potential for hallucinations and the model’s interpretation of images in high-stakes situations. Before broader deployment, OpenAI rigorously tested the model with red teamers to assess risks in areas like extremism and scientific proficiency. They also engaged a diverse group of alpha testers for valuable feedback. OpenAI’s dedication to responsible usage is evident in their efforts to ensure ChatGPT respects individuals’ privacy and remains a useful tool.

OpenAI’s commitment to building safe and beneficial Artificial General Intelligence (AGI) remains steadfast. The deployment of these advanced voice and image capabilities follows a gradual and considered approach. This allows OpenAI to continuously improve the system, refine risk mitigation measures, and prepare users for the increasing power of AI systems in the future.

Transparency and Model Limitations

OpenAI places a strong emphasis on transparency regarding ChatGPT’s limitations. While the model excels at transcribing English text, it may perform poorly with some other languages, particularly those with non-Roman scripts. OpenAI advises users to exercise caution and proper verification for higher-risk use cases, especially in specialized fields like research.

OpenAI DevDay

Exciting news emerged from the latest OpenAI DevDay event, as the leading AI company unveiled a series of groundbreaking enhancements and cost reductions across their platform. Among the key highlights were the introduction of the advanced GPT-4 Turbo with a 128K context window, the Assistants API for streamlined app development, and the integration of new modalities such as vision, DALL·E 3, and Text-to-speech (TTS) capabilities

GPT-4 Turbo: Redefining Context and Affordability

The highly anticipated GPT-4 Turbo stole the spotlight with its increased capabilities and extended context window, now encompassing the equivalent of more than 300 pages of text in a single prompt. OpenAI also positively surprised developers with a significant reduction in pricing, offering GPT-4 Turbo at a 3x cheaper rate for input tokens and a 2x cheaper rate for output tokens compared to its predecessor, GPT-4.

Function Calling and Enhanced Instruction Following

Enhancements to the function calling feature drew attention, allowing for the integration of multiple functions in a single message. The accuracy of function calling has also been substantially improved, ensuring a more precise selection of the right function parameters. Moreover, the platform now boasts improved instruction following, enabling the generation of specific formats with heightened accuracy.

Assisting Development with New API Capabilities

The introduction of the Assistants API marked a significant milestone for developers seeking to create agent-like experiences within their applications. This development was further augmented by the inclusion of the Code Interpreter and Retrieval tools, empowering developers to build high-quality AI apps with ease. The Assistants API introduces persistent and infinitely long threads, liberating developers from the constraints of context windows.

Lower Prices and Enhanced Rate Limits for Developers

In a move to support developers and scale their applications, OpenAI has substantially reduced prices across various models, passing on significant savings to the developer community. Additionally, the company has doubled the tokens per minute limit for all paying GPT-4 customers, facilitating the seamless expansion of their applications.

OpenAI’s commitment to continuous innovation and developer support was prominently showcased during the DevDay event. With these cutting-edge upgrades and cost reductions, the company has reaffirmed its position as a trailblazer in the AI industry — waiting for the competition to catch up.

Beyond Text to Images, Voices, and Real-Time Knowledge

Voice and image understanding are amazing features that improve the way we interact with ChatGPT 4. However there are improvements also in ChatGPT’s temporal limitations: the model indeed cannot update itself without undergoing another training process, which takes time and resources. Therefore, the model had knowledge only up to September 2021 for a long time.

Well, now ChatGPT 4 returns in 2023, thanks to a retrained model and the capability to browse the internet, which should also give more accurate results and avoid hallucinations.


Gemini vs. ChatGPT: The AI Showdown Looms

The timing of these innovations from OpenAI raises an intriguing question: Could this be a response to the looming presence of Gemini? Google’s DeepMind has been quietly working on its own language model, Gemini, in a bid to compete with OpenAI. It is accessible with Bard, but the it still doesn’t seem to be at the same level of ChatGPT. However, the public version of Bards uses a lighter version of Gemini – the real power lies in the bigger Gemini Ultra model, for which we do not have access yet and only some benchmarks are provided by Google.

Read more about the main differences between Gemini and ChatGPT in this article.

The fusion of DeepMind and Google’s vast resources could potentially give Gemini a substantial advantage. As the competition heats up, we can expect rapid advancements and exciting developments in the field of AI. The stakes have never been higher, and the possibilities are limitless.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.