Logged-out Icon

OpenAI’s ChatGPT can now see, hear, and speak: Here’s how

ChatGPT voice/image features

American artificial intelligence company OpenAI has taken a significant leap forward and on Monday announced new voice and image capabilities in its popular AI bot ChatGPT. “We are beginning to roll out new voice and image capabilities in ChatGPT. They offer a new, more intuitive type of interface by allowing you to have a voice conversation or show ChatGPT what you’re talking about,” OpenAI said in a blog.

The Sam Altman-led company will begin rolling out the latest features to Plus and Enterprise users over the next two weeks and plans to expand these capabilities to other groups of users, including developers, soon after. While the Voice feature will be rolled out to iOS and Android, the image feature will be available on all platforms.



Voice Conversations with ChatGPT

One of the most exciting features of the latest ChatGPT update is its ability to engage in voice conversations. This functionality greatly enhances the conversational capabilities of the AI chatbot, making it even more versatile and user-friendly. Whether you want to have a casual conversation or seek assistance with complex tasks, ChatGPT is ready to listen and respond in real time.

To initiate a voice conversation with ChatGPT, users need to go to Settings, click on New Features on the mobile app, and opt into voice conversations. ChatGPT offers five different voices and users can tap the headphone button located in the top-right corner of the home screen to choose their preferred voice. The company has collaborated with professional voice actors to create each of the voices and also uses its open-source speech recognition system Whisper to transcribe the spoken words into text.

Image Sharing and Interpretation

Another interesting feature of the latest ChatGPT update is its ability to process and interpret a wide range of images, such as photographs, screenshots, and documents containing both text and images. Whether you want to describe a scene, identify objects, or receive information related to an image, ChatGPT is up to the task.

To share an image with ChatGPT, users can simply upload a picture using the photo button. iOS or Android users need to first tap the plus button. Furthermore, users can discuss multiple images or use ChatGPT’s drawing tool to guide their assistant. The AI chatbot will then use its advanced image recognition algorithms to analyze the content and provide relevant information or descriptions This feature is powered by multimodal GPT-3.5 and GPT-4.

“Voice and image give you more ways to use ChatGPT in your life. Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it. When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow up questions for a step by step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you,” OpenAI explained.

The AI firm is taking a strategic approach to deploy image and voice capabilities gradually. Their overarching goal is to develop Artificial General Intelligence (AGI) that is safe and beneficial for humanity. This approach involves releasing their tools incrementally, allowing them to make continuous improvements and refine risk mitigation measures over time. While these capabilities surely open up exciting possibilities for creativity and accessibility, they also introduce new risks, such as the potential for malicious actors to impersonate public figures or commit fraud.

Voice Capabilities

To harness the power of this technology responsibly, OpenAI is initially focusing on a specific use case: voice chat. They’ve collaborated with voice actors and are partnering with other entities like Spotify. For instance, Spotify is leveraging this technology for their Voice Translation feature, which helps podcasters reach a wider audience by translating podcasts into additional languages using the podcasters’ own voices.

Image Input Challenges

Vision-based models present unique challenges, including hallucinations and the model’s interpretation of images in high-stakes domains. Before a broader deployment, OpenAI rigorously tested the model with red teamers in areas like extremism and scientific proficiency, as well as a diverse group of alpha testers. This extensive research has enabled them to establish key guidelines for responsible usage.

Ensuring Vision is Useful and Safe

OpenAI’s vision technology is designed to assist users in their daily lives effectively. To refine this approach, they have collaborated with Be My Eyes, a mobile app for the visually impaired. Feedback from users has emphasized the value of having general conversations about images, even those containing people in the background. OpenAI has also implemented technical measures to restrict ChatGPT’s ability to analyze and make direct statements about individuals, respecting privacy concerns.

Transparency and Limitations

OpenAI is transparent about ChatGPT’s limitations, particularly in specialized topics like research. They discourage high-risk use cases without proper verification. Additionally, ChatGPT excels at transcribing English text but may perform poorly with non-Roman script languages. Non-English users are advised against using ChatGPT for such purposes.

OpenAI’s ChatGPT has taken the world by storm since its launch in November last year. Short for Chat Generative Pre-trained Transformer, ChatGPT is a popular AI text generator and has been a center of attention for many due to its human-like response. However, concerns around fake news, plagiarism, bias, manipulation, and privacy have led several other authorities globally to study and investigate the impact and potential risks such AI platforms could pose.

Meanwhile, several other tech majors like Google and Baidu are also investing heavily in the AI segment to emulate the success of ChatGPT. In February, Sundar Pichai-led tech major Google announced its experimental conversational artificial intelligence chatbot service Bard. In the following month of March, Baidu chief executive officer Robin Li presented a pre-recorded video wherein Ernie Bot presented features like mathematical logic reasoning, generated a conference poster and video based on a prompt, and answered questions related to Chinese fiction, among others.

This website uses cookies to ensure you get the best experience on our website