
What if your AI chatbot could see what you’re seeing? That’s now a reality thanks to xAI’s latest update to its chatbot, Grok. The new feature, Grok Vision, brings real-time image analysis to your iPhone, allowing the chatbot to interpret whatever you point your phone’s camera at. Whether it’s a product, a document, or a sign in a foreign language, xAI can identify it and give you contextual answers in seconds.
Think Google Gemini’s vision or OpenAI’s ChatGPT-4 with Vision, but now with xAI’s unique spin.
With Grok Vision, available now on the Grok ios app (but not yet for Android), users can activate voice mode, open the camera, and ask, “What am I looking at?” The AI then analyses the scene and responds intelligently.
Imagine snapping a quick shot of a foreign menu and getting an instant translation, or pointing your phone at a power outlet and asking if it’s compatible with your charger. Grok Vision promises this kind of real-world utility.
Here's a quick look shared by @MarioNawfal
"The Vision feature on iOS allows the chatbot to analyze real-world objects, text, and environments through your camera."
The update doesn’t stop at computer vision. xAI also rolled out multilingual audio capabilities and real-time search in Grok’s voice mode. These features are currently accessible on Android, but only to users subscribed to the premium SuperGrok plan ($30/month).
Grok now speaks:
- Spanish 🇪🇸
- French 🇫🇷
- Turkish 🇹🇷
- Japanese 🇯🇵
- Hindi 🇮🇳
Check out the update
xAI has been on a roll. Earlier this month, they introduced a memory feature, enabling Grok to recall past conversations, similar to what OpenAI and Anthropic are building into their AI systems.
The team also released a canvas-style workspace, allowing users to create documents collaboratively with Grok, a move that positions it closer to tools like Notion AI or Google’s Duet AI.
With Grok Vision, xAI is aiming to close the gap with other multimodal giants like Google and OpenAI, but it’s also carving its niche: combining AI-powered productivity with real-world visual interaction.