travel-compAnIon

image request for architectural style and history of a buidling
voice recognition and functioncalling
chat for recommendations, directions, historical facts

Inspiration

Exploring new cities, especially in foreign countries with different languages, can be a challenging adventure. You might want to learn about the city or country's history, cultural differences, or other interesting facts. Or maybe you are overwhelmed by the beauty of the city and unsure of what to explore next. A tourist guide that can help you navigate and discover the city can be incredibly helpful, but they are often not always available. That's where Travel Companion can help. It acts as your personal companion, making it easy for you to explore any city that you are interested in.

What it does

This mobile smartphone app helps you explore cities and learn interesting things about them. Whether you are a traveler, a history enthusiast, or simply curious about your surroundings, this app will be your ultimate knowledge guide.

You can get information about any building, its architectural style or its history. Just take a picture of the building!
Or just talk to the AI guide and depending on your interests or needs it will display Points of Interest in your area on a 3D map and through an Augmented Reality overlay.
If you want to dig really deep and learn more about the city or place you are currently at, just chat with the AI model. It will teach you historical facts, give you recommendations, or navigate you through the city.

What’s special about this app is that it adds the geocoded address to the user input, which is then sent to Gemini. This enhances spatial understanding and allows for more accurate answers. For example, this feature enables the model to correctly identify even lesser-known buildings.

How I built it

Developed using the Unity3D Game Engine, ARFoundation and ARCoreExtensions for augmented reality capabilities. I used Google VertexAI REST API to access the gemini-1.5-pro model for chat and for function calling events.

Image processing is handled as follows: when a user takes a photo, it is first uploaded to Google Cloud Storage via the .NET library. The image is then processed using the gemini-1.0-vision model to extract detailed information about the photographed subject.

For security and data privacy, user authentication is managed through OAuth2. The app also leverages Google’s Places API (specifically, the Nearby Search API) and the Geocoding API, both accessed via REST API calls. Although speech recognition capabilities are supported by Gemini, I choose a local model, whisper-tiny, which runs on Sentis - a Unity inference engine.

The photorealistic 3D building model that can be seen in some parts of the video are displayed using CesiumION. The tiles itself are Googles Photorealistic3DTiles and display the buildings surrounding you.

The user's geographic location is acquired using ARCore and Google's Visual Positioning System (VPS). This location data is then geocoded to convert it into a precise address. This address is sent along with user queries to Gemini. Incorporating the geocoded address into requests to Gemini significantly enhances the accuracy of responses to spatial questions, providing users with information that is highly relevant to their specific location.

Challenges i ran into

Bringing all these different technologies into one working application wasn't easy in itself. Also, always keeping in mind usability, performance, and most importantly, keeping AI response time low for a good user experience.

Accomplishments that i am proud of

I think building something that can be potentially used by anyone, anywhere in the world, regardless of their language, knowledge, or location. This app can be used anywhere in the world to explore cities and places.

What I learned

A lot about Google Gemini and multimodal LLMs in general. I also learned that providing the model with specific spatial information like your current position can greatly enhance the accuracy of its responses.

What's next for travel-compAnIon

Fine-tuning the model and finding the best possible prompt to get a desired response is not easy and can definitely be improved in this application. Improving answer accuracy, especially in relation to spatial understanding, will be a topic I will definitely investigate further and is very important for LLM navigation tasks.

Built With

arcore
c#
cesium-ion
google-gemini
google-geocoding-api
google-maps-platform
google-places
google-vertex-ai
unity

Updates

Moritz Cermann started this project — Apr 30, 2024 04:43 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.