October 5, 2024 Anton Cherepov ← To blog

Enhancing Mobile App UX: AI-Powered Optimization Strategies

Our initial task was to design the architecture for a mobile application with an emphasis on offloading intensive computations to the server, delivering them to the client upon loading the tour. Subsequently, the client, upon changes in geoposition and at timed intervals, activates its own small neural network, executable on the client side, utilizing the data provided by the server. On top of the outputted result, proprietary algorithms for text alignment and smoothing operate to ensure smooth transitions, along with protective mechanisms guaranteeing the quality and security of the delivered texts. The text must be generated continuously, maintaining a speech output speed acceptable by tour guide standards—we aimed for approximately 50 words per minute.

This was the challenge presented to us. The training of neural networks, their selection, and the selection of necessary training data and metrics were handled by our separate AI training department—details of which will be discussed in separate articles. I want to share the significant difficulties we faced in achieving the required speed and text interactivity during operation, while also maintaining the necessary performance and energy efficiency on the device. The goal was to have our application, running on an average device—we targeted the iPhone 13 mini—deliver speech quickly upon changes in geolocation, while minimizing internet usage.

Scheme for interacting between mobile client and server software

Top-down view of the interaction between the server and the client

The core concept of our system is to mimic the operation of a tour guide as closely as possible. Before creating a tour, a tour guide first gathers information along the tourist’s route, forms the route, selects data suitable for the current tourist, identifies buildings to visit, attaches relevant materials to them, crafts coherent speech, and then improvises based on prepared materials while navigating the route. Our system follows a similar process.

The most intensive tasks, such as route preparation, material collection, building selection, and the creation of coherent speech and multiple templates, are handled by heavy neural networks, specifically big Language Model (big LLM), on the server. Their output, in the form of separate materials (text, images, segmented text snippets), as well as a distilled, small model tuned for a specific route, is sent to the device. Additionally, specific technical rules for algorithms used for text smoothing and coherent speech formation, according to tourism standards, are sent. Compressed packages containing all this data are sent via gRPC to the client, where the materials are stored in a local mobile database. A certain generated material is also duplicated in a vector database on the server, ready for quick regeneration if the user significantly deviates from the route or if the mobile client needs to regenerate text using existing information—this is done in a fast mode using the vector database, contingent on internet availability. This process is triggered periodically through a handshake, subject to the client’s permission. This ensures a broader adaptability of the generated tour. Essentially, our system operates without the internet by guiding the user along a pre-prepared route, adjusting dynamically at specific locations and based on predetermined materials.

When the internet is available and permission is granted, our system responds more effectively to any changes in the route, generating text using online and large neural networks, selecting increasingly new information in response to the user’s movements.

On the client side, two custom neural networks operate—one is responsible for generating connected text based on materials prepared by large neural networks, and the other is responsible for converting text into speech. A third network handles text recognition and invokes corresponding functions in our engine. We will elaborate separately on the third network, but in general, we use the pre-trained RNN-Transducer neural network, supplemented by a simple neural network for DNN-ActionVoice classification (based on CNN Incision).

The first neural model is a custom-generated model sent from the large neural network. The second model is a pre-trained neural network adept at smoothing text according to tourism standards.

The primary work is carried out by the Guide Generated Engine, which in real-time receives geolocation coordinates, generates text for them, converts it into speech, and reacts to changes in geolocation by adjusting the text accordingly (making it more detailed, concise, transitioning to other text, or stopping the speech altogether). To speed up the system, critical components are stored in the mobile vector database.

Guide Generated Engine - What It’s Built On

Since our team specializes in Javascript, we initially built the system using React. Currently, we leverage the capabilities of React Native to run our client on both the Android and iOS platforms. Our main goal was to provide the user with a responsive interface where the generation of the tourist guide occurs smoothly, intuitively, and predictably. We addressed this by dividing it into the client-side, where the user defines parameters, and then sends a request to the server, entering a waiting mode. During this mode, the user can interact with the application seamlessly. Upon completion of the internal loading for the mobile neural network from the server, the user receives a notification that they can now engage with their AI tour guide along the chosen route.

Once the user selects a route and starts following it, our system operates as follows:

First, the engine queries the internal vector storage using user data and geolocation from the geolocation engine, obtaining pre-prepared internal representations with the necessary information. The engine then sends them to the Mobile Text Neural Network, which generates a textual guide for the tour. Utilizing the current information read by the neural network and the new text, the engine interacts with the Mobile Transition Engine to transition, generating text that will be displayed to the user. This text is used for speech generation, narrating to the user what they see.

Verifiability and factual accuracy of the text play a crucial role. To achieve this, we retrieve all information from the mobile storage pre-provided by the backend during tour generation. If, after processing through the mobile neural network, we find that the text quality falls below the required level—with limited information, references, and factual accuracy—and there is an internet connection, we send additional requests to the backend to obtain supplementary information. This additional information is then sent directly to the engine for smooth transitions in generating the final text.

Finally, we have the Engine for sound generation from text. The key goal for the entire system is smoothness and expected speed. We achieve this by having most of the complex functionality pre-implemented by the backend and readily available. Additionally, sound and text generation take around 2-3 minutes, aligning with the user’s current geolocation, speed, and direction of movement. In the transition engine, we have a set of different templates for creating intelligent pauses, engaging users’ attention while the text is being generated.

Rules of Tour Generation

The speech of the tour guide in our engine is generated according to the following algorithm:

10% of the information is general, necessary to ensure a smooth entry into the tour and facilitate transitions.
40% of the information is specific to the particular object being discussed.
20% consists of clarifying questions to the user of the application.
30% of the information is additional text used to maintain user attention, with transitions to general information, information related to previous user texts, and the next location and direction of the user.

To keep the generated tour within these boundaries, we use alignment and railroad algorithms to guide the mobile device’s neural network in the required direction. In the transition engine, when generating a tour, we use code to keep the information within the necessary time frames.

In normal conditions, the speech of the tour guide should be:

Bright, expressive, and vivid. The tour guide must be able to engage listeners, capture their attention, and evoke positive emotions by using various expressive speech techniques such as metaphors, comparisons, epithets, etc.
Accessible and understandable. The tour guide should speak in simple, understandable language, avoiding the use of complex terms and abbreviations. They should consider the age, interests, and level of preparedness of the listeners.
Informative and substantive. The tour guide must provide listeners with accurate and current information about the tour object. They should be able to structure information for easy assimilation.
The speed of the tour guide’s speech should be optimal for listeners to understand and remember the information, usually around 50 words per minute.
The number of characters in one phrase of the tour guide should not exceed 30-40, allowing them to speak clearly and intelligibly, without stumbling.
Pauses in the tour guide’s speech are necessary for listeners to comprehend information and ask questions. Pauses are usually made after important theses, before transitioning to a new topic, or after a question from the tour guide.
The tour guide’s questions to listeners help them better understand and remember the information. Questions can be both open-ended, encouraging reflection and discussion, and closed-ended, helping them assess knowledge.
Appeals to the user in the tour guide’s speech help create an atmosphere of dialogue and engagement. They can be both formal and informal, depending on the nature of the tour.

Here are a few specific examples of how the tour guide can use these elements in their speech:

Bright, expressive, and vivid speech: “Here, on this hill, in ancient times, there was an ancient settlement. It was surrounded by mighty oaks that rustled in the wind, as if telling their ancient stories.”
Accessible and understandable speech: “So, we see that this monument is a sculpture of a warrior holding a sword. It is made of bronze and stands about two meters tall.”
Informative and substantive speech: “This church was built in the late 17th century. It is one of the most famous architectural landmarks in the city. The interior of the church is rich and exquisite, featuring frescoes, icons, and other works of art.”

Of course, the specific features of the tour guide’s speech depend on the specific conditions of the tour. For example, if the tour is conducted for children, the tour guide’s speech should be simpler and more understandable. If the tour is for foreign tourists, the tour guide should use slower speech and avoid complex terms.

Temporal Framework

The tour generation speed should be 50 words per minute, equivalent to 50-55 tokens. While we generally maintain this speed online, there are occasional spikes of 10-20 tokens due to delays in the neural network and memory overflow. To address this, we maintain an advancing process by generating around 180-250 tokens per minute in multiple parallel branches. From these, we select the most suitable texts for the user, ensuring the speed stays within the 50-55 token range.

It’s important to note that the mentioned 50% of information is generated from text reservoirs for transitions. These reservoirs, cached in both the mobile vector storage and the general cache, store texts instantly available to override any issues. This demonstrates to the user that the tour guide is actively contemplating information but is ready to engage them with interesting content during this time. Through our testing, this occurs no less than once an hour and is only related to users deviating from the generated route, requiring additional queries to the backend. If there is no internet connection, we switch to responses indicating insufficient information or reply with general and additional phrases if statistical maximization suggests the necessary output.

Conclusion

This is an overview of our architecture for generating tour guides in a smooth mode for tourists, where we maintain speed, fluidity, responsiveness, fact-checking, and functionality on practically any device, even without internet access. We have described how we developed it on our prototypes and are currently in the process of testing it on various devices in real-world scenarios. If you would like to get in touch with us and learn more about our work, please contact us via email.