Enhancing Tour Experiences: Which LLM is better for Tour Guide Generation
Kirill Sergeev To blog

Enhancing Tour Experiences: Which LLM is better for Tour Guide Generation

The evolution of Large Language Models (LLMs) like Mistral 7B has been remarkable, but their application in specialized domains, such as creating tour guides, reveals significant limitations. General LLMs are trained on a wide array of data, which, while comprehensive, often lacks the depth and specificity needed for producing high-quality, engaging tour guide content. This inadequacy is highlighted when these models are tasked with generating texts that require a unique blend of informational depth, local flavor, and narrative appeal, all crucial for effective tour guiding.

When we subjected common LLMs to real-user evaluations, the feedback revealed a notable disparity. These models managed a comparability rate of merely 30% when measured against authentic, human-crafted tour guide texts. This gap is primarily due to the generic nature of the training data used for these models, which results in outputs that, although factually correct, miss the stylistic and content-specific nuances essential for captivating tour narratives.

Comparing LLMs for Tour Guide Production

Best large language models in 2023 User satisfaction rate across LLMs

The evaluation involved two approaches: zero-shot (0-shot) performance, where the model generates content without prior examples, and few-shot performance, where the model is given a few examples to guide its output. The results were compared against real tour guides, with a focus on user queries and information from a dataset of 100 tour guides. Here’s a detailed look at each model:

Compared to these models, the custom Mistral 7B, fine-tuned on a specific dataset, showed a remarkable 58% in zero-shot performance. This indicates the significant impact of specialized training and fine-tuning on a model’s ability to produce contextually accurate and engaging content for tour guides. The evaluation underscores the importance of tailored data and fine-tuning in achieving high-quality, AI-generated tour guide scripts.

Evaluation procedure

The evaluation involved a comprehensive comparison process where each LLM’s generated tour guide scripts were measured against authentic tour guide content. This comparison was based on several key criteria:

Each model was scored based on these criteria, leading to the percentage scores in both zero-shot and few-shot scenarios.

Specialized training of own system for generating tour guides

Specialized training of own system

To address these shortcomings, we embarked on a targeted approach, fine-tuning Mistral 7B with two tailored datasets. The first is the Canadian open dataset, a comprehensive collection encompassing detailed descriptions of various attractions across Canada. The second, our unique tour guide dataset, comprises a rich array of texts and narratives developed by our experienced tour guides in Russia. This blend of local and international data provided a more rounded training experience for Mistral 7B, enabling it to grasp the intricacies of tour guide content better.

Comparison of User Satisfaction Rates Across Various LLMs

The training process involved leveraging AWS’s powerful cloud computing capabilities to handle the computational demands of training a Large Language Model (LLM). The first dataset, an open-source collection containing detailed information about Canadian attractions, was sizable at 7GB. This dataset provided a rich source of factual and descriptive content about various locations, landmarks, and cultural elements across Canada, offering a broad spectrum of information relevant to tour guides.

The second dataset, smaller at 700MB, was a proprietary collection comprising real tour guide scripts and narratives. This private dataset was accumulated from our own resources and was instrumental in providing the model with examples of actual tour guide language, storytelling techniques, and the specific structure typical of tour guide texts.

By training on AWS, the model could leverage advanced machine learning algorithms and scalable computing resources to efficiently process and learn from these datasets. The larger Canadian dataset contributed to the model’s understanding of factual accuracy and detail, while the smaller, specialized tour guide dataset imparted the necessary stylistic and contextual nuances needed for authentic tour guide script generation.

This dual-dataset approach, combined with AWS’s robust training environment, ensured that the custom LLM developed a comprehensive understanding of both the factual content required for tour guides and the narrative style that makes such guides engaging and informative.

Results

Performance Comparison: Generic LLM vs Custom Mistral 7B

The specialized training of the model on Amazon Web Services using the dual-dataset approach significantly enhanced its performance in generating tour guide scripts. The model’s output comparability rate with professional tour guide texts reached 60%, a testament to the efficacy of this training method. This notable improvement was evident not only in the factual accuracy of the generated content, which is crucial for reliable and informative tour guides, but also in the style and format of the presentation. The AI-generated scripts began to mirror the engaging, narrative-driven style typical of traditional tour guides.

Quantitatively, the fine-tuned Mistral 7B model showed a remarkable 62% performance. This indicates that the information provided by the AI was twice as likely to be directly relevant to the specific interests and inquiries of users seeking tour guide information. Additionally, there was a 51% performance in narrative appeal, suggesting that the AI-generated content became significantly more engaging and enjoyable to read, resembling the storytelling quality of human tour guides.

ROUGE-L Scores on Tour Guide Dataset

The plot above visualizes the ROUGE-L scores for various language models tested on a tour guide dataset. The ROUGE-L metric, which measures the longest common subsequence between the generated summary and a reference summary, is expressed as a percentage, indicating the level of overlap.

Here’s a description of the results:

The method of testing involved using 100 examples from our tour guide dataset and comparing the model-generated summaries with human-generated summarizations. The results indicate varying levels of proficiency across different models in capturing the essential elements of the tour guide content, with PEFT Mistral showing a notably higher aptitude in this regard. ​

Perhaps most impressively, the user satisfaction rate with the AI-generated tour guide content saw a dramatic increase. While standard LLM outputs had a user satisfaction rate of around 40%, this figure soared to 71% with the content produced by Mistral 7B. This high level of user satisfaction indicates that the AI was not only providing accurate and relevant information but doing so in a manner that resonated well with the users, offering them an experience akin to interacting with a knowledgeable and engaging human tour guide.

Overall, the fine-tuning of Mistral 7B using specific and rich datasets on a robust platform like AWS resulted in a marked improvement in the AI’s ability to generate high-quality, engaging, and user-centric tour guide content.

Updated LLM Evaluation on Tour Guide Script Generation

The custom Mistral 7B, fine-tuned on a specific dataset, was notably superior, indicating the effectiveness of specialized training and adaptation to the tour guide content domain. The plot is a chart representing the performance scores of a our own model in generating tour guide scripts across six evaluation criteria. The scores are:

Future Directions and the Role of Advanced AI Integration

However, our quest for perfection does not stop here. We are currently experimenting with integrating advanced features into Mistral 7B, such as specialized guardrails and Retrieval-Augmented Generation (RAG) techniques. These, coupled with our vector store and Real-Life AI Feedback (RLAIF) systems, are expected to further refine the model’s responses, aiming for a 90% comparability rate with human tour guides.

Despite these advancements, it’s important to acknowledge that AI, in its current state, cannot entirely replace the nuanced expertise of human tour guides. AI systems like Mistral 7B are excellent for replacing basic, often generic tour guide services, especially in scenarios that demand quick and adaptable guidance. For more complex and interactive travel experiences, human guides still hold the edge.

In conclusion, the integration of Mistral 7B with specialized datasets signifies a major stride forward in AI-driven tourism. With continued refinements and strategic enhancements, AI tour guides are poised to offer experiences that rival those of human guides in text quality and interactivity. This synergy of advanced AI and human expertise heralds a new era in travel, where technology enhances and complements the rich tapestry of human-guided tours.