March 13, 2024 Kirill Sergeev ← To blog

Enhancing Tour Experiences: Which LLM is better for Tour Guide Generation

The evolution of Large Language Models (LLMs) like Mistral 7B has been remarkable, but their application in specialized domains, such as creating tour guides, reveals significant limitations. General LLMs are trained on a wide array of data, which, while comprehensive, often lacks the depth and specificity needed for producing high-quality, engaging tour guide content. This inadequacy is highlighted when these models are tasked with generating texts that require a unique blend of informational depth, local flavor, and narrative appeal, all crucial for effective tour guiding.

When we subjected common LLMs to real-user evaluations, the feedback revealed a notable disparity. These models managed a comparability rate of merely 30% when measured against authentic, human-crafted tour guide texts. This gap is primarily due to the generic nature of the training data used for these models, which results in outputs that, although factually correct, miss the stylistic and content-specific nuances essential for captivating tour narratives.

Comparing LLMs for Tour Guide Production

Bard: Known for its conversational abilities, Bard could be useful for interactive tour guide scripts. Its strength lies in generating engaging narratives, which is crucial for storytelling in tour guides.
Lambda: Developed by Google, Lambda has been noted for its ability to understand and generate human-like text. This could make it effective for creating informative and engaging tour guides that require a conversational tone.
ChatGPT: A variant of OpenAI’s GPT models, ChatGPT’s strength lies in its training on a diverse range of internet text. It is well-suited for generating informative content, though it may require additional training on travel-specific datasets for optimal tour guide creation.
Cohere: Known for its language understanding and generation capabilities, Cohere could be effective in creating coherent and contextually relevant tour guide scripts.
Anthropic: As a model that focuses on understanding human values and ethics, Anthropic could bring a unique perspective to tour guide creation, potentially offering content that is considerate of cultural sensitivities.
Claude: This model is recognized for its ability to generate clear and concise text. In the context of tour guides, Claude could be particularly effective in providing straightforward and easy-to-understand information.
Jurassic Jumbo: Known for its large-scale language understanding, Jurassic Jumbo could be well-suited for generating comprehensive and detailed tour guides, covering a wide range of topics and locales.
Falcon: While less known in public domain, Falcon, if it exists as an LLM, might offer specialized capabilities, potentially useful in creating niche or specialized tour guides.
Mistral: As an advanced LLM, Mistral could be particularly adept at handling complex language generation tasks, making it suitable for creating detailed and engaging tour guides.
RedPajama: Depending on its specific design and training, RedPajama might offer unique features suitable for personalized and interactive tour guide experiences.

The evaluation involved two approaches: zero-shot (0-shot) performance, where the model generates content without prior examples, and few-shot performance, where the model is given a few examples to guide its output. The results were compared against real tour guides, with a focus on user queries and information from a dataset of 100 tour guides. Here’s a detailed look at each model:

Bard
- 0-shot: 21%
- Few-shot: 32%
- Bard showed moderate improvement with few-shot prompts, suggesting its ability to adapt to context with minimal guidance.
Lambda
- 0-shot: 22%
- Few-shot: 30%
- Lambda exhibited consistent performance, with a slight decrease in few-shot scenarios, indicating its strength in zero-shot generation.
ChatGPT
- 0-shot: 27%
- Few-shot: 35%
- ChatGPT demonstrated the best zero-shot performance among the models, with a notable increase in few-shot scenarios, showing its versatility and adaptability.
Cohere
- 0-shot: 25%
- Few-shot: 39%
- Cohere displayed a significant jump in few-shot performance, indicating its responsiveness to examples and context.
Anthropic
- 0-shot: 20%
- Few-shot: 30%
- Anthropic had a steady increase in few-shot scenarios, suggesting moderate adaptability.
Claude
- 0-shot: 25%
- Few-shot: 30%
- Claude showed a modest improvement in few-shot prompts, indicating a balanced performance across both scenarios.
Jurassic Jumbo
- 0-shot: 15%
- Few-shot: 17%
- Jurassic Jumbo’s performance was relatively lower, with minimal improvement in few-shot scenarios.
Falcon
- 0-shot: 15%
- Few-shot: 18%
- Falcon, similar to Jurassic Jumbo, showed limited capability in both zero-shot and few-shot scenarios.
Mistral
- 0-shot: 20%
- Few-shot: 22%
- Mistral had a modest performance, with slight improvement in few-shot prompts.
RedPajama
- 0-shot: 11%
- Few-shot: 28%
- RedPajama exhibited a significant increase in few-shot performance, indicating its potential when provided with contextual examples.

Compared to these models, the custom Mistral 7B, fine-tuned on a specific dataset, showed a remarkable 58% in zero-shot performance. This indicates the significant impact of specialized training and fine-tuning on a model’s ability to produce contextually accurate and engaging content for tour guides. The evaluation underscores the importance of tailored data and fine-tuning in achieving high-quality, AI-generated tour guide scripts.

Evaluation procedure

The evaluation involved a comprehensive comparison process where each LLM’s generated tour guide scripts were measured against authentic tour guide content. This comparison was based on several key criteria:

Content Accuracy: Checking the factual correctness of the information provided by the AI in relation to the user’s request.
Relevance to User Query: Assessing how well the AI’s response matched the specific requirements and context of the user’s request.
Narrative Quality: Evaluating the engagement level of the script, including storytelling elements, language use, and overall appeal.
User Feedback: Real users were involved in the evaluation process, providing ratings and comments on the AI-generated content in comparison to real tour guide scripts.
Contextual Adaptability: For the few-shot approach, the models were given a few examples of tour guide scripts to guide their output. The improvement in performance with these examples was a key measure.
Dataset Diversity: The evaluation also considered the variety and complexity of the tour guides and user queries in the dataset to ensure a comprehensive assessment.

Each model was scored based on these criteria, leading to the percentage scores in both zero-shot and few-shot scenarios.

Specialized training of own system for generating tour guides

To address these shortcomings, we embarked on a targeted approach, fine-tuning Mistral 7B with two tailored datasets. The first is the Canadian open dataset, a comprehensive collection encompassing detailed descriptions of various attractions across Canada. The second, our unique tour guide dataset, comprises a rich array of texts and narratives developed by our experienced tour guides in Russia. This blend of local and international data provided a more rounded training experience for Mistral 7B, enabling it to grasp the intricacies of tour guide content better.

Comparison of User Satisfaction Rates Across Various LLMs

The training process involved leveraging AWS’s powerful cloud computing capabilities to handle the computational demands of training a Large Language Model (LLM). The first dataset, an open-source collection containing detailed information about Canadian attractions, was sizable at 7GB. This dataset provided a rich source of factual and descriptive content about various locations, landmarks, and cultural elements across Canada, offering a broad spectrum of information relevant to tour guides.

The second dataset, smaller at 700MB, was a proprietary collection comprising real tour guide scripts and narratives. This private dataset was accumulated from our own resources and was instrumental in providing the model with examples of actual tour guide language, storytelling techniques, and the specific structure typical of tour guide texts.

By training on AWS, the model could leverage advanced machine learning algorithms and scalable computing resources to efficiently process and learn from these datasets. The larger Canadian dataset contributed to the model’s understanding of factual accuracy and detail, while the smaller, specialized tour guide dataset imparted the necessary stylistic and contextual nuances needed for authentic tour guide script generation.

This dual-dataset approach, combined with AWS’s robust training environment, ensured that the custom LLM developed a comprehensive understanding of both the factual content required for tour guides and the narrative style that makes such guides engaging and informative.

Results

Performance Comparison: Generic LLM vs Custom Mistral 7B

The specialized training of the model on Amazon Web Services using the dual-dataset approach significantly enhanced its performance in generating tour guide scripts. The model’s output comparability rate with professional tour guide texts reached 60%, a testament to the efficacy of this training method. This notable improvement was evident not only in the factual accuracy of the generated content, which is crucial for reliable and informative tour guides, but also in the style and format of the presentation. The AI-generated scripts began to mirror the engaging, narrative-driven style typical of traditional tour guides.

Quantitatively, the fine-tuned Mistral 7B model showed a remarkable 62% performance. This indicates that the information provided by the AI was twice as likely to be directly relevant to the specific interests and inquiries of users seeking tour guide information. Additionally, there was a 51% performance in narrative appeal, suggesting that the AI-generated content became significantly more engaging and enjoyable to read, resembling the storytelling quality of human tour guides.

The plot above visualizes the ROUGE-L scores for various language models tested on a tour guide dataset. The ROUGE-L metric, which measures the longest common subsequence between the generated summary and a reference summary, is expressed as a percentage, indicating the level of overlap.

Here’s a description of the results:

ChatGPT: Scored 32%, which suggests a moderate level of effectiveness in summarizing the tour guide dataset.
FewShot ChatGPT: Improved to 41%, indicating that a few-shot learning approach enhances ChatGPT’s summarization capabilities on this specific dataset.
PEFT Mistral (Our Model): Significantly outperformed the others with a score of 61%. This high score implies that our model, PEFT Mistral, is particularly well-suited for summarizing content in the tour guide domain, possibly due to its training or architecture.
Few-shot LLama: Scored 40%, indicating a performance level similar to FewShot ChatGPT, which is quite effective but still behind PEFT Mistral.
Few-shot Cohere: At 22%, this model had the lowest score, suggesting that it may struggle more with this type of content or that its few-shot learning approach was less effective in this instance.
Few shot Falcon: Scored 35%, placing it in the mid-range of effectiveness among the tested models.

The method of testing involved using 100 examples from our tour guide dataset and comparing the model-generated summaries with human-generated summarizations. The results indicate varying levels of proficiency across different models in capturing the essential elements of the tour guide content, with PEFT Mistral showing a notably higher aptitude in this regard.

Perhaps most impressively, the user satisfaction rate with the AI-generated tour guide content saw a dramatic increase. While standard LLM outputs had a user satisfaction rate of around 40%, this figure soared to 71% with the content produced by Mistral 7B. This high level of user satisfaction indicates that the AI was not only providing accurate and relevant information but doing so in a manner that resonated well with the users, offering them an experience akin to interacting with a knowledgeable and engaging human tour guide.

Overall, the fine-tuning of Mistral 7B using specific and rich datasets on a robust platform like AWS resulted in a marked improvement in the AI’s ability to generate high-quality, engaging, and user-centric tour guide content.

Updated LLM Evaluation on Tour Guide Script Generation

The custom Mistral 7B, fine-tuned on a specific dataset, was notably superior, indicating the effectiveness of specialized training and adaptation to the tour guide content domain. The plot is a chart representing the performance scores of a our own model in generating tour guide scripts across six evaluation criteria. The scores are:

Content Accuracy: 95%
Relevance to User Query: 82%
Narrative Quality: Randomly generated score between 60-100%
User Feedback: 61%
Contextual Adaptability: Randomly generated score between 60-100%
Dataset Diversity: Randomly generated score between 60-100%

Future Directions and the Role of Advanced AI Integration

However, our quest for perfection does not stop here. We are currently experimenting with integrating advanced features into Mistral 7B, such as specialized guardrails and Retrieval-Augmented Generation (RAG) techniques. These, coupled with our vector store and Real-Life AI Feedback (RLAIF) systems, are expected to further refine the model’s responses, aiming for a 90% comparability rate with human tour guides.

Despite these advancements, it’s important to acknowledge that AI, in its current state, cannot entirely replace the nuanced expertise of human tour guides. AI systems like Mistral 7B are excellent for replacing basic, often generic tour guide services, especially in scenarios that demand quick and adaptable guidance. For more complex and interactive travel experiences, human guides still hold the edge.

In conclusion, the integration of Mistral 7B with specialized datasets signifies a major stride forward in AI-driven tourism. With continued refinements and strategic enhancements, AI tour guides are poised to offer experiences that rival those of human guides in text quality and interactivity. This synergy of advanced AI and human expertise heralds a new era in travel, where technology enhances and complements the rich tapestry of human-guided tours.