SimTube: Generating Simulated Video Comments through Multimodal AI and User Personas

Yu-Kai Hung¹, Yun-Chien Huang¹, Ting-Yu Su¹, Yen-Ting Lin¹, Lung-Pan Cheng², Bryan Wang^2*, Shao-Hua Sun^1*

¹National Taiwan University, ²Adobe Research, ^*Equal advisory contribution

The SimTube is a feedback tool that utilizes VLM to digest the video and LLM to generate diverse and helpful comments for video creators before video publication. SimTube provides a solid foundation of social computing on the video sharing platforms.

Abstract

Audience feedback is crucial for refining video content, yet it typically comes after publication, limiting creators' ability to make timely adjustments. To bridge this gap, we introduce SimTube, a generative AI system designed to simulate audience feedback in the form of video comments before a video's release. SimTube features a computational pipeline that integrates multimodal data from the video—such as visuals, audio, and metadata—with user personas derived from a broad and diverse corpus of audience demographics, generating varied and contextually relevant feedback. Furthermore, the system's UI allows creators to explore and customize the simulated comments. Through a comprehensive evaluation—comprising quantitative analysis, crowd-sourced assessments, and qualitative user studies—we show that SimTube's generated comments are not only relevant, believable, and diverse but often more detailed and informative than actual audience comments, highlighting its potential to help creators refine their content before release.

System Pipeline

Fig. 1: The system pipeline of SimTube

Starting from how the audience would perceive a video, from clicking the video that interests them, watching the video, acquiring the content, to leaving their comments. The backend pipeline consists of three primary components, as illustrated in Fig 1: (a) Video Understanding, which captures the semantics of video content through multimodal summarization; (b) Persona Query, which retrieves relevant user personas for providing feedback on the video; and (c) Comment Generation, which combines video understanding and user persona information to generate and present comments in the UI, allowing user interaction.

Types of Comments and Interaction Design

Fig. 2: Types of comments and interaction design

Our system supports four types of simulated comments using data from earlier stages. The first two types are automatically generated by the system, while the last two are user-initiated and customizable, offering quick and interactive feedback.

Primary Comments: The initial comments that appear directly under the video.
Thread Comments: The generated responses to existing comments, structured hierarchically as threaded replies beneath primary comments.
Custom Persona Comments: The comments generated based on a user-defined persona, tailored to user specifications.
Response comments: The comments generated for responding to the user's reply, contributing to a discussion thread under the replied comment.

Users can interact with SimTube through Thread Expansion and Persona Crafting to create Response Comments and Custom Persona Comments respectively.

Thread Expansion

Users can extend existing discussions by replying to any simulated comment. The dialogue is then deepened with the system generating a follow-up reply with the original persona.

Custom Persona Comments

Users can receive feedback from specific audience's perspectives by defining personas. A new comment will be generated according to the user-defined persona and the video content.

User Scenario

SimTube is designed to assist video creators with video rough-cut versions.

Many participants (P4, P6, P8) emphasized creating rough-cut versions or teasers during the editing phase, enabling collaboration and feedback from sponsors or team members. As P4 explained, "A rough-cut lets my team and sponsors give feedback before we proceed further".

This suggests that SimTube could enhance this process by providing automatic, diverse feedback on uploaded rough cuts. P6 added, "I can seamlessly integrate SimTube into my workflow and collect more feedback with minimal effort by uploading the rough-cut version or any segments whenever I complete one."

(Case #1) Inspiration for New Video Topic

SimTube can inspire new video topics. Participants (P1, P3, P6) highlighted that AI-generated comments led them to explore new ideas. For instance, P1, a travel vlogger, received recommendations for famous tourist spots like Ikseon-dong Hanok Village after uploading a Korean vlog, even though these places were not featured in the video. "SimTube correctly listed all my itineraries based on my narration, which helped me plan new vlogs," P1 noted, demonstrating the system's ability to generate contextually relevant insights.

(Case #2) Revision Current Video Editing

SimTube could also influence ongoing video production. P6 uploaded a half-finished street interview video on student lifestyles, and the persona-based comments generated by SimTube extended the discussion between the host and interviewee, introducing new topics such as time management for university students. "It prompted me to explore this theme further and enriched my video," P6 shared, illustrating how SimTube's integration can guide and enhance content creation throughout different stages of the workflow.

Evaluation

Fig. 3: The page evaluation of SimTube

According to the crowd-sourced evaluation and the evaluations using various automatic metrics, the comments generated by our system display superior word-level diversity, while Real Comments showcase better semantic diversity. Although a few real comments cover distinct common topics, clusters of real comments may be highly similar, as reflected by the Self-BLEU score. Concerning relevance to video content, generated comments outperform Real Comments in word-level, semantic-level, and LLM evaluations. In comparison to Real Comments, generated comments tend to be more on-topic, authentic, and differentiated, offering a potent source of inspiration. Despite their limited semantic diversity, the scalability, rapid production, and pre-publication availability make generated comments an advantageous preliminary source of inspiration and feedback complementing Real Comments, particularly before the formal publication of videos.

Future Works

Expanding SimTube's Pipeline

While SimTube can generate comments for general video content, it currently does not consider inherent variations in video like genre, style, or cultural context. These areas present opportunities to expand SimTube's computational pipeline to accommodate additional contextual information and enable more customized comment generation. Future improvements could also include handling longer video inputs and enhancing the overall quality of language generation to provide more nuanced and useful feedback.

Integration into Video Production Workflows

While we have assessed the quality of our generated comments, the system has yet to be deployed in real-world settings. Future research should explore integrating SimTube into video editing tools or production environments to evaluate its overall impact. Qualitative studies could further investigate how the system complements professional workflows, providing deeper insights into its practical utility. Relevantly, our system generates comments only based on a single video version. However, creators often produce multiple iterations to determine the best result. By analyzing a series of video edits, the system could generate comparative feedback highlighting differences between the current and previous versions, enabling users to refine their work more effectively by leveraging the strengths of each iteration. However, expanding to handle multiple versions introduces challenges related to system scalability and processing efficiency that warrant future explorations.

Implications of AI-Generated Comments

We recognize the implications of using AI for human-like comment generation, including bias, misuse, and harmful output.

Biases could be inherent in system components such as image captioning models and LLMs. While these models have been aligned to reduce negative impacts, entirely eliminating bias remains challenging.

The overreliance on SimTube's feedback could lead creators to conform too closely to simulated audience preferences, potentially stifling creativity and diversity in content creation.

The AI-generated harmful content negatively impacts users. We are committed to managing this issue responsibly to ensure that tools like SimTube support and enhance creators' work while minimizing risks.

BibTex

            
@inproceedings{hung2025simtube,
  title={SimTube: Generating Simulated Video Comments through Multimodal AI and User Personas},
  author={Hung, Yu-Kai and Huang, Yun-Chien and Su, Ting-Yu and Lin, Yen-Ting and Cheng, Lung-Pan and Wang, Bryan and Sun, Shao-Hua},
  booktitle={International Conference on Intelligent User Interfaces},
  year={2025}
}