TnT-LLM for User Intent and Conversational Domain Labeling in Bing Copilot

21 Apr 2025

Table of Links

Abstract and 1 Introduction

3.2 Phase 2: LLM-Augmented Text Classification

4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies

4.2 Phase 2 Evaluation Strategies

5 Experiments and 5.1 Data

5.2 Taxonomy Generation

5.3 LLM-Augmented Text Classification

5.4 Summary of Findings and Suggestions

6 Discussion and Future Work, and References

A. Taxonomies

B. Additional Results

C. Implementation Details

D. Prompt Templates

5 EXPERIMENTS

We showcase the utility of TnT-LLM for two text mining tasks of special interest in today’s LLM era: User intent detection and conversational domain labeling over human-AI chat transcripts.

5.1 Data

Our conversation transcripts are taken from Microsoft’s Bing Consumer Copilot system, which is a multilingual, open-domain generative search engine that assists users through a chat experience. We randomly sample 10 weeks of conversations from 8/6/2023 to 10/14/2023, with 1k conversations per week for Phase 1, where we perform a random 60%-20%-20% split for “learning” the label taxonomy, validation, and testing respectively. We then sample another 5k conversations per week from the same time range for Phase 2, and apply the same train/validation/test data split.

We perform two steps of filtering to ensure the quality and privacy of the data. First, we apply an in-house privacy filter that scrubs all personal information (e.g., addresses, phone numbers) from the original conversation content. Second, we apply a content filter that removes all conversations that contain harmful or inappropriate content that should not be exposed to annotators or downstream analyses. After applying these filters, we obtain 9,592 conversations for Phase 1 and 48,160 conversations for Phase 2. We leverage the FastText language detector [11, 12] to identify the primary language of each conversation, where we find around half of the conversations in our corpus are in English.

In the remainder of this section, we will report results on the following datasets:

• BingChat-Phase1-L-Multi: The test set used in the taxonomy generation phase, which includes around 2k conversations.

• BingChat-Phase2-L-Multi: The test set used in the label assignment phase, which includes around 10k conversations.

Besides the above datasets, we also reserve two separate English-only conversation datasets to perform human evaluations, with the same privacy and content filter applied.

• BingChat-Phase1-S-Eng includes 200 English conversations to evaluate label taxonomy.

• BingChat-Phase2-S-Eng includes 400 English conversations to evaluate label assignment.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

← Previous

Evaluating TnT-LLM Text Classification: Human Agreement and Scalable LLM Metrics

Up Next →

TnT-LLM for Automated Taxonomy Generation: Outperforming Clustering Baselines