TnT-LLM: Democratizing Text Mining with Automated Taxonomy and Scalable Classification

22 Apr 2025

Table of Links

Abstract and 1 Introduction

3.2 Phase 2: LLM-Augmented Text Classification

4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies

4.2 Phase 2 Evaluation Strategies

5 Experiments and 5.1 Data

5.2 Taxonomy Generation

5.3 LLM-Augmented Text Classification

5.4 Summary of Findings and Suggestions

6 Discussion and Future Work, and References

A. Taxonomies

B. Additional Results

C. Implementation Details

D. Prompt Templates

6 DISCUSSION AND FUTURE WORK

This work has the potential to create significant impact for research and application of AI technologies in text mining. Our framework has demonstrated the ability to use LLMs as taxonomy generators, as well as data labelers and evaluators. These automations could lead to significant efficiency gains and cost savings for a variety of domains and applications that rely on understanding, structuring and analyzing massive volumes of unstructured text. It could also broadly democratize the process of mining knowledge from text, empowering non-expert users and enterprises to interact with and interpret their data through natural language, thereby leading to better insights and data-driven decision making for a range of industries and sectors. Additionally, our framework and research findings relate to other work that leverages LLMs for taxonomy creation and text clustering, and has important empirical lessons for the efficient use of instruction-following models in these scenarios.

Despite these initial successes, there are some important challenges and future directions that are worth exploring. As we have already noted, LLMs are expensive and slow. In future work, we hope to explore ways to improve the speed, efficiency and robustness of our framework, through hybrid approaches that further explore the combination of LLMs with embedding-based methods, or model distillation that fine-tunes a smaller model through instructions from a larger one. Evaluation continues to be a crucial and open challenge for future work, and we plan to explore ways of performing more robust LLM-aided evaluations in future work, for example by fine-tuning a model to expand its reasoning capabilities beyond pairwise judgement tasks. While this work has focused largely on text mining in the conversational domain, we also hope to explore the extensibility of our framework to other domains as well. Finally, many domains have ethical considerations from the perspective of privacy and security that must be taken into account when performing large-scale automated text mining, and we hope to engage with these challenges more deeply in future work.

REFERENCES

[1] Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. Mining text data (2012), 77–128.

[2] David Arthur, Sergei Vassilvitskii, et al. 2007. k-means++: The advantages of careful seeding. In Soda, Vol. 7. 1027–1035.

[3] Léon Bottou. 1998. Online algorithms and stochastic approximations. Online learning in neural networks (1998).

[4] B Barla Cambazoglu, Leila Tavakoli, Falk Scholer, Mark Sanderson, and Bruce Croft. 2021. An intent taxonomy for questions asked in web search. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval. 85–94.

[5] Jonathan Chang, Sean Gerrish, Chong Wang, Jordan Boyd-Graber, and David Blei. 2009. Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems 22 (2009).

[6] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.

[7] Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement 33, 3 (1973), 613–619.

[8] Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056 (2023).

[9] Simon Haykin. 1998. Neural networks: a comprehensive foundation. Prentice Hall PTR.

[10] Andreas Hotho, Andreas Nürnberger, and Gerhard Paaß. 2005. A brief survey of text mining. Journal for Language Technology and Computational Linguistics 20, 1 (2005), 19–62.

[11] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).

[12] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016).

[13] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).

[14] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). https://api.semanticscholar.org/CorpusID:6628106

[15] Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen White, and Sujay Jauhar. 2023. Making Large Language Models Better Data Creators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 15349–15360. https://doi.org/10.18653/v1/2023.emnlp-main.948

[16] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023).

[17] Geoffrey J McLachlan and Kaye E Basford. 1988. Mixture models: Inference and applications to clustering. Vol. 38. M. Dekker New York.

[18] Chau Minh Pham, Alexander Hoyle, Simeng Sun, and Mohit Iyyer. 2023. TopicGPT: A Prompt-based Topic Modeling Framework. arXiv preprint arXiv:2311.01449 (2023).

[19] Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. 2023. Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7957–7968. https://doi.org/10.18653/v1/2023.emnlp-main.494

[20] Daniel E Rose and Danny Levinson. 2004. Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web. 13–19.

[21] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65.

[22] Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6, 12 (2008), e26752.

[23] Chirag Shah, Ryen W White, Reid Andersen, Georg Buscher, Scott Counts, Sarkar Snigdha Sarathi Das, Ali Montazer, Sathish Manivannan, Jennifer Neville, Xiaochuan Ni, et al. 2023. Using large language models to generate, validate, and apply user intent taxonomies. arXiv preprint arXiv:2309.13063 (2023).

[24] Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li, and Jiawei Han. 2020. Nettaxo: Automated topic taxonomy construction from text-rich network. In Proceedings of the Web Conference 2020. 1908–1919.

[25] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.

[26] Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2022. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. https://arxiv.org/abs/2212.09741

[27] Ah-Hwee Tan et al. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 workshop on knowledge discovery from advanced databases, Vol. 8. 65–70.

[28] Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621 (2023).

[29] Ike Vayansky and Sathish AP Kumar. 2020. A review of topic modeling methods. Information Systems 94 (2020), 101582.

[30] Zihan Wang, Jingbo Shang, and Ruiqi Zhong. 2023. Goal-Driven Explainable Clustering via Language Descriptions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 10626–10649. https://doi.org/10.18653/v1/2023.emnlp-main.657

[31] Anuradha Welivita and Pearl Pu. 2020. A Taxonomy of Empathetic Response Intents in Human Social Conversations. In Proceedings of the 28th International Conference on Computational Linguistics. 4886–4899.

[32] Qingkai Zeng, Jinfeng Lin, Wenhao Yu, Jane Cleland-Huang, and Meng Jiang. 2021. Enhancing taxonomy completion with concept generation via fusing relational representations. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2104–2113.

[33] Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, and Jiawei Han. 2018. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2701–2709.

[34] Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2023. Can Large Language Models Transform Computational Social Science? arXiv preprint arXiv:2305.03514 (2023).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

← Previous

TnT-LLM: High-Quality Automated Text Mining and Efficient LLM-Augmented Classification

Up Next →

TnT-LLM Generated Taxonomies: User Intent and Conversation Domain Labels