A Comprehensive Analysis on LLM-based Node Classification Algorithms

1The Chinese University of Hong Kong, 2Microsoft Research Asia, 3University of Illinois Urbana-Champaign

πŸ“ Abstract

Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. Subsequently, we conducted extensive experiments, training and evaluating over 2,200 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size) that affect performance. Our findings uncover eight novel insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field.

🌟 Overview

πŸ“Š A Testbed We release LLMNodeBed, a PyG-based testbed designed to facilitate reproducible and rigorous research in LLM-based node classification algorithms. The initial release includes ten datasets, eight LLM-based algorithms, and three learning configurations. LLMNodeBed allows for easy addition of new algorithms or datasets, and a single command to run all experiments, and to automatically generate all tables included in this work.

πŸ” Comprehensive Experiments By training and evaluating over 2,200 models, we analyzed how the learning paradigm, homophily, language model type and size, and prompt design impact the performance of each algorithm category.

πŸ“š Insights and Tips Detailed experiments were conducted to analyze each influencing factor. We identified the settings where each algorithm category performs best and the key components for achieving this performance. Our work provides intuitive explanations, practical tips, and insights about the strengths and limitations of each algorithm category.

πŸ“Š LLMNodeBed

Datasets LLMNodeBed comprises ten datasets spanning the academic, web link, social, and E-Commerce domains. These datasets vary significantly in scale, ranging from thousands of nodes to millions of edges, and exhibit differing levels of homophily.

Baselines LLMNodeBed includes eight LLM-based baseline algorithms alongside classic methods. (1) LLM-as-Encoder: ENGINE and GNN w/ LLMEmb, (2) LLM-as-Reasoner: TAPE, (3) LLM-as-Predictor: LLM Instruction Tuning, GraphGPT, and LLaGA, (4) LLM Direct Inference with both Advanced Prompts and Enriched Prompts, (5) Graph Foundation Models: ZeroG

Learning Paradigms Baselines are evaluated under three learning configurations: Semi-supervised, Supervised, and Zero-shot.

πŸ” Main Experiments

The performance comparison under semi-supervised and supervised settings measured by Accuracy (%), is reported above.

πŸ“š Our key takeaways are as follows:

1 Appropriately incorporating LLMs consistently improves the performance.
2 LLM-based methods provide greater improvements in semi-supervised settings than in supervised settings.
3 LLM-as-Reasoner methods are highly effective when labels heavily depend on text.
4 LLM-as-Encoder methods balance computational cost and accuracy effectively.
5 LLM-as-Predictor methods are more effective when labeled data is abundant.

The performance comparison under zero-shot settings measured by Accuracy (%) and Macro-F1 (%), is reported above.

πŸ“š Our key takeaways are as follows:

6 GFMs can outperform open-source LLMs but still fall short of strong LLMs like GPT-4o.
7 LLM direct inference can be improved by appropriately incorporating structural information.

Comparison of LLM-as-Encoder and LM-as-Encoder with Accuracy (%) reported under semi-supervised setting.

πŸ“š Takeaway 8 LLM-as-Encoder significantly outperforms LMs in heterophilic graphs.

🚩 Citation

Feel free to cite this work if you find it useful to you!

@misc{wu2025llmnodebed,
    title={A Comprehensive Analysis on LLM-based Node Classification Algorithms}, 
    author={Xixi Wu and Yifei Shen and Fangzhou Ge and Caihua Shan and Yizhu Jiao and Xiangguo Sun and Hong Cheng},
    year={2025},
    eprint={2502.00829},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2502.00829}, 
}