Zhuohang Jiang

I am currently pursuing a PhD degree at the Hong Kong Polytechnic University. My supervisors are Qing Li and Wenqi Fan.

I studied at Sichuan University (SCU) from 2020 to 2024, where I majored in Computer Science & Technology. My Major GPA (CS courses): 3.79/4, 89.39/100; Overall GPA: 3.78/4, 89.25/100

During my time at Sichuan University, I worked as a research assistant at MachineILab from 2022 to 2024, advised by Prof. JiZhe Zhou. I participated in one National Natural Science Foundation of China project and one National Key R&D Program of China.

Email / CV / Google Scholar / Github

Research Topics

My research areas include recommendation systems(RS), large language models(LLM), computer vision(CV), and graph neural networks(GNN). My previous research was primarily focused on topics within computer vision, such as tampering detection and object recognition tasks. Currently, I am diving into the realms of large language models (LLM) and Retrieval Augument Generation (RAG) .

News

🏆 2025-05-16 - Our benchmark paper HiBench was accepted to KDD Benchmark Track! 🎉

🎉 2025-05-07 - Our survey paper A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models was accepted to KDD Tutorial Track! 🎊

📜 2025-03-30 - Completed the survey paper A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models.

📘 2025-03-01 - Completed the HiBench paper and released the code and dataset on GitHub and Hugging Face.

🌟 2025-01-15 - Mesoscopic Insights: Orchestrating Multi-Scale & Hybrid Architecture for Image Manipulation Localization was published in AAAI 2025.

🏆 2024-12-01 - IMDL-BenCo was published in NeurIPS 2024 Benchmark Tracks and received a Spotlight award.

🎓 2024-09-01 - Beginning my pursuit of a PhD degree in Hong Kong PolyU.

🎓 2024-06-26 - Graduated from Sichuan University with a bachelor's degree.

🛠️ 2024-06-12 - Completed the co-work project IMDLBenCo and finished a paper IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization

🔍 2024-05-24 - Finished a paper Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning

Publications

[KDD'25] A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models [arxiv]
Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, Qing Li
With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously without fatigue or performance degradation. In the context of the web, leveraging AI Agents -- termed WebAgents -- to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of parameters have exhibited human-like language understanding and reasoning capabilities, showing proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be utilized to develop powerful AI Agents that automatically handle web tasks, providing significant convenience to users?' To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions, significantly enhancing the convenience of daily human life. In this survey, we comprehensively review existing research studies on WebAgents across three key aspects: architectures, training, and trustworthiness. Additionally, several promising directions for future research are explored to provide deeper insights.

[KDD'25] HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning [arxiv] [GitHub] [Hugging Face]
Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q. Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter H.F. Ng, Qing Li
Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84% (Llama-3.1-8B) and 31.38% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, this https URL, to encourage evaluation.

[AAAI'25] Mesoscopic Insights: Orchestrating Multi-Scale & Hybrid Architecture for Image Manipulation Localization [arxiv]
Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, Jizhe Zhou
The mesoscopic level serves as a bridge between the macroscopic and microscopic worlds, addressing gaps overlooked by both. Image manipulation localization (IML), a crucial technique to pursue truth from fake images, has long relied on low-level (microscopic-level) traces. However, in practice, most tampering aims to deceive the audience by altering image semantics. As a result, manipulation commonly occurs at the object level (macroscopic level), which is equally important as microscopic traces. Therefore, integrating these two levels into the mesoscopic level presents a new perspective for IML research. Inspired by this, our paper explores how to simultaneously construct mesoscopic representations of micro and macro information for IML and introduces the Mesorch architecture to orchestrate both. Specifically, this architecture i) combines Transformers and CNNs in parallel, with Transformers extracting macro information and CNNs capturing micro details, and ii) explores across different scales, assessing micro and macro information seamlessly. Additionally, based on the Mesorch architecture, the paper introduces two baseline models aimed at solving IML tasks through mesoscopic representation. Extensive experiments across four datasets have demonstrated that our models surpass the current state-of-the-art in terms of performance, computational complexity, and robustness.

[NIPS'24] IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization [arxiv]
Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, Jizhe Zhou
A comprehensive benchmark is yet to be established in the Image Manipulation Detection & Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading model evaluations, severely undermining the development of this field. However, the scarcity of open-sourced baseline models and inconsistent training and evaluation protocols make conducting rigorous experiments and faithful comparisons among IMDL models challenging. To address these challenges, we introduce IMDL-BenCo, the first comprehensive IMDL benchmark and modular codebase. IMDL-BenCo:i) decomposes the IMDL framework into standardized, reusable components and revises the model construction pipeline, improving coding efficiency and customization flexibility;ii) fully implements or incorporates training code for state-of-the-art models to establish a comprehensive IMDL benchmark; and iii) conducts deep analysis based on the established benchmark and codebase, offering new insights into IMDL model architecture, dataset characteristics, and evaluation standards. Specifically, IMDL-BenCo includes common processing algorithms, 8 state-of-the-art IMDL models (1 of which are reproduced from scratch), 2 sets of standard training and evaluation protocols, 15 GPU-accelerated evaluation metrics, and 3 kinds of robustness evaluation. This benchmark and codebase represent a significant leap forward in calibrating the current progress in the IMDL field and inspiring future breakthroughs.

Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning [arxiv]
Zhuohang Jiang, Bingkui Tong, Xia Du, Ahmed Alhammadi, Jizhe Zhou
To explicitly derive the objects' privacy class from the scene contexts, in this paper, we interpret the POI task as a visual reasoning task aimed at the privacy of each object in the scene. Following this interpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard contains three stages. i) Structuring: an unstructured image is first converted into a structured, heterogeneous scene graph that embeds rich scene contexts. ii) Data Augmentation: a contextual perturbation oversampling strategy is proposed to create slightly perturbed privacy-sensitive objects in a scene graph, thereby balancing the skewed distribution of privacy classes. iii) Hybrid Graph Generation & Reasoning: the balanced, heterogeneous scene graph is then transformed into a hybrid graph by endowing it with extra "node-node" and "edge-edge" homogeneous paths. These homogeneous paths allow direct message passing between nodes or edges, thereby accelerating reasoning and facilitating the capturing of subtle context changes.

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer [arxiv]
Xiaochen Ma, Bo Du,Zhuohang Jiang, Ahmed Y. Al Hammadi, Jizhe Zhou
Due to limited datasets, there is currently no pure ViT-based approach for IML to serve as a benchmark, and CNNs dominate the entire task. Nevertheless, CNNs suffer from weak long-range and non-semantic modeling. To bridge this gap, based on the fact that artifacts are sensitive to image resolution, amplified under multi-scale features, and massive at the manipulation border, we formulate the answer to the former question as building a ViT with high-resolution capacity, multi-scale feature extraction capability, and manipulation edge supervision that could converge with a small amount of data. We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML. Extensive experiments on five benchmark datasets verified our model outperforms the state-of-the-art manipulation localization methods.

Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing on Low-level Features [arxiv]
Xiaochen Ma, Zhuohang Jiang, Xiong Xu, Chi-Man Pun, Jizhe Zhou
This necessitates IML models to carry out a semantic understanding of the entire image. In this paper, we reformulate the IML task as a high‑level vision task that greatly benefits from low‑level features. We propose a method to enhance the Masked Autoencoder (MAE) by incorporating high‑resolution inputs and a perceptual loss supervision module, which we term Perceptual MAE (PMAE). While MAE has demonstrated an impressive understanding of object semantics, PMAE can also comprehend low‑level semantics with our proposed enhancements. This paradigm effectively unites the low‑level and high‑level features of the IML task and outperforms state‑of‑the‑art tampering localization methods on five publicly available datasets, as evidenced by extensive experiments.

Contour‑Aware Contrastive Learning for Image Manipulation Localization
Qin Li, Chunfang Yu, Zhuohang Jiang, AND Jizhe Zhou
We propose a novel Contour‑aware Contrastive Learning Network (CaCL‑Net) based on the encoder‑decoder architecture. On the encoder side, since the contour is foremost concerned in IML, we consider the image patches sampled along the manipulation contour are the hard examples and set them as the anchor. The patches of pure tampered and authentic pixels are set as positives and negatives respectively to conduct contrastive learning. The decoder then manages to specify the manipulated regions and restores the explicit contours of the manipulations through the proposed Contour Binary Cross‑Entropy (CBCE) loss.

Research Projects

Research on Scene Graph Structure Learning Method for Private Object Detection
Advisor: Jizhe Zhou
Participate as an intern
National Natural Science Foundation of China , 2024
The privacy-sensitive object detection problem reauires the model to locate private objects in bounding boxes on images or videos. Research on privacy-sensitive object detection has imnortant value for personal-privacyprotection. Privacy-sensitive ob ject detection is actually a scene reasoning problem. However existing privacy-sensitive oh iect detection methods are all basedon the object detection framework.Due to the lack of scene reasoning ability,existing methods suffer from detection accuracy,generalizability,and interpretability.This project intends to build a set of privacy-sensitive objectdetection methods with scene reasoning capability through scene graphs. Unlikeother tasks, privacy-sensitive object detection requires a non-parametric scenegraph structure to keep the graph sparse，dynamic，and interpretable.Therefore,this project correspondingly proposes the scene graph structure learning methods.By studying 1) the distillation method of the graph structure to sparse the scenegraph，2) the transferring method between the scene graphs of different frames tomake the scene graph dynamic，3) the privacy-rule reasoning method based on thescene graph structure，solves the problem of scene graph generation with thenon-parametric graph structure， builds a new privacy-sensitive object detectionframework based on scene reasoning， break the bottlenecks of privacy-sensitiveobject detection methods in accuracy，generalizability,and interpretability.Thisprivacy-sensitive object detection framework also enriched the theoretical framework and application scenarios of neural networks.

Intelligent control and full life feedforward deduction technology through pre planning and post evaluation
Advisor: Jizhe Zhou
Participate as an intern
National Key R&D Program of China, 2023
This topic proposes to adopt the scheme of "first completing information, then path reasoning". Specifically, this sub project intends to improve the initial network diagram based on the data base established in previous projects and further consider the frequency of common occurrence of impact factors. Then, based on the external knowledge causal information completion method of remote supervision, the network graph is again completed and cleaned, and then a path reasoning algorithm based on depth first traversal of the graph is used to achieve feedforward reasoning. Finally, a human-computer interactive network information verification method based on uncertainty reasoning is used to revise the causal relationship in the network diagram again based on artificial feedback on the reasoning path, further improving the accuracy and recall rate of the path reasoning algorithm results.

Education

Sichuan University, Chengdu, Sichuan, China
B.E. in Computer Science and Technology • Sep. 2020 to Jun. 2024

Hong Kong Polytechnic University, Hongkong, China
PHD. in Computer Science and Technology • Sep. 2024 to Present

Experience

DICALab, Sichuan University
Research Assistant • Sep. 2022 to Jun. 2024
Advisor: Prof. JiZhe Zhou

Covariant association, Sichuan University
the president of the covariant association • Sep. 2022 to Jun. 2023
the Covariant association obtain at least 400+ members, in which communicate the technology of the computer science.

Award

Computer Design Competition China, 2023

Provincial First Prize

Comprehensive First Class Scholarship Sichuan University, Sichuan, China, 2022

Top 1%

Outstanding students of Sichuan University Sichuan University, Sichuan, China, 2022

Top 5%

Updated at Apr. 2025

Thanks Jon Barron for this amazing template