Research

An Overview of Backdoor Attacks in AI Systems Across Vision and Language Tasks

This blog post explores the concepts of backdoor attacks in AI systems, including their sources, categories, and current defense strategies. Notably, backdoor attacks are prevalent in both vision and language-based tasks.

Hung Du

Jan 27, 2025 • 9 min read

ℹ️

The terminology in this post is framed within the context of computer science, AI and multi-agent systems.

Introduction

Artificial Intelligence (AI) systems leveraging machine learning or deep learning are increasingly used across various fields, including healthcare, surveillance, military operations, and more. These sectors often have stringent requirements for security and privacy, yet many AI systems remain vulnerable to backdoor attacks. Such attacks, which are a type of semantic attack, can exist in both data and models [1, 2]. For instance, in healthcare applications like medication management, a backdoor attack could result in incorrect dosage distribution, jeopardizing patient health and undermining the effectiveness of healthcare missions. Similarly, in military contexts, backdoor attacks on autonomous drones or surveillance systems might cause them to misidentify targets or fail to distinguish between friendly and enemy forces, leading to potentially disastrous consequences during operations.

Recent studies highlight the critical impact of backdoor attacks on both vision and natural language processing tasks, particularly with Large Language Models (LLMs) [1, 2, 3]. While considerable attention has been given to the research and development of platforms for data labeling [4, 5, 6], model development [7, 8, 9, 10], and AI system development [11, 12, 13], less focus has been placed on creating platforms to validate labeled data and AI models specifically to guard against backdoor attacks. It is crucial to research and develop innovative platforms designed to validate labeled data and AI models. This helps to ensure protection against backdoor threats in real-world AI applications.

Backdoor Attacks in AI Systems

Backdoor attacks in AI involve embedding malicious triggers or patterns within AI system operations, including training data, model development, and model deployment [1]. Such attacks allow an attacker to manipulate the system’s behavior under specific conditions without being detected during normal operation [1, 2]. In addition, backdoor attacks tend to be stealthy, resilient to traditional detection methods and transferable. The overview of backdoor attack scenarios is illustrated in Figure 1. Details of each process are provided below.

Where Do They Come From?

A backdoor attack can be injected by the following roles in the AI system operation:

Data labeler: This role aims to collect and label data for training AI models.
Model designer: This role aims to design and develop AI models by following sequential steps, such as data analysis, data pre-processing, model selection and model training and evaluation.
System designer: This role aims to deploy AI models for specific applications.

An attack can be introduced from the following scenarios [2]:

Third-party datasets: When training and evaluating AI models, model designers often rely on open-source datasets obtained from third-party sources. Although these datasets are widely used, they may contain poisoned data samples that attackers can exploit to launch an attack. In addition, during the labeling process, a data labeler might insert an infinitesimal number of poisoned samples.
Third-party platforms: With limited computational resources, model designers often turn to third-party platforms that provide high-performance computing for training and evaluating models. However, this introduces a vulnerability, as attackers or platform providers could potentially gain access to these platforms, manipulate the training process, and inject poisoned data samples or malicious model weights to trigger a backdoor attack. In the case of LLM applications, attackers could exploit a backdoor vulnerability by altering a small part of the instructions in the prompts [3], which are typically configured by system designers during deployment.
Third-party models: A third-party model can be either downloaded or accessed through an application programming interface (API) from external sources. In the first case, open-source models may contain poisoned weights that attackers can exploit to carry out an attack [1, 2]. In the second case, an attacker could modify the pipelines associated with the API, excluding the input and output data schema, such as the data and model pipelines.

Categories of Backdoor Attacks

Backdoor attacks can be broadly classified into two categories: data-level attacks and model-level attacks [1]. Note that a few examples below are just for illustrative purposes. With the advancement of generative models, more sophisticated techniques for backdoor attacks have emerged. Details can be found in these surveys [1, 2, 3, 22].

Data-level Attacks

In data-level attacks, attackers develop AI systems designed to generate poisoned data points or manipulate the latent representations of these data points to target specific labels. For vision-related tasks, backdoor attacks can be implemented using stealthy patterns [14], image scaling techniques [15], or generative models designed with specific poisoning criteria [16] (see Figures 2, 3, and 4). It is worth nothing that backdoor triggers in poisoned images are often imperceptible or challenging for the human visual system to detect.

**Figure 2.** Examples of backdoor signals created by Barni *et al.* [14] to poison CNNs (this figure is captured from [1]).

**Figure 3.** Examples of backdoor attacks created by Quiring and Rieck [15] (this figure is captured from [1]).

**Figure 4.** An example of poisoned image created by WaNet [16] (this figure is captured from [1]).

For language tasks, backdoor signals can be embedded within the documents used for model training, various parts of a prompt (e.g., instructions, examples, and context), or within a chain of prompts (see Figures 5, 6, and 7).

**Figure 5.** An example of backdoor signals embedded in documents used to train LLMs for review classification (this figure is captured from [3]).

**Figure 6.** Examples of backdoor attacks by modifying prompts (examples in this figure are from [3]).

**Figure 7.** Examples of backdoor signals within a chain of prompts (examples in this figure are from [17, 18]).

Model-level Attacks

Model-level attacks involve the development of AI systems that focus on generating poisoned model weights to exploit specific situations. One common approach to generating poisoned models involves mimicking the weights of the target model with minor adjustments (refer to Figure 8). This allows the poisoned model to maintain good performance on normal cases while bypassing human validation.

**Figure 8.** An example of modifying model weights to generate a poisoned vision model (this is inspired by [19])

On the other hand, practitioners often utilize techniques like fine-tuning, knowledge distillation, or low-rank adaptation (LoRA) when applying open-source large language models to their specific problem domains. However, these techniques can also serve as entry points for injecting backdoor signals into the AI system. For example, Yin et al. [20] used LoRA to generate poisoned models designed to attack the global model during the model merging process (see Figure 9). As another example, Cheng et al. [21] demonstrated that backdoor attacks can be transferred between large language models through knowledge distillation (see also Figure 10).

**Figure 9.** An example of backdoor attacks via model merging (this is captured from [20])

**Figure 10.** An example of backdoor attacks via knowledge distillation (this is captured from [21]).

Existing Defense Strategies

Defense strategies against backdoor attacks include rule-based matching algorithms for input verification, anomaly detection, clustering techniques for validating data distribution, model reconstruction, and model fine-tuning or retraining [1, 2, 3]. However, several challenges emerge:

The verification processes can be labor-intensive.
It is difficult and time-consuming for human-engineering to all possible poisoned signals or patterns needed to train existing detection models.
Many defense strategies are not practical for real-world deployment.

To address these challenges, one potential solution is to integrate generative models within a multi-agent deep reinforcement learning framework. For simplicity, we consider a setup with two deep reinforcement learning agents: one agent (i.e., Generator), equipped with generative models, is responsible for synthesizing all possible poisoned data points or their latent representations, while the other agent (i.e., Detector) is tasked with detecting these poisoned data. In addition, deep reinforcement learning algorithms are used to optimize each agent's performance in settings that simulate real-world scenarios. The Detector can be further employed to prevent backdoor attacks on AI systems across domains such as healthcare, surveillance, military operations, and more.

More details about state-of-the-art approaches are provided in the following surveys: vision models [1, 2, 24] and language models [3, 22, 23].

References

Guo, W., Tondi, B. and Barni, M., 2022. An overview of backdoor attacks against deep neural networks and possible defences. IEEE Open Journal of Signal Processing, 3, pp.261-287.
Li, Y., Jiang, Y., Li, Z. and Xia, S.T., 2022. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 35(1), pp.5-22.
Zhao, S., Jia, M., Guo, Z., Gan, L., Xu, X., Wu, X., Fu, J., Feng, Y., Pan, F. and Tuan, L.A., 2024. A survey of backdoor attacks and defenses on large language models: Implications for security measures. arXiv preprint arXiv:2406.06852.
Ashktorab, Z., Desmond, M., Andres, J., Muller, M., Joshi, N.N., Brachman, M., Sharma, A., Brimijoin, K., Pan, Q., Wolf, C.T. and Duesterwald, E., 2021. Ai-assisted human labeling: Batching for efficiency without overreliance. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), pp.1-27.
Chew, R., Wenger, M., Kery, C., Nance, J., Richards, K., Hadley, E. and Baumgartner, P., 2019. SMART: an open source data labeling platform for supervised learning. Journal of Machine Learning Research, 20(82), pp.1-5.
Woodward, K., Kanjo, E., Oikonomou, A. and Chamberlain, A., 2020. LabelSens: enabling real-time sensor data labelling at the point of collection using an artificial intelligence-based approach. Personal and Ubiquitous Computing, 24(5), pp.709-722.
Hosny, A., Schwier, M., Berger, C., Örnek, E.P., Turan, M., Tran, P.V., Weninger, L., Isensee, F., Maier-Hein, K.H., McKinley, R. and Lu, M.T., 2019. Modelhub. ai: Dissemination platform for deep learning models. arXiv preprint arXiv:1911.13218.
Chard, R., Li, Z., Chard, K., Ward, L., Babuji, Y., Woodard, A., Tuecke, S., Blaiszik, B., Franklin, M.J. and Foster, I., 2019, May. DLHub: Model and data serving for science. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 283-292). IEEE.
Taraghi, M., Dorcelus, G., Foundjem, A., Tambon, F. and Khomh, F., 2024. Deep learning model reuse in the huggingface community: Challenges, benefit and trends. arXiv preprint arXiv:2401.13177.
Ouyang, W., Beuttenmueller, F., Gómez-de-Mariscal, E., Pape, C., Burke, T., Garcia-López-de-Haro, C., Russell, C., Moya-Sans, L., De-La-Torre-Gutiérrez, C., Schmidt, D. and Kutra, D., 2022. Bioimage model zoo: a community-driven resource for accessible deep learning in bioimage analysis. BioRxiv, pp.2022-06.
Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A., Vollmer, A.M. and Wagner, S., 2022. Software engineering for AI-based systems: a survey. ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2), pp.1-59.
Lu, Q., Zhu, L., Xu, X., Xing, Z. and Whittle, J., 2024. Towards Responsible AI in the Era of Generative AI: A Reference Architecture for Designing Foundation Model based Systems. IEEE Software.
Lu, Q., Zhu, L., Xu, X., Liu, Y., Xing, Z. and Whittle, J., 2024, April. A taxonomy of foundation model based systems through the lens of software architecture. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI (pp. 1-6).
Barni, M., Kallas, K. and Tondi, B., 2019, September. A new backdoor attack in cnns by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing (ICIP) (pp. 101-105). IEEE.
Quiring, E. and Rieck, K., 2020, May. Backdooring and poisoning neural networks with image-scaling attacks. In 2020 IEEE Security and Privacy Workshops (SPW) (pp. 41-47). IEEE.
Nguyen, A. and Tran, A., 2021. Wanet--imperceptible warping-based backdoor attack. arXiv preprint arXiv:2102.10369.
Xiang, Z., Jiang, F., Xiong, Z., Ramasubramanian, B., Poovendran, R. and Li, B., 2024. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242.
Cao, Y., Cao, B. and Chen, J., 2023. Stealthy and persistent unalignment on large language models via backdoor injections. arXiv preprint arXiv:2312.00027.
Dumford, J. and Scheirer, W., 2020, September. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB) (pp. 1-9). IEEE.
Yin, M., Zhang, J., Sun, J., Fang, M., Li, H. and Chen, Y., 2024. LoBAM: LoRA-Based Backdoor Attack on Model Merging. arXiv preprint arXiv:2411.16746.
Cheng, P., Wu, Z., Ju, T., Du, W. and Liu, Z.Z.G., 2024. Transferring backdoors between large language models by knowledge distillation. arXiv preprint arXiv:2408.09878.
Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y. and Li, X., 2024. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
Das, B.C., Amini, M.H. and Wu, Y., 2024. Security and privacy challenges of large language models: A survey. ACM Computing Surveys.
Zhang, S., Pan, Y., Liu, Q., Yan, Z., Choo, K.K.R. and Wang, G., 2024. Backdoor attacks and defenses targeting multi-domain ai models: A comprehensive review. ACM Computing Surveys, 57(4), pp.1-35.