Research

Multi-Agent Systems and Compound AI Systems

We will explore compound AI systems while highlighting how it can be used to tackle challenges in prompting and real-world problems. We will also compare between Multi-Agent Systems and Compound AI Systems based on shared characteristics.

Hung Du

Jan 9, 2025 • 10 min read

ℹ️

The terminology in this post is framed within the context of computer science, AI and multi-agent systems.

ℹ️

An agent is a computational entity (refer to Agent and Multi-Agent System).

ℹ️

The concept of compound AI systems is inspired by The Shift from Models to Compound AI Systems.

Introduction

The adoption of AI in businesses and enterprises is rapidly growing. This leads to a surge in software applications powered by large language models (LLMs) or vision models. However, these applications face several significant challenges:

Scaling large models for domain-specific problems is difficult due to issues like data drift [1, 2, 3, 4, 5].
Operating large models is expensive, as they require substantial computational resources.
Large models produce hallucinated or inconsistent results for specific problems.
Frequent updates to large models (e.g., new versions released annually) make previously designed prompts less effective, requiring considerable effort to redesign them for compatibility [6, 7].
The dynamic nature of real-world environments can expose the limitations of large models, reducing robustness and resulting in unpredictable outcomes that negatively affect user experience.

To address these challenges, recent research has focused on developing compound AI systems [8, 9, 10, 11, 12] and multi-agent systems [13, 14, 15, 16, 17], while also exploring techniques for creating smaller, more scalable models [18, 19, 20, 21]. In this post, we discuss the concept of compound AI systems, their potential to tackle the issues mentioned above, and their shared characteristics with multi-agent systems.

Compound AI Systems

Definition

A compound AI system consists of multiple components that interact with each other for solving AI tasks [8]. Categories of interaction involves: function calls to models or API calls to sub-systems or external tools. Several examples of compound AI systems are provided below:

Figure 1. A compound AI system for question-answering application

Question-Answering Application: This application is designed to answer questions based on its knowledge base. To accomplish this, the system retrieves documents from its data storage that are relevant to the given question. It then summarizes the retrieved information and generates a human-like response. The process involves two key components: a Retrieval-Augmented Generation (RAG) module and a Language Model (LLM) (refer to Figure 1).

Figure 2. A compound AI system for identifying potential CVs from requirements

CV Filtering Application: This application focuses on filtering out CVs that do not meet job requirements or qualifications. It works by extracting keywords from CVs, which are then used for filtering and scoring purposes. CVs with scores exceeding a predefined threshold advance to the next stage. The system consists of three main components: (i) a keyword extraction module, (ii) a scoring module, and (iii) a filtering module (refer to Figure 2).

Small Models in Compound AI Systems

Generalization is a fundamental goal for large models. To achieve this, these models typically have a vast number of parameters (e.g., 1B, 7B, 8B, 405B, etc.) and are trained on extensive datasets spanning multiple domains. However, such large models often offer more capacity and complexity than required for many real-world applications. This leads to inefficient scaling. To tackle this challenge, recent research has shifted toward developing small models using techniques like knowledge distillation [22, 23, 24], quantization [25, 26, 27, 28, 29], and neural architecture search [30, 31, 32].

While both small and large models continue to improve, the issue of hallucination persists. To reduce hallucination, prompt engineers often refine prompt templates with more precise instructions. However, a challenge arises with prompt sensitivity when a new version of a model is released [6, 7]. This tends to result in suboptimal outcomes and situations where existing prompt templates no longer function as expected. To address these challenges, the literature suggests several potential solutions, such as sampling [33, 34], chaining trials [34, 35, 36], or implementing multi-step validation strategies. These approaches are often integrated into compound AI systems. For instance, multiple verifiers can be trained to assess the correctness of a language model's outputs. Similarly, the chain-of-thought mechanism and its variants [37, 38] are frequently employed to evaluate the factual consistency of large language models (LLMs).

When solving a complex task, using a single prompt for all steps tends to yield less consistent or poorer outcomes compared to breaking the task into multiple steps with separate prompts [39, 40, 41, 42]. This can be due to the limitations of current LLMs in handling multi-goal optimization and complex reasoning. By dividing a problem into steps and addressing them with multiple prompts, a compound AI system is formed. Depending on the nature of the problem, these steps can be organized in the hierarchy or the distributed manner.

Effective prompting strategies to achieve better outcomes from most LLMs include: (i) focusing on a single goal at a time, (ii) providing clear and specific instructions, and (iii) limiting the model’s responses to a predefined set of options. Importantly, the models handling these prompts can be dynamic, which opens up a research area in designing LLM systems. This field explores standard design patterns to optimize both computational efficiency and task performance. Furthermore, common system design questions raised by practitioners include:

Should the same model be used for all steps in the system?
How can system-level optimization be balanced with model-level optimization?
What strategies can be employed to effectively debug individual components when issues arise?

Multi-Agent Systems and Compound AI Systems

Multi-agent systems (MAS) are similar to compound AI systems due to their overlapping concepts. Distinguishing these systems helps to simplify requirements for AI tasks while solving real-world problems. Key differences between these system is highlighted in Table 1.

**Table 1.** Key differences between multi-agent systems and compound AI systems
	Multi-Agent Systems	Compound AI Systems
Purpose	Utilizing autonomous intelligent agents to perform complex high-level tasks	Integrating various components into a unified system for an AI task
Interaction	Independent agents	Modules or subsystems
Communication	It can be either peer-to-peer communication (decentralized) or via a central authority (centralized)	Modules often communicate via a central architecture
Product Stage	When scaling beyond MVP	Proof-of-Concept (PoC) or Minimum Viable Product (MVP)
System complexity	Medium - High	Low - Medium

Does an agent contain compound AI systems? The definition of an agent and the scope of the problem influence how we answer this question. In many cases, an agent relies on multiple AI modules to address different aspects of its functionality. For instance, large language models (LLMs) are often referred to as "LLM agents" because they exhibit characteristics of agency (see Agency, Agentic AI and Multi-Agent Systems). However, some practitioners view LLMs differently, treating them as modules that are activated through function calls rather than standalone agents.

References

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. and Bouchachia, A., 2014. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4), pp.1-37.
Miller, J., Krauth, K., Recht, B. and Schmidt, L., 2020, November. The effect of natural distribution shift on question answering models. In International conference on machine learning (pp. 6905-6916). PMLR.
Lazaridou, A., Kuncoro, A., Gribovskaya, E., Agrawal, D., Liska, A., Terzi, T., Gimenez, M., de Masson d'Autume, C., Kocisky, T., Ruder, S. and Yogatama, D., 2021. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34, pp.29348-29363.
Liang, J., He, R. and Tan, T., 2024. A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision, pp.1-34.
Yang, J., Zhou, K., Li, Y. and Liu, Z., 2024. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12), pp.5635-5662.
Sclar, M., Choi, Y., Tsvetkov, Y. and Suhr, A., 2023. Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
Arora, S., Narayan, A., Chen, M.F., Orr, L., Guha, N., Bhatia, K., Chami, I., Sala, F. and Ré, C., 2022. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441.
Zaharia, M., Khattab, O., Chen, L., Davis, J.Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N. and Ghodsi, A., 2024. The shift from models to compound ai systems. Berkeley Artificial Intelligence Research Lab. Available online at: https://bair. berkeley. edu/blog/2024/02/18/compound-ai-systems/(accessed February 27, 2024).
Lin, M., Sheng, J., Zhao, A., Wang, S., Yue, Y., Wu, Y., Liu, H., Liu, J., Huang, G. and Liu, Y.J., 2024. LLM-based Optimization of Compound AI Systems: A Survey. arXiv preprint arXiv:2410.16392.
Jain, S., Raju, R., Li, B., Csaki, Z., Li, J., Liang, K., Feng, G., Thakkar, U., Sampat, A., Prabhakar, R. and Jairath, S., 2024. Composition of Experts: A Modular Compound AI System Leveraging Large Language Models. arXiv preprint arXiv:2412.01868.
Sinha, S., Premsri, T. and Kordjamshidi, P., 2024. A Survey on Compositional Learning of AI Models: Theoretical and Experimetnal Practices. arXiv preprint arXiv:2406.08787.
Han, S., Hu, Z., Shah, A.D., Jin, H., Yao, Y., Stripelis, D., Xu, Z. and He, C., 2024. TorchOpera: A Compound AI System for LLM Safety. arXiv preprint arXiv:2406.10847.
Du, H., Thudumu, S., Vasa, R. and Mouzakis, K., 2024. A Survey on Context-Aware Multi-Agent Systems: Techniques, Challenges and Future Directions. arXiv preprint arXiv:2402.01968.
Dorri, A., Kanhere, S.S. and Jurdak, R., 2018. Multi-agent systems: A survey. Ieee Access, 6, pp.28573-28593.
Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O. and Zhang, X., 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680.
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E. and Zheng, R., 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
Xie, J., Chen, Z., Zhang, R., Wan, X. and Li, G., 2024. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116.
Lu, Z., Li, X., Cai, D., Yi, R., Liu, F., Zhang, X., Lane, N.D. and Xu, M., 2024. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790.
Magister, L.C., Mallinson, J., Adamek, J., Malmi, E. and Severyn, A., 2023, July. Teaching Small Language Models to Reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 1773-1781).
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N. and Sarrazin, N., 2023. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
Javaheripi, M., Bubeck, S., Abdin, M., Aneja, J., Bubeck, S., Mendes, C.C.T., Chen, W., Del Giorno, A., Eldan, R. and Gopi, S., 2023. Phi-2: The surprising power of small language models. Microsoft Research Blog, 1(3), p.3.
Gou, J., Yu, B., Maybank, S.J. and Tao, D., 2021. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), pp.1789-1819.
Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D. and Zhou, T., 2024. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116.
Yang, C., Zhu, Y., Lu, W., Wang, Y., Chen, Q., Gao, C., Yan, B. and Chen, Y., 2024. Survey on knowledge distillation for large language models: methods, evaluation, and application. ACM Transactions on Intelligent Systems and Technology.
Polino, A., Pascanu, R. and Alistarh, D., 2018. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668.
Zhou, Y., Moosavi-Dezfooli, S.M., Cheung, N.M. and Frossard, P., 2018, April. Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
Rokh, B., Azarpeyvand, A. and Khanteymoori, A., 2023. A comprehensive survey on model quantization for deep neural networks in image classification. ACM Transactions on Intelligent Systems and Technology, 14(6), pp.1-50.
Li, S., Ning, X., Wang, L., Liu, T., Shi, X., Yan, S., Dai, G., Yang, H. and Wang, Y., 2024. Evaluating quantized large language models. arXiv preprint arXiv:2402.18158.
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J. and Han, S., 2023, July. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (pp. 38087-38099). PMLR.
Ren, P., Xiao, Y., Chang, X., Huang, P.Y., Li, Z., Chen, X. and Wang, X., 2021. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys (CSUR), 54(4), pp.1-34.
Chen, A., Dohan, D. and So, D., 2024. EvoPrompting: language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36.
Javaheripi, M., de Rosa, G., Mukherjee, S., Shah, S., Religa, T., Teodoro Mendes, C.C., Bubeck, S., Koushanfar, F. and Dey, D., 2022. Litetransformersearch: Training-free neural architecture search for efficient language models. Advances in Neural Information Processing Systems, 35, pp.24254-24267.
Semnani, S., Yao, V., Zhang, H. and Lam, M., 2023, December. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 2387-2413).
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A. and Zhou, D., 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, pp.24824-24837.
Kim, G., Kim, S., Jeon, B., Park, J. and Kang, J., 2023, December. Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 996-1009).
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R. and Hesse, C., 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Tam, D., Mascarenhas, A., Zhang, S., Kwan, S., Bansal, M. and Raffel, C., 2023, July. Evaluating the Factual Consistency of Large Language Models Through News Summarization. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 5220-5255).
Zverev, E., Abdelnabi, S., Tabesh, S., Fritz, M. and Lampert, C.H., 2024. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?. arXiv preprint arXiv:2403.06833.
Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J. and Beutel, A., 2024. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208.
Addala, K., Baghel, K.D.P., Kirtani, C., Anand, A. and Shah, R.R., 2024. Steps are all you need: Rethinking STEM Education with Prompt Engineering. arXiv preprint arXiv:2412.05023.
Cheng, K., Ahmed, N.K., Willke, T. and Sun, Y., 2024. Structure Guided Prompt: Instructing Large Language Model in Multi-Step Reasoning by Exploring Graph Structure of the Text. arXiv preprint arXiv:2402.13415.