Evaluating the autonomous and copilot limitations of AI agents for biological discovery

Harvard University

Abstract:

Recent advances in large language models (LLMs) have improved their ability to execute structured analytical workflows, including standard bioinformatic pipelines. However, computational biology rarely consists of deterministic pipeline execution alone. Biological datasets are heterogeneous and noisy, and meaningful discovery often requires open-ended hypothesis generation and iterative reasoning over multimodal evidence. The extent to which emerging agentic AI systems can support this mode of scientific discovery remains poorly characterized. Here, we systematically evaluate the capabilities and limitations of agentic AI for biological discovery using multimodal oncology datasets spanning 15 cancer types. We benchmark 10 analysis tasks designed to vary in biological reasoning complexity, including replication of canonical workflows, tumor-program characterization, tumor-microenvironment analysis, and immune-cell discovery tasks. We also benchmark autonomous and human-copilot agent configurations. Our results delineate the current boundaries of agentic AI in computational biology and provide a framework for evaluating AI systems designed to support scientific discovery.

Biography:

Shreya Johri is a postdoctoral fellow in the Van Allen Lab at the Dana-Farber Cancer Institute. She completed her PhD at Harvard Medical School in 2025, where her dissertation focused on Accelerating Foundation Models for Medical Diagnosis and Biological Discovery. Her research centers on developing and evaluating large language models in complex real-world scientific domains, with a focus on generalizable evaluation frameworks, agentic tool-use systems, and multimodal foundation models. She has led the development of AI systems aimed at enabling more accurate and comprehensive medical diagnosis, as well as supporting data-driven biological discovery in oncology.

Back to the MIA homepage.

ӳ��ý