RCSB PDB AI Help Desk: RAG for Protein Deposition Support
An AI-driven Help Desk has been created by researchers utilizing Retrieval-Augmented Generation (RAG) to assist structural biologists in submitting protein structures to the Protein Data Bank (PDB). This system, which is based on LangChain and incorporates a pgvector store (PostgreSQL) along with GPT-4.1-mini, aims to alleviate the difficulties encountered by RCSB PDB biocurators, who manage over 40% of worldwide depositions. In 2025, the Help Desk processed around 19,000 messages related to roughly 8,000 entries. The RAG framework utilizes pymupdf4llm for extracting PDFs while preserving Markdown, employs two-stage document chunking, and incorporates Maximal Marginal Relevance retrieval, along with a topical guardrail and a tailored system prompt. The PDB has received over 245,000 experimentally validated 3D structures, curated by approximately 20 expert biocurators from the wwPDB.
Key facts
- AI Help Desk uses Retrieval-Augmented Generation (RAG) for protein structure deposition support.
- System built on LangChain with pgvector store (PostgreSQL) and GPT-4.1-mini.
- RCSB PDB biocurators process over 40% of global depositions.
- Approximately 19,000 messages in about 8,000 entries received in 2025.
- Over 245,000 experimentally determined 3D structures in the PDB.
- ~20 expert biocurators across the wwPDB validate and biocurate incoming data.
- System uses pymupdf4llm for PDF extraction, two-stage chunking, and Maximal Marginal Relevance retrieval.
- Topical guardrail filters off-topic queries; system prompt prevents exposure of internal terminology.
Entities
Institutions
- RCSB PDB
- Protein Data Bank (PDB)
- wwPDB