ARTFEED — Contemporary Art Intelligence

RCSB PDB AI Help Desk: RAG for Protein Deposition Support

other · 2026-04-29

An AI-driven Help Desk has been created by researchers utilizing Retrieval-Augmented Generation (RAG) to assist structural biologists in submitting protein structures to the Protein Data Bank (PDB). This system, which is based on LangChain and incorporates a pgvector store (PostgreSQL) along with GPT-4.1-mini, aims to alleviate the difficulties encountered by RCSB PDB biocurators, who manage over 40% of worldwide depositions. In 2025, the Help Desk processed around 19,000 messages related to roughly 8,000 entries. The RAG framework utilizes pymupdf4llm for extracting PDFs while preserving Markdown, employs two-stage document chunking, and incorporates Maximal Marginal Relevance retrieval, along with a topical guardrail and a tailored system prompt. The PDB has received over 245,000 experimentally validated 3D structures, curated by approximately 20 expert biocurators from the wwPDB.

Key facts

  • AI Help Desk uses Retrieval-Augmented Generation (RAG) for protein structure deposition support.
  • System built on LangChain with pgvector store (PostgreSQL) and GPT-4.1-mini.
  • RCSB PDB biocurators process over 40% of global depositions.
  • Approximately 19,000 messages in about 8,000 entries received in 2025.
  • Over 245,000 experimentally determined 3D structures in the PDB.
  • ~20 expert biocurators across the wwPDB validate and biocurate incoming data.
  • System uses pymupdf4llm for PDF extraction, two-stage chunking, and Maximal Marginal Relevance retrieval.
  • Topical guardrail filters off-topic queries; system prompt prevents exposure of internal terminology.

Entities

Institutions

  • RCSB PDB
  • Protein Data Bank (PDB)
  • wwPDB

Sources