How to Build a Multimodal RAG Application with Gemini API File Search: A Step-by-Step Developer Guide

By • min read

Introduction

If you're building a retrieval-augmented generation (RAG) application that handles both text and images, the Gemini API File Search tool now makes it easier than ever. With the addition of Gemini Embedding 2, images like charts, product photos, and diagrams can be natively indexed and searched in the same store as your text documents—no separate OCR pipeline needed.

How to Build a Multimodal RAG Application with Gemini API File Search: A Step-by-Step Developer Guide
Source: dev.to

In this step-by-step guide, you'll learn how to set up a multimodal File Search store, upload documents and images, perform queries with grounded generation, and extract image citations from the results. By the end, you'll have a fully functional RAG system that returns both text and visual answers with source references.

What You Need

Step-by-Step Instructions

Step 1: Create a File Search Store

A File Search Store is a managed, persistent container for your document embeddings. Think of it as a vector database that the API handles for you—chunking, embedding, indexing, and retrieval are all automated.

To enable multimodal search, you must specify gemini-embedding-2 as the embedding model. If you omit this parameter, the default gemini-embedding-001 (text-only) is used, and you cannot change it later. Use the following Python code:

from google import genai
from google.genai import types

client = genai.Client()

file_search_store = client.file_search_stores.create(
    config={
        "display_name": "product-catalog",
        "embedding_model": "models/gemini-embedding-2"
    }
)
print(f"Created store: {file_search_store.name}")

Once created, the store is ready to accept files. Note the display_name is for your reference; the returned name (e.g., projects/.../fileSearchStores/...) is used in subsequent steps.

Step 2: Upload Documents and Images

Next, upload your files (PDFs, images, etc.) to the Gemini API and associate them with your store. The API automatically chunks and indexes each file using the embedding model you chose.

First, upload a file using the client, then add it to the store. Here's an example for an image:

# Upload a file (PDF or image) to the Gemini API
image_file = client.files.upload(
    file="path/to/product_photo.jpg",
    config={"display_name": "Product Photo A"}
)

# Associate the file with your File Search store
client.file_search_stores.add_file(
    file_search_store=file_search_store.name,
    file=image_file
)

Repeat for each file you want to index. You can mix PDFs and images in the same store—the model handles both formats. For optimal results, ensure your images are clear and contain visual information the model can embed (avoid text-heavy images that rely solely on OCR).

Step 3: Query with Grounded Generation

Now you can ask questions that leverage both text and images from your store. Use the file_search tool in generate_content to let the model automatically retrieve relevant chunks and produce a grounded response.

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What are the key features shown in the product photo?",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                file_search=types.FileSearch(
                    stores=[
                        types.FileSearchStore(
                            name=file_search_store.name
                        )
                    ]
                )
            )
        ]
    )
)
print(response.text)

The model will search the store, retrieve relevant text and image context, and generate an answer that references the source files. If your query is about an image, the model uses the embeddings to match the visual content—without needing OCR.

How to Build a Multimodal RAG Application with Gemini API File Search: A Step-by-Step Developer Guide
Source: dev.to

Step 4: Retrieve Image Citations

One of the key benefits of the File Search tool is built-in citations. When the model uses a specific file (especially images), the response includes grounding metadata with downloadable references. To extract image citations from the response object, use:

if response.grounding_metadata:
    for chunk in response.grounding_metadata.grounding_chunks:
        if chunk.retrieved_context:
            # Each chunk has a 'context' with file URI and other metadata
            print(f"File: {chunk.retrieved_context.uri}")
            print(f"Title: {chunk.retrieved_context.title}")
            # For images, you can get the download URL
            if chunk.retrieved_context.mime_type.startswith('image/'):
                print(f"Download URL: {chunk.retrieved_context.signed_url}")

This code iterates through grounding chunks and prints details for each source file. The signed_url (if present) provides a temporary, authenticated link to download the image directly from the API's storage.

Tips for Success

By following these steps, you have built a powerful multimodal RAG endpoint that goes beyond text-only search. The Gemini API does the heavy lifting—chunking, embedding, indexing, and retrieval—so you can focus on crafting great user experiences.

Recommended

Discover More

10 Critical Updates on GitHub Availability and ScalingCausal Inference Crisis: Opt-In Bias Skews AI Feature Metrics – Propensity Scores Offer SolutionAI-Assisted Hacking Wave Hits Mexican Government as Cyber Threats Surge: Breaking ReportGlobal Internet Disruptions in Q1 2026: Government Shutdowns, Power Grid Collapses, and MoreStack Overflow Unveils Major Redesign, Opens Up to Open-Ended Questions in March 2026 Update