Today, we’re excited to announce our new integration with NVIDIA NIM/NeMo. In this blog post, we present a solution concept of an interactive chatbot based on a Retrieval Augmented Generation (RAG) architecture with Couchbase Capella as a Vector database. The retrieval and generation phases of the RAG pipeline are accelerated by NVIDIA NIM/NeMo with just a few lines of code.
Enterprises across various verticals strive to offer the best customer service to their customers. To achieve this, they are arming their frontline workers such as ER nurses, store sales associates, and help desk representatives, with AI-powered interactive question-and-answer (QA) chatbots to retrieve relevant and up-to-date information quickly.
Chatbots are usually based on RAG, an AI framework used for retrieving facts from the enterprise’s knowledge base to ground LLM responses in the most accurate and recent information. It involves three distinct phases, which starts with the retrieval of the most relevant context using vector search, augmentation of the user’s query with the context, and, finally, generating relevant responses using an LLM.
The problem with existing RAG pipelines is that calls to the embedding service in the retrieval phase for converting user prompts into vectors can add significant latency, slowing down applications that require interactivity. Vectorizing a document corpus consisting of millions of PDFs, docs, and other knowledge bases can take a long time to vectorize, increasing the likelihood of using stale data for RAG. Further, users find it challenging to accelerate inference (tokens/sec) cost-efficiently to reduce the response time of their chatbot applications.
Figure 1 depicts a performant stack that will enable you to easily develop an interactive customer service chatbot. It consists of the StreamLit application framework, LangChain for orchestration, Couchbase Capella for indexing and searching vectors, and NVIDIA NIM/NeMo for accelerating the retrieval and generation stages.
Couchbase Capella, a high-performance database-as-a-service (DBaaS), allows you to get started quickly with storing, indexing, and querying operational, vector, text, time series, and geospatial data while leveraging the flexibility of JSON. You can easily integrate Capella for vector search or semantic search without the need for a separate vector database by integrating an orchestration framework such as LangChain or LlamaIndex into your production RAG pipeline. It offers the hybrid search capability, which blends vector search with traditional search to improve search performance significantly. Further, you can extend vector search to the edge using Couchbase mobile for edge AI use cases.
Once you have configured Capella Vector Search, you can proceed to choose a performant model from the NVIDIA API Catalog, which offers a broad spectrum of foundation models that span open-source, NVIDIA AI foundation, and custom models, optimized to deliver the best performance on NVIDIA accelerated infrastructure. These models are deployed as NVIDIA NIM either on-prem or in the cloud using easy-to-use prebuilt containers via a single command. NeMo Retriever, a part of NVIDIA NeMo, offers information retrieval with the lowest latency, highest throughput, and maximum data privacy.
The chatbot that we have developed using the aforementioned stack will allow you to upload your PDF documents and ask questions interactively. It uses NV-QA-Embed, a GPU-accelerated text embedding model used for question-answer retrieval, and Llama 3 – 70B, which is packaged as a NIM and accelerated on NVIDIA infrastructure. The langchain-nvidia-ai-endpoints package contains LangChain integrations for building applications with models on NVIDIA NIM. Although we have used NVIDIA-hosted endpoints for prototyping purposes, we recommend that you consider using self-hosted NIM by referring to the NIM documentation for production deployments.
You can use this solution to support use cases that require quick information retrieval such as:
- 
- Enabling ER nurses to speed up triaging by quick access to relevant healthcare information for alleviating overcrowding, long waits for care, and poor patient satisfaction.
- Helping customer service agents discover relevant knowledge quickly via an internal knowledge-base chatbot to reduce caller wait times. This will not only help boost CSAT scores but also allow for managing high call volumes.
- Helping sales associates inside a store to quickly discover and recommend items in a product catalog similar to the picture or description of the item requested by a shopper but is currently out of stock (stockout), to improve the shopping experience.
 
In conclusion, you can develop an interactive GenAI application, like a chatbot, with grounded and relevant responses using Couchbase Capella-based RAG and accelerate it using NVIDIA NIM/NeMo. This combination provides scalability, reliability, and ease of use. In addition to deploying alongside Capella for a DBaaS experience, NIM/NeMo can be deployed with on-prem or self-managed Couchbase in public clouds within your VPC for use cases that have stricter requirements for security and privacy. Additionally, you can use NeMo Guardrails to control the output of your LLM for content that your company deems objectionable.
The details of the chatbot application can be found in the Couchbase Developer Portal along with the complete code. Please sign up for a Capella trial account, free NVIDIA NIM account, and start developing your GenAI application.

