Couchbase Capella

Capella Model Service: Secure, Scalable, and OpenAI-Compatible

Couchbase Capella has launched a Private Preview for AI services! Check out this blog for an overview of how these services simplify the process of building cloud-native, scalable AI applications and AI agents.

In this blog, we’ll explore the Model Service – a feature in Capella that lets you deploy private language models and embedding models securely and at scale. This service enables inference to run close to your data for improved performance and compliance.

Why use the Capella Model Service?

Many enterprises face security and compliance challenges when developing AI agents. Due to regulations like GDPR and PII protection, companies often cannot use open language models or store data outside their internal network. This limits their ability to explore AI-driven solutions.

The Capella Model Service addresses this by simplifying the operational complexities of deploying a private language model within the same internal network as the customer’s cluster.

This ensures:

    • Data used for inference never leaves the operational cluster’s virtual network boundary
    • Low-latency inference due to minimal network overhead
    • Compliance with enterprise data security policies

Key features of the Capella Model Service

    • Secure Model Deployment – Run models in a secure-sandboxed environment
    • OpenAI-Compatible APIs & SDK Support – Easily invoke Capella hosted models with OpenAI compatible libraries and frameworks like Langchain
    • Performance Enhancements – Includes caching and batching value-added offerings for efficiency
    • Moderation Tools – Provides content moderation and keyword filtering capabilities

Getting started: deploy and use a model in Capella

Let’s go through a simple tutorial to deploy a model in Capella and use it for basic AI tasks.

What you’ll learn:

    1. Deploying a language model in Capella
    2. Using the model for chat completions
    3. Exploring value-added features

Prerequisites

Before you begin, ensure you have:

  • Signed up for Private Preview and enabled AI services for your organization. Sign up Here!
  • Organization Owner role permissions to manage language models
  • A multi-AZ operational cluster (recommended for enhanced performance)
  • Sample buckets to leverage value-added features like caching and batching

Step 1: Deploying the language model

Learning Objective: Learn how to deploy a private language model in Capella and configure key settings.

Navigate to AI Services on the Capella home page and click on Model Service to proceed.

Selecting the AI model service for Capella

Select model configuration

    • Choose an operational cluster for your model
    • Define compute sizing and select a foundation model

Compute sizing and select an AI foundation model

As you scroll down, you will see an option to select a bunch of value-added services offered by Capella.

Let’s understand what each section means.

Caching

Caching allows you to store and retrieve LLM responses efficiently, reducing costs and improving response times. You can choose between Conversational, Standard, and Semantic caching.

Caching lets you shrink costs and accelerate retrieval by reducing calls to the LLM. You can also use caching to store conversations within a chatbot session to provide context for enhanced conversational experiences.

Select cache storage & caching strategy

In the Bucket, Scope and Collection fields, select a designated bucket in your cluster where the inference responses will be cached for fast retrieval.

Then, select the Caching strategy to include “Conversational”, “Standard” and “Semantic” Caching.

Do note, that for Semantic Caching, the Model Service leverages an embedding model – it is useful to create an embedding model upfront for the same cluster or to create it on the fly from this screen.

Here, I have selected a pre-deployed embedding model for semantic caching.

AI caching for embedding models

Guardrails

Guardrails offer content moderation for both user prompts and model responses, leveraging the Llama-3 Guard model. A customizable moderation template is available to suit different AI application needs.

For now, we will keep the default configuration and move ahead.

Setting AI model guardrails

Keyword filtering

Keyword filtering lets you specify up to ten keywords to be removed from prompts and responses. For example, filtering out terms like “classified” or “confidential” can prevent sensitive information from being included in responses.

Setting Keyword filtering for AI model usage

Batching

Batching enables more efficient request handling by processing multiple API requests asynchronously.

For Bucket, Scope and Collection, select the bucket in your operational cluster where batching metadata can be stored.

Batching API requests for AI model access

Deploy the model

Click the Deploy Model button to launch the necessary GPU-based compute. The deployment process may take 15-20 minutes. Once ready, deployed models can be tracked on the Models List page in the AI Product Hub.

List of AI models configured in AI Services


Step 2: Using the model endpoint

Learning Objective: Understand how to access the model securely and send inference requests.

Let us now see how to consume the model for inferencing and how to leverage the value-added services.

Grant access to the model

To allow access, add your IP address to the allowed list and create database credentials for authenticating model inference requests.

Go to the cluster, click on the Connect screen and add your IP to the allowed IP list and create new database credentials for the cluster. We will use these credentials to authenticate the model inferencing requests.

SDKs to connect to Capella AI services

Model endpoint URL

On the Model List page, locate the Model URL. For example, a URL might look like: https://ai123.apps.cloud.couchbase.com.

AI model listing with status and model URL

Run chat completion

To use the OpenAI-compatible API, you can send a chat request using curl:

All OpenAI APIs here are supported out-of-the-box with the Capella Model Service.

Generate embeddings

To generate embeddings for text input, use the following curl command:


Step 3: using value-added features

Learning Objective: Optimize AI performance with caching, batching, moderation, and keyword filtering.

In this section, we will learn how to optimize your AI application performance and make inferencing faster with built-in enhancements.

Caching reduces redundant computations. Batching improves request efficiency. Content moderation ensures appropriate AI-generated responses. Keyword filtering helps restrict specific terms from appearing in results.

Caching – Reduce redundant computations

Standard caching

Pass a header named X-cb-cache and provide value as “standard”:

Response

(Time taken < 500 ms)

Semantic caching

To understand how semantic caching works, we can pass a series of prompts to the model, to query information around the same entity – for example “San Francisco”. This series of inferences will create embeddings in the caching bucket on the input and use those embeddings to provide top matching results with a high relevance score.

We slightly edit our input prompt in the earlier example to say –

“Can you suggest three must-visit tourist spots in San Francisco for a fun experience?”

This returns the same result as the earlier request for similar question, showing that Model Service is leveraging semantic search for caching.

Response

Batching – Improve throughput for multiple requests

If you’re working on an application that frequently queries the Capella Model Service API, batching is a powerful way to speed up responses and optimize API usage.

You can batch multiple requests by using the same OpenAI APIs here – https://platform.openai.com/docs/api-reference/batch for performing inferencing at once.

Here is a sample curl call:

    1. Prepare a sample batch file – batch_requests.jsonl and upload using /v1/files API
    2. Create the batch using /v1/files API
    3. Fetch batch results to track status
Content Moderation – Filter sensitive content

Response

 

Keyword Filtering – Restrict specific words or phrases

Response


Final thoughts

Capella’s Model Service is now available for Private Preview. Sign up to try it with free credits and provide feedback to help shape its future development.

Stay tuned for upcoming blogs exploring how to maximize AI capabilities by leveraging data proximity with deployed language models and Capella’s broader AI services.

Sign up for the Private Preview here!

References

Acknowledgements

Thanks to the Capella team (Jagadesh M, Ajay A, Aniket K, Vishnu N, Skylar K, Aditya V, Soham B, Hardik N, Bharath P, Mohsin A,  Nayan K, Nimiya J, Chandrakanth N, Pramada K, Kiran M, Vishwa Y,  Rahul P, Mohan V, Nithish R, Denis S. and many more…).  Thanks to everyone who helped directly or indirectly! <3

 

Share this article
Get Couchbase blog updates in your inbox
This field is required.

Author

Posted by Talina Shrotriya, Software Engineering Manager

Leave a comment

Ready to get Started with Couchbase Capella?

Start building

Check out our developer portal to explore NoSQL, browse resources, and get started with tutorials.

Use Capella free

Get hands-on with Couchbase in just a few clicks. Capella DBaaS is the easiest and fastest way to get started.

Get in touch

Want to learn more about Couchbase offerings? Let us help.