Talk to Your Data: A UDF That Speaks Your Language

SELECT u.name, COUNT(o.id) AS total_orders
FROM `commerce`.sales.users AS u
JOIN `commerce`.sales.orders AS o ON u.id = o.user_id
WHERE o.status = "completed"
AND DATE_DIFF_STR(NOW_STR(), o.order_date, "day") <= 30
GROUP BY u.name
ORDER BY total_orders DESC
LIMIT 5;

SELECT u.name, COUNT(o.id) AS total_orders

FROM `commerce`.sales.users AS u

JOIN `commerce`.sales.orders AS o ON u.id = o.user_id

WHERE o.status = "completed"

AND DATE_DIFF_STR(NOW_STR(), o.order_date, "day") <= 30

GROUP BY u.name

ORDER BY total_orders DESC

LIMIT 5;

The query above provides valuable insights from your data that’s stored in Couchbase about your top five users who generated the most completed orders within the past 30 days. But what if you’re not an advanced SQL++ developer and need the answers by 11 p.m. for a report? You then need to wait for a developer to write a SQL++ query and get you the answers.

Alternatively, consider a case where you need to do some ad hoc debugging to address questions like:

Are there any documents where the date the order was delivered is missing?
Does that mean that the order was cancelled? Or did we misplace the order and the order never got delivered? Or was everything ok, but we simply missed adding the order_delivered value in the field?

In this case, you not only need to search the order_delivered field, but also look at order_cancelled or investigate comments to figure out if it was misplaced, etc. So the query to be written isn’t simple or straightforward.

SELECT 
    o.orderId,
    o.orderDate,
    o.order_cancelled,
    o.order_delivered,
    o.comments,
    
    CASE
        WHEN o.order_cancelled = TRUE THEN "Order was cancelled"
        WHEN ANY c IN o.comments SATISFIES LOWER(c) LIKE "%misplac%" 
                                    OR LOWER(c) LIKE "%lost%" THEN "Order may have been misplaced"
        WHEN ANY c IN o.comments SATISFIES LOWER(c) LIKE "%deliver%" THEN "Delivered but field not updated"
        ELSE "Reason unknown — investigate"
    END AS reason
FROM `commerce`.`sales`.`orders` AS o
WHERE o.order_delivered IS MISSING OR o.order_delivered IS NULL;

SELECT

o.orderId,

o.orderDate,

o.order_cancelled,

o.order_delivered,

o.comments,

CASE

WHEN o.order_cancelled = TRUE THEN "Order was cancelled"

WHEN ANY c IN o.comments SATISFIES LOWER(c) LIKE "%misplac%"

OR LOWER(c) LIKE "%lost%" THEN "Order may have been misplaced"

WHEN ANY c IN o.comments SATISFIES LOWER(c) LIKE "%deliver%" THEN "Delivered but field not updated"

ELSE "Reason unknown — investigate"

END AS reason

FROM `commerce`.`sales`.`orders` AS o

WHERE o.order_delivered IS MISSING OR o.order_delivered IS NULL;

In such cases, it would help if you had a reliable assistant available 24×7 to get all these answers. The UDF described in this blog is such an assistant. It accepts your questions in the most natural way and returns results in JSON. Behind the scenes, it connects to a model of your choice, along with your API key, to convert your thoughts into SQL++ and then executes it. And all you need to invoke this assistant is to use the UDF.

SELECT NL2SQL(
  [“`commerce`.`sales`.`orders`”],
  "Are there any documents where the order_delivered date is missing?and if so why?",
  "",
  "https://api.openai.com/v1/chat/completions",
  "gpt-4o-2024-05-13"
) ;

SELECT NL2SQL(

[“`commerce`.`sales`.`orders`”],

"Are there any documents where the order_delivered date is missing?and if so why?",

"",

"https://api.openai.com/v1/chat/completions",

"gpt-4o-2024-05-13"

) ;

How It Works

1. Set up the library.
You first create a JavaScript library used by the UDF.

Library:

/*
input:

keyspaces: an array of strings, each string represents a keyspaces "bucket.scope.collection" with proper escaping using grave-accent quote wherever required
prompt: users natural language request
apikey: your openai api key
model: string representing the model's name, see https://platform.openai.com/docs/api-reference/completions/create#completions-create-model for more details
 
output:
chat-completions api response with the generated sql statement
*/
 
 
 
function inferencer(k) {
   var infq = N1QL("SELECT t.properties FROM(INFER "+k+ ") as t") ;
   var res=[]
   for(const doc of infq) {
       res.push(doc)
   }
  
    return res[0];
}
 
 
function nl2sql(keyspaces, prompt, apikey, modelapi, model) {
  
   
   collectionSchema = {}
   for(const k in keyspaces) {
       c = inferencer(keyspaces[k])
       collectionSchema[keyspaces[k]] = c
   }
  
   collectionSchemaStr = JSON.stringify(collectionSchema)
  
   promptContent = `Information:\nCollection's schema: ${collectionSchemaStr}\n\nPrompt: \"${prompt}\"\n\nThe query context is set.\n\nBased on the above Information, write valid SQL++ and return only the statement and no explanation. For retrieval, use aliases. Use UNNEST from the FROM clause when appropriate. \n\nIf you're sure the Prompt can't be used to generate a query, first say \"#ERR:\" and then explain why not.`
  
    
    data = {"messages":[{"role":"system","content":"You are a Couchbase Capella expert. Your task is to create valid queries to retrieve or create data based on the provided Information.\n\nApproach this task step-by-step and take your time."},{"role":"user","content":promptContent}],
        "model": model,
        "temperature":0,
        "max_tokens":1024,
        "stream":false}

    
    
var dataStr = JSON.stringify(data)
    .replace(/\\/g, "\\\\")    // escape backslashes
    .replace(/"/g, '\\"');     // escape quotes
   
    var completionsurl = modelapi
  
   var q= `SELECT CURL("${completionsurl}", {
        "request": "POST",
        "header": ["Authorization: Bearer ${apikey}", "Content-type: application/json"],
        "data": "${dataStr}"
    }) AS result;`
   
    var completionsq = N1QL(q);
       
    
 
    var res = []
   for(const doc of completionsq) {
       res.push(doc);
     }
   
  try {
    content = res[0].result.choices[0].message.content    
  } catch(e) {
    return res;  
  } 
   
  stmt = content.trim().substring(7, content.length-4)
   
  isSelect = (stmt.substring(0,6).toLowerCase())==="select"
  if(isSelect === false){
      return {
          "generated_statement": stmt
      }
  }
  
  var runq = N1QL(stmt);
   
   var rrun = []
  for(const doc of runq) {
      rrun.push(doc)
  }
  
  return {
      "generated_statement": stmt,
      "results": rrun
  }
    
}

input:

keyspaces: an array of strings, each string represents a keyspaces "bucket.scope.collection" with proper escaping using grave-accent quote wherever required

prompt: users natural language request

apikey: your openai api key

model: string representing the model's name, see https://platform.openai.com/docs/api-reference/completions/create#completions-create-model for more details

output:

chat-completions api response with the generated sql statement

function inferencer(k) {

var infq = N1QL("SELECT t.properties FROM(INFER "+k+ ") as t") ;

var res=[]

for(const doc of infq) {

res.push(doc)

}

return res[0];

}

function nl2sql(keyspaces, prompt, apikey, modelapi, model) {

collectionSchema = {}

for(const k in keyspaces) {

c = inferencer(keyspaces[k])

collectionSchema[keyspaces[k]] = c

}

collectionSchemaStr = JSON.stringify(collectionSchema)

promptContent = `Information:\nCollection's schema: ${collectionSchemaStr}\n\nPrompt: \"${prompt}\"\n\nThe query context is set.\n\nBased on the above Information, write valid SQL++ and return only the statement and no explanation. For retrieval, use aliases. Use UNNEST from the FROM clause when appropriate. \n\nIf you're sure the Prompt can't be used to generate a query, first say \"#ERR:\" and then explain why not.`

data = {"messages":[{"role":"system","content":"You are a Couchbase Capella expert. Your task is to create valid queries to retrieve or create data based on the provided Information.\n\nApproach this task step-by-step and take your time."},{"role":"user","content":promptContent}],

"model": model,

"temperature":0,

"max_tokens":1024,

"stream":false}

var dataStr = JSON.stringify(data)

.replace(/\\/g, "\\\\") // escape backslashes

.replace(/"/g, '\\"'); // escape quotes

var completionsurl = modelapi

var q= `SELECT CURL("${completionsurl}", {

"request": "POST",

"header": ["Authorization: Bearer ${apikey}", "Content-type: application/json"],

"data": "${dataStr}"

}) AS result;`

var completionsq = N1QL(q);

var res = []

for(const doc of completionsq) {

res.push(doc);

}

try {

content = res[0].result.choices[0].message.content

} catch(e) {

return res;

}

stmt = content.trim().substring(7, content.length-4)

isSelect = (stmt.substring(0,6).toLowerCase())==="select"

if(isSelect === false){

return {

"generated_statement": stmt

}

var runq = N1QL(stmt);

var rrun = []

for(const doc of runq) {

rrun.push(doc)

}

return {

"generated_statement": stmt,

"results": rrun

}

2. Upload the library.
Run the curl command after copying the provided library code into a file, i.e., usingailib.js.

curl -X POST https://localhost:9499/evaluator/v1/libraries/usingailib 
--data-binary @usingailib.js -u Administrator:password

1 2	curl -X POST https://localhost:9499/evaluator/v1/libraries/usingailib --data-binary @usingailib.js -u Administrator:password

3. Create the UDF.
Use the create function statement below to create the UDF once you have created the library:

CREATE OR REPLACE FUNCTION NL2SQL(keyspaces, prompt, apikey, modelapi, model) 
LANGUAGE JAVASCRIPT AS "nl2sql" AT "usingailib";

1 2	CREATE OR REPLACE FUNCTION NL2SQL(keyspaces, prompt, apikey, modelapi, model) LANGUAGE JAVASCRIPT AS "nl2sql" AT "usingailib";

NL2SQL() now acts as your multilingual translator between human language and Couchbase’s query engine. You simply give it some context and a natural language request, and it returns a response.

How the UDF Thinks

Under the hood, it uses your preferred model when invoking the UDF to understand your intent and generate a query that Couchbase can execute.

The advantage of using the chat completions API means you could simply plug in a model from other providers that are compliant with the same API spec. You can use your own private LLM or known ones from Open AI, Gemini, Claude, etc.

The invoked UDF requires the following information from you:

keyspaces – An array of strings, each representing a Couchbase keyspace (bucket.scope.collection).Use grave accent quotes where needed to escape special names (like travel-sample.inventory.route). This tells the UDF where to look for your data.
prompt – Your request in plain English (or any other language).
Example: “Show me all users who made a purchase in the last 24 hours.”
apikey – Your API key used for authenticating with the model endpoint.
model endpoint – e.g., Open AI compliant chat completions URL.
model – The name of the model you want to use from the provider.
e.g., “gpt-4o-2024-05-13”

There are also several available functions in the library:

inferencer()

Before generating a query, the UDF first tries to understand your data. The inferencer() helper function calls Couchbase’s INFER statement to retrieve a collection’s schema:

function inferencer(k) {
   var infq = N1QL("SELECT t.properties FROM (INFER " + k + ") AS t");
   var res = [];
   for (const doc of infq) {
       res.push(doc);
   }
   return res[0];
}

function inferencer(k) {

var infq = N1QL("SELECT t.properties FROM (INFER " + k + ") AS t");

var res = [];

for (const doc of infq) {

res.push(doc);

}

return res[0];

}

This schema is used to help the AI understand what kind of data lives inside each collection.

The main function: nl2sql()

Collects all schemas for the given keyspaces using the inferencer(). Constructs a prompt that includes: the inferred schema, your natural language query, and a Couchbase prompt to nudge the LLM.
Sends it to the LLM.
Extracts the generated SQL++ from the model’s response.
Executes it directly if it’s a SELECT statement and returns both the generated SQL++ statement and the query results.

The reason for not executing non-select statements is that you don’t want this UDF to insert, update, or delete documents in a collection without you verifying it. So the SQL++ statement lets you execute it after it’s been verified.

Example use case:

SELECT default:NL2SQL(
  ["`travel-sample`.inventory.hotel"],
  "Give me hotels in San Francisco that have free parking and free breakfast and a rating of more than 3",  "",
  "https://api.openai.com/v1/chat/completions",
  "gpt-4o-2024-05-13"
);

SELECT default:NL2SQL(

["`travel-sample`.inventory.hotel"],

"Give me hotels in San Francisco that have free parking and free breakfast and a rating of more than 3", "",

"https://api.openai.com/v1/chat/completions",

"gpt-4o-2024-05-13"

);

Result:
[{
    "$1": {
        "generated_statement": "SELECT h.name, h.address, h.city, h.state, h.country, h.free_parking, h.free_breakfast, r.ratings.Overall\nFROM `travel-sample`.inventory.hotel AS h\nUNNEST h.reviews AS r\nWHERE h.city = \"San Francisco\"\n  AND h.free_parking = true\n  AND h.free_breakfast = true\n  AND r.ratings.Overall > 3;",
        "results": [{
                "Overall": 4,
                "address": "520 Church St",
                "city": "San Francisco",
                "country": "United States",
                "free_breakfast": true,
                "free_parking": true,
                "name": "Parker House",
                "state": "California"
            },
            {
                "Overall": 4,
                "address": "520 Church St",
                "city": "San Francisco",
                "country": "United States",
                "free_breakfast": true,
                "free_parking": true,
                "name": "Parker House",
                "state": "California"
            },
            {
                "Overall": 5,
                "address": "520 Church St",
                "city": "San Francisco",
                "country": "United States",
                "free_breakfast": true,
                "free_parking": true,
                "name": "Parker House",
                "state": "California"
            },
            {
                "Overall": 4,
                "address": "520 Church St",
                "city": "San Francisco",
                "country": "United States",
                "free_breakfast": true,
                "free_parking": true,
                "name": "Parker House",
                "state": "California"
            },
            {
                "Overall": 5,
                "address": "465 Grant Ave",
                "city": "San Francisco",
                "country": "United States",
                "free_breakfast": true,
                "free_parking": true,
                "name": "Grant Plaza Hotel",
                "state": "California"
            },
            {
                "Overall": 5,
                "address": "465 Grant Ave",
                "city": "San Francisco",
                "country": "United States",
                "free_breakfast": true,
                "free_parking": true,
                "name": "Grant Plaza Hotel",
                "state": "California"
            },
...

Result:

[{

"$1": {

"generated_statement": "SELECT h.name, h.address, h.city, h.state, h.country, h.free_parking, h.free_breakfast, r.ratings.Overall\nFROM `travel-sample`.inventory.hotel AS h\nUNNEST h.reviews AS r\nWHERE h.city = \"San Francisco\"\n AND h.free_parking = true\n AND h.free_breakfast = true\n AND r.ratings.Overall > 3;",

"results": [{

"Overall": 4,

"address": "520 Church St",

"city": "San Francisco",

"country": "United States",

"free_breakfast": true,

"free_parking": true,

"name": "Parker House",

"state": "California"

{

"Overall": 4,

"address": "520 Church St",

"city": "San Francisco",

"country": "United States",

"free_breakfast": true,

"free_parking": true,

"name": "Parker House",

"state": "California"

{

"Overall": 5,

"address": "520 Church St",

"city": "San Francisco",

"country": "United States",

"free_breakfast": true,

"free_parking": true,

"name": "Parker House",

"state": "California"

{

"Overall": 4,

"address": "520 Church St",

"city": "San Francisco",

"country": "United States",

"free_breakfast": true,

"free_parking": true,

"name": "Parker House",

"state": "California"

{

"Overall": 5,

"address": "465 Grant Ave",

"city": "San Francisco",

"country": "United States",

"free_breakfast": true,

"free_parking": true,

"name": "Grant Plaza Hotel",

"state": "California"

{

"Overall": 5,

"address": "465 Grant Ave",

"city": "San Francisco",

"country": "United States",

"free_breakfast": true,

"free_parking": true,

"name": "Grant Plaza Hotel",

"state": "California"

...

Experimenting with models from other providers

The next example uses Gemini’s Open AI-compatible API. You simply change the model provider’s URL from the previous Open AI API to Gemini’s API. Also, be sure to change the model parameter to a model it recognizes. Of course, you need to also update the api-key from Open AI’s key to Gemini’s key.

SELECT NL2SQL(
  ["`travel-sample`.inventory.hotel"],
  "Show me hotels in France",
  "",
  "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",
  "gemini-2.0-flash"
)as p;

SELECT NL2SQL(

["`travel-sample`.inventory.hotel"],

"Show me hotels in France",

"",

"https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",

"gemini-2.0-flash"

)as p;

The following illustrates the result:

[{
    "p": {
        "generated_statement": "SELECT h.name AS hotel_name, h.city AS hotel_city\nFROM `travel-sample`.inventory.hotel AS h\nWHERE h.country = \"France\";",
        "results": [{
                "hotel_city": "Giverny",
                "hotel_name": "The Robins"
            },
            {
                "hotel_city": "Giverny",
                "hotel_name": "Le Clos Fleuri"
            },
 	…
	     {
                "hotel_city": "Ferney-Voltaire",
                "hotel_name": "Hotel Formule 1"
            }
        ]
    }
}]

[{

"p": {

"generated_statement": "SELECT h.name AS hotel_name, h.city AS hotel_city\nFROM `travel-sample`.inventory.hotel AS h\nWHERE h.country = \"France\";",

"results": [{

"hotel_city": "Giverny",

"hotel_name": "The Robins"

{

"hotel_city": "Giverny",

"hotel_name": "Le Clos Fleuri"

…

{

"hotel_city": "Ferney-Voltaire",

"hotel_name": "Hotel Formule 1"

}

]

}

}]

Conclusion

This blog provides a glimpse into how you can leverage AI to interact with your data in Couchbase. With this UDF, natural language querying becomes a reality – no SQL++ expertise required. It is model-agnostic and safe for production queries.

And this is just the beginning. In the future, we hope to extend it to:

Image → SQL++
Voice → SQL++
Agent-like pipelines

… all running inside Couchbase workflows.

References
Capella IQ: https://docs.couchbase.com/cloud/get-started/capella-iq/get-started-with-iq.html
Chat completions APIs:
https://platform.openai.com/docs/api-reference/chat
https://ai.google.dev/gemini-api/docs/openai#rest

Gaurav Jayaraj - Software Engineer

Share this article

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Quickstart

Resource Center

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Talk to Your Data: A UDF That Speaks Your Language

How It Works

How the UDF Thinks

Experimenting with models from other providers

Conclusion

Get Couchbase blog updates in your inbox

Author

Posted by Gaurav Jayaraj - Software Engineer

Leave a comment Cancel reply

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch