Because who has the time ? (also part 1 because it took me further than I expected 😬)

Couchbase recently introduced support for Vector Search. And I have been looking for an excuse to play with it. As it turns out there was recently a great Twitter thread about Developer Marketing. I can relate to most of what’s in there. It’s a fantastic thread. I could summarize it to make sure my teammates can get the best out of it in a short time. Like, I could write that summary manually. Or that could be the excuse I was looking for.

Let’s ask an LLM, Large Language Model, to summarize this brilliant thread for me, and for the benefit of others. In theory, things should go as follow:

    1. Getting the tweets
    2. Transforming them in vectors thanks to a LLM
    3. Storing the tweet and vectors in Couchbase
    4. Creating an index to query them
    5. Ask something to the LLM
    6. Transform that into a vector
    7. Run a vector search to get some context for the LLM
    8. Create the LLM prompt from the question and the context
    9. Get a fantastic answer back

This is basically a RAG workflow. RAG stands for Retrieval Augmented Generation. It allows developers to build more accurate, robust LLM-based applications by providing context.

Extracting Twitter Data

First thing first, getting data out of Twitter. This is actually the hard part if you don’t subscribe to their API. But with some good old scrapping, you can still do something decent. Probably not 100% accurate, but decent. So let’s get to it.

Getting my favorite IDE, with the Couchbase plugin installed, I create a new Python script and start playing with twikit, a Twitter scraper library. Everything works great until I quickly get an HTTP error 429. Too Many Requests. I have been scrapping too hard. I have been caught. A couple things to mitigate that.

    1. First, make sure you store your auth cookie in a file and reuse it, instead of frantically re logging-in like I did.
    2. Second, switch to an online IDE, you will be able to change IP more easily.
    3. Third, introduce waiting time and make it random. Not sure if the random part helps, but why not, it’s easy.

Final script looks like this:

It was a bit painful to avoid the 429, I went through several iterations but in the end got something that mostly works. I just needed to add the start and finishing bracket to turn it into a valid JSON array:

Josh is obviously right, socks are at the heart of what we do in developer marketing, alongside irony.

I now have a file containing an array of JSON documents, all with dev marketing hot takes. What’s next?

Turning Tweets in Vectors

To make sure it can be used by a LLM as additional context, it needs to be transformed into a vector, or embedding. Basically it’s an array of decimal values between 0 and 1. All of this will allow RAG, Retrieval Augmented Generation. It’s not universal, every LLM has their own representation of an object (like text, audio or video data). Being extremely lazy and unaware of what’s going on in that space, I chose OpenAI/ChatGPT. It’s like there are more models coming up every week than we had JavaScript frameworks in 2017.

Anyway, I created my OpenAI account, created an API key, added a couple bucks because apparently you can’t use their API if you don’t, even the free stuff. Then I was ready to transform tweets into vectors. The shortest path to getting the embedding through the API is to use curl. It will look like this:

Here you can see that the JSON input has an input field that will be transformed into a vector, and the model field that references the model to be used to transform the text in a vector. The output gives back the vector, model used, and API usage stats.

Fantastic, now what? Turning these into vectors is not cheap. Better to be stored in a database to be reused later. Plus, you can easily get some nice added features like hybrid search.

There are a couple ways to see that. There is a tedious manual way that’s great to learn things. And then there is using libraries and tools that makes life easier. I actually went straight ahead using Langchain thinking it would make my life easier, and it did, until I got a ‘little’ lost. So, for our collective learning benefit, let’s start with the manual way. I have an array of JSON documents, I need to vectorize their content, store it in Couchbase, and then I will be able to query them with another vector.

Loading the tweets in a Vector Store like Couchbase

I am going to use Python because I feel like I have to get better at it, even though we can see Langchain implementation in Java or JavaScript. And the first thing I want to address is how to connect to Couchbase:

From this code you can see the connect_to_couchbase method that accepts a connection string, username and password. All of them are provided by the environment variables loaded at the beginning. Once we have the cluster object we can get the associated bucket, scope, and collection. If you are unfamiliar with Couchbase, collections are similar to an RDBMS table. Scopes can have as many collections and buckets as many scopes. This granularity is useful for a variety of reasons (multi-tenancy, faster sync, backup, etc.).

One more thing before getting the collection. We need code to transform text in vectors. Using the OpenAI client it looks like this:

This will do a similar job as the earlier curl call. Just make sure you have the OPENAI_API_KEY environment variable set for the client to work.

Now let’s see how to create a Couchbase document out of a JSON tweet, with the generated embedding.

The document has three fields, metadata contains the whole tweet, text is the text transformed as a string and embedding is the embedding generated with OpenAI. The key will be the id of the tweet. And upsert is used to either update or insert the doc if it does not exist.

If I go ahead and run this, and connect to my Couchbase server, I will see documents being created.

A Screenshot of the Couchbase Capella UI showing the list of created Documents

At this point I have extracted data from Twitter, uploaded it into Couchbase as one tweet per document, with the OpenAI embedding generated and inserted for each tweet. I am ready to ask questions to query similar documents.

Run Vector Search on Tweets

And now it’s time to talk about Vector Search. How to search for tweets similar to a given text? The first thing to do is to transform the text in a vector or embedding. So let’s ask the question:

That’s it. The queryEmbedding variable contains a vector representing the query. On to the query:

Because I want to see what I am doing, I am activating Couchbase SDK logs by setting up this environment variable:

If you have been following along and everything goes well, you should get an error message!

And this is fine because we get a QueryIndexNotFoundException. It’s looking for an index that does not exist yet. So we need to create it. You can login to your cluster on Capella and follow along:

Once you have the index, you can run it again and should get this:

We get SearchRow objects that contain the index used, the key of the document, the related score, and then a bunch of empty fields. You can see that this is also ordered by score, and it’s giving the closest tweet to the given query it found.

How do we know if it worked? Fastest thing to do is look for the document with our IDE plugin. If you are using VSCode or any JetBrains IDE, it should be pretty easy. You can also login to Couchbase Capella and find it there.

Or we can modify the search index to store the associated text field and metadata, and rerun the query:


Conclusion

So it worked, Josh’s tweet about socks shows up at the top of the search. Now you know how to scrape twitter, transform tweets in vectors, store, index and query them in Couchbase. What does that have to do with LLM and AI? More on that in the next post!

Author

Posted by Laurent Doguin, Developer Advocate, Couchbase

Laurent is a Paris based Developer Advocate where he focuses on helping Java developers and the French community. He writes code in Java and blog posts in Markdown. Prior to joining Couchbase he was Nuxeo’s community liaison where he devoted his time and expertise to helping the entire Nuxeo Community become more active and efficient.

Leave a reply