PostgreSQL

PostgreSQL as Vector database: Create LLM Apps with pgvector

Rajnikant Rakesh

November 26, 2023

Introduction

PostgreSQL, a powerful open-source relational database management system, has long been revered for its robustness, extensibility, and SQL compliance. However, in recent times, there has been a paradigm shift in the way databases are utilized, with an increasing focus on vector data processing. This has led to the emergence of PostgreSQL as a formidable player in the realm of vector databases, further strengthened by the innovative pgvector extension.

pgvector is a PostgreSQL extension that provides powerful functionalities for working with vectors in a high-dimensional space. It introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data directly within the PostgreSQL database.

Tessell for PostgreSQL supports pgvector as a first-class integration

At Tessell, like any other company, we have generated a considerable amount of content in various forms, such as blogs, technical documents, FAQs, etc. Some are present publicly on our website, some on internal confluence and cloud drives, and with that comes the problem of too much information. Anyone new needs to scan through this abundance of information to figure things out.

So, we at Tessell thought, why not eat our own food first and decided to come up with an internal ChatGPT tool for our Sales and marketing teams to consume content in a Q&A fashion. This has helped our Sales, Sales Engineering, and Marketing teams to churn out content much more quickly.

So how did we go about doing it? This article talks about the step-by-step guide to building our own GPT.

Design

With the task cut out, we started to design the architecture of what the app would look like.

Tessell GPT consists of three major parts:

Generate Vector embeddings for Content using OpenAI APIs.
Storing Vector embeddings in PostgreSQL with pgvector.
Retrieve embeddings from vector database to augment LLM generation, the whole process is also known as RAG (Retrieval Augmented Generation)

Now, that we have the the basic components of the design laid out let's look at a few important terms in the context of LLM Apps.

How to Create and Store OpenAI Embeddings for Your Documents?

And maybe the prior question is Why to generate embeddings of your documents in the first place?

While GPT-3 and GPT-4 are trained on huge datasets which is publicly available they may be missing some information needed to give a good answer because that information was not in the dataset used to train the model (for example, the information is stored in private documents or only became available recently); RAG addresses the problem that a foundational model may be missing data to give a correct precise answer.

RAG’s solution is dead simple: provide additional context to the foundational model in the prompt. For example, if someone asks GPT, “What is a Tessell DBaaS?” and the foundational model has never heard of Tessell, you can transform the prompt into context: “Tessell DBaaS is a cloud-based platform that simplifies the setup, management, security, and scalability of relational databases such as PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. It also offers a managed service for event processing systems like Kafka. Tessell aims to be the go-to DBaaS platform for startups and enterprises by combining the benefits of cloud elasticity, utility economics, and enterprise-level governance and security.”

The foundational model can then use its knowledge of Tessell DBaaS to wax eloquently about Tessell. This technique is insanely powerful—it allows you to “teach” foundational models about things only you know about and use that to create a ChatGPT++ experience for your users.

But what context do you provide to the model? If you have a library of information, how do you know what’s relevant to a given question? Cue in embeddings. As mentioned above, OpenAI embeddings are a mathematical representation of the semantic meaning of a piece of text that allows for similarity search.

This means that if you get a user question and calculate its embedding, you can use a similarity search against data embeddings in your library to find the most relevant information. But that requires having an embedding representation of your library. Now this is where PostgreSQL as a vector database (with PgVector) forms the foundation of storing these embeddings.

So with basics out of the way, Let's get moving!

Setup

Set up your Virtual Python environment and install the following requirements.

Copied to clipboard!


openai
pandas
numpy
tiktoken
psycopg2
pgvector
python-dotenv

You’ll need to sign up for an OpenAI Developer Account and create an OpenAI API Key – we recommend getting a paid account to avoid rate limiting and setting a spending cap so that you avoid any surprises with bills.

Once you have an OpenAI API key, it’s a best practice to store it as an environment variable and then have your Python program read it.

Copied to clipboard!


#First, run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...

# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key  = os.environ['OPENAI_API_KEY']

Generate Vector embeddings for Content using OpenAI APIs

Embeddings measure how related text strings are. First, we'll create embeddings using the OpenAI API on some text we want the LLM to answer questions on.

We first wanted to scrape all the public blogs, articles, and FAQs from tessell.com. This works as a good starting point for us to build our own LLM app.

There are various ways in which one can web scrape the data using various basic Python packages like BeautifulSoup or more specific LLM frameworks like Langchain & Llmaindex. We wrote our own content scraper using BeautifulSoup as we want to bring first-class ingestion interfaces to Tessell for our customers in the near future.

The scraped data was then stored in CSV files for further operation.

Copied to clipboard!


# Load your CSV file into a pandas DataFrame
df = pd.read_csv('blog_posts_data.csv')
df.head()

The output looks like this:

	title	url	content
0	What types of licensing options are available with Tessell for Oracle?	https://help.tessell.com/en/articles/6537778-what-types-of-licensing-options-are-available-with-tessell-for-oracle	As of now, Tessell supports only one type of licensing option for Oracle. Bring Your Own License (BYOL)In this licensing model,……..

Once the content was available we tokenized the information and generated embeddings using OpenAI APIs

Copied to clipboard!


#Helper function: get embeddings for a text
def get_embeddings(text):
   response = openai.Embedding.create(
       model="text-embedding-ada-002",
       input = text.replace("\n"," ")
   )
   embedding = response['data'][0]['embedding']
   return embedding

The new data frame with embeddings should look like this:

title	url	content	tokens	embeddings
What types of licensing options are available with Tessell for Oracle?	As of now, Tessell supports only one type of licensing option for Oracle. Bring Your Own License (BYOL)In this licensing model, you can…	https://help.tessell.com/en/articles/6537778-what-types-of-licensing-options-are-available-with-tessell-for-oracle	126	[0.008382752537727356, -0.019573453813791275, 0.009279926307499409, -0.045913375914…

Storing Vector embeddings in PostgreSQL with pgvector

Now that we have created embedding vectors for our blog content, the next step is to store the embedding vectors in a vector database to help us perform a fast search over many vectors.

We’ll use PostgreSQL with the pgvector extension installed as our vector database. pgvector extends PostgreSQL to handle vector data types and vector similarity search, like nearest neighbor search, which we’ll use to find the k most related embeddings in our database for a given user prompt.

Create a Tessell for PostgreSQL database and install pgvector

First, we’ll create a PostgreSQL database. You can create a Tessell for PostgreSQL database in minutes for free on AWS or Azure or use a local PostgreSQL database for this step.

Once you’ve created your PostgreSQL database, export your connection string as an environment variable, and just like the OpenAI API key, we’ll read it into our Python program from the environment file:

Copied to clipboard!


# Tessell for PostgreSQL database connection string
# Found under Tessell Overview tab in Tessell console.
import os 
connection_string  = os.environ['TESSELL_POSTGRESQL']

We then connect to our database using the popular psycopg2 python library and install the pgvector extension as follows:

Copied to clipboard!


# Connect to PostgreSQL database in Tessell using connection string
conn = psycopg2.connect(connection_string)
cur = conn.cursor()

#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector");
conn.commit()

Connect to and configure your vector database

Once we’ve installed pgvector, we use the register_vector() command to register the vector type with our connection:

Copied to clipboard!


# Register the vector type with psycopg2
register_vector(conn)

Once we’ve connected to the database, let’s create a table that we’ll use to store embeddings along with metadata. Our table will look as follows:

id	title	url	content	tokens	embeddings

id represents the unique ID of each vector embedding in the table.
title is the blog title from which the content associated with the embedding is taken.
url is the blog URL from which the content associated with the embedding is taken.
content is the actual blog content associated with the embedding.
tokens is the number of tokens the embedding represents.
embedding is the vector representation of the content.

One advantage of using PostgreSQL as a vector database is that you can easily store metadata and embedding vectors in the same database, which is helpful for supplying the user-relevant information related to the response they receive, like links to read more or specific parts of a blog post that are relevant to them.

Copied to clipboard!


# Create table to store embeddings and metadata
table_create_command = """
CREATE TABLE embeddings (
            id bigserial primary key, 
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()

Ingest and store vector data into PostgreSQL using pgvector

Now that we’ve created the database and created the table to store the embeddings and metadata, the final step is to insert the embedding vectors into the database.

For this step, it’s a best practice to batch insert the embeddings rather than insert them one by one. Load the data frame that was generated in the first step.

Copied to clipboard!


#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()

Nearest Neighbor Search Using pgvector for RAG

Given a user question, we’ll perform the following steps to use information stored in the vector database to answer their question using Retrieval Augmented Generation:

Create an embedding vector for the user question.
Use pgvector to perform a vector similarity search and retrieve the k nearest neighbors to the question embedding from our embedding vectors representing the blog content. In our example, we’ll use k=3, finding the three most similar embedding vectors and associated content.
Supply the content retrieved from the database as additional context to the model and ask it to perform a completion task to answer the user question.

Here’s the function we use to find the three nearest neighbors to the user question. Note it uses pgvector’s <=> operator, which finds the Cosine distance (also known as Cosine similarity) between two embedding vectors.

Copied to clipboard!


#Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN <=> operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs

We supply helper functions to create an embedding for the user question and to get a completion response from an OpenAI model. We use GPT-3.5, but you can use GPT-4 or any other model from OpenAI.

We also specify a number of parameters, such as limits of the maximum number of tokens in the model response and model temperature, which controls the randomness of the model, which you can modify to your liking:

Copied to clipboard!


# Helper function: get text completion from OpenAI API
# Note we're using the latest gpt-3.5-turbo-0613 model
def get_completion_from_messages(messages, model="gpt-3.5-turbo-0613", temperature=0, max_tokens=1000):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message["content"]

# Helper function: get embeddings for a text
def get_embeddings(text):
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input = text.replace("\n"," ")
    )
    embedding = response['data'][0]['embedding']
    return embedding

Alright, so now we have all the pieces of puzzle. Let's put them together

We’ll define a function to process the user input by retrieving the most similar documents from our database and passing the user input, along with the relevant retrieved-context to the OpenAI model to provide a completion response to.

Note that we modify the system prompt as well in order to influence the tone of the model’s response.

We pass to the model the content associated with the three most similar embeddings to the user input using the assistant role. You can also append the additional context to the user message.

Copied to clipboard!


# Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
    delimiter = "```"

    #Step 1: Get documents related to the user input from database
    related_docs = get_top3_similar_docs(get_embeddings(user_input), conn)
    #print(related_docs)

    # Step 2: Get completion from OpenAI API
    # Set system message to help set appropriate tone and context for model
    system_message = f"""
    You are a friendly Tessell chatbot. \
    You can answer questions about Tessell DBaaS, its features and its use cases. \
    You respond in a concise, technically credible tone. \
    """

    # Prepare messages to pass to model
    # We use a delimiter to help the model understand the where the user_input starts and ends
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{delimiter}{user_input}{delimiter}"},
        {"role": "assistant", "content": f"Relevant Tessell case studies information: \n {related_docs[0][0]} \n {related_docs[1][0]} {related_docs[2][0]}"}
    ]

    final_response = get_completion_from_messages(messages)
    return final_response

This whole code was then packaged as a REST API and deployed on AWS and forms the backend of Tessell GPT

We also developed a static HTML/JS-based interface to interact with the application just like we do with ChatGPT.

This is how it looks:

Let's start with an easy one.

Let’s make it more specific.

We often get asked about certain technical features of the product from customers, while the information is available in long-form documentation which often takes time to read and summarize. Let’s see what Tessell GPT does with one of such question.

Pretty impressive..!

Conclusion

Retrieval Augmented Generation (RAG) is a powerful method of building applications with LLMs that enable you to teach foundation models about things it was not originally trained on—like private documents or recently published information.

We covered how we created Tessell GPT to answer questions about Tessell DBaaS. We used the content from the Tessell website, internal documents, etc. to show how to create, store, and perform similarity search on OpenAI embeddings. We used Tessell for PostgreSQL and pgvector as our vector database to store and query the embeddings.

And if you’re looking for a production PostgreSQL database for your vector workloads, try Tessell for PostgreSQL.

These might also interest you

MySQL

PostgreSQL as Vector database: Create LLM Apps with pgvector

Introduction

Tessell for PostgreSQL supports pgvector as a first-class integration

Design

How to Create and Store OpenAI Embeddings for Your Documents?

Setup

Generate Vector embeddings for Content using OpenAI APIs

Storing Vector embeddings in PostgreSQL with pgvector

Create a Tessell for PostgreSQL database and install pgvector

Connect to and configure your vector database

Ingest and store vector data into PostgreSQL using pgvector

Nearest Neighbor Search Using pgvector for RAG

Alright, so now we have all the pieces of puzzle. Let's put them together

This whole code was then packaged as a REST API and deployed on AWS and forms the backend of Tessell GPT

Conclusion

Related articles

Microsoft Active Directory Authentication using Kerberos for MySQL Enterprise

Maximizing Efficiency: Consolidating Oracle Databases on the Cloud

What is DBaaS (Database as a Service)?

Get the best performance while being very cost effective.