Notebooks
O
OpenAI
Getting Started With MyScale And OpenAI

Getting Started With MyScale And OpenAI

chatgptopenaigpt-4examplesmyscalevector_databasesopenai-apiopenai-cookbook

Using MyScale as a vector database for OpenAI embeddings

This notebook provides a step-by-step guide on using MyScale as a vector database for OpenAI embeddings. The process includes:

  1. Utilizing precomputed embeddings generated by OpenAI API.
  2. Storing these embeddings in a cloud instance of MyScale.
  3. Converting raw text query to an embedding using OpenAI API.
  4. Leveraging MyScale to perform nearest neighbor search within the created collection.

What is MyScale

MyScale is a database built on Clickhouse that combines vector search and SQL analytics to offer a high-performance, streamlined, and fully managed experience. It's designed to facilitate joint queries and analyses on both structured and vector data, with comprehensive SQL support for all data processing.

Deployment options

  • Deploy and execute vector search with SQL on your cluster within two minutes by using MyScale Console.

Prerequisites

To follow this guide, you will need to have the following:

  1. A MyScale cluster deployed by following the quickstart guide.
  2. The 'clickhouse-connect' library to interact with MyScale.
  3. An OpenAI API key for vectorization of queries.

Install requirements

This notebook requires the openai, clickhouse-connect, as well as some other dependencies. Use the following command to install them:

[ ]

Prepare your OpenAI API key

To use the OpenAI API, you'll need to set up an API key. If you don't have one already, you can obtain it from OpenAI.

[ ]

Connect to MyScale

Follow the connections details section to retrieve the cluster host, username, and password information from the MyScale console, and use it to create a connection to your cluster as shown below:

[1]

Load data

We need to load the dataset of precomputed vector embeddings for Wikipedia articles provided by OpenAI. Use the wget package to download the dataset.

[ ]

After the download is complete, extract the file using the zipfile package:

[ ]

Now, we can load the data from vector_database_wikipedia_articles_embedded.csv into a Pandas DataFrame:

[ ]

Index data

We will create an SQL table called articles in MyScale to store the embeddings data. The table will include a vector index with a cosine distance metric and a constraint for the length of the embeddings. Use the following code to create and insert data into the articles table:

[ ]

We need to check the build status of the vector index before proceeding with the search, as it is automatically built in the background.

[2]
articles count: 25000
index build status: Built

Search data

Once indexed in MyScale, we can perform vector search to find similar content. First, we will use the OpenAI API to generate embeddings for our query. Then, we will perform the vector search using MyScale.

[4]
1 Battle of Bannockburn
2 Wars of Scottish Independence
3 1651
4 First War of Scottish Independence
5 Robert I of Scotland
6 841
7 1716
8 1314
9 1263
10 William Wallace
[ ]