Notebooks
P
Pinecone
Pinecone Import

Pinecone Import

vector-databasesemantic-searchAILLMPythondocsjupyter-notebookpinecone-examples

Open In Colab Open nbviewer

Import from object storage

Note: This feature is in public preview and available only on Standard and Enterprise plans.

Scenario: Ingesting Parquet Data from S3 to Pinecone Serverless

In this scenario, we will be generating JSON data, embedding that data using the Pinecone Inference API, storing the data in AWS S3 as Parquet files, and ingesting the data from S3 into a Pinecone Serverless index.

Problem Overview

The goal is to move the data from S3 to Pinecone so that it can be used for future tasks such as semantic search. This process ensures that the data is efficiently searchable and retrievable by applications.

Solution Steps

  1. Generate data: Begin by generating the data that needs to be processed.

  2. Chunk data: Split the generated data into smaller, manageable chunks that can be processed and embedded effectively.

  3. Embed data: Create vector embeddings from the chunked data. These embeddings are crucial for indexing and retrieval in Pinecone.

  4. Create Parquet files: Save the vector embeddings, along with metadata, into Parquet files.

  5. Access S3 bucket: Access the S3 bucket where the Parquet files will be stored.

  6. Upload Parquet files: Upload the Parquet files containing the embeddings to the S3 bucket.

  7. Create Pinecone index: Create a Pinecone index where the embeddings will be stored. This index will allow for efficient similarity search and other tasks.

  8. Load S3 data into Pinecone index: Load the embeddings from the S3 bucket into the Pinecone index.

Please see our official Understanding Imports in Pinecone documentation for additional information.

The data flow for the notebook is outlined below:

Pinecone101_flow.png

Import required libraries

First, we need to import all the necessary libraries that will be used for phrase generation, text chunking, embedding, and uploading to S3.

[ ]
[ ]

Generate unique phrases

In this step, we create a list of adjectives, nouns, and verbs, and then randomly combine them to form 100 unique phrases. Each phrase will follow the structure: 'The [adjective] [noun] [verb] over the [adjective] [noun]'. These phrases will be stored in a Pandas DataFrame for further processing.

[ ]
Generating 100 unique phrases...
Unique phrases generated.

Chunk text using LangChain

Here we chunk the generated phrases into smaller pieces using LangChain's RecursiveCharacterTextSplitter. This is useful when working with large texts, as smaller chunks can be processed more efficiently.

[ ]

Get your API key

[ ]

Initialize a Pinecone client

We load the Pinecone API key from a configuration file (config.ini) and initialize the Pinecone client. Pinecone will be used to embed the chunked text.

[ ]

Embed text using the Pinecone Inference API

Next, we embed the chunked text using Pinecone's embedding model. We process each chunk and store the embedding values back into the DataFrame.

[ ]
All embeddings generated.

Upload embedded data to S3

Now, we can upload the embedded DataFrame to an Amazon S3 bucket as Parquet files in chunks. This step uses the boto3 library to interact with S3.

Be sure to replace the bucket name and folder name.

[ ]

Create a serverless index

[ ]
[ ]
{'dimension': 1024,
, 'index_fullness': 0.0,
, 'namespaces': {},
, 'total_vector_count': 0}

Start import task

Each file contains:

  • id: Unique identifier
  • Values: Embedded vectors
  • metadata: JSON-formatted dictionary with metadata

Note: This task may take 10 minutes or more to complete. Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.

Specify AWS S3 folder and start task

[ ]

Check the status of the import

[ ]

List import operations

[ ]

Describe a specific import

[ ]

Cancel the Import (if needed)

[ ]

Delete the index

[ ]

Conclusion

In this notebook, we successfully generated random phrases, chunked them, embedded the chunked texts using Pinecone, and uploaded the final embedded data to Amazon S3. You can further customize this notebook for you use case by updating it to use your S3 bucket, changing the chunk size or embedding model as needed.