Imdb Metadata Json
Load JSON and Metadata filter JSON field or array
In this notebook, we are going to use Kaggle IMDB data, available as either raw JSON or CSV. We'll load it into Milvus vector database, then search with a metadata filter.
Let's get started!
Read JSON data into a pandas dataframe
The JSON data comes from https://www.kaggle.com/datasets/nelepie/imdb-genre-classification
df shape: (7, 10357) data is valid JSON. df shape: (100, 8) movie_index object title object description object poster_url object labels object Genres object film_year int64 text object dtype: object
Example text length: 240 Example text: Black Adam Nearly 5,000 years after he was bestowed with the almighty powers of the Egyptian gods - and imprisoned just as quickly - Black Adam is freed from his earthly tomb, ready to unleash his unique form of justice on the modern world.
Read CSV data into a pandas dataframe
The data used in this notebook is Kaggle 48K movies which contains a lot of metadata in addition to the raw review text.
Usually there is a data cleaning step. Such as replace empty strings with "" or unusual and empty fields with median values. Below, I'll just drop rows with null values.
Start up a Zilliz free tier cluster.
Code in this notebook uses fully-managed Milvus on Ziliz Cloud free trial.
- Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.
- On the Cluster main page, copy your
API Keyand store it locally in a .env variable. See note below how to do that. - Also on the Cluster main page, copy the
Public Endpoint URI.
π‘ Note: To keep your tokens private, best practice is to use an env variable. See how to save api key in env variable.
ππΌ In Jupyter, you need a .env file (in same dir as notebooks) containing lines like this:
- ZILLIZ_API_KEY=f370c...
- OPENAI_API_KEY=sk-H...
- VARIABLE_NAME=value...
Type of server: Zilliz Cloud Vector Database(Compatible with Milvus 2.3)
Load the Embedding Model checkpoint and use it to create vector embeddings
Embedding model: We will use the open-source sentence transformers available on HuggingFace to encode the documentation text. We will download the model from HuggingFace and run it locally.
π‘Tip: A good way to choose a sentence transformer model is to check the MTEB Leaderboard. Sort descending by column "Retrieval Average" and choose the best-performing small model.
Two model parameters of note below:
- EMBEDDING_DIM refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation.
- MAX_SEQ_LENGTH is the maximum Context Length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off. This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.
device: cpu
<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
model_name: WhereIsAI/UAE-Large-V1
EMBEDDING_DIM: 1024
MAX_SEQ_LENGTH: 512
Create a Milvus collection
You can think of a collection in Milvus like a "table" in SQL databases. The collection will contain the
- Schema (or no-schema Milvus client).
π‘ You'll need the vectorEMBEDDING_DIMparameter from your embedding model. Typical values are:- 1024 for sbert embedding models
- 1536 for ada-002 OpenAI embedding models
- Vector index for efficient vector search
- Vector distance metric for measuring nearest neighbor vectors
- Consistency level
In Milvus, transactional consistency is possible; however, according to the CAP theorem, some latency must be sacrificed. π‘ Searching movie reviews is not mission-critical, so
eventuallyconsistent is fine here.
Add a Vector Index
The vector index determines the vector search algorithm used to find the closest vectors in your data to the query a user submits.
Most vector indexes use different sets of parameters depending on whether the database is:
- inserting vectors (creation mode) - vs -
- searching vectors (search mode)
Scroll down the docs page to see a table listing different vector indexes available on Milvus. For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - Automatically determined based on OSS vs Zilliz cloud, type of GPU, size of data.
Besides a search algorithm, we also need to specify a distance metric, that is, a definition of what is considered "close" in vector space. In the cell below, the HNSW search index is chosen. Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance
π‘ Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same. Only choose L2 if you plan to keep your embeddings unnormalized.
Successfully created collection: `imdb_metadata`
{'aliases': [],
'auto_id': True,
'collection_id': 448076879578126663,
'collection_name': 'imdb_metadata',
'consistency_level': 3,
'description': '',
'enable_dynamic_field': True,
'fields': [{'auto_id': True,
'description': '',
'field_id': 100,
'is_primary': True,
'name': 'id',
'params': {},
'type': <DataType.INT64: 5>},
{'description': '',
'field_id': 101,
'name': 'vector',
'params': {'dim': 1024},
'type': <DataType.FLOAT_VECTOR: 101>}],
'num_partitions': 1,
'num_shards': 1,
'properties': {}}
Chunking
Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap. In this demo, I will use:
- Strategy = Keep movie reveiws as single chunks unless they are too long.
- Chunk size = Use the embedding model's parameter
MAX_SEQ_LENGTH - Overlap = Rule-of-thumb 10-15%
- Function = Langchain's convenient
RecursiveCharacterTextSplitterto split up long reviews recursively.
chunk size: 512 original shape: (100, 8) new shape: (100, 9) Chunking + embedding time for 100 docs: 6.592205047607422 sec type embeddings: <class 'pandas.core.series.Series'> of <class 'numpy.ndarray'> of numbers: <class 'numpy.float32'>
Insert data into Milvus
For each original text chunk, we'll write the quadruplet (vector, text, source, h1, h2) into the database.
The Milvus Client wrapper can only handle loading data from a list of dictionaries.
Otherwise, in general, Milvus supports loading data from:
- pandas dataframes
- list of dictionaries
Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder.
Start inserting entities Milvus Client insert time for 100 vectors: 0.36271023750305176 seconds
{'movie_index': 'tt6443346',
, 'title': 'Black Adam',
, 'description': 'Nearly 5,000 years after he was bestowed with the almighty powers of the Egyptian gods - and imprisoned just as quickly - Black Adam is freed from his earthly tomb, ready to unleash his unique form of justice on the modern world.',
, 'poster_url': 'https://m.media-amazon.com/images/M/MV5BYzZkOGUwMzMtMTgyNS00YjFlLTg5NzYtZTE3Y2E5YTA5NWIyXkEyXkFqcGdeQXVyMjkwOTAyMDU@._V1_QL75_UX190_CR0,0,190,281_.jpg',
, 'labels': 'SuperHero',
, 'Genres': ['Action', 'Adventure', 'Fantasy'],
, 'film_year': 2022,
, 'chunk': 'Black Adam Nearly 5,000 years after he was bestowed with the almighty powers of the Egyptian gods - and imprisoned just as quickly - Black Adam is freed from his earthly tomb, ready to unleash his unique form of justice on the modern world.',
, 'vector': array([ 0.00401279, 0.00175461, -0.00104652, ..., -0.00563261,
, -0.00292824, 0.0038763 ], dtype=float32)} {'aliases': [],
'auto_id': True,
'collection_id': 448076879578126663,
'collection_name': 'imdb_metadata',
'consistency_level': 3,
'description': '',
'enable_dynamic_field': True,
'fields': [{'auto_id': True,
'description': '',
'field_id': 100,
'is_primary': True,
'name': 'id',
'params': {},
'type': <DataType.INT64: 5>},
{'description': '',
'field_id': 101,
'name': 'vector',
'params': {'dim': 1024},
'type': <DataType.FLOAT_VECTOR: 101>}],
'num_partitions': 1,
'num_shards': 1,
'properties': {}}
timing: 0.037996768951416016 seconds
[{'count(*)': 100}]
timing: 0.12284588813781738 seconds
['movie_index', , 'title', , 'description', , 'poster_url', , 'labels', , 'Genres', , 'film_year', , 'chunk']
['Action', , 'Adventure', , 'Crime', , 'Horror', , 'Sci-Fi', , 'Thriller', , 'Drama', , 'Mystery', , 'Comedy', , 'Animation', , 'Fantasy']
Ask a question about your data
So far in this demo notebook:
- Your custom data has been mapped into a vector embedding space
- Those vector embeddings have been saved into a vector database
Next, you can ask a question about your custom data!
π‘ In LLM vocabulary:
Query is the generic term for user questions.
A query is a list of multiple individual questions, up to maybe 1000 different questions!
Question usually refers to a single user question.
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"
Semantic Search = very fast search of the entire knowledge base to find the
TOP_Kdocumentation chunks with the closest embeddings to the user's query.
π‘ The same model should always be used for consistency for all the embeddings data and the query.
query length: 45
Execute a vector search
Search Milvus using PyMilvus API.
π‘ By their nature, vector searches are "semantic" searches. For example, if you were to search for "leaky faucet":
Traditional Key-word Search - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.
Semantic search - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words.
filter: json_contains(Genres, "Sci-Fi") and film_year < 2019 Milvus Client search time for 100 vectors: 0.3123598098754883 seconds type: <class 'list'>, count: 2
Retrieved result #1 distance = 0.5802159309387207 movie_index: tt0434409
('Chunk text: V for Vendetta In a future British dystopian society, a shadowy '
'freedom fighter, known only by the alias of "V", plots to overthrow the '
'tyrannical government - with the help of a young woman.')
movie_index: tt0434409
title: V for Vendetta
description: In a future British dystopian society, a shadowy freedom fighter, known only by the alias of "V", plots to overthrow the tyrannical government - with the help of a young woman.
poster_url: https://m.media-amazon.com/images/M/MV5BOTI5ODc3NzExNV5BMl5BanBnXkFtZTcwNzYxNzQzMw@@._V1_QL75_UX190_CR0,0,190,281_.jpg
labels: SuperHero
Genres: ['Action', 'Drama', 'Sci-Fi']
film_year: 2005
Retrieved result #2
distance = 0.5398463606834412
movie_index: tt0489099
('Chunk text: Jumper A teenager with teleportation abilities suddenly finds '
'himself in the middle of an ancient war between those like him and their '
'sworn annihilators.')
movie_index: tt0489099
title: Jumper
description: A teenager with teleportation abilities suddenly finds himself in the middle of an ancient war between those like him and their sworn annihilators.
poster_url: https://m.media-amazon.com/images/M/MV5BMjEwOTkyOTI3M15BMl5BanBnXkFtZTcwNTQxMjU1MQ@@._V1_QL75_UX190_CR0,0,190,281_.jpg
labels: SuperHero
Genres: ['Action', 'Adventure', 'Sci-Fi']
film_year: 2008
Author: Christy Bergman Python implementation: CPython Python version : 3.11.8 IPython version : 8.22.2 torch : 2.2.1 transformers : 4.39.1 sentence_transformers: 2.6.0 pymilvus : 2.4.0 langchain : 0.1.13 conda environment: py311