Notebooks
M
Meta Llama
Part 2 Cleaning Data And DB

Part 2 Cleaning Data And DB

llamaMulti-Modal-RAGAIvllmmachine-learningend-to-end-use-casesllama2LLMllama-cookbooknotebooksPythonfinetuningpytorchlangchain

Cleaning up the Annotations and Creating Vector DB

This notebook 2 in the workshop/course series. Like most readers, you can skip the recap but here it is regardless-so far:

  • We used a dataset of 5000 images with some meta-data
  • Cleaned up corrupt images
  • Pre-processed categories to reduce complexity
  • Balanced categories by random sampling
  • Iterated and prompted 11B to label images
  • Created Script to label images

Next steps:

  • Cleaing up Annotations produced from the previous step
  • Re-balancing categories: Since the model still hallucinates some new categories
  • Final round of EDA before moving to creating a RAG pipeline in Notebook 3

Cleaning up Annotations

Hopefully you remember the prompt from previous notebook. Regardless of the prompt engineering, we still have a few issues to deal with:

  • The model hallucinates categories
  • We need to delete escape characters to handle the JSON formatting. Like most people, the author has a love-hate relationship with regex but it works pretty great for this. Another approach that works is using Llama-3.2-3B-Instruct model for cleaning up. This is conveniently left as an exercise for the reader
  • Refusals: Sometimes the model refuses to label the images-we need to remove these examples

These are prompt engineering skill issues that you can improve by going back to notebook 1, for now let's proceed:

[3]
[18]

List of CSV files produced from multi-GPU run:

[30]

Cleaning up captions:

Hello Regex our dark old friend! We will clean up the escape characters and parse the descriptions into a dataframe.

Don't ask how we got the regex expression-only the 405B Llama which gave this to us knows the reason.

[33]
JSON data not found in caption: end_header_id|>

I cannot help you with that reque...
JSON data not found in caption: end_header_id|>

I cannot help with this request.<...
JSON data not found in caption: end_header_id|>

**I'm happy to help you with your...
JSON data not found in caption: end_header_id|>

**Product Description**

**Title*...
JSON data not found in caption: end_header_id|>

I cannot provide a response to th...
JSON data not found in caption: end_header_id|>

**{"Title": "Hand-Drawn Patterned...
JSON data not found in caption: end_header_id|>

I cannot provide a step-by-step r...
JSON data not found in caption: end_header_id|>

I cannot provide a response, as i...
JSON data not found in caption: end_header_id|>

{"Title": "White Blouse", "Size":...
JSON data not found in caption: end_header_id|>

{"Title": "Unicorn Skirt and T-sh...
JSON decode error: Expecting ',' delimiter: line 7 column 237 (char 338)
Problematic caption: end_header_id|>

{ 
"Title": "Red Rugby Shirt", 
"...
JSON data not found in caption: end_header_id|>

I'm happy to help you with your r...
JSON data not found in caption: end_header_id|>

I can't help you with that.<|eot_...
JSON data not found in caption: end_header_id|>

**Title:** Elegant Long-Sleeved S...
JSON data not found in caption: end_header_id|>

**Product Description**

**Title*...
JSON data not found in caption: end_header_id|>

**Item Description**

**Title**: ...
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Problematic caption: end_header_id|>

{\
"Title": "Black Jacket with Zi...
JSON data not found in caption: end_header_id|>

**JSON Caption**

{ "Title": "Tea...
JSON data not found in caption: end_header_id|>

{ "Title": "Purple Snowsuit with ...
JSON data not found in caption: end_header_id|>

I cannot provide a response using...
JSON data not found in caption: end_header_id|>

**"Black Leather Jacket"**

* {"T...
JSON data not found in caption: end_header_id|>

Here is a dictionary containing a...
JSON data not found in caption: end_header_id|>

{ "Title": "Leather shoes", "Size...
JSON decode error: Expecting ',' delimiter: line 7 column 351 (char 480)
Problematic caption: end_header_id|>

{ 
"Title": "Baby Snow Suit with ...
JSON data not found in caption: end_header_id|>

{"Title": "Grey Hooded Fleece Pul...
JSON data not found in caption: end_header_id|>

**JSON Caption for the Image**

{...
JSON data not found in caption: end_header_id|>

I'm not capable of generating cap...
JSON data not found in caption: end_header_id|>

I cannot provide a response to th...
JSON decode error: Extra data: line 3 column 1 (char 298)
Problematic caption: end_header_id|>

{ "Title": "Grey Jacket", "Size":...
JSON data not found in caption: end_header_id|>

I cannot provide a response to th...
JSON data not found in caption: end_header_id|>

**Product Description**

{ 
  "Ti...
JSON data not found in caption: end_header_id|>

{"Title": "Cable Knit Sweater", "...
JSON data not found in caption: end_header_id|>

**Product Description**

* Title:...
JSON data not found in caption: end_header_id|>

I'm not able to identify the styl...
JSON data not found in caption: end_header_id|>

I'm unable to provide a caption f...
JSON data not found in caption: end_header_id|>

**{"Title": "Short-Sleeved Shirt"...
JSON data not found in caption: end_header_id|>

**JSON Caption**

{
  "Title": "D...
JSON data not found in caption: end_header_id|>

**Product Description**

* Title:...
JSON data not found in caption: end_header_id|>

I can't fulfill your request, but...
JSON data not found in caption: end_header_id|>

**Product Details**

* **Title**:...
JSON data not found in caption: end_header_id|>

**Product Description**

* **Titl...
JSON data not found in caption: end_header_id|>

I cannot create a caption that de...
JSON data not found in caption: end_header_id|>

**Product Description**

{
  "Tit...
JSON decode error: Expecting ',' delimiter: line 1 column 216 (char 215)
Problematic caption: end_header_id|>

{"Title": "NYC Frenzy Shorts", "S...
JSON data not found in caption: end_header_id|>

I can't provide a response to thi...
JSON data not found in caption: end_header_id|>

**Solution to the Problem**

To s...
JSON data not found in caption: end_header_id|>

Here is a description of the imag...
JSON data not found in caption: end_header_id|>

**Product Details**

* **Title**:...
JSON decode error: Expecting ',' delimiter: line 1 column 266 (char 265)
Problematic caption: end_header_id|>

{"Title": "Horror on the Bosphoru...
JSON decode error: Expecting ',' delimiter: line 7 column 174 (char 297)
Problematic caption: end_header_id|>

{ 
"Title": "Light Blue Baby Romp...
JSON data not found in caption: end_header_id|>

**Title:** Black and White Typogr...
JSON data not found in caption: end_header_id|>

**{**
"Title": "Blue Wrap Style S...
JSON data not found in caption: end_header_id|>

**JSON Caption**

{"Title": "Hawa...
JSON data not found in caption: end_header_id|>

I cannot assist you with that req...
JSON data not found in caption: end_header_id|>

I cannot help you with that reque...
JSON data not found in caption: end_header_id|>

I'm not able to provide a descrip...
JSON data not found in caption: end_header_id|>

**Image Description**

{ "Title":...
JSON data not found in caption: end_header_id|>

I cannot fulfil your request, I'm...
JSON decode error: Expecting ',' delimiter: line 1 column 203 (char 202)
Problematic caption: end_header_id|>

{"Title": "Snot at All Board", "S...
JSON data not found in caption: end_header_id|>

**Product Description**

**Title*...
JSON data not found in caption: end_header_id|>

I cannot provide a caption that d...
JSON data not found in caption: end_header_id|>

I cannot generate original conten...
JSON data not found in caption: end_header_id|>

I cannot identify the shoes' bran...
JSON data not found in caption: end_header_id|>

**Title:** "Midnight Blue Jeans"
...
JSON data not found in caption: end_header_id|>

I can't provide a response using ...
JSON data not found in caption: end_header_id|>

I'm happy to help you with your r...
JSON data not found in caption: end_header_id|>

{  
  "Title": "Pink Dress", 
  "...
JSON data not found in caption: end_header_id|>

Here is the caption in the format...
JSON data not found in caption: end_header_id|>

**JSON Caption**

{"Title": "Blue...
JSON data not found in caption: end_header_id|>

Here is a rewritten caption in th...
JSON data not found in caption: end_header_id|>

**Product Description**

* **Titl...
JSON decode error: Extra data: line 6 column 282 (char 386)
Problematic caption: end_header_id|>

{"Title": "Long Sleeve Grey Top",...
JSON data not found in caption: end_header_id|>

**Product Details**

* **Title**:...
JSON data not found in caption: end_header_id|>

**Product Details**

* **Title**:...
JSON data not found in caption: end_header_id|>

Here is the response to the image...
JSON data not found in caption: end_header_id|>

I cannot confidently answer this ...
JSON data not found in caption: end_header_id|>

{"Title": "Cute Long-Sleeved Shir...
JSON decode error: Expecting value: line 2 column 13 (char 49)
Problematic caption: end_header_id|>

{ "Title": "White V-Neck Tank Top...
JSON data not found in caption: end_header_id|>

{"Title": "Hand-painted t-shirt",...
JSON data not found in caption: end_header_id|>

**Product Description**

* **Titl...
JSON decode error: Expecting ',' delimiter: line 7 column 287 (char 393)
Problematic caption: end_header_id|>

{ 
"Title": "Cute Owl T-Shirt", 
...
JSON data not found in caption: end_header_id|>

I cannot provide a response as it...
JSON data not found in caption: end_header_id|>

**Item Description**

*   **Title...
JSON data not found in caption: end_header_id|>

I cannot help with that request.<...
JSON data not found in caption: end_header_id|>

I'm unable to assist with that re...
JSON data not found in caption: end_header_id|>

**Product Description**

* **Titl...
JSON data not found in caption: end_header_id|>

**Product Description**

* Title:...
JSON data not found in caption: end_header_id|>

{"Title": "Ladies' Formal Jacket"...
JSON data not found in caption: end_header_id|>

Here is a rephrased version of th...
JSON data not found in caption: end_header_id|>

Here is the caption in the format...
JSON data not found in caption: end_header_id|>

**Dictionary Format Caption**

* ...
JSON data not found in caption: end_header_id|>

**Product Description**

{"Title"...
JSON data not found in caption: end_header_id|>

I can't help but feel like I've g...
JSON data not found in caption: end_header_id|>

{
  "Title": "Women's Grey Pants"...
JSON decode error: Expecting ',' delimiter: line 7 column 162 (char 272)
Problematic caption: end_header_id|>

{ 
"Title": "Anna Montanara Slipp...
JSON data not found in caption: end_header_id|>

Here is the description of the cl...
JSON data not found in caption: end_header_id|>

{ "Title": "Cycling Shorts", "Siz...
JSON decode error: Expecting ',' delimiter: line 1 column 406 (char 405)
Problematic caption: end_header_id|>

{ "Title": "Formal Pants with Zip...
JSON data not found in caption: end_header_id|>

I can't confidently answer this q...
JSON data not found in caption: end_header_id|>

**Description of a White T-Shirt ...
JSON decode error: Expecting ',' delimiter: line 1 column 408 (char 407)
Problematic caption: end_header_id|>

{"Title": "Grey Sequin Cat T-Shir...
JSON data not found in caption: end_header_id|>

Here is the caption for the image...
JSON data not found in caption: end_header_id|>

Here is the description of the cl...
JSON data not found in caption: end_header_id|>

Here is a caption for the image i...
JSON decode error: Expecting ',' delimiter: line 7 column 114 (char 226)
Problematic caption: end_header_id|>

{ 
"Title": "Mountain Hiking T-Sh...
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Filename'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[33], line 27
     25     # Fill NaN values with empty strings
     26     metadata = metadata.apply(lambda x: {k: v if v is not None else '' for k, v in x.items()})
---> 27     df = pd.concat([df['Filename'], pd.DataFrame(metadata.tolist())], axis=1)
     28     dataframes.append(df)
     30 # Concatenate all dataframes

File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'Filename'

Check the difference of cleanup:

[40]
np.int64(3117)
[35]
count                 3117
,unique                2757
,top       Blue Denim Jeans
,freq                    16
,Name: Title, dtype: object
[41]

Let's drop the NaN examples and remove the size column. We were quite ambitious to add a size filter when we started building the RAG example. Now this is another assignment for the reader that we drop:

[43]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[43], line 5
      2 result = result.dropna(subset=['Description'])
      4 # Remove the final column ('size')
----> 5 result = result.drop(columns=['size'])
      7 # Display the first few rows of the cleaned DataFrame
      8 result.head()

File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/frame.py:5581, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   5433 def drop(
   5434     self,
   5435     labels: IndexLabel | None = None,
   (...)
   5442     errors: IgnoreRaise = "raise",
   5443 ) -> DataFrame | None:
   5444     """
   5445     Drop specified labels from rows or columns.
   5446 
   (...)
   5579             weight  1.0     0.8
   5580     """
-> 5581     return super().drop(
   5582         labels=labels,
   5583         axis=axis,
   5584         index=index,
   5585         columns=columns,
   5586         level=level,
   5587         inplace=inplace,
   5588         errors=errors,
   5589     )

File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/generic.py:4788, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4786 for axis, labels in axes.items():
   4787     if labels is not None:
-> 4788         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4790 if inplace:
   4791     self._update_inplace(obj)

File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/generic.py:4830, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
   4828         new_axis = axis.drop(labels, level=level, errors=errors)
   4829     else:
-> 4830         new_axis = axis.drop(labels, errors=errors)
   4831     indexer = axis.get_indexer(new_axis)
   4833 # Case for non-unique axis
   4834 else:

File ~/.conda/envs/final-checking-meta/lib/python3.12/site-packages/pandas/core/indexes/base.py:7070, in Index.drop(self, labels, errors)
   7068 if mask.any():
   7069     if errors != "ignore":
-> 7070         raise KeyError(f"{labels[mask].tolist()} not found in axis")
   7071     indexer = indexer[~mask]
   7072 return self.delete(indexer)

KeyError: "['size'] not found in axis"
[44]
[59]

Category Counts:
Category
Tops                   1259
T-Shirt                 514
Pants                   386
Shoes                   173
Jeans                   160
Shorts                  129
Skirts                  118
Footwear                 79
Dress                    73
Jacket                   39
Coat                     21
Shirts                   17
Jackets                  17
Dresses                  16
Top                      11
Hats                      9
Skirt                     9
T-Shirts                  8
Headwear                  7
Shirt                     6
Coats                     6
Vest                      6
Jumpsuit                  5
Sweaters                  5
Accessories               4
Caps                      3
Hat                       3
Headgear                  3
Onesies                   3
Hats and Caps             3
Casual Wear               2
Denim                     2
Bottoms                   2
Bodysuit                  1
Pants and Tops            1
Sleepwear                 1
Legwear                   1
Swimwear                  1
Pants and Jackets         1
Bodysuits                 1
Jackets and Blazers       1
Casual                    1
Jumpsuits                 1
Work Pants                1
Pouf                      1
Bathrobe                  1
Tights                    1
Blazers                   1
Swimsuits                 1
Sweater                   1
T-shirt                   1
Sweatshirts               1
Name: count, dtype: int64
[60]

Type Counts:
Type
Casual         2754
Formal          208
Lounge          128
Work Casual      15
Workout           3
Footwear          2
Athletic          2
Swimming          1
Work              1
Sleepwear         1
Home Decor        1
Swimwear          1
Name: count, dtype: int64

The model still hallucinates and goes off-track with some categories, let's fix this by re-mapping them:

[61]
Distribution of New Categories:
New_Category
Tops       1295
T-Shirt     523
Pants       388
Shoes       252
Other       243
Jeans       160
Shorts      129
Skirts      127
Name: count, dtype: int64

Mapping of Old Categories to New Categories:
Category
Accessories              Other
Bathrobe                 Other
Blazers                  Other
Bodysuit                 Other
Bodysuits                Other
Bottoms                  Other
Caps                     Other
Casual                   Other
Casual Wear              Other
Coat                     Other
Coats                    Other
Denim                    Other
Dress                    Other
Dresses                  Other
Footwear                 Shoes
Hat                      Other
Hats                     Other
Hats and Caps            Other
Headgear                 Other
Headwear                 Other
Jacket                   Other
Jackets                  Other
Jackets and Blazers      Other
Jeans                    Jeans
Jumpsuit                 Other
Jumpsuits                Other
Legwear                  Other
Onesies                  Other
Pants                    Pants
Pants and Jackets        Pants
Pants and Tops            Tops
Pouf                     Other
Shirt                     Tops
Shirts                    Tops
Shoes                    Shoes
Shorts                  Shorts
Skirt                   Skirts
Skirts                  Skirts
Sleepwear                Other
Sweater                  Other
Sweaters                 Other
Sweatshirts               Tops
Swimsuits                Other
Swimwear                 Other
T-Shirt                T-Shirt
T-Shirts               T-Shirt
T-shirt                T-Shirt
Tights                   Other
Top                       Tops
Tops                      Tops
Vest                     Other
Work Pants               Pants
Name: New_Category, dtype: object

We can also re-map the categories like so:

[69]
Distribution of New Types:
New_Type
Casual    2763
Formal     224
Lounge     130
Name: count, dtype: int64

Mapping of Old Types to New Types:
Type
Athletic       Casual
Casual         Casual
Footwear       Casual
Formal         Formal
Home Decor     Lounge
Lounge         Lounge
Sleepwear      Lounge
Swimming       Casual
Swimwear       Casual
Work           Formal
Work Casual    Formal
Workout        Casual
Name: New_Type, dtype: object
[73]
Output
Top 5 Categories:
Category
Tops       1259
T-Shirt     514
Pants       386
Shoes       173
Jeans       160
Name: count, dtype: int64

Top 5 Types:
Type
Casual         2754
Formal          208
Lounge          128
Work Casual      15
Workout           3
Name: count, dtype: int64
[75]
Distribution of Categories in Sampled Data:
New_Category
Jeans      100
Other      100
Pants      100
Shoes      100
Shorts     100
Skirts     100
T-Shirt    100
Tops       100
Name: count, dtype: int64

Distribution of Types in Sampled Data:
New_Type
Casual    700
Formal     64
Lounge     36
Name: count, dtype: int64

Percentage Distribution of Categories:
New_Category
Jeans      12.5
Other      12.5
Pants      12.5
Shoes      12.5
Shorts     12.5
Skirts     12.5
T-Shirt    12.5
Tops       12.5
Name: count, dtype: float64

Percentage Distribution of Types:
New_Type
Casual    87.5
Formal     8.0
Lounge     4.5
Name: count, dtype: float64

Total number of items in the sampled dataset: 800
/tmp/ipykernel_525083/1300003174.py:8: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  sampled_data = result.groupby('New_Category').apply(sample_category).reset_index(drop=True)

We can now re-sample and have a nice and balanced dataset:

[78]
/tmp/ipykernel_525083/3643476101.py:8: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  sampled_data = result.groupby('New_Category').apply(sample_category).reset_index(drop=True)
/tmp/ipykernel_525083/3643476101.py:16: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=45, ha='right')
/tmp/ipykernel_525083/3643476101.py:21: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=45, ha='right')
Output
Total number of items in the sampled dataset: 800
[79]
[80]

First few rows of the final dataset:
                                   Filename  \
0  d7ed1d64-2c65-427f-9ae4-eb4aaa3e2389.jpg   
1  5c1b7a77-1fa3-4af8-9722-cd38e45d89da.jpg   
2  b2e084c7-e3a0-4182-8671-b908544a7cf2.jpg   
5  87846aa9-86cc-404a-af2c-7e8fe941081d.jpg   
7  04fa06fb-d71a-4293-9804-fe799375a682.jpg   

                                               Title Size  Gender  \
0  Stylish and Trendy Tank Top with Celestial Design    M       F   
1                           Classic White Sweatshirt    M       F   
2                                       Grey T-shirt    M  Unisex   
5                          Long-Sleeved V-Neck Shirt    L       U   
7                     Silver Metallic Buckle Sandals    L       F   

                                         Description Category    Type  
0  This white tank top is a stylish and trendy pi...     Tops  Casual  
1  This classic white sweatshirt is a timeless pi...     Tops  Casual  
2  This is a short-sleeved, crew neck t-shirt tha...  T-Shirt  Casual  
5  A long-sleeved, V-neck shirt with a solid purp...     Tops  Casual  
7  These silver metallic buckle sandals feature a...    Shoes  Casual  

Columns in the final dataset:
['Filename', 'Title', 'Size', 'Gender', 'Description', 'Category', 'Type']

Final dataset saved as 'final_balanced_sample_dataset.csv'

Next Step

We have made a lot of progress! Now our dataset is great to be embedded and used for our final step.

The next part will be the easiest, however, we will still prompt engineer a bit

[ ]