Notebooks
P
Pinecone
Cohere Multilingual Search

Cohere Multilingual Search

cohere-multilinguallearnmultilingualsearchpinecone-examples

Open In Colab Open nbviewer

[1]

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: pip install --upgrade pip

Multilingual Search with Cohere

Cohere released what might be the most advanced multilingual embedding model back in December 2022.

Cohere's multilingual model supports 100+ languages and at the time of release provided 230% better performance than the previous state-of-the-art in multilingual search.

A key advance in the ability of this model (beyond pure performance), is the ability to create meaningful embeddings for longer text. Previous multilingual models would not produce quality embeddings for anything longer than a sentence of text, Cohere's multilingual-22-12 model can do it for paragraphs of text.

Dataset

We'll start by setting up our dataset for multilingual search. The dataset being used is the Wikipedia multilingual dataset.

To download the dataset we do:

[2]
[3]
{'id': 0,
, 'title': 'Deaths in 2022',
, 'text': 'The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:',
, 'url': 'https://en.wikipedia.org/wiki?curid=69407798',
, 'wiki_id': 69407798,
, 'views': 5674.4492597435465,
, 'paragraph_id': 0,
, 'langs': 38}
[6]
{'id': 0,
, 'title': 'Italia',
, 'text': "LItalia (, ), ufficialmente Repubblica Italiana, è uno Stato membro dell'Unione europea, situato nell'Europa meridionale, il cui territorio coincide in gran parte con l'omonima regione geografica. L'Italia è una repubblica parlamentare unitaria e conta una popolazione di circa 59 milioni di abitanti, che ne fanno il terzo Stato dell'Unione europea per numero di abitanti. La capitale è Roma.",
, 'url': 'https://it.wikipedia.org/wiki?curid=2340360',
, 'wiki_id': 2340360,
, 'views': 3425.779427882056,
, 'paragraph_id': 0,
, 'langs': 307}

We have 6.46M English records, and 1.74M Italian records.

If you like, feel free to use the full dataset — naturally this will cost money.

For the sake of time and your pocket, in this demo we'll stick with a smaller set of ~100K records from each language. You can modify this number later as we get to the Indexing step.

Encoding with Cohere

To embed our text using Cohere we need to first initialize our connection to Cohere. For this we need an API key, then we do:

[7]

Given some text we embed it using the multilingual-22-12 model like so:

[8]
768, 2

This shows that we have 2 768-dimensional vectors, one for each text that we just encoded.

That's it, we've created our embeddings — it's incredibly easy to do.

Before we embed and index everything we'll need to initialize a vector index using Pinecone to store our embeddings within.

Creating Vector Index

To create a vector index we need to initialize our connection to Pinecone, for this we need a free API key and then pass it below:

[10]

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions here.

[ ]

Now we can initialize the vector index. There are a few parameters we need to do this:

  • dimension: vector dimensionality, this must align to the embedding model dimensionality — for us this is 768.

  • metric: the similarity metric being used to compare vectors. Different embedding models produce vectors that should be used with different metrics — in this case we need 'dotproduct'

[ ]
[11]
{'dimension': 768,
, 'index_fullness': 0.0,
, 'namespaces': {},
, 'total_vector_count': 0}

With our embedding model and vector index setup we can move onto indexing everything.

Indexing Everything

[12]
  0%|          | 0/34 [00:00<?, ?it/s]

We check for the total number of vectors added to the index:

[13]
{'dimension': 768,
, 'index_fullness': 0.0,
, 'namespaces': {'': {'vector_count': 20400}},
, 'total_vector_count': 20400}

Now we move on to querying.

Making Queries

We first define a search function to handle embedding, querying, and printing results.

[22]

Let's try something not well covered by English wikipedia pages:

[24]
1. Mostro di Firenze (it)
  https://it.wikipedia.org/wiki?curid=658864
  Uno dei testimoni principali dell'accusa contro Pacciani fu Giuseppe Bevilacqua, un funzionario dell...
  Translate: https://translate.google.com/?sl=auto&tl=en&text=Mostro+di+Firenze%0AUno+dei+testimoni+principali+dell%27accusa+contro+Pacciani+fu+Giuseppe+Bevilacqua%2C+un+funzionario+dell%27American+Battle+Monuments+Commission+che+nel+1985+dirigeva+il+cimitero+americano+di+Firenze+in+localit%C3%A0+Falciani%2C+a+poche+centinaia+di+metri+dall%27ultima+scena+del+crimine+del+Mostro+in+Via+degli+Scopeti.

2. Gianluigi Buffon (it)
  https://it.wikipedia.org/wiki?curid=103015
  Nell'estate 2009 viene ingaggiato come testimonial dalla "poker room online" PokerStars. Nell'ottobr...
  Translate: https://translate.google.com/?sl=auto&tl=en&text=Gianluigi+Buffon%0ANell%27estate+2009+viene+ingaggiato+come+testimonial+dalla+%22poker+room+online%22+PokerStars.+Nell%27ottobre+2011%2C+si+ritrova+assieme+ad+Eleonora+Abbagnato%2C+in+uno+spot+pubblicitario+per+Ferrarelle.+Diversi+anni+dopo%2C+nel+2016%2C+subentra+a+Rocco+Siffredi+come+%22testimonial+pubblicitario%22+di+Amica+Chips%3B+mentre+nel+2018+tocca+alla+Birra+Moretti%2C+che+lancia+con+la+sua+immagine%2C+una+campagna+multimediale+chiamata%2C+%22Fai+ridere+Gigi+Buffon%22.

3. Silvio Berlusconi (it)
  https://it.wikipedia.org/wiki?curid=2749106
  Dal 2020 Berlusconi è fidanzato con Marta Fascina (Melito di Porto Salvo, 9 gennaio 1990), deputata ...
  Translate: https://translate.google.com/?sl=auto&tl=en&text=Silvio+Berlusconi%0ADal+2020+Berlusconi+%C3%A8+fidanzato+con+Marta+Fascina+%28Melito+di+Porto+Salvo%2C+9+gennaio+1990%29%2C+deputata+di+Forza+Italia+eletta+nel+2018+nella+circoscrizione+Campania+1.

Once you're done, delete the index to save resources.

[ ]