Notebooks
P
Pinecone
Retrieval Perf Bge Flag Fp32

Retrieval Perf Bge Flag Fp32

semantic-searchlearn2311-embedding-examplessearchpinecone-examples
[1]
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 493.7/493.7 kB 13.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 16.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 21.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.2/311.2 kB 26.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 37.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 261.4/261.4 kB 34.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/86.0 kB 13.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 114.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 86.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 74.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 37.3 MB/s eta 0:00:00
  Building wheel for FlagEmbedding (setup.py) ... done
  Building wheel for sentence_transformers (setup.py) ... done

Dataset Download

We're going to test with a more real world use-case, with messy, imperfect data. We will use the jamescalam/ai-arxiv-chunked dataset.

[2]
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]
Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Dataset({
,    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
,    num_rows: 41584
,})

First we define our embedding function.

[8]
0

Use this to build a Numpy array of cohere embedding vectors.

[9]
  0%|          | 0/21 [00:00<?, ?it/s]

Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00,  3.25s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:29<00:00,  3.63s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:28<00:00,  3.58s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.28s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:27<00:00,  3.38s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:27<00:00,  3.45s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00,  3.21s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00,  3.24s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.30s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00,  3.21s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:28<00:00,  3.59s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:27<00:00,  3.42s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00,  3.21s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.31s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00,  3.24s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:22<00:00,  2.77s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.27s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.30s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.28s/it]

Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00,  3.29s/it]

Inference Embeddings: 100%|██████████| 3/3 [00:07<00:00,  2.66s/it]

Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our arr vectors.

[14]
[15]
PaLM and LaMDA (Thoppilan et al., 2022). PaLM
and LLaMA were trained on datasets that contain
a similar number of code tokens.
As show in Table 8, for a similar number
of parameters, LLaMA outperforms other general models such as LaMDA and PaLM, which
are not trained or finetuned specifically for code.
LLaMA with 13B parameters and more outperforms LaMDA 137B on both HumanEval and
MBPP. LLaMA 65B also outperforms PaLM 62B,
even when it is trained longer. The pass@1 results
reported in this table were obtained by sampling
with temperature 0.1. The pass@100 and pass@80
metrics were obtained with temperature 0.8. We
use the same method as Chen et al. (2021) to obtain
unbiased estimates of the pass@k.
It is possible to improve the performance on code
by finetuning on code-specific tokens. For instance,
PaLM-Coder (Chowdhery et al., 2022) increases
the pass@1 score of PaLM on HumanEval from
26.2% for PaLM to 36%. Other models trained
----------
•Zero-shot. We provide a textual description
of the task and a test example. The model
either provides an answer using open-ended
generation, or ranks the proposed answers.
•Few-shot. We provide a few examples of the
task (between 1 and 64) and a test example.
The model takes this text as input and generates the answer or ranks different options.
We compare LLaMA with other foundation models, namely the non-publicly available language
models GPT-3 (Brown et al., 2020), Gopher (Rae
et al., 2021), Chinchilla (Hoffmann et al., 2022)
and PaLM (Chowdhery et al., 2022), as well as
the open-sourced OPT models (Zhang et al., 2022),
GPT-J (Wang and Komatsuzaki, 2021), and GPTNeo (Black et al., 2022). In Section 4, we also
briefly compare LLaMA with instruction-tuned
models such as OPT-IML (Iyer et al., 2022) and
Flan-PaLM (Chung et al., 2022).We evaluate LLaMA on free-form generation
tasks and multiple choice tasks. In the multiple
choice tasks, the objective is to select the most
----------
but BoolQ. Similarly, this model surpasses PaLM540B everywhere but on BoolQ and WinoGrande.
LLaMA-13B model also outperforms GPT-3 on
most benchmarks despite being 10 smaller.
3.2 Closed-book Question Answering
We compare LLaMA to existing large language
models on two closed-book question answering
benchmarks: Natural Questions (Kwiatkowski
et al., 2019) and TriviaQA (Joshi et al., 2017). For
both benchmarks, we report exact match performance in a closed book setting, i.e., where the models do not have access to documents that contain
evidence to answer the question. In Table 4, we
report performance on NaturalQuestions, and in Table 5, we report on TriviaQA. On both benchmarks,
LLaMA-65B achieve state-of-the-arts performance
in the zero-shot and few-shot settings. More importantly, the LLaMA-13B is also competitive on
these benchmarks with GPT-3 and Chinchilla, despite being 5-10 smaller. This model runs on a
single V100 GPU during inference.
0-shot 1-shot 5-shot 64-shot
Gopher 280B 43.5 - 57.0 57.2
----------
[16]
Ricardo Lopez-Barquilla, Marc Shedroff, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta
Chauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh
Yazdan, Elisa Garcia Anzano, and Natascha Parks.
•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.
46
•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original
Llama team who helped get this work started.
•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the figures in the
paper.
•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the
internal demo.
•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,
Laurens van der Maaten, Jason Weston, and Omer Levy.
----------
AI research community could take to build consensus around how to red team andhow to release findings
from red teaming .
For how to red team , we have detailed our initial approach. However, we conducted this effort in isolation, and we would have benefited from participating in a community-based effort to address certain open
questions:
• Who should red team and why?
• What protections should we put in place to ensure the safety of the red team?
• What instructions and information about the models should we provide to the red team?
• How should we annotate and analyze the data we collect?
• What constitutes a successful red team attempt?
We can make progress towards answering these questions by convening a multidisciplinary community to
share different approaches to internal red teaming and drive toward consensus.
The research community lacks shared norms and best practices for how to release findings from red teaming. As a result, we made our decision to release the data largely on our own and likely missed critical
perspectives from experts, other disciplines, and members of the public.14The decision for how to appropriately release findings will ultimately require a subjective judgment call. For our purposes, we reviewed a
sample of our red team dataset and evaluated the pros and cons of a public release (See §A.5). Among them
----------
improved various NLP tasks. The introduction of the Transformer architecture [46] laid the groundwork for the development of these powerful language models (Devlin et al. 11, Radford et al. 34, Lewis
et al. 21, Raffel et al. 35, Brown et al. 6, Chowdhery et al. 8, Zhang et al. 52, Scao et al. 37, Touvron
et al. 45,inter alia ). Among them, GPT-3 [ 6] has been particularly influential, showcasing an
exceptional capacity to adapt to diverse tasks through the in-context learning capabilities of LLMs.
Recently, LLaMA [ 45] has emerged as a pivotal open-source base language model, driving a series
of open-source breakthroughs [43, 7, 15, 23] that strive to keep pace with the closed-source frontier
in the field.
J Experimental Details
J.1 (Topic-Guided Red-Teaming) Self-Instruct
For both Self-Instruct and Topic-Guided Red-Teaming Self-Instruct, we set the maximal number of
new tokens in the generation to 384. The new tokens are generated by nuclear sampling [ 16] with a
top-p threshold p= 0:98and temperature t= 1:0.
J.2 Principle-Driven Self-Alignment
----------
[17]
Rank the {{num}} passages above based on their relevance to the search query. The passages
should be listed in descending order using identifiers, and the most relevant passages should be
listed first, and the output format should be [] > [], e.g., [1] > [2]. Only response the ranking results,
do not say any word or explain.
B Related Work
B.1 Information Retrieval with LLMs
Recently, large language models (LLMs) have found increasing applications in information retrieval.
Several approaches have been proposed to utilize LLMs for passage retrieval. For example, SGPT (Muennighoff, 2022) generates text embeddings using GPT, DSI (Tay et al., 2022) proposes a differentiable
search index, and HyDE (Gao et al., 2022) generates pseudo-documents using GPT-3. In addition, LLMs
have also been used for passage re-ranking tasks. UPR (Sachan et al., 2022a) and SGPT-CE (Muennighoff,
2022) introduce instructional query generation methods, while HELM (Liang et al., 2022) utilizes instruction relevance generation. LLMs are also employed for training data generation. InPars (Bonifacio et al.,
----------
thinking, problem-solving, and analytical skills, making them ideal for evaluating the performance
of large language models in relation to human cognition. More specifically, we collect exams
corresponding to 8 subjects from Chinese Gaokao: history, math, English, Chinese, geography,
biology, chemistry and physics. We select mathematical questions from GRE, select English and
math subjects from SAT to construct the benchmark.
Law School Admission Test: Law school admission tests, such as the LSAT , are intended to measure
the reasoning and analytical skills of prospective law students. These tests include sections on logical
reasoning, reading comprehension, and analytical reasoning, which challenge the test-takers’ ability
to analyze complex information and draw accurate conclusions. Incorporating these tasks in our
benchmark enables us to assess language models’ capabilities in legal reasoning and analysis.
Lawyer Qualification Test: Lawyer qualification tests, such as the bar exam, assess the legal
knowledge, analytical skills, and ethical understanding of individuals pursuing a career in law. These
exams cover a broad range of legal topics, including constitutional law, contract law, criminal law, and
property law, and require candidates to demonstrate their ability to apply legal principles and reason
effectively. By incorporating lawyer qualification tests in our benchmark, we can evaluate language
models’ performance in the context of professional legal expertise and ethical judgment. Specifically,
----------
the LLMs’ ability for searching. For example, New
Bing utilizes GPT-4 to generate responses based on
the retrieved documents (Microsoft, 2023). As a
Figure 1: Average results of ChatGPT and GPT-4
(zero-shot) on passage re-ranking benchmarks (TREC,
BEIR, and Mr.TyDi), compared with BM25 and
previous best supervised systems (SOTA sup., e.g.,
monoT5 (Nogueira et al., 2020)).
result, it is still unclear whether LLMs, e.g., ChatGPT, are good at search.
To this end, this paper aims to investigate the potential of LLMs in relevance ranking for IR. Specifically, we focus on the following two questions:
•(RQ1) How does ChatGPT perform on passage re-ranking tasks?
•(RQ2) How to imitate the ranking capabilities
of ChatGPT to a smaller, specialized model?
To answer the first question, we explore two
strategies (Sachan et al., 2022a; Liang et al., 2022)
to instruct ChatGPT performing on passage reranking tasks, which we named instructional query
generation andinstructional relevance generation .
However, we observe that these methods have limited performance in re-ranking and heavily rely
----------
[18]
ranked from top 1 to top 5. We compare the five ranked groups against the baseline, and show the
relative scores in Figure 4 (a,b). The ChatGPT and GPT-4 evaluation is consistent with the orders
6
60% 70% 80% 90% 100%LLaMA (13B)Alpaca (13B)Vicuna (13B)LLaMA_GPT4 (7B)LLaMA_GPT4 (7B, R1)BardChatGPTGPT4
67% 466 : 69776% 539 : 71293% 639 : 68887% 607 : 70089% 620 : 69392% 624 : 68195% 652 : 684100% 758 : 758(a) All chatbots against GPT-4, whose Chinese responses are translated from English
60% 70% 80% 90% 100%LLaMA (13B)Alpaca (13B)Vicuna (13B)LLaMA_GPT4 (7B)LLaMA_GPT4 (7B, R1)BardChatGPTGPT4
----------
to GPT-3 corresponds to the Stanford Alpaca model. From Figure 3(a), we observe that ( i) For the
“Helpfulness” criterion, GPT-4 is the clear winner with 54.12% of the votes. GPT-3 only wins 19.74%
of the time. ( ii) For the “Honesty” and “Harmlessness” criteria, the largest portion of votes goes
to the tie category, which is substantially higher than the winning categories but GPT-3 (Alpaca) is
slightly superior.
Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4 in
Figure 3(b). The observations are quite consistent over the three criteria: GPT-4-instruction-tuned
LLaMA performs similarly to the original GPT-4. We conclude that learning from GPT-4 generated
5
60% 70% 80% 90% 100%12345BRanking Group 94% 624 : 66792% 614 : 67091% 623 : 68289% 597 : 66989% 605 : 67891% 609 : 666
----------
-0.043
-0.009+0.0132-0.004 +0.0562
+0.0387-0.012
-0.076Alpaca: 0.39 LLaMA-GPT4: 0.34 GPT4: 0.37Figure 6: ROUGE-L on unnatural instructions evaluated with 9K samples. The instructions are
grouped into four subsets based on the ground-truth response length. The mean values are reported in
the legend. The difference with GPT-4 is reported on the bar per group. LLaMA-GPT4 is a closer
proxy to GPT-4 than Alpaca.
closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and
GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to
make the response more chat-like, which probably leads to lower ROUGE-L scores.
5 R ELATED WORK
Instruction Tuning. Instruction tuning of LLMs is an increasingly popular research direction in
NLP (Zhong et al., 2021; Ouyang et al., 2022; Wei et al., 2021). Existing works aim to improve
the quality and scale of three factors in the development pipeline, including instruction-following
----------