Retrieval Perf Bge Flag Fp32
semantic-searchlearn2311-embedding-examplessearchpinecone-examples
Export
[1]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 493.7/493.7 kB 13.3 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 16.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 21.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.2/311.2 kB 26.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 37.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 261.4/261.4 kB 34.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/86.0 kB 13.8 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 114.4 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 86.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 74.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 37.3 MB/s eta 0:00:00 Building wheel for FlagEmbedding (setup.py) ... done Building wheel for sentence_transformers (setup.py) ... done
Dataset Download
We're going to test with a more real world use-case, with messy, imperfect data. We will use the jamescalam/ai-arxiv-chunked dataset.
[2]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/153M [00:00<?, ?B/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Dataset({
, features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
, num_rows: 41584
,}) First we define our embedding function.
[8]
0
Use this to build a Numpy array of cohere embedding vectors.
[9]
0%| | 0/21 [00:00<?, ?it/s]
Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00, 3.25s/it] Inference Embeddings: 100%|██████████| 8/8 [00:29<00:00, 3.63s/it] Inference Embeddings: 100%|██████████| 8/8 [00:28<00:00, 3.58s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.28s/it] Inference Embeddings: 100%|██████████| 8/8 [00:27<00:00, 3.38s/it] Inference Embeddings: 100%|██████████| 8/8 [00:27<00:00, 3.45s/it] Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00, 3.21s/it] Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00, 3.24s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.30s/it] Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00, 3.21s/it] Inference Embeddings: 100%|██████████| 8/8 [00:28<00:00, 3.59s/it] Inference Embeddings: 100%|██████████| 8/8 [00:27<00:00, 3.42s/it] Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00, 3.21s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.31s/it] Inference Embeddings: 100%|██████████| 8/8 [00:25<00:00, 3.24s/it] Inference Embeddings: 100%|██████████| 8/8 [00:22<00:00, 2.77s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.27s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.30s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.28s/it] Inference Embeddings: 100%|██████████| 8/8 [00:26<00:00, 3.29s/it] Inference Embeddings: 100%|██████████| 3/3 [00:07<00:00, 2.66s/it]
Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our arr vectors.
[14]
[15]
PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a similar number of code tokens. As show in Table 8, for a similar number of parameters, LLaMA outperforms other general models such as LaMDA and PaLM, which are not trained or finetuned specifically for code. LLaMA with 13B parameters and more outperforms LaMDA 137B on both HumanEval and MBPP. LLaMA 65B also outperforms PaLM 62B, even when it is trained longer. The pass@1 results reported in this table were obtained by sampling with temperature 0.1. The pass@100 and pass@80 metrics were obtained with temperature 0.8. We use the same method as Chen et al. (2021) to obtain unbiased estimates of the pass@k. It is possible to improve the performance on code by finetuning on code-specific tokens. For instance, PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%. Other models trained ---------- •Zero-shot. We provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers. •Few-shot. We provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options. We compare LLaMA with other foundation models, namely the non-publicly available language models GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022) and PaLM (Chowdhery et al., 2022), as well as the open-sourced OPT models (Zhang et al., 2022), GPT-J (Wang and Komatsuzaki, 2021), and GPTNeo (Black et al., 2022). In Section 4, we also briefly compare LLaMA with instruction-tuned models such as OPT-IML (Iyer et al., 2022) and Flan-PaLM (Chung et al., 2022).We evaluate LLaMA on free-form generation tasks and multiple choice tasks. In the multiple choice tasks, the objective is to select the most ---------- but BoolQ. Similarly, this model surpasses PaLM540B everywhere but on BoolQ and WinoGrande. LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10 smaller. 3.2 Closed-book Question Answering We compare LLaMA to existing large language models on two closed-book question answering benchmarks: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). For both benchmarks, we report exact match performance in a closed book setting, i.e., where the models do not have access to documents that contain evidence to answer the question. In Table 4, we report performance on NaturalQuestions, and in Table 5, we report on TriviaQA. On both benchmarks, LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings. More importantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, despite being 5-10 smaller. This model runs on a single V100 GPU during inference. 0-shot 1-shot 5-shot 64-shot Gopher 280B 43.5 - 57.0 57.2 ----------
[16]
Ricardo Lopez-Barquilla, Marc Shedroff, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta Chauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh Yazdan, Elisa Garcia Anzano, and Natascha Parks. •ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support. 46 •Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original Llama team who helped get this work started. •Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the figures in the paper. •Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the internal demo. •Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau, Laurens van der Maaten, Jason Weston, and Omer Levy. ---------- AI research community could take to build consensus around how to red team andhow to release findings from red teaming . For how to red team , we have detailed our initial approach. However, we conducted this effort in isolation, and we would have benefited from participating in a community-based effort to address certain open questions: • Who should red team and why? • What protections should we put in place to ensure the safety of the red team? • What instructions and information about the models should we provide to the red team? • How should we annotate and analyze the data we collect? • What constitutes a successful red team attempt? We can make progress towards answering these questions by convening a multidisciplinary community to share different approaches to internal red teaming and drive toward consensus. The research community lacks shared norms and best practices for how to release findings from red teaming. As a result, we made our decision to release the data largely on our own and likely missed critical perspectives from experts, other disciplines, and members of the public.14The decision for how to appropriately release findings will ultimately require a subjective judgment call. For our purposes, we reviewed a sample of our red team dataset and evaluated the pros and cons of a public release (See §A.5). Among them ---------- improved various NLP tasks. The introduction of the Transformer architecture [46] laid the groundwork for the development of these powerful language models (Devlin et al. 11, Radford et al. 34, Lewis et al. 21, Raffel et al. 35, Brown et al. 6, Chowdhery et al. 8, Zhang et al. 52, Scao et al. 37, Touvron et al. 45,inter alia ). Among them, GPT-3 [ 6] has been particularly influential, showcasing an exceptional capacity to adapt to diverse tasks through the in-context learning capabilities of LLMs. Recently, LLaMA [ 45] has emerged as a pivotal open-source base language model, driving a series of open-source breakthroughs [43, 7, 15, 23] that strive to keep pace with the closed-source frontier in the field. J Experimental Details J.1 (Topic-Guided Red-Teaming) Self-Instruct For both Self-Instruct and Topic-Guided Red-Teaming Self-Instruct, we set the maximal number of new tokens in the generation to 384. The new tokens are generated by nuclear sampling [ 16] with a top-p threshold p= 0:98and temperature t= 1:0. J.2 Principle-Driven Self-Alignment ----------
[17]
Rank the {{num}} passages above based on their relevance to the search query. The passages
should be listed in descending order using identifiers, and the most relevant passages should be
listed first, and the output format should be [] > [], e.g., [1] > [2]. Only response the ranking results,
do not say any word or explain.
B Related Work
B.1 Information Retrieval with LLMs
Recently, large language models (LLMs) have found increasing applications in information retrieval.
Several approaches have been proposed to utilize LLMs for passage retrieval. For example, SGPT (Muennighoff, 2022) generates text embeddings using GPT, DSI (Tay et al., 2022) proposes a differentiable
search index, and HyDE (Gao et al., 2022) generates pseudo-documents using GPT-3. In addition, LLMs
have also been used for passage re-ranking tasks. UPR (Sachan et al., 2022a) and SGPT-CE (Muennighoff,
2022) introduce instructional query generation methods, while HELM (Liang et al., 2022) utilizes instruction relevance generation. LLMs are also employed for training data generation. InPars (Bonifacio et al.,
----------
thinking, problem-solving, and analytical skills, making them ideal for evaluating the performance
of large language models in relation to human cognition. More specifically, we collect exams
corresponding to 8 subjects from Chinese Gaokao: history, math, English, Chinese, geography,
biology, chemistry and physics. We select mathematical questions from GRE, select English and
math subjects from SAT to construct the benchmark.
Law School Admission Test: Law school admission tests, such as the LSAT , are intended to measure
the reasoning and analytical skills of prospective law students. These tests include sections on logical
reasoning, reading comprehension, and analytical reasoning, which challenge the test-takers’ ability
to analyze complex information and draw accurate conclusions. Incorporating these tasks in our
benchmark enables us to assess language models’ capabilities in legal reasoning and analysis.
Lawyer Qualification Test: Lawyer qualification tests, such as the bar exam, assess the legal
knowledge, analytical skills, and ethical understanding of individuals pursuing a career in law. These
exams cover a broad range of legal topics, including constitutional law, contract law, criminal law, and
property law, and require candidates to demonstrate their ability to apply legal principles and reason
effectively. By incorporating lawyer qualification tests in our benchmark, we can evaluate language
models’ performance in the context of professional legal expertise and ethical judgment. Specifically,
----------
the LLMs’ ability for searching. For example, New
Bing utilizes GPT-4 to generate responses based on
the retrieved documents (Microsoft, 2023). As a
Figure 1: Average results of ChatGPT and GPT-4
(zero-shot) on passage re-ranking benchmarks (TREC,
BEIR, and Mr.TyDi), compared with BM25 and
previous best supervised systems (SOTA sup., e.g.,
monoT5 (Nogueira et al., 2020)).
result, it is still unclear whether LLMs, e.g., ChatGPT, are good at search.
To this end, this paper aims to investigate the potential of LLMs in relevance ranking for IR. Specifically, we focus on the following two questions:
•(RQ1) How does ChatGPT perform on passage re-ranking tasks?
•(RQ2) How to imitate the ranking capabilities
of ChatGPT to a smaller, specialized model?
To answer the first question, we explore two
strategies (Sachan et al., 2022a; Liang et al., 2022)
to instruct ChatGPT performing on passage reranking tasks, which we named instructional query
generation andinstructional relevance generation .
However, we observe that these methods have limited performance in re-ranking and heavily rely
----------
[18]
ranked from top 1 to top 5. We compare the five ranked groups against the baseline, and show the relative scores in Figure 4 (a,b). The ChatGPT and GPT-4 evaluation is consistent with the orders 6 60% 70% 80% 90% 100%LLaMA (13B)Alpaca (13B)Vicuna (13B)LLaMA_GPT4 (7B)LLaMA_GPT4 (7B, R1)BardChatGPTGPT4 67% 466 : 69776% 539 : 71293% 639 : 68887% 607 : 70089% 620 : 69392% 624 : 68195% 652 : 684100% 758 : 758(a) All chatbots against GPT-4, whose Chinese responses are translated from English 60% 70% 80% 90% 100%LLaMA (13B)Alpaca (13B)Vicuna (13B)LLaMA_GPT4 (7B)LLaMA_GPT4 (7B, R1)BardChatGPTGPT4 ---------- to GPT-3 corresponds to the Stanford Alpaca model. From Figure 3(a), we observe that ( i) For the “Helpfulness” criterion, GPT-4 is the clear winner with 54.12% of the votes. GPT-3 only wins 19.74% of the time. ( ii) For the “Honesty” and “Harmlessness” criteria, the largest portion of votes goes to the tie category, which is substantially higher than the winning categories but GPT-3 (Alpaca) is slightly superior. Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4 in Figure 3(b). The observations are quite consistent over the three criteria: GPT-4-instruction-tuned LLaMA performs similarly to the original GPT-4. We conclude that learning from GPT-4 generated 5 60% 70% 80% 90% 100%12345BRanking Group 94% 624 : 66792% 614 : 67091% 623 : 68289% 597 : 66989% 605 : 67891% 609 : 666 ---------- -0.043 -0.009+0.0132-0.004 +0.0562 +0.0387-0.012 -0.076Alpaca: 0.39 LLaMA-GPT4: 0.34 GPT4: 0.37Figure 6: ROUGE-L on unnatural instructions evaluated with 9K samples. The instructions are grouped into four subsets based on the ground-truth response length. The mean values are reported in the legend. The difference with GPT-4 is reported on the bar per group. LLaMA-GPT4 is a closer proxy to GPT-4 than Alpaca. closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to make the response more chat-like, which probably leads to lower ROUGE-L scores. 5 R ELATED WORK Instruction Tuning. Instruction tuning of LLMs is an increasingly popular research direction in NLP (Zhong et al., 2021; Ouyang et al., 2022; Wei et al., 2021). Existing works aim to improve the quality and scale of three factors in the development pipeline, including instruction-following ----------