Retrieval Perf Cohere
semantic-searchlearn2311-embedding-examplessearchpinecone-examples
Export
[1]
Dataset Download
We're going to test with a more real world use-case, with messy, imperfect data. We will use the jamescalam/ai-arxiv-chunked dataset.
[2]
Dataset({
, features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
, num_rows: 41584
,}) First we define our embedding function.
[46]
Use this to build a Numpy array of cohere embedding vectors.
[47]
0%| | 0/325 [00:00<?, ?it/s]
[48]
[52]
array([18290, 39437, 39445])
Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our arr vectors.
[54]
[55]
(41584,)
[0.47466855 0.53013851 0.53044737]
(3,)
Equal contribution. Correspondence: {htouvron,
thibautlav,gizacard,egrave,glample}@meta.com
1https://github.com/facebookresearch/llamaperformance, a smaller one trained longer will
ultimately be cheaper at inference. For instance,
although Hoffmann et al. (2022) recommends
training a 10B model on 200B tokens, we find
that the performance of a 7B model continues to
improve even after 1T tokens.
The focus of this work is to train a series of
language models that achieve the best possible performance at various inference budgets, by training
on more tokens than what is typically used. The
resulting models, called LLaMA , ranges from 7B
to 65B parameters with competitive performance
compared to the best existing LLMs. For instance,
LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10 smaller. We believe that
this model will help democratize the access and
study of LLMs, since it can be run on a single GPU.
At the higher-end of the scale, our 65B-parameter
model is also competitive with the best large language models such as Chinchilla or PaLM-540B.
Unlike Chinchilla, PaLM, or GPT-3, we only
----------
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov Thomas Scialom
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosedsource models. We provide a detailed description of our approach to fine-tuning and safety
----------
asChatGPT,BARD,andClaude. TheseclosedproductLLMsareheavilyfine-tunedtoalignwithhuman
preferences, which greatly enhances their usability and safety. This step can require significant costs in
computeandhumanannotation,andisoftennottransparentoreasilyreproducible,limitingprogresswithin
the community to advance AI alignment research.
In this work, we develop and release Llama 2, a family of pretrained and fine-tuned LLMs, L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle and
L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested,
L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc models generally perform better than existing open-source models. They also appear to
be on par with some of the closed-source models, at least on the human evaluations we performed (see
----------
[56]
(41584,)
[0.47529661 0.47952211 0.48869599]
(3,)
the training data [13], aiding in disinformation campaigns [12], generating extremist texts [37], spreading
falsehoods [35], and more [9, 10, 18, 57, 22, 51]. As AI systems improve, the scope of possible harms seems
likely to grow [22]. Many strategies have been developed to address some of these harms (e.g., [58, 4, 48,
36, 34, 19, 60]). One potentially useful tool for addressing harm is red teaming—using manual or automated
methods to adversarially probe a language model for harmful outputs, and then updating the model to avoid
such outputs [42, 20, 3, 11]. In this paper, we describe our early efforts to implement manual red teaming to
both make models safer and measure the safety of our models. The models trained with red team data were
described in [4], so here we focus on describing our red team results and techniques in detail in the hope that
others may benefit from and improve on them.
Correspondence to: {deep, liane, jackson, jared, jack}@anthropic.com
Authors above the line break are core contributors. Author contributions are listed in §A.1.arXiv:2209.07858v2 [cs.CL] 22 Nov 2022
2.7B 13B 52B
----------
including limitations and risks that might be exploited by m alicious actors. Further, existing
red teaming approaches are insufficient for addressing thes e concerns in the AI context.
In order for AI developers to make verifiable claims about the ir AI systems being safe or secure, they need
processes for surfacing and addressing potential safety an d security risks. Practices such as red teaming
exercises help organizations to discover their own limitat ions and vulnerabilities as well as those of the
AI systems they develop, and to approach them holistically , in a way that takes into account the larger
environment in which they are operating.23
A red team exercise is a structured effort to find flaws and vuln erabilities in a plan, organization, or
technical system, often performed by dedicated "red teams" that seek to adopt an attacker’s mindset
and methods. In domains such as computer security , red teams are routinely tasked with emulating
attackers in order to find flaws and vulnerabilities in organi zations and their systems. Discoveries made
by red teams allow organizations to improve security and sys tem integrity before and during deployment.
Knowledge that a lab has a red team can potentially improve th e trustworthiness of an organization with
----------
Red teaming ChatGPT via Jailbreaking:
Bias, Robustness, Reliability and Toxicity
Terry Yue Zhuo1,2§, Yujin Huang2, Chunyang Chen2, Zhenchang Xing1,3
1CSIRO’s Data61
2Monash University
3Australian National University
Warning: this paper may contain content that is offensive or upsetting.
Abstract—Recent breakthroughs in natural language processing (NLP) have permitted the synthesis and comprehension
of coherent text in an open-ended way, therefore translating
the theoretical algorithms into practical applications. The large
languagemodels(LLMs)havesignificantlyimpactedbusinesses
such as report summarization software and copywriters. Observations indicate, however, that LLMs may exhibit social
prejudice and toxicity, posing ethical and societal dangers
of consequences resulting from irresponsibility. Large-scale
benchmarks for accountable LLMs should consequently be
developed. Although several empirical investigations reveal
the existence of a few ethical difficulties in advanced LLMs,
there is little systematic examination and user study of the
risks and harmful behaviors of current LLM usage. To further
educate future efforts on constructing ethical LLMs responsibly, we perform a qualitative research method called “red
teaming” on OpenAI’s ChatGPT1to better understand the
----------
[57]
(41584,) [0.49388744 0.5080906 0.51699355] (3,) for fitting LLMs is an enormous training dataset, e.g., the Pile [15], which contains documents from Arxiv, PubMed, Stack Exchange, Wikipedia, as well as a subset of Common Crawl2, and GitHub, among others. For these kinds of LLMs, [16] introduced the terminology of foundation models , which defines training on a very large data basis and the ability to adapt to a variety of downstream tasks. 2.2 ChatGPT ChatGPT is an LLM developed by OpenAI that was first released on November 30th, 2022. The user can directly prompt the model via an API in a conversational way, e.g., allowing for follow-up questions or admission of mistakes [1]. The backbone of ChatGPT is based on the generative pretrained transformer series (GPT; [17, 18, 19]). Despite the success and capacity of the third GPT iteration (GPT-3) [19] with 175B parameters, the challenge of engineering text prompts for achieving the desired generative output remained. This is due to the autoregressive training procedure, which tasks the model to predict a token based on the previous text and thus is optimized for text completion and not dialogues. To improve the dialogue capabilities of the model as well as to reduce bias and ---------- governments, would never do harms with LLMs. Without access to LLMs, we cannot even realize the potential role of LLMs in harms. Thus, an open LLM can provide access and transparency to all researchers, and facilitate the research developments of reducing the potential harms of LLMs, such as algorithms to identify the synthetic text Gehrmann et al. (2019). In addition, it is known that LLMs can suffer from problems in fairness, bias, privacy, and truthfulness Zhang et al. (2021); Lin et al. (2022); Liang et al. (2021); Bender et al. (2021). Thus, instead of providing APIs to black-box models, an open LLM can help reveal the model parameters and internal states corresponding to specific inputs. In conclusion, an open LLM empowers us to conduct studies on LLMs’ flaws in depth and to improve LLMs in terms of ethical concerns. Ethical Evaluation and Improvements. We evaluate GLM-130B on a wide range of ethical evaluation benchmarks, including bias measurement (Nadeem et al., 2021; Nangia et al., 2020), hate speech ---------- and as our pilot experiments have demonstrated the effectiveness of the relevance judgments generated by LLMs, we believe it deserves further exploration. (2) Instruction-tuning LLMs for a universal information access system. Instructiontuning LLMs for diverse ranking tasks, such as passage ranking, entity ranking, response ranking, evidence ranking and etc., has great potential toward a more powerful, universal information access system. (3) End-to-end IR model. Existing multi-stage IR systems always follow a “index-retrieve-rank” pipeline, and the separation of different components makes it hard for end-to-end learning. Considering the remarkable performance of LLMs, it’s possible to use only one LLM covering every component in the IR system, such as retrieval and ranking. (4) Improving the efficiency of LLMs. Though effective, current LLMs generally have hundreds of billions of parameters, and deploying them to real industrial scenarios is prohibitively expensive. Thus, improving the efficiency of LLMs, such as reducing to small models, and lightweight learning, is very worthy of further exploration. References Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles L. A. Clarke. 2021. Shallow pooling for ----------
[58]
(41584,) [0.63758657 0.63869209 0.64677286] (3,) to GPT-3 corresponds to the Stanford Alpaca model. From Figure 3(a), we observe that ( i) For the “Helpfulness” criterion, GPT-4 is the clear winner with 54.12% of the votes. GPT-3 only wins 19.74% of the time. ( ii) For the “Honesty” and “Harmlessness” criteria, the largest portion of votes goes to the tie category, which is substantially higher than the winning categories but GPT-3 (Alpaca) is slightly superior. Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4 in Figure 3(b). The observations are quite consistent over the three criteria: GPT-4-instruction-tuned LLaMA performs similarly to the original GPT-4. We conclude that learning from GPT-4 generated 5 60% 70% 80% 90% 100%12345BRanking Group 94% 624 : 66792% 614 : 67091% 623 : 68289% 597 : 66989% 605 : 67891% 609 : 666 ---------- of the reward model. We compare all the chatbots in Figure 4(c,d). Instruction tuning of LLaMA with GPT-4 often achieves higher performance than tuning with text-davinci-003 (i.e.,Alpaca) and no tuning (i.e.,LLaMA): The 7B LLaMA GPT4 outperforms the 13B Alpaca and LLaMA. However, there is still a gap compared with large commercial chatbots such as GPT-4. We further study the performance of all the chatbots in Chinese in Figure 5. We first translate English responses of chatbots into Chinese using GPT-4. We also translate English questions into Chinese to obtain answers with GPT-4. The comparisons against translated and generated Chinese responses from GPT-4 are shown in Figure 5 (a) and (b), respectively. There are two interesting observations: (i) we find that the relative score metric of GPT-4 evaluation (Vicuna, 2023) is quite consistent, both in terms of different opponent models ( i.e.,ChatGPT or GPT-4) and languages ( i.e.,English or Chinese). (ii)For GPT-4 results alone, the translated responses show superior performance over the generated ---------- -0.043 -0.009+0.0132-0.004 +0.0562 +0.0387-0.012 -0.076Alpaca: 0.39 LLaMA-GPT4: 0.34 GPT4: 0.37Figure 6: ROUGE-L on unnatural instructions evaluated with 9K samples. The instructions are grouped into four subsets based on the ground-truth response length. The mean values are reported in the legend. The difference with GPT-4 is reported on the bar per group. LLaMA-GPT4 is a closer proxy to GPT-4 than Alpaca. closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to make the response more chat-like, which probably leads to lower ROUGE-L scores. 5 R ELATED WORK Instruction Tuning. Instruction tuning of LLMs is an increasingly popular research direction in NLP (Zhong et al., 2021; Ouyang et al., 2022; Wei et al., 2021). Existing works aim to improve the quality and scale of three factors in the development pipeline, including instruction-following ----------