Mistral casino

Mistral casino

{H1}

Transformers

Overview

Mistral was introduced in the this blogpost by Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.

The introduction of the blog post says:

Mistral AI team is proud to release Mistral 7B, the most powerful language model for its size to date.

Mistral-7B is the first large language model (LLM) released by thisisnl.nl

Architectural details

Mistral-7B is a decoder-only Transformer with the following architectural choices:

  • Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of K tokens
  • GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
  • Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.

For more details refer to the release blog post.

License

is released under the Apache license.

Usage tips

The Mistral team has released 3 checkpoints:

  • a base model, Mistral-7B-v, which has been pre-trained to predict the next token on internet-scale data.
  • an instruction tuned model, Mistral-7B-Instruct-v, which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).
  • an improved instruction tuned model, Mistral-7B-Instruct-v, which improves upon v1.

The base model can be used as follows:

>>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> model = thisisnl.nl_pretrained("mistralai/Mistral-7B-v", device_map="auto") >>> tokenizer = thisisnl.nl_pretrained("mistralai/Mistral-7B-v") >>> prompt = "My favourite condiment is">>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda") >>> thisisnl.nl(device) >>> generated_ids = thisisnl.nlte(**model_inputs, max_new_tokens=, do_sample=True) >>> thisisnl.nl_decode(generated_ids)[0] "My favourite condiment is to "

The instruction tuned model can be used as follows:

>>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> model = thisisnl.nl_pretrained("mistralai/Mistral-7B-Instruct-v", device_map="auto") >>> tokenizer = thisisnl.nl_pretrained("mistralai/Mistral-7B-Instruct-v") >>> messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"} ] >>> model_inputs = thisisnl.nl_chat_template(messages, return_tensors="pt").to("cuda") >>> generated_ids = thisisnl.nlte(model_inputs, max_new_tokens=, do_sample=True) >>> thisisnl.nl_decode(generated_ids)[0] "Mayonnaise can be made as follows: ()"

As can be seen, the instruction-tuned model requires a chat template to be applied to make sure the inputs are prepared in the right format.

Speeding up Mistral by using Flash Attention

The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging Flash Attention, which is a faster implementation of the attention mechanism used inside the model.

First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.

pip install -U flash-attn --no-build-isolation

Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the flash attention repository. Make also sure to load your model in half-precision (e.g. )

To load and run a model using Flash Attention-2, refer to the snippet below:

>>> import torch >>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> model = thisisnl.nl_pretrained("mistralai/Mistral-7B-v", torch_dtype=thisisnl.nl16, attn_implementation="flash_attention_2", device_map="auto") >>> tokenizer = thisisnl.nl_pretrained("mistralai/Mistral-7B-v") >>> prompt = "My favourite condiment is">>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda") >>> thisisnl.nl(device) >>> generated_ids = thisisnl.nlte(**model_inputs, max_new_tokens=, do_sample=True) >>> thisisnl.nl_decode(generated_ids)[0] "My favourite condiment is to ()"

Expected speedups

Below is a expected speedup diagram that compares pure inference time between the native implementation in transformers using checkpoint and the Flash Attention 2 version of the model.

Sliding window Attention

The current implementation supports the sliding window attention mechanism and memory efficient cache management. To enable sliding window attention, just make sure to have a version that is compatible with sliding window attention ().

The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (), support batched generation only for and use the absolute position of the current token to compute the positional embedding.

Shrinking down Mistral using quantization

As the Mistral model has 7 billion parameters, that would require about 14GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using quantization

Источник: thisisnl.nl