LLaMA in R with Keras and TensorFlow

OpenAIâs chatGPT has woke up a collective consciousness of what Huge
Language Fashions (LLMs) are able to. With that awakening comes a day by day
march of LLM information: new merchandise, new options, new fashions, new
functions, (and new worries). It kind of feels weâre within the early phases of a
Cambrian explosion of LLMs and LLM powered equipment; itâs now not but transparent how
LLMs will have an effect on and affect our skilled and private lives, however
it kind of feels transparent that they’re going to, by hook or by crook.

Since LLMs are right here to stick, itâs profitable to take a while to
know how those fashions paintings from a first-principles standpoint.
Beginning with the mechanics can assist foster sturdy intuitions that can
tell our utilization of those fashions now and one day. (Particularly if
the longer term is one the place LLMs are a staple of the knowledge scientistâs
toolbox, as commonplace as an lm() serve as name).

And what higher means is there to be told than by way of doing. So with that
preamble, on this publish weâll stroll via an implementation of an LLM,
LLaMA (Touvron et al. 2023)
particularly, in TensorFlow and Keras, with the function being to broaden
working out first, capacity 2d.

Why LLaMA? With the sheer quantity of LLM linked content material and information out
there, it will probably appear daunting to grasp the place to get began. Virtually weekly
it kind of feels there’s a new fashion introduced. Surfing some hubs of LLM
task (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
extra. How to pick out a selected fashion?

Of the various LLM-related information pieces previously months, person who stands
head-and-shoulders above the group is the unlock of
LLaMA,
a contemporary, foundational LLM made to be had to the general public by way of Meta AI in
Februay 2023. On commonplace benchmarks, LLaMA outperforms OpenAIâs GPT-3,
whilst being considerably smaller (regardless that nonetheless massive).

LLaMA is a brilliant beginning position as a result of this can be a easy and trendy
structure, has superb efficiency on benchmarks, and is open. The
fashion structure has had only a few new concepts integrated into it since
the unique Transformer structure first described in,
âConsideration Is All You Wantâ
revealed from Google (Vaswani et al. 2017). 4 other sizes of
LLaMA had been launched: 7 billion and 13 billion parameter fashions
skilled on 1 Trillion tokens, and 33 billion and 65 billion parameter
fashions skilled on 1.4 trillion tokens. This is a gigantic quantity of
coaching information those fashions have noticedâthe most important 65B fashion has been
skilled on roughly the âChinchilla
compute-optimumâ (Hoffmann et al. 2022)
selection of tokens, whilst the smaller LLaMAs are considerably
past that optimal. On this weblog publish weâll focal point at the smallest, 7B
parameter LLaMA fashion, which you’ll conveniently load in the neighborhood and run on
CPU with best 64Gb of RAM.

Whilst now not strictly important, to practice alongside in the neighborhood, youâll almost definitely
wish to achieve the pre-trained LLaMA weights one
means or
every other. Word, the
weights do include their very own license, which you’ll preview
right here.

So, with out additional ado, letâs get began.

Setup

First, weâll wish to set up the specified R and Python applications, and
configure a digital atmosphere:

remotes::install_github(c("rstudio/reticulate",
                          "rstudio/tensorflow",
                          "rstudio/keras"))
reticulate::virtualenv_create("./.venv", model = "3.10")
tensorflow::install_tensorflow(envname = "./.venv", model = "unlock")

library(purrr)
library(envir)

library(tensorflow)
library(tfautograph)
library(keras)

use_virtualenv("./.venv")
choices(tensorflow.extract.warn_tensors_passed_asis = FALSE)

attach_eval({
  import_from(glue, glue)
  import_from(jsonlite, read_json)
  import_from(withr, with_dir, with_options)
  import_from(keras$layers, Dense)
  np <- reticulate::import("numpy", convert = FALSE)

  seq_len0 <- serve as(x) seq.int(from = 0L, duration.out = x)
})

# reticulate::py_install("torch", pip = TRUE)
torch <- reticulate::import("torch", convert = FALSE)
with_dir("~/github/facebookresearch/llama/weights/LLaMA/7B", {
  pretrained_weights <- torch$load("consolidated.00.pth",
                                   map_location = "cpu")
  for (title in names(pretrained_weights)) {
    filename <- sprintf("%s.npy", title)
    array <- pretrained_weights[[nm]]$numpy()
    np$save(filename, array)
    message(glue(
      "wrote: '{basename(filename)}' with form: {array$form}"))
  }
})

weights_path <- serve as(filename) normalizePath(record.trail(
  "~/github/facebookresearch/llama/weights/LLaMA/",
  glue(filename, .envir = father or mother.body())), mustWork = TRUE)

params <- read_json(weights_path("7B/params.json"))
str(params)

Listing of 6
 $ dim        : int 4096
 $ multiple_of: int 256
 $ n_heads    : int 32
 $ n_layers   : int 32
 $ norm_eps   : num 1e-06
 $ vocab_size : int -1

Tokenizer

SentencePiece tokenizer from
Google. SentencePiece is to be had as a TensorFlow graph operation
via
tf_text.SentencepieceTokenizer,
and likewise as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer.
By way of number of a coin turn, weâll use the lower-level tf_text interface.

tf_text <- reticulate::import("tensorflow_text")
tokenizer_path <- weights_path("tokenizer.fashion")
tokenizer <- tf_text$SentencepieceTokenizer(
  tf$io$gfile$GFile(tokenizer_path, "rb")$learn(),
  add_bos = TRUE, add_eos = FALSE,
)

suggested <- "One of the best ways to draw bees"
tokenizer$tokenize(suggested)

tf.Tensor([    1   450  1900   982   304 13978   367   267], form=(8), dtype=int32)

suggested |> tokenizer$tokenize() |> tokenizer$detokenize()

tf.Tensor(b'One of the best ways to draw bees', form=(), dtype=string)

Letâs outline a show_tokens() helper serve as and play with the
tokenizer a little bit.

show_tokens <- serve as(what) > as.integer()
  else
    token_ids <- as.integer(what)
  tokens <- token_ids 

show_tokens(suggested)

        1       450      1900       982       304     13978       367       267
       ""     "The"    "easiest"     "means"      "to" "draw in"      "be"      "es"

Word that âbeesâ is 2 tokens. No longer each and every token corresponds to a note.
For instance, one non-word token we will be able to reliably be expecting to turn up in a
tokenizer skilled on a corpus of English textual content is âing.â Then again, when the
âingâ token presentations up won’t at all times practice your intuitions, as a result of
commonplace phrases get their very own token identification, despite the fact that they may be able to be decomposed into
a couple of tokens.

    1  2348
   "" "ing"

        1      1985
       "" "operating"

     1   8525    292
    "" "flex"  "ing"

     1   2113   9292
    ""  "received" "king"

Some other factor to notice in regards to the tokenizer is that every token series
begins with token identification 1. It is a particular beginning-of-sequence
token that we asked be added once we loaded the tokenizer with
add_bos = TRUE. There are two different such particular tokens that we will be able to
stumble upon later: an end-of-sequence particular tokens with identification 2, and an
unknown-token with identification 0.

as.persona(tokenizer$id_to_string(0L))

[1] "<unk>"

as.persona(tokenizer$id_to_string(1L))

[1] "<s>"

as.persona(tokenizer$id_to_string(2L))

[1] "</s>"

    1     0     2
   "" " â "    ""

General, there are 32,000 tokens.

as.integer(tokenizer$vocab_size())

[1] 32000

One closing statement is that the extra regularly encountered tokens are
assigned decrease ids.

show_tokens(seq(50, len = 10))

 50  51  52  53  54  55  56  57  58  59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"

show_tokens(seq(100, len = 10))

100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

show_tokens(seq(1000, len = 10))

   1000    1001    1002    1003    1004    1005    1006    1007    1008    1009
  "ied"    "ER"  "stat"   "fig"    "me"   "von" "inter"  "roid"  "ater" "their"

show_tokens(seq(10000, len = 10))

   10000    10001    10002    10003    10004    10005    10006    10007
   "Ã¥ng"  "citep"    "Unwell"   "rank" "sender"   "beim"    "ÑÐ°Ðº" "compat"
   10008    10009
"happens"  "diese"

show_tokens(seq(20000, len = 10))

    20000     20001     20002     20003     20004     20005     20006     20007
  "admit" "Remark"     "ÑÑÑ"    "Vien"      "ÑÑ"  "permut"     "cgi"    "crÃt"
    20008     20009
"Console"    "ctic"

show_tokens(seq(to = as.integer(tokenizer$vocab_size()) - 1, len = 10))

31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
  "á½"  "ã"  "ã¹"  "è¾¹"  "è¿"  "é»"  "ì"  "æ¶"  "å¼"  "ç»"

Transferring on, the next move after tokenization is embedding. An embedding
layer is successfully a dictionary look up that converts an integer (token
identification) to a 1-d waft array. For this we will be able to use the usual keras
Embedding layer.

tok_embeddings <- keras$layers$Embedding(
  input_dim = tokenizer$vocab_size(),
  output_dim = params$dim,
  embeddings_initializer =
    (...) np$load(weights_path("7B/tok_embeddings.weight.npy"))
)

tok_embeddings(3L) |> str()

<tf.Tensor: form=(4096), dtype=float32, numpy=â¦>

suggested |> # "One of the best ways to draw bees"
  tokenizer$tokenize() |>
  tok_embeddings() |>
  str()

<tf.Tensor: form=(8, 4096), dtype=float32, numpy=â¦>

`TransformerBlock`

As soon as itâs tokenized and embedded, the enter then passes during the bulk
of the fashion, a chain of repeating TransformerBlock layers. The 7B
fashion has 32 of those TransformerBlock layers, whilst the 65B fashion has
80 of them.

weights_path("7B/params.json")  |> read_json() |> _$n_layers

[1] 32

weights_path("65B/params.json") |> read_json() |> _$n_layers

[1] 80

Here’s what the transformer block looks as if:

TransformerBlock(keras$layers$Layer) %py_class% {
  initialize <- serve as(attn_head_size, attn_n_heads,
                         norm_eps = k_epsilon(), ...,
                         block_id = NULL) {
    tremendous$initialize(...)

    self$consideration <- Consideration(attn_head_size, attn_n_heads,
                                block_id = block_id)

    self$feed_forward <- FeedForward(
      hidden_dim = 4 * attn_head_size * attn_n_heads,
      block_id = block_id)

    self$attention_norm <- RMSNorm(eps = norm_eps,
                                   block_id = block_id,
                                   feeds_into = "consideration")
    self$feed_forward_norm <- RMSNorm(eps = norm_eps,
                                      block_id = block_id,
                                      feeds_into = "ffn")
  }

  name <- serve as(x) >
      self$attention_norm() 
}

Whilst there isn’t a large number of code, there are a large number of concepts packed in
there. This block paperwork the principle trunk of the fashion, so itâs value
taking the time to move via it slowly.

We put in force the TransformerBlock as a subclassed
keras.layers.Layer. That is offers us some niceties like the facility to
compose with different Keras layers, however those are most commonly inappropriate to the
function of this weblog publish; shall we simply as simply put in force this as,
for instance, a vanilla R6 magnificence. Our TransformerBlock magnificence has two
strategies: initialize, known as once we first create the block, and
name, known as once we run the ahead cross of the block.

In initialize, we create 4 layers: an Consideration layer, a
FeedForward layer, and a pair of RMSNorm layers. Weâll take an in depth take a look at
every of those quickly, however even sooner than we accomplish that, we will be able to see how they are compatible
in combination by way of taking a look on the TransformerBlock$name() approach.

The name approach has a couple of easy concepts. In no specific order, the
first one to look at is the composition trend of including residuals.

x2 <- x |> ...
x <- x + x2 # upload residual x to x2

vanishing gradient
downside. Itâs
a skip-connection within the other-wise linear series of matrix
transformations. It reinjects data (throughout the ahead cross), and
gradients (throughout again propagation), again into the trunk. You’ll assume
of those residual connections as liberating the learnable layers in-between
(the ... within the pseudo code) from the weight of getting to
âpass-throughâ or âkeepâ data in x, permitting the weights to
as an alternative focal point on studying transformations which are, (in corporatese
vernacular), value-adding.

The following composition trend to notice is the repeating utilization of a
normalization layer:

x2 <- x |> norm() |> ...
x <- x + x2

There are lots of sorts of normalization layers, however to fairly
over-generalize, they may be able to all be considered a stabilizer that is helping
with coaching. Like their deep-learning cousins the regularizers, their
primary serve as is to stay values passing via in a smart varyâin
the ball park of (-1, 1), most often. Weâll take a more in-depth take a look at
RMSNorm quickly.

Stripped of 2 tips which are most commonly there to assist the fashion teach,
residuals and normalization, the core of the TransformerBlock is simply
this:

x |> consideration() |> feed_forward()

In a second weâll see that that feed_foward is a fairly fancier
variation of a traditional series of Dense layer. Earlier than we get
there we will be able to we safely skip forward to distill the next instinct: a
TransformerBlock is mainly an Consideration layer adopted by way of a couple of
(fancy) dense layers, with some easy composition patterns (tips)
that assist with coaching. Consideration is the guts of the fashion: itâs the
maximum attention-grabbing, and likewise probably the most concerned.

With the framing in position, letâs undergo and take a more in-depth take a look at
RMSNorm, FeedForward, after which with the basis in position, weâll
flip our consideration to Consideration.

`RMSNorm`

RMSNorm(keras$layers$Layer) %py_class% {
  initialize <-
    serve as(eps = 1e-6, ..., block_id = NULL, feeds_into = NULL) {
      tremendous$initialize(...)
      self$eps <- eps
      self$block_id <- block_id
      self$feeds_into <- feeds_into
    }

  construct <- serve as(input_shape) {
    # input_shape == (batch_size, seqlen, params$dim)
    # self$w will broadcast over batch_size and seqlen dims.
    # w_shape == (1, 1, params$dim)
    w_shape <- rep(1L, duration(input_shape))
    w_shape[length(input_shape)] <- as.integer(input_shape) |> tail(1L)

    # outline an area serve as that can load
    # the pretrained-weights if we equipped `block_id` and `feeds_into`
    import_from({self}, block_id, feeds_into)
    initializer <-if (is.null(block_id))
      "ones"
      else if (block_id >=0) {
        (...) weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)
      } else if(block_id == -1)
        # load weights for the general output normalization layer, which isn't
        # a part of a TransformerBlock
        (...) weights_path("7B/norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)

    self$w <- self$add_weight(form = w_shape,
                              initializer = initializer,
                              trainable = TRUE)
  }

  rrms <- serve as(x) {
    # reciprocal root imply sq. alongside the closing axis
    x %>% # (batch_size, seqlen, n_features)
      tf$math$sq.() %>%
      tf$reduce_mean(axis = -1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
      tf$math$upload(self$eps) %>% # for numerical balance
      tf$math$rsqrt()
  }

  name <- serve as(x) {
    x * self$rrms(x) * self$w
  }
}

RMSnorm() has a unmarried trainable tensor w. Within the ahead cross, every
price within the enter is multiplied by way of the reciprocal-root-mean-square of
the entire values within the function axis and by way of w. For sure a mouthful, however
only a easy series of mathematics transformations after all,
designed for the explicit function of fixing the variability of values
passing via.

Letâs kick the tires on it:

norm <- RMSNorm()
m <- matrix(c(0, 1,
              2, 3), nrow = 2)
norm(m)

tf.Tensor(
[[0.         1.4142132 ]
 [0.44721353 1.3416406 ]], form=(2, 2), dtype=float32)

tf.Tensor(
[[0.         1.4142137 ]
 [0.44721362 1.3416408 ]], form=(2, 2), dtype=float32)

tf.Tensor(
[[0.        1.4142137]
 [0.4472136 1.3416408]], form=(2, 2), dtype=float32)

`FeedForward`

Subsequent up is FeedForward()

FeedForward(keras$layers$Layer) %py_class% {

  initialize <- serve as(hidden_dim, multiple_of = 256L,
                         ..., block_id = NULL) {
    tremendous$initialize()

    if(!is.null(multiple_of)) {
      hidden_dim <- hidden_dim %>%
        { as.integer( . * (2/3)) } %>%
        { (. + multiple_of - 1) %/% multiple_of } %>%
        { . * multiple_of }
    }

    self$hidden_dim <- hidden_dim
    self$block_id <- block_id
  }

  construct <- serve as(input_shape) {
    output_dim <- input_shape |> as.integer() |> tail(1)

    if(is.null(self$block_id))
      load_weight <- (...) NULL
    else
      load_weight <- (title) (...) np$load(weights_path(
        "7B/layers.{self$block_id}.feed_forward.{title}.weight.npy"))$`T`

    self$w1 <- Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w1"))
    self$w2 <- Dense(output_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w2"))
    self$w3 <- Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w3"))

    tremendous$construct(input_shape)
  }

  name <- serve as(x) {
    import_from({self}, w1, w2, w3)
    import_from(tf$nn, silu)

    x %>%
      { silu(w1(.)) * w3(.) } %>% # SwiGLU
      w2()
  }

}

Shazeer (2020)
of SwiGLU and different diversifications on GLU is an exemplar of the categories
of explorations and enhancements across the Transformer structure
since its preliminary newsletter in
2017; a gradual accretion of
improvements that has introduced us to lately. The Feedforward$name() is
only a unmarried SwiGLU adopted by way of a linear projection. In its essence,
itâs a artful composition of 3 (realized) linear projections, an
element-wise multiplication, and a silu()
activation
serve as.

Most likely probably the most sudden statement to make this is the relative
dearth of activation purposes, and even non-linearities, now not simply in
FeedForward, however total. The silu() on this feedforward, the
reciprocal-root-mean-square in RMSnorm(), and a softmax() in
Consideration() are the one non-linear transformations in the entire
series of TransformerBlocks. The entirety else is a linear
transformation!

`Consideration`

In the end, letâs flip our consideration to Consideration().

Consideration(keras$layers$Layer) %py_class% {
  initialize <- serve as(head_size, n_heads,
                         ..., block_id = NULL) {
    tremendous$initialize(...)

    self$head_size <- head_size
    self$n_heads <- n_heads

    if (is.null(block_id))
      load_weight <- serve as(title) NULL
    else
      load_weight <- (title) (...) np$load(weights_path(
        "7B/layers.{block_id}.consideration.{title}.weight.npy"))$`T`

    Dense <- serve as(title) keras$layers$Dense(
      devices = n_heads * head_size,
      use_bias = FALSE,
      kernel_initializer = load_weight(title)
    )

    self$wq <- Dense("wq")
    self$wk <- Dense("wk")
    self$wv <- Dense("wv")
    self$wo <- Dense("wo")
  }

  name <- serve as(x) {
    c(batch_size, seqlen, n_features) %<-% tf$unstack(tf$form(x))

    # 1. mission (linear turn out to be) x into
    #    question, key, and price tensors
    # 2. reshape q okay v, splitting out the closing dim (n_features)
    #    into n_heads unbiased subspaces,
    #    every with measurement head_size.
    #    (n_features == head_size * n_heads)
    split_heads_shape <- c(batch_size, seqlen,
                           self$n_heads, self$head_size)
    q <- x |> self$wq() |> tf$reshape(split_heads_shape)
    okay <- x |> self$wk() |> tf$reshape(split_heads_shape)
    v <- x |> self$wv() |> tf$reshape(split_heads_shape)

    # embed positional data in question and key
    # (bsz, seqlen, n_heads, head_size)
    q %<>% apply_rotary_embedding()
    okay %<>% apply_rotary_embedding()

    # reshape:
    #   transfer heads out of the closing 2 axes,
    #   so later matmuls are carried out around the subspaces (heads)
    #   between (seqlen, head_size) axes
    v <- tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    q <- tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    okay <- tf$transpose(okay, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)

    # calculate and normalize consideration ratings
    ratings <- q %*% okay                       # (bsz, n_heads, seqlen, seqlen)
    ratings <- ratings / sqrt(self$head_size) # scale

    # practice causal masks, so the fashion cannot "glance forward" throughout coaching
    masks <- make_mask(seqlen, dtype = ratings$dtype)
    ratings %<>% { . + masks }

    ratings <- tf$nn$softmax(ratings, axis = -1L)

    # alter values tensor with consideration ratings
                      # ratings (bsz, n_heads, seqlen, seqlen)
                      # v      (bsz, n_heads, seqlen, head_size)
    output <- ratings %*% v   # (bsz, n_heads, seqlen, head_size)

    # mix heads again right into a unmarried options dim,
    # so Consideration output_shape==input_shape
    output <- output |>
      tf$transpose(c(0L, 2L, 1L, 3L)) |> # (bsz, seqlen, n_heads, head_size)
      tf$reshape(tf$form(x))            # (bsz, seqlen, n_heads * head_size)

    # another trainable linear projection for just right good fortune
    output <- self$wo(output) # (bsz, seqlen, n_heads * head_size)

    output
  }
}

unique Transformers
paper (and to be had as a keras
builtin beneath keras$layers$MultiHeadAttention()). The core novelty is
the addition of the apply_rotary_embedding() serve as, which weâll
describe in a while. The extra novelty is balanced by way of the simplicity
from the truth that the layer is acting self-attentionâwe donât want
to cross in several question, key, and price tensors (or reason why about what
that suggests), for the reason that similar enter serves all 3 roles. Word that the
standard MultiHeadAttention() layer is roofed rather totally in
the 2d Version of Deep Studying with R,
together with a complete implementation of consideration in base R.

To broaden an working out of the mechanics in a layer like this, itâs
useful to quickly unsee one of the crucial minutia that may act as a fog
obscuring the essence of the operation. On this example, if we
quickly strip out the transpose()s and reshape()s (as artful and
essential as they’re), that is whatâs left:

name <- serve as(x) > self$wv()

  # rotate q,okay to inject place data.
  # pass q,okay to calculate an consideration ranking for every token pair.
  ratings <- rotate(q) %*% rotate(okay)

Returning to the transpose()s and reshapes(), you’ll apply that
their function is to make it in order that the eye calculations are
carried out throughout n_heads unbiased subspaces, relatively than in a
unmarried better house. The similar reasoning drives this resolution as that
riding utilization of depthwise-separable convolutions in symbol fashions.
Empirically, for the mounted compute finances, factoring options into
unbiased subspaces plays higher than doing the similar core
operations in unmarried better function house. As with every issues, there’s
a steadiness to strike between n_heads (the selection of subspaces) and
head_dim (the dimensions of every subspace). The LLaMA authors have struck
the steadiness like this on the quite a lot of fashion sizes:

lapply(c("7B", "13B", "30B", "65B"), (measurement) {
  p <- read_json(weights_path("{measurement}/params.json"))
  with(p, checklist(llama_size = measurement,
               n_heads = n_heads,
               head_dim = dim %/% n_heads))
}) |> dplyr::bind_rows()

# A tibble: 4 Ã 3
  llama_size n_heads head_dim
  <chr>        <int>    <int>
1 7B              32      128
2 13B             40      128
3 30B             52      128
4 65B             64      128

Subsequent we could flip our consideration to the causal consideration masks.

make_mask <- serve as(seqlen, dtype = k_floatx()) {
  x <- tf$vary(seqlen)
  masks <- tf$the place(x[, tf$newaxis] < x[tf$newaxis, ],
                   tf$consistent(-Inf, dtype = dtype),
                   tf$consistent(0, dtype = dtype))

  # broadcast over batch and heads dim
  masks[tf$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
}

The masks is a strictly higher triangular matrix full of -Inf
values. Including the masks to the eye ratings prevents the fashion from
having the ability to âglance forwardâ and notice the eye ranking for a token
pairing it hasnât noticed but at a selected place within the series.
This want for a masks is easiest considered a vestige from coaching,
an equipment that the fashion wanted to be told with and now it will probablyât serve as with out.
Throughout coaching, gradients are calculated for predictions from all
token positions in a chain, together with predictions tokens the place the proper
solution is proper there, because the very subsequent token in similar series. The masks
prevents the fashion from having the ability to cheat and glance forward into the longer term,
one thing it receivedât be capable to do as soon as itâs weâre operating it for inference.

tf.Tensor(
[[[[  0. -inf -inf -inf -inf]
   [  0.   0. -inf -inf -inf]
   [  0.   0.   0. -inf -inf]
   [  0.   0.   0.   0. -inf]
   [  0.   0.   0.   0.   0.]]]], form=(1, 1, 5, 5), dtype=float32)

Rotary Place Embedding

Su et al. (2022) within the paper titled
âRoFormer: Enhanced Transformer with Rotary Place Embeddingâ.

Some context:

The naked Consideration() mechanism doesnât depart any risk for a
tokenâs place in a chain to have an effect on the eye ratings, since
best token-pairs are scored. Consideration treats its enter like a
bag-of-tokens.
The location of a token in a chain is obviously vital, and the
consideration layer will have to have get admission to to that data.
Absolutely the place of a token in a chain is much less vital
than the relative place between tokens. (Particularly so for lengthy
sequences).

Which leads us into the complicated aircraft. If we consider the options as
complicated numbers, we will be able to rotate them, and we will be able to calculate angles between
them. From the Roformers paper:

In particular, incorporating the relative place embedding is
easy: merely rotate the affine-transformed note embedding
vector by way of quantity of perspective multiples of its place index and thus
translates the instinct at the back of Rotary Place Embedding

Increasing fairly: the rotation matrix is designed in order that
therefore, after rotating our q and okay token series embedding
the similar means, the perspective between token options is a serve as of the
relative distance between the ones tokens within the token series. The
relative perspective between two tokens is invariant to absolutely the
place of the ones tokens within the complete series.

In brief, the rotation injects positional data. The that means or
interpretability of that positional data, or how it’s supposed to
be used, and even extracted from the results of q %*% okay, is left to the
fashion to be told.

This is the code:

apply_rotary_embedding <- serve as(x) {
  c(., seqlen, ., head_size) %<-%
    tf$unstack(tf$form(x))

  rotation_matrix <- compute_rotation_matrix(seqlen, head_size)

  x %>%
    view_as_complex() %>%
    { . * rotation_matrix } %>%
    view_as_real()

}

compute_rotation_matrix <-
  serve as(seqlen, feature_dim, theta = 10000) {
    # `feature_dim` right here goes to be consideration$head_size
    # `seqlen` goes to compare the token series duration.

    t <- tf$vary(seqlen, dtype = tf$float32)
    freqs <- tf$vary(get started = 0, restrict = 1, delta = 1 / (feature_dim %/% 2),
                      dtype = tf$float32)
    tf_assert(tf$measurement(freqs) == feature_dim %/% 2)
    freqs <- 1.0 / (theta ^ freqs)

    # outer product; (seqlen, head_size/2)
    freqs <- tf$einsum('a,b->ab', t, freqs)

    rot_mat <- tf$complicated(tf$cos(freqs), tf$sin(freqs))

    # the positional embedding will probably be broadcast throughout batch and heads dim
    rot_mat[tf$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
  }

view_as_complex <- serve as(x) {
  tf$complicated(x[all_dims(), `::2`],
             x[all_dims(), `2::2`])
}

view_as_real <- serve as(x) {
  # xs = (..., f);  xs2 = (..., f*2)
  xs <- tf$form(x)
  xs2 <- tf$concat(checklist(xs[1:(length(xs)-1)],
                        xs[length(xs), drop = FALSE] * 2L),
                   axis = 0L)

  x2 <- tf$stack(checklist(Re(x), Im(x)), axis = -1L)

  # (..., f, 2) -> (..., f*2)
  tf$reshape(x2, xs2)
}

As you’ll see, to consider the embedding options as present within the
complicated aircraft, we simply deal with adjoining pairs of floats within the
underlying array as the true and imaginary a part of a posh quantity. We
rotate the embeddings within the complicated aircraft, then return to imagining
the options as present in the true aircraft. Once more, the task of
decoding the that means of the options after rotation is left to the
fashion to be told.

We will be able to temporarily verify that the rotary embeddings best rotate options
and donât scale them:

close to <- serve as (x, y, tol = 1e-6) abs(x - y) < tol
all(close to(1, Mod(compute_rotation_matrix(2048L, 128L))))

tf.Tensor(True, form=(), dtype=bool)

There’s another trick to look at sooner than transferring on: as a result of a few of
the mathematical homes of the rotation matrix, itâs conceivable to
keep away from doing a complete complicated multiply operation and nonetheless arrive on the
similar outcome. Additionally, for the reason that rotation matrix by no means adjustments, it makes
sense to simply compute it as soon as and cache it, like so:

precomputed_rotation_matrix <- compute_rotation_matrix(
  seqlen = 2048L, # LLaMA max seqlen
  feature_dim = with(params, dim %/% n_heads)  # head_size
)

apply_rotary_embedding_faster <- serve as(x) {

  rotate_every_two <- serve as(x) {
    x1 <- x[all_dims(), `::2`]
    x2 <- x[all_dims(), `2::2`]
    x_ <- tf$stack(checklist(-x2, x1), axis = -1L)
    tf$reshape(x_, tf$form(x))
  }

  repeat_each_twice <- serve as(x) {
    tf$`repeat`(x, 2L, axis = -1L)
  }

  seqlen <- tf$form(x)[2]
  rot <- precomputed_rotation_matrix[, NA:seqlen, , ]

  cos <- Re(rot) |> repeat_each_twice()
  sin <- Im(rot) |> repeat_each_twice()

  (x * cos) + (rotate_every_two(x) * sin)
}

rand <- tf$random$uniform(form(3, 8, params$n_heads, 128))
all(apply_rotary_embedding(rand) ==
    apply_rotary_embedding_faster(rand))

tf.Tensor(True, form=(), dtype=bool)

apply_rotary_embedding <- apply_rotary_embedding_faster

In the end, observe that the rotary positional embeddings are implemented inside
every Consideration layer. That is other from the unique Transformer
implementation, the place a positional embedding was once best added as soon as on the
head of the fashion. Very similar to residual connections, you’ll recall to mind the
presence of those repeated injections of positional data as
relieving the rest trainable layers from the weight of allocating
a few of their weights to the duty of âpassing viaâ or âprotectingâ
the positional data for later layers.

Falbel and Keydana 2023),
so time spent working out them higher is time smartly
spent. For the needs of this weblog publish weâve coated the issues
wanted and weâll transfer directly to tying all items in combination. To head deeper and
broaden a extra mathematically knowledgeable perceive of RoPE, two superb
beginning issues are:

The unique paper by way of Su et al. (2022)
This weblog publish by way of
Biderman et al. (2021)

Tying all of it in combination

With Tokenizer, Embedding, TransformerBlock (RMSNorm,
Consideration FeedForward and apply_rotary_embedding) all coated,
itâs time to tie the entire items in combination right into a Transformer fashion. We
may just do that the usage of %py_class% like with the opposite layers above, however
itâs simply as simple to transport over to the usage of the Keras useful API at this
level.

layer_transformer_block <- create_layer_wrapper(TransformerBlock)
layer_rms_norm <- create_layer_wrapper(RMSNorm)

# enter to the fashion will probably be output from the tokenizer
enter <- layer_input(form(NA)) #, dtype = "int32")

x <- enter |>
  tok_embeddings()  # instantiated previous within the blog-post

for(block_id in seq_len0(params$n_layers)) >
    layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
                            attn_n_heads = params$n_heads,
                            norm_eps = params$norm_eps,
                            block_id = block_id)


# ultimate output projection into logits of output tokens
x <- x |>
  layer_rms_norm(block_id = -1, eps = params$norm_eps) |>
  layer_dense(
    tokenizer$vocab_size(), use_bias = FALSE,
    kernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
  )

# slice out the logits for the closing token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
  output <- x[, -1, ]
})

llama <- keras_model(enter, output) %>%
  collect(jit_compile = TRUE)

next_token_probs <- suggested %>%
  tokenizer$tokenize() %>%
  llama()

next_token_probs

tf.Tensor(
[[-2.4503722e+00 -3.4463339e+00  1.3200411e+01 ...  4.8804146e-01
  -1.3277926e+00  9.9985600e-03]], form=(1, 32000), dtype=float32)

Deep Studying with
R e book), however this weblog publish is lengthy sufficient
already. So for now, letâs simply take the argmax().

sampler <- (logits) tf$argmax(logits, axis = -1L, output_type = "int32")

(next_token <- sampler(next_token_probs))

tf.Tensor([304], form=(1), dtype=int32)

tokenizer$detokenize(next_token) |> as.persona()

[1] "to"

Letâs run it for a couple of tokens and let LLaMa end the sentence:

prompt_tokens <- tokenizer$tokenize("One of the best ways to draw bees")

for (i in 1:20) {

  next_token_probs <- prompt_tokens |> llama()
  next_token <- sampler(next_token_probs)

  prompt_tokens %<>% { tf$concat(c(., next_token), axis = -1L) }

  # finish of sentence
  if (as.logical(next_token == tokenizer$string_to_id(".")))
    destroy
}

prompt_tokens |>
  tokenizer$detokenize() |>
  as.persona() |>
  strwrap(60) |> writeLines()

One of the best ways to draw bees for your lawn is to plant a
number of plant life that bloom at other occasions.

Wrapping up

On this weblog publish weâve walked during the LLaMA structure
applied in R TensorFlow, together with load pretrained weights,
after which run the fashion to generate a sentence. Word, a lot of the code in
this weblog publish is customized for didactic functions. Whilst the
implementation of the LLaMA structure coated on this weblog publish is
suitable for coaching, there are a couple of changes youâll wish to
make sooner than doing a large number of textual content era. The ones come with such things as:

Within the Consideration layer, caching the okay and v tensors. Then,
after the primary ahead cross with the preliminary suggested, best feeding
the fashion the only new token from the sampler(), relatively than
feeding the fashion the entire tokens of the entire suggested on every ahead
cross.
Most effective producing the causal masks make_mask() and rotary_matrix
slices as soon as consistent with ahead cross, as an alternative of inside every Consideration
name.
Updating the TransformerBlock to be cache-aware and to cross
via the best arguments to Consideration()
Wrapping the entire further book-keeping common sense in a customized
TransformerDecoder() magnificence.

The adjustments required to put in force those optimizations for inference
balloon the code measurement and are most commonly about book-keeping, so we receivedât move
via them on this weblog publish. Then again, you’ll discover a fuller
implementation of LLaMA in R Tensorflow, together with a cache-aware
generate() approach that best feeds the fashion one token at a time throughout
the principle inference loop, (and compiles to XLA!),
right here.

Thatâs focused on now. Thank you for studying and satisfied travels to all
exploring this thrilling LLM terrain!

Biderman, Stella, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. 2021. âRotary Embeddings: A Relative Revolution.â weblog.eleuther.ai/rotary-embeddings/.

Falbel, Daniel, and Sigrid Keydana. 2023. âPosit AI Weblog: De-Noising Diffusion with Torch.â https://blogs.rstudio.com/tensorflow/posts/2023-04-13-denoising-diffusion/.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. âCoaching Compute-Optimum Huge Language Fashions.â https://arxiv.org/abs/2203.15556.

Shazeer, Noam. 2020. âGLU Variants Beef up Transformer.â https://arxiv.org/abs/2002.05202.

Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022. âRoFormer: Enhanced Transformer with Rotary Place Embedding.â https://arxiv.org/abs/2104.09864.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, TimothÃ©e Lacroix, Baptiste RoziÃ¨re, et al. 2023. âLLaMA: Open and Environment friendly Basis Language Fashions.â https://doi.org/10.48550/ARXIV.2302.13971.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. âConsideration Is All You Want.â https://arxiv.org/abs/1706.03762.