【人工智能】LLaMA + LoRA 模型细节与代码实现

博主： yezhengmao
发布时间：2023 年 07 月 30 日
1360 次浏览
暂无评论
8488字数
分类：大语言模型

简介

本文主要介绍 LLaMA 模型细节和代码实现，在实现 LLaMA 基座模型基础上介绍 [LoRA]() 模型细节和代码实现。

注意：本文部分给出代码与原实现有些许差异。使用 LLaMA-7B 模型作为示例。

模型细节

LLaMA 模型主要基于 Transformer 架构。以 LLaMA-7B 模型为例，部分参数有：词表大小 32000，词嵌入维数 4096，注意力头 32，Transformer 层 32。整体处理流程如下，细节详见后文：

将句子通过 tokenizer 模型，转化字符串为 tokens 列表，实现字符串到整数列表的转化，即 str -> List[int]，长度为 seq_len。
将 tokens 列表，通过 embedding 层（简单的 lookup table），将 tokens 列表转化为词嵌入向量，即 List[int] -> float[seq_len, embdding_dim]。
通过 N 层的 Transformer 层，得到 float[seq_len, embedding_dim]。
通过 output 层，将输出转化为词表概率，取其中最大值或者依概率取值最为最后预测词。

其中 Transformer 的 Decoder 和 Encoder 见下图， prompt 的计算使用 decoder 部分算法，后续词预测使用 encoder 部分算法：

transformer_decoder

transformer_encoder

线性层 - Linear

计算方式：

$$ output = data * weight^T $$

代码实现：

class Linear(nn.Module):
    def __init__(self, weight: torch.Tensor):
        super().__init__()
        self.weight_ = weight

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        return torch.matmul(data, self.weight_.transpose(0, 1))

归一化层 - RMSNorm

计算方式：

$$ o_i = \frac{a_i * w_i}{\sqrt{\frac{1}{n}\sum_{i=1}^na_i^2}} $$

代码实现：

class RMSNorm(nn.Module):
    def __init__(self, weight: torch.Tensor, eps: float = 1e-06):
        super().__init__()
        self.norm_eps_ = eps
        self.weight_ = weight

    def _norm(self, data: torch.Tensor) -> torch.Tensor:
        return data * torch.rsqrt(data.pow(2).mean(-1, keepdim=True) + self.norm_eps_)

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        output = self._norm(data.float()).type_as(data)
        return output * self.weight_

位置信息嵌入 - RotaryEmbedding

对词向量进行位置信息嵌入，计算方式：

$$ \left(\begin{matrix} e_0 \\ e_1 \\ \cdots \\ e_{d-2} \\ e_{d-1} \\ \end{matrix} \right) \times \left(\begin{matrix} cos\theta_0 \\ cos\theta_0 \\ \cdots \\ cos\theta_{d/2-1} \\ cos\theta_{d/2-1} \\ \end{matrix} \right) + \left(\begin{matrix} -e_1 \\ e_0 \\ \cdots \\ -e_{d-1} \\ e_{d-2}\\ \end{matrix} \right) \times \left(\begin{matrix} sin\theta_0 \\ sin\theta_0 \\ \cdots \\ sin\theta_{d/2-1} \\ sin\theta_{d/2-1} \\ \end{matrix} \right) = \left(\begin{matrix} o_0 \\ o_1 \\ \cdots \\ o_{d-2} \\ o_{d-1} \\ \end{matrix} \right) $$

其中词向量维度为每一个注意力头的维度，即总维度数量 / 注意力头数量。

第 i 个词的第 t 个角度计算方式：

$$ \theta_{i,t} = \frac{i}{10000^{\frac{2t}{d}}} $$

其中 i 为词所在位置，d 为词向量维数，由上述嵌入公式可知仅需 2/d 个角度， t 即为第 t 个角度。

转化为复数可简化计算，计算方式：

$$ (cos\theta_0 + sin\theta_0i)\times(e_0+e_1i)=(e_0cos\theta_0-e_1sin\theta_0)+(e_1cos\theta_0+e_0sin\theta_0)i $$

可预先准备最大词数量 * (注意力头维度 / 2) 的矩阵，与转化为复数形式的词向量相乘：

$$ \left(\begin{matrix} cos\theta_{i,0} + sin\theta_{i,0}i \\ cos\theta_{i,1} + sin\theta_{i,1}i \\ \cdots \\ cos\theta_{i,t-2} + sin\theta_{i,t-1}i \\ cos\theta_{i,t-2} + sin\theta_{i,t-1}i \\ \end{matrix} \right) \times \left(\begin{matrix} e_0+e_1i \\ \cdots \\ \cdots \\ \cdots \\ e_{d-2}+e_{d-1}i \\ \end{matrix} \right) $$

代码实现：

def precompute_rope_angle(dim: int, seq_len: int, theta: float = 10000.0) -> torch.Tensor:
    angles = 1.0 / (theta ** (torch.arange(0, dim, 2) / dim))
    seq = torch.arange(seq_len)
    freqs = torch.outer(seq, angles)
    # 1*cos(angle) + 1*sin(angle)i
    angle_complex = torch.polar(torch.ones_like(freqs), freqs)
    return angle_complex


def apply_rotary_emb(data: torch.Tensor, angle: torch.Tensor) -> torch.Tensor:
    # data shape is: seq_len * n_head * n_dim
    # convert data shape into complex: seq_len * n_head * (dim // 2)
    seq_len, n_head, dim_head = data.shape

    data = torch.view_as_complex(data.reshape(
        seq_len, n_head, dim_head // 2, 2))

    angle_ = angle[:seq_len].view(seq_len, 1, dim_head // 2)

    return torch.view_as_real(data * angle_).flatten(2)

Transformer 层

代码实现

class Transformer(nn.Module):
    def __init__(self, layer_id: int, args: LlamaModelArgs):
        super().__init__()
        # attention
        self.wq_: Linear = None  # dim * dim
        self.wk_: Linear = None  # dim * dim
        self.wv_: Linear = None  # dim * dim
        self.wo_: Linear = None  # dim * dim
        # feed forward
        self.w1_: Linear = None  # also gate FNN * dim
        self.w2_: Linear = None  # also down dim * FNN
        self.w3_: Linear = None  # also up   FNN * dim
        # norm
        self.attention_norm_: RMSNorm = None  # dim
        self.ffn_norm_: RMSNorm = None        # dim

    def forward(self, data: torch.Tensor, rope_angle: torch.Tensor):
        # data shape: seq_len * dim
        seq_len, _ = data.shape

        attention_norm_data = self.attention_norm_.forward(data)

        xq = self.wq_.forward(attention_norm_data)
        xk = self.wk_.forward(attention_norm_data)
        xv = self.wv_.forward(attention_norm_data)

        # conver shape to multi head
        xq = xq.view(seq_len, self.n_heads_, self.head_dim_)
        xk = xk.view(seq_len, self.n_heads_, self.head_dim_)
        xv = xv.view(seq_len, self.n_heads_, self.head_dim_)

        # apply rotary embedding
        rotary_xq, rotary_xk = apply_rotary_emb(xq, xk, rope_angle)

        # shape is: n_head * seq_len * dim_head
        xq = rotary_xq.transpose(0, 1)
        xk = rotary_xk.transpose(0, 1)
        xv = xv.transpose(0, 1)
        scores = torch.matmul(xq, xk.transpose(1, 2)) / \
            math.sqrt(self.head_dim_)
        # need mask the score, score shape is : seq_len * seq_len
        mask = torch.full((seq_len, seq_len), float("-inf"))
        mask = torch.triu(mask, diagonal=1).type_as(scores)
        scores = scores + mask
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        # score shape is: n_head * seq_len * dim_head
        attention_score = torch.matmul(scores, xv)
        # convert shape to: seq_len * dim
        attention_score = attention_score.transpose(
            0, 1).contiguous().view(seq_len, -1)

        # get output attention score
        attention_output_score = self.wo_.forward(attention_score)
        attention_output = attention_output_score + data

        # feed forward fully connected
        score_norm_data = self.ffn_norm_.forward(attention_output)
        w1_data = self.w1_.forward(score_norm_data)
        w3_data = self.w3_.forward(score_norm_data)
        decoder_output = self.w2_(F.silu(w1_data) * w3_data)
        return attention_output + decoder_output

LLaMA 模型

代码实现

class LlamaModel(nn.Module):
    def __init__(self, args: LlamaModelArgs):
        super().__init__()

        self.token_embedding_: torch.Tensor = None

        self.layers_: List[Transformer] = []
        for layer_id in range(args.n_layers_):
            self.layers_.append(Transformer(layer_id, args))

        self.norm_: RMSNorm = None    # dim
        self.output_: Linear = None   # vocab size * dim

    def forward(self, input_tokens: List[int]):
        tokens = torch.tensor(input_tokens, dtype=torch.int)
        data = F.embedding(tokens, self.token_embedding_)

        for layer in self.layers_:
            data = layer.forward(data, self.rope_angle_)

        data = self.norm_.forward(data)
        output = self.output_.forward(data)

        return output

LoRA 模型实现

LoRA 模型对 Transformer 层中的 Linear 进行替换，加入 LoRA_A 和 LoRA_B 两层，进行降秩和升秩。

代码实现

class Linear(nn.Module):
    def __init__(self, weight: torch.Tensor):
        super().__init__()
        
        self.weight_ = weight
        # lora weight
        self.lora_a_: torch.Tensor # r * dim
        self.lora_b_: torch.Tensor # dim * r
        # common paramas
        self.lora_dropout_: float = 0
        self.r_: int = 0
        self.lora_alpha_: int = 0
        self.scaling_: float = 0

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        result = torch.matmul(data, self.weight_.transpose(0, 1))

        data_ = F.dropout(data, self.lora_dropout_)
        data_ = torch.matmul(data_, self.lora_a_.transpose(0, 1))
        data_ = torch.matmul(data_, self.lora_b_.transpose(0, 1))
        data_ = data_ * self.scaling_

        return result + data_

最后修改：2023 年 07 月 31 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

看透一切的道士
就是说，通过debugfs 使能某个tracerpoint，这...
看透一切的道士
博主请教一下，这里tracepoint 的激活只能通过debu...
hmc
划水
harleyliao
插入的时候的加锁代码在哪里呢? 好像么看到
111
你好，你这个rbft的问题解决了嘛

【人工智能】LLaMA + LoRA 模型细节与代码实现

yezhengmao • 2023 年 07 月 30 日

<h2>简介</h2><p>本文主要介绍 LLaMA 模型细节和代码实现，在实现 LLaMA 基座模型基础上介绍 [LoRA]() 模型细节和代码实现。</p><p>注意：本文部分给出代码与<span class="external-link"><a class="no-external-link" href="https://github.com/facebookresearch/llama" target="_blank"><i data-feather="external-link"></i>原实现</a></span>有些许差异。使用 LLaMA-7B 模型作为示例。</p><h2>模型细节</h2><p>LLaMA 模型主要基于 Transformer 架构。以 LLaMA-7B 模型为例，部分参数有：词表大小 32000，词嵌入维数 4096，注意力头 32，Transformer 层 32。整体处理流程如下，细节详见后文：</p><ol><li>将句子通过 tokenizer 模型，转化字符串为 tokens 列表，实现字符串到整数列表的转化，即 <code>str -&gt; List[int]</code>，长度为 <code>seq_len</code>。</li><li>将 tokens 列表，通过 embedding 层（简单的 lookup table），将 tokens 列表转化为词嵌入向量，即 <code>List[int] -&gt; float[seq_len, embdding_dim]</code>。</li><li>通过 N 层的 Transformer 层，得到 <code>float[seq_len, embedding_dim]</code>。</li><li>通过 output 层，将输出转化为词表概率，取其中最大值或者依概率取值最为最后预测词。</li></ol><p>其中 Transformer 的  Decoder 和 Encoder 见下图， prompt 的计算使用 decoder 部分算法，后续词预测使用 encoder 部分算法：</p><p><img src="https://yezhem.oss-cn-chengdu.aliyuncs.com/blog_img/transformer_decoder.png" alt="transformer_decoder" title="transformer_decoder"style=""></p><p><img src="https://yezhem.oss-cn-chengdu.aliyuncs.com/blog_img/transformer_encoder.png" alt="transformer_encoder" title="transformer_encoder"style=""></p><h3>线性层 - Linear</h3><p>计算方式 ：</p><p>$$
output = data * weight^T
$$</p><p>代码实现：</p><pre><code class="lang-python">class Linear(nn.Module):
    def __init__(self, weight: torch.Tensor):
        super().__init__()
        self.weight_ = weight

def forward(self, data: torch.Tensor) -&gt; torch.Tensor:
        return torch.matmul(data, self.weight_.transpose(0, 1))</code></pre><h3>归一化层 - RMSNorm</h3><p>计算方式：</p><p>$$
o_i = \frac{a_i * w_i}{\sqrt{\frac{1}{n}\sum_{i=1}^na_i^2}}
$$</p><p>代码实现：</p><pre><code class="lang-python">class RMSNorm(nn.Module):
    def __init__(self, weight: torch.Tensor, eps: float = 1e-06):
        super().__init__()
        self.norm_eps_ = eps
        self.weight_ = weight

def _norm(self, data: torch.Tensor) -&gt; torch.Tensor:
        return data * torch.rsqrt(data.pow(2).mean(-1, keepdim=True) + self.norm_eps_)

def forward(self, data: torch.Tensor) -&gt; torch.Tensor:
        output = self._norm(data.float()).type_as(data)
        return output * self.weight_</code></pre><h3>位置信息嵌入 - RotaryEmbedding</h3><p>对词向量进行位置信息嵌入，计算方式：</p><p>$$
\left(\begin{matrix}
e_0 \\
e_1 \\
\cdots \\
e_{d-2} \\
e_{d-1} \\
\end{matrix} \right)
\times
\left(\begin{matrix}
cos\theta_0 \\
cos\theta_0 \\
\cdots \\
cos\theta_{d/2-1} \\
cos\theta_{d/2-1} \\
\end{matrix} \right)
+
\left(\begin{matrix}
-e_1 \\
e_0 \\
\cdots \\
-e_{d-1} \\
e_{d-2}\\
\end{matrix} \right)
\times
\left(\begin{matrix}
sin\theta_0 \\
sin\theta_0 \\
\cdots \\
sin\theta_{d/2-1} \\
sin\theta_{d/2-1} \\
\end{matrix} \right)
=
\left(\begin{matrix}
o_0 \\
o_1 \\
\cdots \\
o_{d-2} \\
o_{d-1} \\
\end{matrix} \right)
$$</p><p>其中词向量维度为每一个注意力头的维度，即总维度数量 / 注意力头数量。</p><p>第 i 个词的第 t 个角度计算方式：</p><p>$$
\theta_{i,t} =  \frac{i}{10000^{\frac{2t}{d}}}
$$</p><p>其中 i 为词所在位置，d 为词向量维数，由上述嵌入公式可知仅需 2/d 个角度， t 即为第 t 个角度。</p><p>转化为复数可简化计算，计算方式：</p><p>$$
(cos\theta_0 + sin\theta_0i)\times(e_0+e_1i)=(e_0cos\theta_0-e_1sin\theta_0)+(e_1cos\theta_0+e_0sin\theta_0)i
$$</p><p>可预先准备最大词数量 * (注意力头维度 / 2) 的矩阵，与转化为复数形式的词向量相乘：</p><p>$$
\left(\begin{matrix}
cos\theta_{i,0} + sin\theta_{i,0}i \\
cos\theta_{i,1} + sin\theta_{i,1}i \\
\cdots \\
cos\theta_{i,t-2} + sin\theta_{i,t-1}i \\
cos\theta_{i,t-2} + sin\theta_{i,t-1}i \\
\end{matrix} \right)
\times
\left(\begin{matrix}
e_0+e_1i \\
\cdots \\
\cdots \\
\cdots \\
e_{d-2}+e_{d-1}i \\
\end{matrix} \right)
$$</p><p>代码实现：</p><pre><code class="lang-python">def precompute_rope_angle(dim: int, seq_len: int, theta: float = 10000.0) -&gt; torch.Tensor:
    angles = 1.0 / (theta ** (torch.arange(0, dim, 2) / dim))
    seq = torch.arange(seq_len)
    freqs = torch.outer(seq, angles)
    # 1*cos(angle) + 1*sin(angle)i
    angle_complex = torch.polar(torch.ones_like(freqs), freqs)
    return angle_complex

def apply_rotary_emb(data: torch.Tensor, angle: torch.Tensor) -&gt; torch.Tensor:
    # data shape is: seq_len * n_head * n_dim
    # convert data shape into complex: seq_len * n_head * (dim // 2)
    seq_len, n_head, dim_head = data.shape

data = torch.view_as_complex(data.reshape(
        seq_len, n_head, dim_head // 2, 2))

angle_ = angle[:seq_len].view(seq_len, 1, dim_head // 2)

return torch.view_as_real(data * angle_).flatten(2)</code></pre><h3>Transformer 层</h3><p>代码实现</p><pre><code class="lang-python">class Transformer(nn.Module):
    def __init__(self, layer_id: int, args: LlamaModelArgs):
        super().__init__()
        # attention
        self.wq_: Linear = None  # dim * dim
        self.wk_: Linear = None  # dim * dim
        self.wv_: Linear = None  # dim * dim
        self.wo_: Linear = None  # dim * dim
        # feed forward
        self.w1_: Linear = None  # also gate FNN * dim
        self.w2_: Linear = None  # also down dim * FNN
        self.w3_: Linear = None  # also up   FNN * dim
        # norm
        self.attention_norm_: RMSNorm = None  # dim
        self.ffn_norm_: RMSNorm = None        # dim

def forward(self, data: torch.Tensor, rope_angle: torch.Tensor):
        # data shape: seq_len * dim
        seq_len, _ = data.shape

attention_norm_data = self.attention_norm_.forward(data)

xq = self.wq_.forward(attention_norm_data)
        xk = self.wk_.forward(attention_norm_data)
        xv = self.wv_.forward(attention_norm_data)

# conver shape to multi head
        xq = xq.view(seq_len, self.n_heads_, self.head_dim_)
        xk = xk.view(seq_len, self.n_heads_, self.head_dim_)
        xv = xv.view(seq_len, self.n_heads_, self.head_dim_)

# apply rotary embedding
        rotary_xq, rotary_xk = apply_rotary_emb(xq, xk, rope_angle)

# shape is: n_head * seq_len * dim_head
        xq = rotary_xq.transpose(0, 1)
        xk = rotary_xk.transpose(0, 1)
        xv = xv.transpose(0, 1)
        scores = torch.matmul(xq, xk.transpose(1, 2)) / \
            math.sqrt(self.head_dim_)
        # need mask the score, score shape is : seq_len * seq_len
        mask = torch.full((seq_len, seq_len), float(&quot;-inf&quot;))
        mask = torch.triu(mask, diagonal=1).type_as(scores)
        scores = scores + mask
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        # score shape is: n_head * seq_len * dim_head
        attention_score = torch.matmul(scores, xv)
        # convert shape to: seq_len * dim
        attention_score = attention_score.transpose(
            0, 1).contiguous().view(seq_len, -1)

# get output attention score
        attention_output_score = self.wo_.forward(attention_score)
        attention_output = attention_output_score + data

# feed forward fully connected
        score_norm_data = self.ffn_norm_.forward(attention_output)
        w1_data = self.w1_.forward(score_norm_data)
        w3_data = self.w3_.forward(score_norm_data)
        decoder_output = self.w2_(F.silu(w1_data) * w3_data)
        return attention_output + decoder_output</code></pre><h3>LLaMA 模型</h3><p>代码实现</p><pre><code class="lang-python">class LlamaModel(nn.Module):
    def __init__(self, args: LlamaModelArgs):
        super().__init__()

self.token_embedding_: torch.Tensor = None

self.layers_: List[Transformer] = []
        for layer_id in range(args.n_layers_):
            self.layers_.append(Transformer(layer_id, args))

self.norm_: RMSNorm = None    # dim
        self.output_: Linear = None   # vocab size * dim

def forward(self, input_tokens: List[int]):
        tokens = torch.tensor(input_tokens, dtype=torch.int)
        data = F.embedding(tokens, self.token_embedding_)

for layer in self.layers_:
            data = layer.forward(data, self.rope_angle_)

data = self.norm_.forward(data)
        output = self.output_.forward(data)

return output</code></pre><h3>LoRA 模型实现</h3><p>LoRA 模型对 Transformer 层中的 Linear 进行替换，加入 LoRA_A 和 LoRA_B 两层，进行降秩和升秩。</p><p>代码实现</p><pre><code class="lang-python">class Linear(nn.Module):
    def __init__(self, weight: torch.Tensor):
        super().__init__()
        
        self.weight_ = weight
        # lora weight
        self.lora_a_: torch.Tensor # r * dim
        self.lora_b_: torch.Tensor # dim * r
        # common paramas
        self.lora_dropout_: float = 0
        self.r_: int = 0
        self.lora_alpha_: int = 0
        self.scaling_: float = 0

def forward(self, data: torch.Tensor) -&gt; torch.Tensor:
        result = torch.matmul(data, self.weight_.transpose(0, 1))

data_ = F.dropout(data, self.lora_dropout_)
        data_ = torch.matmul(data_, self.lora_a_.transpose(0, 1))
        data_ = torch.matmul(data_, self.lora_b_.transpose(0, 1))
        data_ = data_ * self.scaling_

return result + data_</code></pre>

【人工智能】LLaMA + LoRA 模型细节与代码实现

简介

模型细节

线性层 - Linear

归一化层 - RMSNorm

位置信息嵌入 - RotaryEmbedding

Transformer 层

LLaMA 模型

LoRA 模型实现

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

【Fabric】PBFT 共识实现与网络搭建

【深入理解计算机系统 / CSAPP】Data Lab

【深入理解计算机系统 / CSAPP】Attack Lab

【Linux Kernel】静态追踪技术 - TracePoint 与 Events

【LevelDB】 SkipList 跳表

【LevelDB】 Arena 内存管理

【工具】Hyper-V 虚拟机环境安装 ArchLinux

【eBPF】使用libbpf开发eBPF程序

【LevelDB】 TableBuilder 持久化 MemTable 数据

【LevelDB】 SkipList 跳表

【人工智能】LLaMA + LoRA 模型细节与代码实现

简介

模型细节

线性层 - Linear

归一化层 - RMSNorm

位置信息嵌入 - RotaryEmbedding

Transformer 层

LLaMA 模型

LoRA 模型实现

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

【人工智能】LLaMA + LoRA 模型细节与代码实现

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款