py transformers庫及windows下使用經(jīng)驗(yàn)_python transformers庫

齊學(xué)樓 2023-09-13

展開全文

文章目錄

一. 項(xiàng)目介紹
二. windows 下安裝
三. Tokenizer
- 自己訓(xùn)練 Tokenizer
- Tokenizer持久化與加載
四. NLP Model
- 主要類的依賴
- input / output of model
五. Pipeline
六. 模型導(dǎo)出
參考

一. 項(xiàng)目介紹

?? Transformers library, 有個(gè)叫 “抱抱臉” 的組織, 出品了 transformer 庫. 該項(xiàng)目有以下特色:

它首先是一個(gè) NLP 的模型倉庫. 遠(yuǎn)端存儲(chǔ)了各種模型的 計(jì)算圖源碼和訓(xùn)練好的模型參數(shù), 可通過互聯(lián)網(wǎng)按需下載.
僅有模型是不足以運(yùn)行 demo 的, 所以它引入 pipeline 設(shè)計(jì), 將 {tokenizer 預(yù)處理, 模型推斷, argmax_id 轉(zhuǎn) text 后處理} 等流水線化, 提供了統(tǒng)一便捷的api, 方便應(yīng)用, 調(diào)參, 學(xué)習(xí).
在 predict 之外, 它還提供了 Trainer 等類, 方便用戶以開發(fā)者的角色去 train&evalua 模型, 支持單機(jī)多卡, tensorboard 記錄等.

transformer 在ML, NLP領(lǐng)域是項(xiàng)里程碑的工作, 基于該思想的模型變種成百上千, 所以這個(gè)庫也可以看作是一個(gè) NLP 社區(qū), 匯聚了熱門模型資源, 大家都學(xué)習(xí)同樣的東西也便于交流.

二. windows 下安裝

可通過pip(建議)或~~conda安裝~~. 常見問題有

pyh5沖突
我通過conda安裝后, pyh5 模塊報(bào)錯(cuò), 因?yàn)?pip裝的pyh5 與 conda裝的pyh5 有沖突, 卸載前者得解.
模型下載過慢
首次加載一個(gè)模型時(shí), 會(huì)自動(dòng)下載, 緩存在 C:\Users\yichu\.cache\huggingface\transformers\ 目錄下, 但名字都是很長一串的 base64 編碼, 不直觀.
除了慢之外, 也難以失敗后斷點(diǎn)重來.

那么, 可以去model hub []中把相應(yīng)模型下載到本地磁盤, 代碼中 model 填完整的本地路徑.

# 目錄下有文件 [config.json , tf_model.h5 , tokenizer_config.json , vocab.txt]
model_path = r'D:\model_repository\transformer\distilbert-base-uncased-finetuned-sst-2-english'
model = transformers.pipeline('sentiment-analysis', model=model_path)

pytorch dataloder多線程報(bào)錯(cuò)
手動(dòng)修改 transformers.pipelines.base.Pipeline.__call__(self, inputs, *args, num_workers=8, **kwargs) 中的
num_workers=0 .

打開日志
該庫有集中化的日志模塊, 默認(rèn)關(guān)閉所有日志, 可簡單設(shè)定級(jí)別.
transformers.logging.set_verbosity(transformers.logging.INFO).

三. Tokenizer

官方文檔見 [4].
完成原始輸入到模型輸入的數(shù)據(jù)處理. 以 pipeline 視角去看,

分詞
查找詞典
截?cái)嗯c填充
mask標(biāo)記
添加 special tokens

tokenizers.Tokenizer.encode_batch(self, input)
input is raw text sequences.

自己訓(xùn)練 Tokenizer

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import CharDelimiterSplit
from tokenizers.processors import BertProcessing
from tokenizers.trainers import WordLevelTrainer

def train_tokenizer():
    tokenizer = Tokenizer(model=WordLevel(unk_token="[UNK]"))

    special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[MASK]", "[PAD]"]
    # INFO:__main__:Train tokenizer done, vocab size is 4196
    trainer = WordLevelTrainer(special_tokens=special_tokens, min_frequency=5)
    pre_tokenizer = CharDelimiterSplit('|')
    tokenizer.pre_tokenizer = pre_tokenizer
    tokenizer.enable_truncation(max_length=100)
    tokenizer.enable_padding(length=100)
    tokenizer.train_from_iterator(iter(text_per_line_generator()), trainer)
    # tokenizer.train_from_iterator(paths, trainer)
    tokenizer.post_processor = BertProcessing(
        cls=("[CLS]", tokenizer.token_to_id("[CLS]")),
        sep=("[SEP]", tokenizer.token_to_id("[SEP]"))
    )
    tokenizer.save(path)

Tokenizer持久化與加載

that contains all its configuration and vocabulary, just use the save() method.

and you can reload your tokenizer from that file with the from_file() class method.

四. NLP Model

以 TFDistilBertForSequenceClassification 類來講解.

主要類的依賴

class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSequenceClassificationLoss):
    def __init__(self, config, *inputs, **kwargs):
        self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
        self.pre_classifier = tf.keras.layers.Dense(config.dim,...)
        self.classifier = tf.keras.layers.Dense(config.num_labels,...)
    def call(self,...):
     	hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
        pooled_output = hidden_state[:, 0]  # (bs, dim)
        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
        pooled_output = self.dropout(pooled_output, training=inputs["training"])  # (bs, dim)
        logits = self.classifier(pooled_output)  # (bs, dim)

# 上面類的父類, 主要是 tf.function 裝飾器, 決定了 model 的簽名, 用于 saved_model 等.        
class TFDistilBertPreTrainedModel(TFPreTrainedModel):
     @tf.function(
        input_signature=[
            {
                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
            }
        ]
    )
    def serving(self, inputs):
        output = self.call(inputs)
        return self.serving_output(output)

class TFSequenceClassificationLoss:
    def compute_loss(self, labels, logits):
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
                from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
        return loss_fn(labels, logits)


class TFDistilBertMainLayer(tf.keras.layers.Layer)
    def __init__(self,config, **kwargs):
        # 內(nèi)含 word_embeddings 和 position_embeddings
        self.embeddings = TFEmbeddings(config, name="embeddings")  # Embeddings
        # 內(nèi)含 多個(gè)TFTransformerBlock
        self.transformer = TFTransformer(config, name="transformer")  # Encoder

class TFTransformer(tf.keras.layers.Layer):
    def __init__(self, config, **kwargs):
        self.layer = [TFTransformerBlock(config, name=f"layer_._{i}") for i in range(config.n_layers)]

    # return 以下三個(gè)
    # hidden_state: 最終層輸出, tf.Tensor(bs, seq_length, dim)
    # all_hidden_states: 每層的 hidden_state
    # all_attentions: 每層的 attention_weight
    def call(self, x, attn_mask, head_mask, output_attentions, output_hidden_states, return_dict, training=False):
    	pass

class TFTransformerBlock(tf.keras.layers.Layer):
    def __init__(self, config, **kwargs):

        self.attention = TFMultiHeadSelfAttention(config, name="attention")
        self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="sa_layer_norm")

        self.ffn = TFFFN(config, name="ffn")
        self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="output_layer_norm")

input / output of model

input
- input_ids
- attention_mask
output
- logits

五. Pipeline

從上方模型章節(jié)可以看到, 它有自己的IO邊界.
實(shí)際使用中, input 是 text, output 是label, 所以有了 pipeline, 將預(yù)處理和后處理也給抽象并串聯(lián)了起來. 做到開箱即用.

class transformers.pipelines.base.Pipeline():
    def __init__(
        self,
        model: Union["PreTrainedModel", "TFPreTrainedModel"],
        tokenizer: Optional[PreTrainedTokenizer] = None,
        ...)
        pass
    def __call__(self, inputs, *args, num_workers=8, **kwargs):
        return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)


    def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
        model_inputs = self.preprocess(inputs, **preprocess_params)
        model_outputs = self.forward(model_inputs, **forward_params)
        outputs = self.postprocess(model_outputs, **postprocess_params)
        return outputs

上方 Pipeline 是抽象類, 一個(gè)可以與TFDistilBertForSequenceClassification 搭配的具體的子類 TextClassificationPipeline 見下.

class TextClassificationPipeline(Pipeline):
    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
        return_tensors = self.framework
        return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
    def _forward(self, model_inputs):
        return self.model(**model_inputs)
    def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
        outputs = model_outputs["logits"][0]
        outputs = outputs.numpy()
        
        if self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
        scores = softmax(outputs)
        return {"label": self.model.config.id2label[scores.argmax().item()], "score": scores.max().item()}

六. 模型導(dǎo)出

TFDistilBertForSequenceClassification 直接 saved_model 導(dǎo)出, 簽名是

structured_input_signature ((), {'input_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_ids'), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='attention_mask')})
structured_outputs {'logits': TensorSpec(shape=(None, 2), dtype=tf.float32, name='logits')}

參考

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：齊學(xué)樓 > 《AI》

舉報(bào)/認(rèn)領(lǐng)