一. 項(xiàng)目介紹
?? Transformers library, 有個(gè)叫 “抱抱臉” 的組織, 出品了 transformer 庫. 該項(xiàng)目有以下特色:
- 它首先是一個(gè) NLP 的模型倉庫. 遠(yuǎn)端存儲(chǔ)了各種模型的 計(jì)算圖源碼和訓(xùn)練好的模型參數(shù), 可通過互聯(lián)網(wǎng)按需下載.
- 僅有模型是不足以運(yùn)行 demo 的, 所以它引入 pipeline 設(shè)計(jì), 將 {tokenizer 預(yù)處理, 模型推斷, argmax_id 轉(zhuǎn) text 后處理} 等流水線化, 提供了統(tǒng)一便捷的api, 方便應(yīng)用, 調(diào)參, 學(xué)習(xí).
- 在 predict 之外, 它還提供了 Trainer 等類, 方便用戶以開發(fā)者的角色去 train&evalua 模型, 支持單機(jī)多卡, tensorboard 記錄等.
transformer 在ML, NLP領(lǐng)域是項(xiàng)里程碑的工作, 基于該思想的模型變種成百上千, 所以這個(gè)庫也可以看作是一個(gè) NLP 社區(qū), 匯聚了熱門模型資源, 大家都學(xué)習(xí)同樣的東西也便于交流.
二. windows 下安裝
可通過pip(建議)或conda安裝. 常見問題有
- pyh5沖突
我通過conda安裝后, pyh5 模塊報(bào)錯(cuò), 因?yàn)?pip裝的pyh5 與 conda裝的pyh5 有沖突, 卸載前者得解. - 模型下載過慢
首次加載一個(gè)模型時(shí), 會(huì)自動(dòng)下載, 緩存在 C:\Users\yichu\.cache\huggingface\transformers\ 目錄下, 但名字都是 很長一串的 base64 編碼, 不直觀. 除了慢之外, 也難以失敗后斷點(diǎn)重來. 
那么, 可以去model hub []中把相應(yīng)模型下載到本地磁盤, 代碼中 model 填完整的本地路徑.
# 目錄下有文件 [config.json , tf_model.h5 , tokenizer_config.json , vocab.txt]
model_path = r'D:\model_repository\transformer\distilbert-base-uncased-finetuned-sst-2-english'
model = transformers.pipeline('sentiment-analysis', model=model_path)
- pytorch dataloder多線程報(bào)錯(cuò)
手動(dòng)修改 transformers.pipelines.base.Pipeline.__call__(self, inputs, *args, num_workers=8, **kwargs) 中的 num_workers=0 .
- 打開日志
該庫有集中化的日志模塊, 默認(rèn)關(guān)閉所有日志, 可簡單設(shè)定級(jí)別. transformers.logging.set_verbosity(transformers.logging.INFO) .
三. Tokenizer
官方文檔見 [4]. 完成原始輸入到模型輸入的數(shù)據(jù)處理. 以 pipeline 視角去看,
- 分詞
- 查找詞典
- 截?cái)嗯c填充
- mask標(biāo)記
- 添加 special tokens
tokenizers.Tokenizer.encode_batch(self, input) input is raw text sequences.
自己訓(xùn)練 Tokenizer
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import CharDelimiterSplit
from tokenizers.processors import BertProcessing
from tokenizers.trainers import WordLevelTrainer
def train_tokenizer():
tokenizer = Tokenizer(model=WordLevel(unk_token="[UNK]"))
special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[MASK]", "[PAD]"]
# INFO:__main__:Train tokenizer done, vocab size is 4196
trainer = WordLevelTrainer(special_tokens=special_tokens, min_frequency=5)
pre_tokenizer = CharDelimiterSplit('|')
tokenizer.pre_tokenizer = pre_tokenizer
tokenizer.enable_truncation(max_length=100)
tokenizer.enable_padding(length=100)
tokenizer.train_from_iterator(iter(text_per_line_generator()), trainer)
# tokenizer.train_from_iterator(paths, trainer)
tokenizer.post_processor = BertProcessing(
cls=("[CLS]", tokenizer.token_to_id("[CLS]")),
sep=("[SEP]", tokenizer.token_to_id("[SEP]"))
)
tokenizer.save(path)
Tokenizer持久化與加載
that contains all its configuration and vocabulary, just use the save() method.
and you can reload your tokenizer from that file with the from_file() class method.
四. NLP Model
以 TFDistilBertForSequenceClassification 類來講解.
主要類的依賴
class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSequenceClassificationLoss):
def __init__(self, config, *inputs, **kwargs):
self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
self.pre_classifier = tf.keras.layers.Dense(config.dim,...)
self.classifier = tf.keras.layers.Dense(config.num_labels,...)
def call(self,...):
hidden_state = distilbert_output[0] # (bs, seq_len, dim)
pooled_output = hidden_state[:, 0] # (bs, dim)
pooled_output = self.pre_classifier(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output, training=inputs["training"]) # (bs, dim)
logits = self.classifier(pooled_output) # (bs, dim)
# 上面類的父類, 主要是 tf.function 裝飾器, 決定了 model 的簽名, 用于 saved_model 等.
class TFDistilBertPreTrainedModel(TFPreTrainedModel):
@tf.function(
input_signature=[
{
"input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
"attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
}
]
)
def serving(self, inputs):
output = self.call(inputs)
return self.serving_output(output)
class TFSequenceClassificationLoss:
def compute_loss(self, labels, logits):
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
return loss_fn(labels, logits)
class TFDistilBertMainLayer(tf.keras.layers.Layer)
def __init__(self,config, **kwargs):
# 內(nèi)含 word_embeddings 和 position_embeddings
self.embeddings = TFEmbeddings(config, name="embeddings") # Embeddings
# 內(nèi)含 多個(gè)TFTransformerBlock
self.transformer = TFTransformer(config, name="transformer") # Encoder
class TFTransformer(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
self.layer = [TFTransformerBlock(config, name=f"layer_._{i}") for i in range(config.n_layers)]
# return 以下三個(gè)
# hidden_state: 最終層輸出, tf.Tensor(bs, seq_length, dim)
# all_hidden_states: 每層的 hidden_state
# all_attentions: 每層的 attention_weight
def call(self, x, attn_mask, head_mask, output_attentions, output_hidden_states, return_dict, training=False):
pass
class TFTransformerBlock(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
self.attention = TFMultiHeadSelfAttention(config, name="attention")
self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="sa_layer_norm")
self.ffn = TFFFN(config, name="ffn")
self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="output_layer_norm")
input / output of model
五. Pipeline
從上方 模型 章節(jié)可以看到, 它有自己的IO邊界. 實(shí)際使用中, input 是 text, output 是label, 所以有了 pipeline, 將預(yù)處理和后處理也給抽象并串聯(lián)了起來. 做到開箱即用.
class transformers.pipelines.base.Pipeline():
def __init__(
self,
model: Union["PreTrainedModel", "TFPreTrainedModel"],
tokenizer: Optional[PreTrainedTokenizer] = None,
...)
pass
def __call__(self, inputs, *args, num_workers=8, **kwargs):
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
model_inputs = self.preprocess(inputs, **preprocess_params)
model_outputs = self.forward(model_inputs, **forward_params)
outputs = self.postprocess(model_outputs, **postprocess_params)
return outputs
上方 Pipeline 是抽象類, 一個(gè)可以與TFDistilBertForSequenceClassification 搭配的具體的子類 TextClassificationPipeline 見下.
class TextClassificationPipeline(Pipeline):
def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
return_tensors = self.framework
return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
def _forward(self, model_inputs):
return self.model(**model_inputs)
def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
outputs = model_outputs["logits"][0]
outputs = outputs.numpy()
if self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
scores = softmax(outputs)
return {"label": self.model.config.id2label[scores.argmax().item()], "score": scores.max().item()}
六. 模型導(dǎo)出
TFDistilBertForSequenceClassification 直接 saved_model 導(dǎo)出, 簽名是
structured_input_signature ((), {'input_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_ids'), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='attention_mask')})
structured_outputs {'logits': TensorSpec(shape=(None, 2), dtype=tf.float32, name='logits')}
參考
- github
- 官方doc
- model hub
- tokenizer 官方文檔
|