【原】python情感分析：基于jieba的分詞及snownlp的情感分析！

Python集中營 2023-01-29 發(fā)布于甘肅

展開全文

情感分析（sentiment analysis）是2018年公布的計算機(jī)科學(xué)技術(shù)名詞。

它可以根據(jù)文本內(nèi)容判斷出所代表的含義是積極的還是負(fù)面的，也可以用來分析文本中的意思是褒義還是貶義。

一般應(yīng)用場景就是能用來做電商的大量評論數(shù)據(jù)的分析，比如好評率或者差評率的統(tǒng)計等等。

我們這里使用到的情感分析的模塊是snownlp，為了提高情感分析的準(zhǔn)確度選擇加入了jieba模塊的分詞處理。

由于以上的兩個python模塊都是非標(biāo)準(zhǔn)庫，因此我們可以使用pip的方式進(jìn)行安裝。

pip install jieba

pip install snownlp

jieba是一個強(qiáng)大的中文分詞處理庫，能夠滿足大多數(shù)的中文分詞處理，協(xié)助snownlp的情感分析。

# Importing the jieba module and renaming it to ja.
import jieba as ja
from snownlp import SnowNLP

# Importing the snownlp module and renaming it to nlp.

為了避免大家使用過程中出現(xiàn)的版本沖突問題，這里將python的內(nèi)核版本展示出來。

python解釋器版本：3.6.8

接下來首先創(chuàng)建一組需要進(jìn)行情感分的數(shù)據(jù)源，最后直接分析出該文本代表的是一個積極情緒還是消極情緒。

# Creating a variable called analysis_text and assigning it the value of a string.
analysis_text = '這個實在是太好用了，我非常的喜歡，下次一定還會購買的！'

定義好了需要分析的數(shù)據(jù)來源語句，然后就是分詞處理了。這里說明一下為什么需要分詞處理，是因為snownlp這個情感分析模塊它的中文分詞結(jié)果不太標(biāo)準(zhǔn)。

比如說，'不好看'，這個詞如果使用snownlp來直接分詞的話大概率的就會分為'不'和'好看'這兩個詞。

這樣的明明是一個帶有負(fù)面情緒的中文詞匯可能就直接被定義為正面情緒了，這也就是為什么這里需要先使用jieba進(jìn)行分詞處理了。

# Using the jieba module to cut the analysis_text into a list of words.
analysis_list = list(ja.cut(analysis_text))

# Printing the list of words that were cut from the analysis_text.
print(analysis_list)

# ['這個', '實在', '是', '太', '好', '用', '了', '，', '我', '非常', '的', '喜歡', '，', '下次', '一定', '還會', '購買', '的', '！']

根據(jù)上面分詞以后的結(jié)果來看，分詞的粒度還是比較細(xì)致的，每個詞都是最多兩個字符串的長度。

使用jieba提供的cut()函數(shù)，關(guān)鍵詞已經(jīng)分割完成了，接著就是提取主要的關(guān)鍵字。

一般情況下我們做情感分析都會提取形容詞類型的關(guān)鍵字，因為形容詞能夠代表該文本所表現(xiàn)出來的情緒。

# Importing the `posseg` module from the `jieba` module and renaming it to `seg`.
import jieba.posseg as seg

# This is a list comprehension that is creating a list of tuples. Each tuple contains the word and the flag.
analysis_words = [(word.word, word.flag) for word in seg.cut(analysis_text)]

# Printing the list of tuples that were created in the list comprehension.
print(analysis_words)

# [('這個', 'r'), ('實在', 'v'), ('是', 'v'), ('太', 'd'), ('好用', 'v'), ('了', 'ul'), ('，', 'x'), ('我', 'r'), ('非常', 'd'), ('的', 'uj'), ('喜歡', 'v'), ('，', 'x'), ('下次', 't'), ('一定', 'd'), ('還', 'd'), ('會', 'v'), ('購買', 'v'), ('的', 'uj'), ('！', 'x')]

根據(jù)上面的python推導(dǎo)式，將分詞以后的關(guān)鍵字和該關(guān)鍵自對應(yīng)的詞性提取出來。

下面是一份jieba模塊使用過程中對應(yīng)的詞性表，比如詞性標(biāo)記a代表的就是形容詞。

# This is a list comprehension that is creating a list of tuples. Each tuple contains the word and the flag.
keywords = [x for x in analysis_words if x[1] in ['a', 'd', 'v']]

# Printing the list of tuples that were created in the list comprehension.
print(keywords)

# [('實在', 'v'), ('是', 'v'), ('太', 'd'), ('好用', 'v'), ('非常', 'd'), ('喜歡', 'v'), ('一定', 'd'), ('還', 'd'), ('會', 'v'), ('購買', 'v')]

根據(jù)關(guān)鍵詞的標(biāo)簽提取出關(guān)鍵字以后，這個時候可以將情感標(biāo)記去除只保留關(guān)鍵字就可以了。

# This is a list comprehension that is creating a list of words.
keywords = [x[0] for x in keywords]

# Printing the list of keywords that were created in the list comprehension.
print(keywords)

# ['實在', '是', '太', '好用', '非常', '喜歡', '一定', '還', '會', '購買']

到現(xiàn)在為至，分詞的工作已經(jīng)處理完了，接下來就是情感分析直接使用snownlp分析出結(jié)果。

# Creating a variable called `pos_num` and assigning it the value of 0.
pos_num = 0

# Creating a variable called `neg_num` and assigning it the value of 0.
neg_num = 0

# This is a for loop that is looping through each word in the list of keywords.
for word in keywords:
    # Creating a variable called `sl` and assigning it the value of the `SnowNLP` function.
    sl = SnowNLP(word)
    # This is an if statement that is checking to see if the sentiment of the word is greater than 0.5.
    if sl.sentiments > 0.5:
        # Adding 1 to the value of `pos_num`.
        pos_num = pos_num + 1
    else:
        # Adding 1 to the value of `neg_num`.
        neg_num = neg_num + 1
    # This is printing the word and the sentiment of the word.
    print(word, str(sl.sentiments))

下面就是對原始文本提取關(guān)鍵詞以后的每個詞的情感分析結(jié)果，0-1之間代表情緒越接近于1代表情緒表現(xiàn)的越是積極向上。

# 實在 0.3047790802524796
# 是 0.5262327818078083
# 太 0.34387502381406
# 好用 0.6558628208940429
# 非常 0.5262327818078083
# 喜歡 0.6994590939824207
# 一定 0.5262327818078083
# 還 0.5746682977321914
# 會 0.5539033457249072
# 購買 0.6502590673575129

為了使得關(guān)鍵詞的分析結(jié)果更加的符合我們的想法也可以對負(fù)面和正面的關(guān)鍵詞進(jìn)行統(tǒng)計得到一個結(jié)果。

# This is a string that is using the `format` method to insert the value of `pos_num` into the string.
print('正面情緒關(guān)鍵詞數(shù)量：{}'.format(pos_num))

# This is a string that is using the `format` method to insert the value of `neg_num` into the string.
print('負(fù)面情緒關(guān)鍵詞數(shù)量：{}'.format(neg_num))

# This is a string that is using the `format` method to insert the value of `pos_num` divided by the value of `pos_num`
# plus the value of `neg_num` into the string.
print('正面情緒所占比例：{}'.format(pos_num/(pos_num + neg_num)))

# 正面情緒關(guān)鍵詞數(shù)量：8
# 負(fù)面情緒關(guān)鍵詞數(shù)量：2
# 正面情緒所占比例：0.8