語音信號(hào)處理之(二)基音周期估計(jì)(Pitch Detection) zouxy09@qq.com http://blog.csdn.net/zouxy09 這學(xué)期有《語音信號(hào)處理》這門課,快考試了,所以也要了解了解相關(guān)的知識(shí)點(diǎn)。呵呵,平時(shí)沒怎么聽課,現(xiàn)在只能抱佛腳了。順便也總結(jié)總結(jié),好讓自己的知識(shí)架構(gòu)清晰點(diǎn),也和大家分享下。下面總結(jié)的是第二個(gè)知識(shí)點(diǎn):基音周期估計(jì)。我們用C++實(shí)現(xiàn)了基于自相關(guān)函數(shù)法的基音周期檢測(cè),并且結(jié)合了OpenCV來顯示語音波形。因?yàn)榛ǖ臅r(shí)間不多,所以可能會(huì)有不少說的不妥的地方,還望大家指正。謝謝。 一、概述 1.1、基音與基音周期估計(jì) 人在發(fā)音時(shí),根據(jù)聲帶是否震動(dòng)可以將語音信號(hào)分為清音跟濁音兩種。濁音又稱有聲語言,攜帶者語言中大部分的能量,濁音在時(shí)域上呈現(xiàn)出明顯的周期性;而清音類似于白噪聲,沒有明顯的周期性。發(fā)濁音時(shí),氣流通過聲門使聲帶產(chǎn)生張弛震蕩式振動(dòng),產(chǎn)生準(zhǔn)周期的激勵(lì)脈沖串。這種聲帶振動(dòng)的頻率稱為基音頻率,相應(yīng)的周期就成為基音周期。 通常,基音頻率與個(gè)人聲帶的長(zhǎng)短、薄厚、韌性、勁度和發(fā)音習(xí)慣等有關(guān)系,在很大程度上反應(yīng)了個(gè)人的特征。此外,基音頻率還跟隨著人的性別、年齡不同而有所不同。一般來說,男性說話者的基音頻率較低,而女性說話者和小孩的基音頻率相對(duì)較高。 基音周期的估計(jì)稱謂基音檢測(cè),基音檢測(cè)的最終目的是為了找出和聲帶振動(dòng)頻率完全一致或盡可能相吻合的軌跡曲線。 基因周期作為語音信號(hào)處理中描述激勵(lì)源的重要參數(shù)之一,在語音合成、語音壓縮編碼、語音識(shí)別和說話人確認(rèn)等領(lǐng)域都有著廣泛而重要的問題,尤其對(duì)漢語更是如此。漢語是一種有調(diào)語言,而基因周期的變化稱為聲調(diào),聲調(diào)對(duì)于漢語語音的理解極為重要。因?yàn)樵跐h語的相互交談中,不但要憑借不同的元音、輔音來辨別這些字詞的意義,還需要從不同的聲調(diào)來區(qū)別它,也就是說聲調(diào)具有辨義作用;另外,漢語中存在著多音字現(xiàn)象,同一個(gè)字的不同的語氣或不同的詞義下具有不同的聲調(diào)。因此準(zhǔn)確可靠地進(jìn)行基音檢測(cè)對(duì)漢語語音信號(hào)的處理顯得尤為重要。 1.2、基音周期估計(jì)的現(xiàn)有方法 到目前為止,基音檢測(cè)的方法大致上可以分為三類: 1)時(shí)域估計(jì)法,直接由語音波形來估計(jì)基音周期,常見的有:自相關(guān)法、并行處理法、平均幅度差法、數(shù)據(jù)減少法等; 2)變換法,它是一種將語音信號(hào)變換到頻域或者時(shí)域來估計(jì)基音周期的方法,首先利用同態(tài)分析方法將聲道的影響消除,得到屬于激勵(lì)部分的信息,然后求取基音周期,最常用的就是倒譜法,這種方法的缺點(diǎn)就是算法比較復(fù)雜,但是基音估計(jì)的效果卻很好; 3)混合法,先提取信號(hào)聲道模型參數(shù),然后利用它對(duì)信號(hào)進(jìn)行濾波,得到音源序列,最后再利用自相關(guān)法或者平均幅度差法求得基因音周期。 三、基于自相關(guān)的基音周期檢測(cè) 3.1、自相關(guān)函數(shù) 能量有限的語音信號(hào)x(n)的短時(shí)自相關(guān)函數(shù)定義為:  此公式表示一個(gè)信號(hào)和延遲m點(diǎn)后該信號(hào)本身的相似性。如果信號(hào)x(n)具有周期性,那么它的自相關(guān)函數(shù)也具有周期性,而且周期與信號(hào)x(n)的周期性相同。自相關(guān)函數(shù)提供了一種獲取周期信號(hào)周期的方法。在周期信號(hào)周期的整數(shù)倍上,它的自相關(guān)函數(shù)可以達(dá)到最大值,因此可以不考慮起始時(shí)間,而從自相關(guān)函數(shù)的第一個(gè)最大值的位置估計(jì)出信號(hào)的基音周期,這使自相關(guān)函數(shù)成為信號(hào)基音周期估計(jì)的一種工具。 3.2、短時(shí)自相關(guān)函數(shù)法 語音信號(hào)是非穩(wěn)態(tài)信號(hào)它的特征是隨時(shí)間變化的,但在一個(gè)很短的時(shí)間段內(nèi)可以認(rèn)為具有相對(duì)穩(wěn)定的特征即短時(shí)平穩(wěn)性。因此語音具有短時(shí)自相關(guān)性。這個(gè)時(shí)間段約5ms-50ms。為其統(tǒng)計(jì)特性和頻譜特性都是對(duì)短時(shí)段而言的。這使得要對(duì)語音信號(hào)作數(shù)字處理必須先按短時(shí)段對(duì)語音信號(hào)分幀。這樣每一幀信號(hào)都具有短時(shí)平穩(wěn)性從而進(jìn)行短時(shí)相關(guān)分析。 能量有限的語音信號(hào)s(n)的短時(shí)自相關(guān)函數(shù)定義為: 
一般要求一幀至少包含2個(gè)以上的周期。一般,基頻最低50Hz,故周期最長(zhǎng)為20ms。而且相鄰幀之間要有足夠的重疊。具體應(yīng)用時(shí),窗口長(zhǎng)度根據(jù)采樣率確定幀長(zhǎng)。 
該幀的自相關(guān)函數(shù)中,除去第一個(gè)最大值后(0處),最大值Kmax= 114,那么該幀對(duì)應(yīng)的基頻16kHz/114=140Hz。 四、基于自相關(guān)的基音周期檢測(cè)算法實(shí)現(xiàn) 這個(gè)實(shí)現(xiàn)課程要求是用C++來實(shí)現(xiàn)的。然后為了畫波形,我用到了我比較熟悉的OpenCV。OpenCV畫出來的波形還是不錯(cuò)的,而且如果是動(dòng)態(tài)的波形平移,挺好看的,就像心電圖那么動(dòng)人。 實(shí)驗(yàn)采用一段男聲讀“播放”兩個(gè)字的聲音wav文件,其為16KHz采樣率,16bit量化。整段語音長(zhǎng)656.7ms,節(jié)點(diǎn)共10508個(gè)。 
我們先要確定幀長(zhǎng)。下面分別是幀長(zhǎng)200,320和400個(gè)節(jié)點(diǎn)時(shí)所包含的周期數(shù)。200時(shí)只有一個(gè)周期,而400有三個(gè)周期,所以我們采用400的幀長(zhǎng)。 
通過計(jì)算短時(shí)能量區(qū)分voice和unvoice。語音信號(hào){x(n)}的某幀信號(hào)的短時(shí)平均能量En的定義為: 
語音中濁音段的短時(shí)平均能量遠(yuǎn)遠(yuǎn)大于清音段的短時(shí)平均能量。因此,短時(shí)平均能量的計(jì)算給出了區(qū)分清音段與濁音段的依據(jù),即En(濁)>En(清)。 
計(jì)算每一幀的過程中,會(huì)顯示在原來波形中的位置,并且實(shí)時(shí)顯示該幀得到的基音周期。另外還會(huì)在另一個(gè)窗口實(shí)時(shí)顯示該幀的原始波形。 
該幀的原始波形圖(以下為不同時(shí)間的兩幀,會(huì)動(dòng)態(tài)變化): 
下面左邊的圖是計(jì)算該語音的所有幀對(duì)應(yīng)的基音周期的點(diǎn),由圖可以看出存在不少的野點(diǎn)。因?yàn)?,需要?duì)此進(jìn)行進(jìn)一步的處理,即去除野點(diǎn)。這里通過中值濾波來除去野點(diǎn),濾波結(jié)果見右圖。 
C++程序如下:(每按一次空格進(jìn)入下一個(gè)步驟) - // Description : Pitch detection
- // Author : Zou Xiaoyi
- // HomePage : http://blog.csdn.net/zouxy09
- // Date : 2013/06/08
- // Rev. : 0.1
-
- #include <iostream>
- #include <fstream>
- #include "opencv2/opencv.hpp"
- #include "ReadWriteWav.h"
- #include <string>
-
- using namespace std;
- using namespace cv;
-
- #define MAXLENGTH 1000
-
- void wav2image(Mat &img, vector<short> wavData, int wav_start, int width, int max_amplitude)
- {
- short max(0), min(0);
- for (int i = 0; i < wavData.size(); i++)
- {
- if (wavData[i] > max)
- max = wavData[i];
- if (wavData[i] < min)
- min = wavData[i];
- }
- cout<<max<<'\t'<<min<<endl;
-
- max_amplitude = max_amplitude > 480 ? 480 : max_amplitude;
-
- // normalize
- for (int i = 0; i < wavData.size(); i++)
- {
- wavData[i] = (wavData[i] - min) * max_amplitude / (max - min);
- }
-
- int j = 0;
- Point prePoint, curPoint;
- if (width >= 400)
- {
- img.create(max_amplitude, width, CV_8UC3);
- img.setTo(Scalar(0, 0, 0));
- for (int i = wav_start; i < wav_start + width; i++)
- {
- prePoint = Point(j, img.rows - (int)wavData[i]);
- if (j)
- line(img, prePoint, curPoint, Scalar(0, 255, 0), 2);
- curPoint = prePoint;
- j++;
- }
-
- if (width > MAXLENGTH)
- {
- cout<<"The wav is too long to show, and it will be resized to 1200"<<endl;
- resize(img, img, Size(MAXLENGTH, img.rows));
- }
- }
- else
- {
- img.create(max_amplitude, 400, CV_8UC3);
- img.setTo(Scalar(0, 0, 0));
- for (int i = wav_start; i < wav_start + width; i++)
- {
- prePoint = Point(j*400/width, img.rows - (int)wavData[i]);
- circle(img, prePoint, 3, Scalar(0, 0, 255), CV_FILLED);
- j++;
- }
- cout<<"The wav is too small to show, and it will be resized to 400"<<endl;
- }
- }
-
- short calOneFrameACF(vector<short> wavFrame, int sampleRate)
- {
- vector<float> acf;
- acf.empty();
-
- // calculate ACF
- for (int k = 0; k < wavFrame.size(); k++)
- {
- float sum = 0.0;
- for (int i = 0; i < wavFrame.size() - k; i++)
- {
- sum = sum + wavFrame[i] * wavFrame[ i + k ];
- }
- acf.push_back(sum);
- }
-
- // find the max one
- float max(-999);
- int index = 0;
- for (int k = 0; k < wavFrame.size(); k++)
- {
- if (k > 25 && acf[k] > max)
- {
- max = acf[k];
- index = k;
- }
- }
- return (short)sampleRate / index;
- }
-
- int main()
- {
- const char *wavFile = "bofang.wav";
- vector<short> data;
- int nodesPerFrame = 400;
-
-
- /************* Write data to file part Start ***************/
- fstream writeFile;
- writeFile.open("statistics.txt", ios::out);
- /************* Write data to file part End ***************/
-
-
- /************* Read and show the input wave part Start ***************/
- int sampleRate;
- int dataLength = wav2allsample(wavFile, data, sampleRate);
- if (!dataLength)
- {
- cout <<"Reading wav file error!"<<endl;
- return -1;
- }
- Mat originalWave;
- wav2image(originalWave, data, 0, dataLength, 400);
- line(originalWave, Point(0, originalWave.rows * 0.5), Point(originalWave.cols, originalWave.rows * 0.5), Scalar(0, 0, 255), 2);
- imshow("originalWave", originalWave);
-
- // write data
- writeFile<<"Filename: "<<wavFile<<endl<<"SampleRate: "<<sampleRate<<"Hz"<<endl<<"dataLength: "<<dataLength<<endl;
-
- cout<<"Press space key to continue"<<endl;
- while (waitKey(30) != ' ');
- /************* Read and show the input wave part End ***************/
-
-
- /******** Calculate energy to separate voice and unvoice part Start *********/
- int nodeCount = 0;
-
- // The sum must be double type
- vector<double> energyTmp;
- double maxEnergy(0);
- while(nodeCount < (dataLength - nodesPerFrame))
- {
- double sum(0);
- for (int i = nodeCount; i < (nodeCount + nodesPerFrame); i++)
- {
- sum += (double)data[i] * data[i];
- }
- if (sum > maxEnergy)
- {
- maxEnergy = sum;
- }
- energyTmp.push_back(sum);
- nodeCount++;
- }
-
- // Transform to short type for show
- vector<short> energy;
-
- // Fill element of boundary
- short tmp = (short)(energyTmp[0] * 400 / maxEnergy);
- for (int i = 0; i < nodesPerFrame * 0.5; i++)
- {
- energy.push_back(tmp);
- }
- for (int i = 0; i < energyTmp.size(); i++)
- {
- energy.push_back((short)(energyTmp[i] * 400 / maxEnergy));
- }
- // Fill element of boundary
- tmp = (short)(energyTmp[energyTmp.size() - 1] * 400 / maxEnergy);
- for (int i = 0; i < nodesPerFrame * 0.5; i++)
- {
- energy.push_back(tmp);
- }
-
- // show
- Mat showEnergy;
- wav2image(showEnergy, energy, 0, energy.size(), 400);
- line(showEnergy, Point(0, showEnergy.rows - 1), Point(showEnergy.cols, showEnergy.rows - 1), Scalar(0, 0, 255), 2);
- imshow("showEnergy", showEnergy);
- while (waitKey(30) != ' ');
-
- // separate voice and unvoice
- float thresVoice = 400 * 0.15;
- line(showEnergy, Point(0, showEnergy.rows - thresVoice), Point(showEnergy.cols, showEnergy.rows - thresVoice), Scalar(0, 255, 255), 2);
- imshow("showEnergy", showEnergy);
- while (waitKey(30) != ' ');
-
- // Find the Transition point and draw them
- bool high = false;
- vector<int> separateNode;
- for (int i = 0; i < energy.size(); i++)
- {
- if ( !high && energy[i] > thresVoice)
- {
- separateNode.push_back(i);
- high = true;
- writeFile<<"UnVoice to Voice: "<<i<<endl;
- line(showEnergy, Point(i * MAXLENGTH / dataLength, 0), Point(i * MAXLENGTH / dataLength, showEnergy.rows), Scalar(255, 255, 255), 2);
- putText(showEnergy, "Voice", Point(i * MAXLENGTH / dataLength, showEnergy.rows * 0.5 + 40), FONT_HERSHEY_SIMPLEX, 1, Scalar(255, 255, 255), 2);
- imshow("showEnergy", showEnergy);
- while (waitKey(30) != ' ');
- }
- if ( high && energy[i] < thresVoice)
- {
- separateNode.push_back(i);
- high = false;
- writeFile<<"Voice to UnVoice: "<<i<<endl;
- line(showEnergy, Point(i * MAXLENGTH / dataLength, 0), Point(i * MAXLENGTH / dataLength, showEnergy.rows), Scalar(255, 0, 0), 2);
- putText(showEnergy, "UnVoice", Point(i * MAXLENGTH / dataLength, showEnergy.rows * 0.5 + 40), FONT_HERSHEY_SIMPLEX, 1, Scalar(255, 0, 0), 2);
- imshow("showEnergy", showEnergy);
- while (waitKey(30) != ' ');
- }
- }
- /******** Calculate energy to separate voice and unvoice part End ***********/
-
-
- /******************* Calculate all frame part Start ***************/
- int frames = 0;
- vector<short> allPitchFre;
- writeFile<<"The pitch frequency is:"<<endl;
- while(frames < 2 * dataLength / nodesPerFrame)
- {
- vector<short> wavFrame;
- wavFrame.empty();
-
- // get one frame, 400 nodes per frame, and shift 200 nodes, or overlap 200 nodes
- int start = frames * nodesPerFrame * 0.5;
- for (int i = start; i < start + nodesPerFrame; i++)
- wavFrame.push_back(data[i]);
-
- // calculate the ACF of this frame
- float pitchFreqency = calOneFrameACF(wavFrame, sampleRate);
- allPitchFre.push_back(pitchFreqency);
-
- cout<<"The pitch frequency is: "<<pitchFreqency <<" Hz"<<endl;
- writeFile<<pitchFreqency<<endl;
-
- // show current frame in the whole wave
- Mat originalWave;
- wav2image(originalWave, data, 0, dataLength, 400);
- line(originalWave, Point(0, originalWave.rows * 0.5), Point(originalWave.cols, originalWave.rows * 0.5), Scalar(0, 0, 255), 2);
- line(originalWave, Point(start * MAXLENGTH / dataLength, 0), Point(start * MAXLENGTH / dataLength, originalWave.rows), Scalar(0, 0, 255), 2);
- line(originalWave, Point((start + nodesPerFrame)* MAXLENGTH / dataLength, 0), Point((start + nodesPerFrame)* MAXLENGTH / dataLength, originalWave.rows), Scalar(0, 0, 255), 2);
-
- // put the pitchFreqency of this frame in the whole wave
- stringstream buf;
- buf << pitchFreqency;
- string num = buf.str();
- putText(originalWave, num, Point(start * MAXLENGTH / dataLength, 30), FONT_HERSHEY_SIMPLEX, 0.7, Scalar(0, 0, 255), 2);
- imshow("originalWave", originalWave);
-
- // show current frame in zoom out model
- Mat oneSelectFrame;
- wav2image(oneSelectFrame, wavFrame, 0, wavFrame.size(), 400);
- imshow("oneSelectFrame", oneSelectFrame);
-
- if (!frames)
- while (waitKey(30) != ' ');
-
- frames++;
- waitKey(50);
- }
- cout<<"Num of frames is: "<<frames<<endl;
- /******************* Calculate all frame part End ***************/
-
-
- // show all pitch frequency before smooth
- Mat showAllPitchFre;
- wav2image(showAllPitchFre, allPitchFre, 0, allPitchFre.size(), 400);
- putText(showAllPitchFre, "Before smooth", Point(10, showAllPitchFre.rows - 20), FONT_HERSHEY_SIMPLEX, 1, Scalar(60, 200, 255), 1);
- imshow("showAllPitchFre", showAllPitchFre);
-
-
- /******************* Smooth by medium filter part Start **************/
- int kernelSize = 5;
- vector<short> afterMedFilter;
- short sum(0);
- afterMedFilter.assign(allPitchFre.size(), allPitchFre[0]);
-
- for (int k = cvFloor(kernelSize/2); k < allPitchFre.size(); k++)
- {
- vector<short> kernelData;
- for (int i = -cvFloor(kernelSize/2); i < cvCeil (kernelSize/2); i++)
- kernelData.push_back(allPitchFre[k+i]);
- nth_element(kernelData.begin(), kernelData.begin() + cvCeil (kernelSize/2), kernelData.end());
- afterMedFilter[k] = kernelData[cvCeil (kernelSize/2)];
- sum += afterMedFilter[k];
- cout<<afterMedFilter[k]<<endl;
- }
-
- // show all pitch frequency and mean pitch frequency after smooth
- Mat showAfterMedFilter;
- wav2image(showAfterMedFilter, afterMedFilter, 0, afterMedFilter.size(), 400);
- putText(showAfterMedFilter, "After smooth", Point(10, showAfterMedFilter.rows - 20), FONT_HERSHEY_SIMPLEX, 1, Scalar(60, 200, 255), 1);
-
- short mean = sum / (afterMedFilter.size() - cvFloor(kernelSize/2));
- writeFile<<"The mean pitch frequency is: "<<mean<<endl;
- stringstream buf;
- buf << mean;
- string num = "Mean: " + buf.str() + "Hz";
- putText(showAfterMedFilter, num, Point(10, 40), FONT_HERSHEY_SIMPLEX, 1, Scalar(255, 200, 255), 2);
- imshow("showAfterMedFilter", showAfterMedFilter);
- /******************* Smooth by medium filter part End ***************/
-
- while (waitKey(30) != 27);
-
- return 0;
- }
|