「手把手教你」使用Python玩轉(zhuǎn)金融時間序列模型

只怕想不到 2019-08-28

展開全文

01引言

上一篇推文【Python量化基礎(chǔ)】時間序列的自相關(guān)性與平穩(wěn)性

著重介紹了時間序列的一些基礎(chǔ)概念，包括自相關(guān)性、偏自相關(guān)性、白噪聲和平穩(wěn)性，以及Python的簡單實現(xiàn)。本文在此基礎(chǔ)上，以滬深300指數(shù)收益率數(shù)據(jù)為例，探討如何使用Python對平穩(wěn)時間序列進(jìn)行建模和預(yù)測分析。時間序列經(jīng)典模型主要有自回歸模型AR，移動回歸模型MA，移動自回歸模型ARMA，以及差分移動自回歸模型ARIMA，今天主要介紹這四種模型的基本原理以及Python的實現(xiàn)步驟。

02AR模型

AR模型全稱為Autoregressive Models，即自回歸模型，用于刻畫因變量能由它的多個滯后項表示。p階自回歸模型可以寫成：

下面模擬一個AR(1)模型。

import pandas as pd

import numpy as np

import statsmodels.tsa.api as smt

#tsa為Time Series analysis縮寫

import statsmodels.api as sm

import scipy.stats as scs

from arch import arch_model

#畫圖

import matplotlib.pyplot as plt

import matplotlib as mpl

%matplotlib inline

#正常顯示畫圖時出現(xiàn)的中文和負(fù)號

from pylab import mpl

mpl.rcParams['font.sans-serif']=['SimHei']

mpl.rcParams['axes.unicode_minus']=False

#先定義一個畫圖函數(shù)，后面都會用到

def ts_plot(data, lags=None,title=''):

if not isinstance(data, pd.Series):

data = pd.Series(data)

#matplotlib官方提供了五種不同的圖形風(fēng)格，

#包括bmh、ggplot、dark_background、fivethirtyeight和grayscale

with plt.style.context('ggplot'):

fig = plt.figure(figsize=(10, 8))

layout = (3, 2)

ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)

acf_ax = plt.subplot2grid(layout, (1, 0))

pacf_ax = plt.subplot2grid(layout, (1, 1))

qq_ax = plt.subplot2grid(layout, (2, 0))

pp_ax = plt.subplot2grid(layout, (2, 1))

data.plot(ax=ts_ax)

ts_ax.set_title(title+'時序圖')

smt.graphics.plot_acf(data, lags=lags, ax=acf_ax, alpha=0.5)

acf_ax.set_title('自相關(guān)系數(shù)')

smt.graphics.plot_pacf(data, lags=lags, ax=pacf_ax, alpha=0.5)

pacf_ax.set_title('偏自相關(guān)系數(shù)')

sm.qqplot(data, line='s', ax=qq_ax)

qq_ax.set_title('QQ 圖')

scs.probplot(data, sparams=(data.mean(),

data.std()), plot=pp_ax)

pp_ax.set_title('PP 圖')

plt.tight_layout()

return

# 模擬AR(1) 過程

#設(shè)置隨機(jī)種子（括號里數(shù)字無意義）

np.random.seed(1)

#模擬次數(shù)

n=5000

#AR模型的參數(shù)

a = 0.8

#擾動項為正態(tài)分布

x = w = np.random.normal(size=n)

for t in range(1,n):

x[t] = a*x[t-1] + w[t]

#畫圖

ts_plot(x, lags=30)

模擬的AR(1)模型是正態(tài)的。自相關(guān)系數(shù)圖（ACF）顯示滯后值之間存在顯著的序列相關(guān)性，偏自相關(guān)系數(shù)圖（PACF）則顯示在滯后1期時截尾（迅速降為0）。下面使用statsmodels構(gòu)建AR(p)模型，先用AR模型擬合上述模擬的數(shù)據(jù)，并返回估計的系數(shù)參數(shù)），然后選擇最佳滯后階數(shù)，最后與原模型設(shè)置對比看是否選擇了正確的滯后項。假如AR模型是正確的，那估計的系數(shù)參數(shù)將很接近真實的系數(shù)0.8，選擇的階數(shù)也會等于1。

#估計數(shù)據(jù)的AR模型參數(shù)和滯后階數(shù)

def simu_ar(data,a,maxlag=30,true_order = 1):

'''data:要擬合的數(shù)據(jù)；a為參數(shù),可以為列表；maxlag:最大滯后階數(shù)'''

# 擬合AR(p)模型

result = smt.AR(data).fit(maxlag=maxlag, ic='aic', trend='nc')

#選擇滯后階數(shù)

est_order = smt.AR(data).select_order(maxlag=maxlag,

ic='aic', trend='nc')

#參數(shù)選擇標(biāo)準(zhǔn)ic : 有四個選擇 {‘a(chǎn)ic’,’bic’,’hqic’,’t-stat’}

#趨勢項：trend：c是指包含常數(shù)項，nc為不含常數(shù)項

#打印結(jié)果

print(f'參數(shù)估計值：{result.params.round(2)}，

估計的滯后階數(shù)：{est_order}')

print(f'真實參數(shù)值：{a}，真實滯后階數(shù) {true_order}')

simu_ar(x,a=0.8)

參數(shù)估計值：[0.8]，估計的滯后階數(shù)：1

真實參數(shù)值：0.8，真實滯后階數(shù) 1

看下如何用AR(p)模型來擬合滬深300的對數(shù)收益

# Select best lag order for hs300 returns

import tushare as ts

token='輸入token'

pro=ts.pro_api(token)

df=pro.index_daily(ts_code='000300.SH')

df.index=pd.to_datetime(df.trade_date)

del df.index.name

df=df.sort_index()

df['ret']=np.log(df.close/df.close.shift(1))

max_lag = 30

Y=df.ret.dropna()

ts_plot(Y,lags=max_lag,title='滬深300')

result = smt.AR(Y.values).fit(maxlag=max_lag, ic='aic', trend='nc')

est_order = smt.AR(Y.values).select_order(maxlag=max_lag,

ic='aic', trend='nc')

print(f'滬深300擬合AR模型的參數(shù)：{result.params.round(2)}')

print(f'滬深300擬合AR模型的最佳滯后階數(shù) {est_order}')

滬深300擬合AR模型的參數(shù)：[ 0.03 -0.03 ...]

滬深300擬合AR模型的最佳滯后階數(shù) 15

最好的階數(shù)選擇是15或者說有15個參數(shù)！任何模型有這么多參數(shù)在實際中不可能有用。顯然有比這個模型更好的模型可以解釋滬深300收益率走勢。

03MA模型

MA(q)模型與AR(p)模型非常相似。不同之處在于，MA(q)模型是對過去的白噪聲誤差項的線性組合，而不是過去觀測的線性組合。MA模型的動機(jī)是我們可以直接通過擬合誤差項的模型來觀察誤差過程中的“沖擊”。在一個AR(p)模型中，通過在一系列過去的觀察中使用ACF間接觀察到這些沖擊。MA(q)模型的公式是：

下面使用Python模擬MA(1) 過程。

#這里使用arma模型進(jìn)行模擬，設(shè)定ar階數(shù)為0，即得到ma模型

alphas = np.array([0.])

betas = np.array([0.6])

ar = np.r_[1, -alphas]

ma = np.r_[1, betas]

#模擬MA的樣本數(shù)據(jù)

ma_sample = smt.arma_generate_sample(ar=ar, ma=ma, nsample=1000)

ts_plot(ma_sample, lags=30,title='MA(1)模型')

ACF函數(shù)顯示滯后1階系數(shù)顯著異于0，表明MA（1）模型適合擬合的數(shù)據(jù)。

# 對上述模擬數(shù)據(jù)進(jìn)行ARMA模型擬合

max_lag = 30

result = smt.ARMA(ma1, order=(0, 1)).fit(maxlag=max_lag,

method='mle', trend='nc')

print(result.summary())

模型估計d 滯后系數(shù)為0.6277，與真實值0.6比較接近。注意到，95％置信區(qū)間確實包含該真實值。

下面嘗試用MA(3)模型去擬合滬深300股價的對數(shù)收益，但這次并不知道真實的參數(shù)值。結(jié)果顯示，擬合的殘差自相關(guān)系數(shù)和偏自相關(guān)系數(shù)比較符合白噪聲過程，但由于存在厚尾，MA模型并不是預(yù)測滬深300未來回報的最佳模型。

max_lag = 30

result=smt.ARMA(Y.values,order(0,3)).fit(maxlag=max_lag,

method='mle', trend='nc')

print(result.summary())

resid=pd.Series(result.resid,index=Y.index)

ts_plot(resid, lags=max_lag,title='滬深300指數(shù)MA擬合殘差')

04ARMA模型

ARMA模型全稱為自回歸移動平均模型Autoregressive Moving Average Models - ARMA(p, q)，是AR（p）和MA（q）模型之間的結(jié)合，從金融的角度理解，AR和MA模型的理論意義在于：AR（p）模型試圖捕捉（解釋）交易市場中經(jīng)常觀察到的動量和均值回復(fù)效應(yīng)。MA（q）模型嘗試捕捉（解釋）在白噪聲條件下觀察到的沖擊效應(yīng)。這些沖擊效應(yīng)可以被認(rèn)為是影響觀察過程的意外事件。ARMA模型的弱點在于忽視了大多數(shù)金融時間序列中的波動聚集效應(yīng)。模型的公式可以表示為:

print(result.summary())

# 下面使用ARMA(2, 2) 模型進(jìn)行模擬分析

max_lag = 30

n = 5000

burn = int(n/10)

alphas = np.array([0.5, -0.25])

betas = np.array([0.5, -0.3])

#注意ar模型1代表0階（自身），然后在其他系數(shù)前加負(fù)號

ar = np.r_[1, -alphas]

ma = np.r_[1, betas]

arma22 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)

_ = ts_plot(arma22, lags=max_lag)

result = smt.ARMA(arma22, order=(2, 2)).fit(maxlag=max_lag,

method='mle', trend='nc', burnin=burn)

結(jié)果顯示模型估計的參數(shù)與真實參數(shù)基本上吻合。下面使用ARMA模型來擬合滬深300的收益數(shù)據(jù)。ACF和PACF沒有顯示出明顯的自相關(guān)性。QQ和概率圖顯示殘差大致為正態(tài)分布但厚尾。總體而言，這個模型的殘差看起來不像白噪聲，說明模型還是沒有很好的擬合其波動性特性。

#不事先確定滯后階數(shù)，而是通過信息準(zhǔn)則選擇最佳的滯后階數(shù)

#先將初始值設(shè)置為無窮大

best_aic = np.inf

best_order = None

best_mdl = None

rng = range(5)

for i in rng:

for j in rng:

try:

tmp_mdl = smt.ARMA(Y.values, order=(i,j))

.fit(method='mle', trend='nc')

tmp_aic = tmp_mdl.aic

if tmp_aic < best_aic:

best_aic = tmp_aic

best_order = (i, j)

best_mdl = tmp_mdl

except: continue

print(f'最佳滯后階數(shù):{best_order}')

print(best_mdl.summary())

resid=pd.Series(best_mdl.resid,index=Y.index)

ts_plot(resid, lags=30,title='滬深300指數(shù)ARMA擬合殘差')

最佳滯后階數(shù):(4, 4)

05ARIMA模型

ARIMA模型全稱是差分移動自回歸模型（Autoregressive Integrated Moving Average Models），是ARMA模型的拓展。由于現(xiàn)實中很多時間序列不是平穩(wěn)的，但可以通過差分來實現(xiàn)平穩(wěn)，即通過一階差分可以將非平穩(wěn)機(jī)游走其轉(zhuǎn)化為平穩(wěn)的白噪聲。由于前三個模型都有時間序列平穩(wěn)的假設(shè)在，如果時間序列存在明顯的上升或者下降趨勢，模型預(yù)測的效果大大折扣。對于有明顯下降或者上升趨勢的數(shù)據(jù)集，可以使用差分的方式將其轉(zhuǎn)化為平穩(wěn)序列，然后使用ARMA模型進(jìn)行擬合。假設(shè)模型經(jīng)過d次差分通過了時間序列平穩(wěn)的檢驗，ARMA的系數(shù)為p，q，ARIMA模型為ARIMA(p，d，q）。

下面通過迭代（p，d，q）的不同組合，找到擬合滬深300收益率數(shù)據(jù)的最佳ARIMA模型。通過AIC信息準(zhǔn)則來評估每個模型，最后選取AIC最小的。

#原理與擬合ARMA模型類似

best_aic = np.inf

best_order = None

best_mdl = None

#假定最多滯后4階

pq_rng = range(5)

#假定最多差分一次

d_rng = range(2)

for i in pq_rng:

for d in d_rng:

for j in pq_rng:

try:

tmp_mdl = smt.ARIMA(Y.values, order=(i,d,j))

.fit(method='mle', trend='nc')

tmp_aic = tmp_mdl.aic

if tmp_aic < best_aic:

best_aic = tmp_aic

best_order = (i, d, j)

best_mdl = tmp_mdl

except: continue

print(f'ARIMA模型最佳階數(shù)選擇：{best_order}')

# 對擬合殘差進(jìn)行可視化

print(best_mdl.summary())

resid=pd.Series(best_mdl.resid,index=Y.index)

_ = ts_plot(resid, lags=30,title='滬深300指數(shù)ARIMA殘差')

ARIMA模型最佳階數(shù)選擇：(4, 0, 4)

最好的模型是差分為0，因為我們使用的是收益率數(shù)據(jù)，相對于已經(jīng)采用了第一次對數(shù)差分來計算股票收益率。模型殘差圖結(jié)果與上面使用的ARMA模型基本相同。顯然，ARIMA模型同樣無法解釋時間序列中的條件波動性。到這一步，時間序列的基本模型和建模步驟基本上大家已經(jīng)熟知，下面利用模型的forecast()方法進(jìn)行預(yù)測。

# 對滬深300收益率未來20天進(jìn)行預(yù)測

n_steps = 20

#分別設(shè)置95%和99%的置信度

f, err95, ci95 = best_mdl.forecast(steps=n_steps)

_, err99, ci99 = best_mdl.forecast(steps=n_steps, alpha=0.01)

date=(df.index[-1]).strftime('%Y%m%d')

cal=pro.trade_cal(exchange='', start_date=date)

idx = cal[cal.is_open==1][:20]['cal_date'].values

fc_95 = pd.DataFrame(np.column_stack([f, ci95]),

index=idx,columns=['forecast', 'lower_ci_95', 'upper_ci_95'])

fc_99 = pd.DataFrame(np.column_stack([ci99]),

index=idx, columns=['lower_ci_99', 'upper_ci_99'])

fc_all = fc_95.combine_first(fc_99)

#fc_all.head()

# 對預(yù)測的20日收益率數(shù)據(jù)進(jìn)行可視化

plt.style.use('ggplot')

fig = plt.figure(figsize=(12,7))

ax = plt.gca()

ts = df['ret'][-500:].copy()

ts.plot(ax=ax, label='HS300收益率')

# 樣本內(nèi)預(yù)測

pred=best_mdl.predict(np.arange(len(ts)) [0], np.arange(len(ts))[-1])

pf=pd.Series(pred,index=ts.index)

pf.plot(ax=ax, style='r-', label='樣本內(nèi)預(yù)測')

fc_all.index=pd.to_datetime(fc_all.index)

fc_all.plot(ax=ax)

plt.fill_between(fc_all.index, fc_all.lower_ci_95,

fc_all.upper_ci_95, color='gray', alpha=0.7)

plt.fill_between(fc_all.index, fc_all.lower_ci_99,

fc_all.upper_ci_99, color='gray', alpha=0.2)

plt.title('{} 天HS300收益率預(yù)測ARIMA{}'.format(n_steps,

best_order))

plt.legend(loc='best', fontsize=10)

plt.show()

06結(jié)語

本文主要以滬深300指數(shù)收益率數(shù)據(jù)為例，簡要介紹了時間序列四大經(jīng)典模型的基本原理和Python的簡單應(yīng)用，不難發(fā)現(xiàn)，這些模型在擬合和預(yù)測滬深300指數(shù)收益率上顯得力不從心。實際上，這些模型有一個潛在假設(shè)是干擾項的方差是固定不變的，但是研究者發(fā)現(xiàn)金融經(jīng)濟(jì)數(shù)據(jù)（如股票收益率）大都存在異方差現(xiàn)象，因此傳統(tǒng)的時間序列模型無法獲得可靠的估計結(jié)果。為了解決金融資產(chǎn)收益率序列波動聚集的難題，學(xué)者們提出了ARCH、GARCH以及協(xié)整模型，后續(xù)推文將會對這一方面的應(yīng)用進(jìn)行詳細(xì)介紹。

參考資料：

1. statsmodels官方文檔

2. Time Series Analysis (TSA) in Python - Linear Models to GARCH

著作權(quán)歸作者所有