Pandas必備技能之“分組聚合操作”

立志德美 2019-07-08

展開全文

TUSHARE 金融與技術(shù)學(xué)習(xí)興趣小組

翻譯整理 | 一只小綠怪獸

在處理數(shù)據(jù)的過程中，知道如何對數(shù)據(jù)集進(jìn)行分組、聚合操作是一項必備的技能，能夠大大提升數(shù)據(jù)分析的效率。

分組是指根據(jù)一個或多個鍵將數(shù)據(jù)拆分為多個組的過程，這里的鍵可以理解為分組的條件。聚合指的是任何能夠從數(shù)組產(chǎn)生標(biāo)量值的數(shù)據(jù)轉(zhuǎn)換過程。分組、聚合操作一般會同時出現(xiàn)，用于計算分組數(shù)據(jù)的統(tǒng)計值或?qū)崿F(xiàn)其他功能。

本文會介紹如何利用Pandas中提供的groupby功能，靈活高效地對數(shù)據(jù)集進(jìn)行分組、聚合操作。

【工具】Python 3

【數(shù)據(jù)】Tushare

【注】示例注重的是方法的講解，請大家靈活掌握。

原理

Pandas中用groupby機制進(jìn)行分組、聚合操作的原理可以分為三個階段，即“拆分split-應(yīng)用apply-合并combine”，下圖就是一個簡單的分組聚合過程。

第一階段，數(shù)據(jù)會根據(jù)一個或多個鍵key被拆分split成多組，然后將一個函數(shù)應(yīng)用apply到各個分組并產(chǎn)生一個新值，最后所有這些函數(shù)的執(zhí)行結(jié)果會被合并combine到最終的結(jié)果對象中。

groupby函數(shù)

用Pandas中提供的分組函數(shù)groupby【1】能夠很方便地對表格進(jìn)行分組操作。我們先從tushare.pro上面獲取一個包含三只股票日線行情數(shù)據(jù)的表格。

import tushare as ts
import pandas as pd


pd.set_option('expand_frame_repr', False)  # 顯示所有列
ts.set_token('your token')
pro = ts.pro_api()

code_list = ['000001.SZ', '600000.SH', '000002.SZ']
stock_data = pd.DataFrame()
for code in code_list:
    print(code)
    df = pro.daily(ts_code=code, start_date='20180101', end_date='20180104')
    stock_data = stock_data.append(df, ignore_index=True)

print(stock_data)


000001.SZ
600000.SH
000002.SZ
     ts_code trade_date   open   high    low  close  pre_close  change  pct_chg         vol       amount
0  000001.SZ   20180104  13.32  13.37  13.13  13.25      13.33   -0.08    -0.60  1854509.48  2454543.516
1  000001.SZ   20180103  13.73  13.86  13.20  13.33      13.70   -0.37    -2.70  2962498.38  4006220.766
2  000001.SZ   20180102  13.35  13.93  13.32  13.70      13.30    0.40     3.01  2081592.55  2856543.822
3  600000.SH   20180104  12.70  12.73  12.62  12.66      12.66    0.00     0.00   278838.04   353205.838
4  600000.SH   20180103  12.73  12.80  12.66  12.66      12.72   -0.06    -0.47   378391.01   480954.809
5  600000.SH   20180102  12.61  12.77  12.60  12.72      12.59    0.13     1.03   313230.53   398614.966
6  000002.SZ   20180104  32.76  33.53  32.10  33.12      32.33    0.79     2.44   529085.80  1740602.533
7  000002.SZ   20180103  32.50  33.78  32.23  32.33      32.56   -0.23    -0.71   646870.20  2130249.691
8  000002.SZ   20180102  31.45  32.99  31.45  32.56      31.06    1.50     4.83   683433.50  2218502.766

接下來，我們以股票代碼'ts_code'這一列為鍵，用groupby函數(shù)對表格進(jìn)行分組，代碼如下。

grouped = stock_data.groupby('ts_code')
print(grouped)

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000002B1AD25D4A8>

注意，這里并沒有打印出表格，而是一個GroupBy對象，因為我們還沒有對分組進(jìn)行計算。也就是說，目前只完成了上面提到的第一個階段的拆分split操作，需要繼續(xù)調(diào)用聚合函數(shù)完成計算。

聚合函數(shù)

常用的聚合函數(shù)如下，我們繼續(xù)用上面的表格數(shù)據(jù)進(jìn)行演示。

① 按列'ts_code'分組，用函數(shù).mean()計算分組中收盤價列'close'的平均值。

     ts_code trade_date   open   high    low  close  pre_close  change  pct_chg         vol       amount
0  000001.SZ   20180104  13.32  13.37  13.13  13.25      13.33   -0.08    -0.60  1854509.48  2454543.516
1  000001.SZ   20180103  13.73  13.86  13.20  13.33      13.70   -0.37    -2.70  2962498.38  4006220.766
2  000001.SZ   20180102  13.35  13.93  13.32  13.70      13.30    0.40     3.01  2081592.55  2856543.822
3  600000.SH   20180104  12.70  12.73  12.62  12.66      12.66    0.00     0.00   278838.04   353205.838
4  600000.SH   20180103  12.73  12.80  12.66  12.66      12.72   -0.06    -0.47   378391.01   480954.809
5  600000.SH   20180102  12.61  12.77  12.60  12.72      12.59    0.13     1.03   313230.53   398614.966
6  000002.SZ   20180104  32.76  33.53  32.10  33.12      32.33    0.79     2.44   529085.80  1740602.533
7  000002.SZ   20180103  32.50  33.78  32.23  32.33      32.56   -0.23    -0.71   646870.20  2130249.691
8  000002.SZ   20180102  31.45  32.99  31.45  32.56      31.06    1.50     4.83   683433.50  2218502.766

grouped = stock_data.groupby('ts_code')
print(grouped['close'].mean())

ts_code
000001.SZ    13.426667
000002.SZ    32.670000
600000.SH    12.680000
Name: close, dtype: float64

② 按列'ts_code'分組，用函數(shù).sum()計算分組中收盤價漲跌幅(%)列'pct_chg'的和。

print(grouped['pct_chg'].sum())

ts_code
000001.SZ   -0.29
000002.SZ    6.56
600000.SH    0.56
Name: pct_chg, dtype: float64

③ 按列'ts_code'分組，用函數(shù).count()計算分組中收盤價列'close'的數(shù)量。

print(grouped['close'].count())

ts_code
000001.SZ    3
000002.SZ    3
600000.SH    3
Name: close, dtype: int64

④ 按列'ts_code'分組，用函數(shù).max()和.min()計算分組中收盤價列'close'的最大、最小值。

print(grouped['close'].max())
print(grouped['close'].min())

ts_code
000001.SZ    13.70
000002.SZ    33.12
600000.SH    12.72
Name: close, dtype: float64

ts_code
000001.SZ    13.25
000002.SZ    32.33
600000.SH    12.66
Name: close, dtype: float64

⑤ 按列'ts_code'分組，用函數(shù).median()計算分組中收盤價列'close'的算術(shù)中位數(shù)。

print(grouped['close'].median())

ts_code
000001.SZ    13.33
000002.SZ    32.56
600000.SH    12.66
Name: close, dtype: float64

我們也可以用多個鍵進(jìn)行分組聚合。示例中以['ts_code', 'trade_date']為鍵，從左到右的先后順序分組，然后調(diào)用.count()函數(shù)計算分組中的數(shù)量。

by_mult = stock_data.groupby(['ts_code', 'trade_date'])
print(by_mult['close'].count())

ts_code    trade_date
000001.SZ  20180102      1
           20180103      1
           20180104      1
000002.SZ  20180102      1
           20180103      1
           20180104      1
600000.SH  20180102      1
           20180103      1
           20180104      1
Name: close, dtype: int64

如果不想把分組鍵設(shè)置為索引，可以向groupby傳?參數(shù)as_index=False。

by_mult = stock_data.groupby(['ts_code', 'trade_date'], as_index=False)
print(by_mult['close'].count())

     ts_code trade_date  close
0  000001.SZ   20180102      1
1  000001.SZ   20180103      1
2  000001.SZ   20180104      1
3  000002.SZ   20180102      1
4  000002.SZ   20180103      1
5  000002.SZ   20180104      1
6  600000.SH   20180102      1
7  600000.SH   20180103      1
8  600000.SH   20180104      1

如果想要一次應(yīng)用多個聚合函數(shù)，可以調(diào)用.agg()【2】方法。

aggregated = grouped['close'].agg(['max', 'median'])
print(aggregated)

           close       
             max median
ts_code                
000001.SZ  13.70  13.33
000002.SZ  33.12  32.56
600000.SH  12.72  12.66

也可以對多個列一次應(yīng)用多個聚合函數(shù)。

aggregated = grouped['pre_close', 'close'].agg(['max', 'median'])
print(aggregated)

          pre_close         close       
                max median    max median
ts_code                                 
000001.SZ     13.70  13.33  13.70  13.33
000002.SZ     32.56  32.33  33.12  32.56
600000.SH     12.72  12.66  12.72  12.66

還可以對不同列應(yīng)用不同的聚合函數(shù)。這里我們先自己定義一個聚合函數(shù)spread，用于計算最大值和最小值之間的差值，再調(diào)用.agg()方法，傳??個從列名映射到函數(shù)的字典。

def spread(series):
    return series.max() - series.min()

aggregator = {'close': 'mean', 'vol': 'sum', 'pct_chg': spread}
aggregated = grouped.agg(aggregator)
print(aggregated)

               close         vol  pct_chg
ts_code                                  
000001.SZ  13.426667  6898600.41     5.71
000002.SZ  32.670000  1859389.50     5.54
600000.SH  12.680000   970459.58     1.50

巧用apply函數(shù)

巧用apply【3】并傳入自定義函數(shù)，可以實現(xiàn)更一般性的“拆分-應(yīng)用-合并”的操作，傳入的自定義函數(shù)可以是任何你想要實現(xiàn)的功能。下面舉幾個實例。

用分組平均值填充NaN值。

     ts_code trade_date         vol
0  000001.SZ   20180102  2081592.55
1  000001.SZ   20180103  2962498.38
2  000001.SZ   20180104         NaN
3  600000.SH   20180102   313230.53
4  600000.SH   20180103   378391.01
5  600000.SH   20180104         NaN
6  000002.SZ   20180102   683433.50
7  000002.SZ   20180103   646870.20
8  000002.SZ   20180104         NaN

fill_mean = lambda g: g.fillna(g.mean())
stock_data = stock_data.groupby('ts_code', as_index=False, group_keys=False).apply(fill_mean)
print(stock_data)

    ts_code trade_date          vol
0  000001.SZ   20180102  2081592.550
1  000001.SZ   20180103  2962498.380
2  000001.SZ   20180104  2522045.465
6  000002.SZ   20180102   683433.500
7  000002.SZ   20180103   646870.200
8  000002.SZ   20180104   665151.850
3  600000.SH   20180102   313230.530
4  600000.SH   20180103   378391.010
5  600000.SH   20180104   345810.770

篩選出分組中指定列具有最大值的行。

    ts_code trade_date         vol
0  000001.SZ   20180104  1854509.48
1  000001.SZ   20180103  2962498.38
2  000001.SZ   20180102  2081592.55
3  600000.SH   20180104   278838.04
4  600000.SH   20180103   378391.01
5  600000.SH   20180102   313230.53
6  000002.SZ   20180104   529085.80
7  000002.SZ   20180103   646870.20
8  000002.SZ   20180102   683433.50

def top(df, column='vol'):
    return df.sort_values(by=column)[-1:]

stock_data = stock_data.groupby('ts_code',  as_index=False, group_keys=False).apply(top)
print(stock_data)

     ts_code trade_date         vol
1  000001.SZ   20180103  2962498.38
8  000002.SZ   20180102   683433.50
4  600000.SH   20180103   378391.01

分組進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化。

    ts_code trade_date  close
0  000001.SZ   20180102  13.70
1  000001.SZ   20180103  13.33
2  000001.SZ   20180104  13.25
3  000001.SZ   20180105  13.30
4  600000.SH   20180102  12.72
5  600000.SH   20180103  12.66
6  600000.SH   20180104  12.66
7  600000.SH   20180105  12.69

min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())
stock_data['close_normalised'] = stock_data.groupby(['ts_code'])['close'].apply(min_max_tr)
print(stock_data)

     ts_code trade_date  close  close_normalised
0  000001.SZ   20180102  13.70          1.000000
1  000001.SZ   20180103  13.33          0.177778
2  000001.SZ   20180104  13.25          0.000000
3  000001.SZ   20180105  13.30          0.111111
4  600000.SH   20180102  12.72          1.000000
5  600000.SH   20180103  12.66          0.000000
6  600000.SH   20180104  12.66          0.000000
7  600000.SH   20180105  12.69          0.500000

總結(jié)

本文介紹了如何利用Pandas中提供的groupby功能，靈活高效地對數(shù)據(jù)集進(jìn)行分組、聚合操作，其原理是對數(shù)據(jù)進(jìn)行“拆分split-應(yīng)用apply-合并combine”的過程。

首先，介紹了常用的幾個聚合函數(shù)，包括.mean(), .sum(), .count(), .max(), .min(), .median()。接著，介紹了一些較為復(fù)雜的分組聚合操作，包括用多個鍵分組，調(diào)用.agg()對多列一次應(yīng)用多個聚合函數(shù)、對不同列應(yīng)用不同的聚合函數(shù)。

最后，用幾個實例介紹了在分組聚合操作中巧用apply函數(shù)的好處。相關(guān)官方文檔鏈接已附在下面，感興趣的話可以自行查看所有可設(shè)置的參數(shù)，解鎖更多新功能！

END