hadoop單元測試方法--使用和增強(qiáng)MRUnit

風(fēng)自向前 2011-06-22

展開全文

1前言

hadoop的mapreduce提交到集群環(huán)境中出問題的定位是比較麻煩的，有時需要一遍遍的修改代碼和打出日志來排查一個很小的問題，如果數(shù)據(jù)量大的話調(diào)試起來相當(dāng)耗時間。因此有必要使用良好的單元測試手段來盡早的消除明顯的bug（當(dāng)然僅有單元測試是不夠的，畢竟跟集群的運行環(huán)境還是不一樣的）。

然而做mapreduce的單元測試會有一些障礙，比如Map和Reduce一些參數(shù)對象是在運行時由hadoop框架傳入的，例如OutputCollector、Reporter、InputSplit等。這就需要有Mock手段。最初寫mapreduce單元測試的時候自己寫了幾個簡單的Mock也基本能滿足需要，后來發(fā)現(xiàn)MRUnit比我寫的要好用所以研究了一下就采用了。MRUnit是專門為hadoop mapreduce寫的單元測試框架，API簡潔明了，簡單實用。但也有一些薄弱的地方，比如不支持MultipleOutputs（很多情況下我們會用MultipleOutputs作為多文件輸出，后面將介紹如何增強(qiáng)MRUnit使之支持MultipleOutputs）。

2 MRUnit

MRUnit針對不同測試對象分別使用以下幾種Driver：

l MapDriver，針對單獨的Map測試。

l ReduceDriver，針對單獨的Reduce測試。

l MapReduceDriver，將Map和Reduce連貫起來測試。

l PipelineMapReduceDriver，將多個Map-Reduce pair貫串測試。

MapDriver

單獨測試Map的例子，假設(shè)我們要計算一個賣家的平均發(fā)貨速度。Map將搜集每一次發(fā)貨的時間間隔。針對Map的測試，

//這是被測試的Map

private Map mapper;

private MapDriver<LongWritable, Text, Text, TimeInfo> mapDriver;

@Before

public void setUp() {

mapper = new Map();

mapDriver = new MapDriver<LongWritable, Text, Text, TimeInfo>();

}

@Test

public void testMap_timeFormat2() {

String sellerId = "444";

//模擬輸入一行（withInput），假設(shè)從這行數(shù)據(jù)中我們可以獲得賣家(sellerId) //某一次時間間隔為10小時.

//我們期望它輸出sellerId為key，value為代表1次10小時的TimeInfo對象。 //（withOutput）

//如果輸入數(shù)據(jù)經(jīng)過Map計算后為期望的結(jié)果，那么測試通過。

Text mapInputValue = new Text("……");

mapDriver.withMapper(mapper)

.withInput(null, mapInputValue)

.withOutput(new Text(sellerId), new TimeInfo(1, 10))

.runTest();

}

ReduceDriver

針對Reduce的單獨測試，還是這個例子。Reduce為根據(jù)Map或Combiner輸出的n次時間間隔的總和來計算平均時間。

private Reduce reducer;

@Before

public void setUp() {

reducer = new Reduce();

reduceDriver = new ReduceDriver<Text, TimeInfo, Text, LongWritable>(reducer);

}

@Test

public void testReduce () {

List<TimeInfo> values = new ArrayList<TimeInfo>();

values.add(new TimeInfo(1, 3));//一次3小時

values.add(new TimeInfo(2, 5));//兩次總共5小時

values.add(new TimeInfo(3, 7));//三次總共7小時

//values作為444這個賣家的reduce輸入，

//期望計算出平均為2小時

reduceDriver.withReducer(reducer)

.withInput(new Text("444"), values)

.withOutput(new Text("444"),new LongWritable(2))

.runTest();

}

MapReduceDriver

以下為Map和Reduce聯(lián)合測試的例子，

private MapReduceDriver<LongWritable, Text, Text, TimeInfo, Text, LongWritable> mrDriver;

private Map mapper;

private Reduce reducer;

@Before

public void setUp() {

mapper = new Map();

reducer = new Reduce();

mrDriver = new MapReduceDriver<LongWritable, Text, Text, TimeInfo, Text, LongWritable>(mapper, reducer);

}

@Test

public void testMapReduce_3record_1user() {

Text mapInputValue1 = new Text("……");

Text mapInputValue2 = new Text("……");

Text mapInputValue3 = new Text("……");

//我們期望從以上三條Map輸入計算后，

//從reduce輸出得到444這個賣家的平均時間為2小時.

mrDriver.withInput(null, mapInputValue1)

.withInput(null, mapInputValue2)

.withInput(null, mapInputValue3)

.withOutput(new Text("444"),new LongWritable(2))

.runTest();

}

3 增強(qiáng)MRUnit

下面介紹為MRUnit框架增加了支持MultipleOutputs、從文件加載數(shù)據(jù)集和自動裝配等幾個特性，使它更加便于使用。

如何支持MultipleOutputs

然而很多場景下我們需要使用MultipleOutputs作為reduce的多文件輸出，MRUnit缺少支持。分析源碼后為MRUnit增強(qiáng)擴(kuò)展了兩個Driver：ReduceMultipleOutputsDriver和MapReduceMultipleOutputDriver來支持MultipleOutputs。

ReduceMultipleOutputsDriver

ReduceMultipleOutputsDriver是ReduceDriver的增強(qiáng)版本，假設(shè)前面例子中的Reduce使用了MultipleOutputs作為輸出，那么Reduce的測試將出現(xiàn)錯誤。

使用ReduceMultipleOutputsDriver改造上面的測試用例(注意粗體部分),

private Reduce reducer;

@Before

public void setUp() {

reducer = new Reduce();

//注意這里ReduceDriver改為使用ReduceMultipleOutputsDriver

reduceDriver = new ReduceMultipleOutputsDriver<Text, TimeInfo, Text, LongWritable>(reducer);

}

@Test

public void testReduce () {

List<TimeInfo> values = new ArrayList<TimeInfo>();

values.add(new TimeInfo(1, 3));//一次3小時

values.add(new TimeInfo(2, 5));//兩次總共5小時

values.add(new TimeInfo(3, 7));//三次總共7小時

//values作為444這個賣家的reduce輸入，

//期望計算出平均為2小時

reduceDriver.withReducer(reducer)

.withInput(new Text("444"), values)

//Note

//假設(shè)使用id(444)%8的方式來分文件

//表示期望"somePrefix"+444%8這個collector將搜集到數(shù)據(jù)xxx

. withMutiOutput ("somePrefix"+444%8,new Text("444"),new LongWritable(2))

.runTest();

}

MapReduceMultipleOutputDriver

跟ReduceMultipleOutputsDriver類似，MapReduceMultipleOutputDriver用來支持使用了MultipleOutputs的Map-Reduce聯(lián)合測試。MapReduceDriver一節(jié)中的例子將改為，

private MapReduceDriver<LongWritable, Text, Text, TimeInfo, Text, LongWritable> mrDriver;

private Map mapper;

private Reduce reducer;

@Before

public void setUp() {

mapper = new Map();

reducer = new Reduce();

//改為使用ReduceMultipleOutputsDriver

mrDriver = new ReduceMultipleOutputsDriver<LongWritable, Text, Text, TimeInfo, Text, LongWritable>(mapper, reducer);

}

@Test

public void testMapReduce_3record_1user() {

Text mapInputValue1 = new Text("……");

Text mapInputValue2 = new Text("……");

Text mapInputValue3 = new Text("……");

//我們期望從以上三條Map輸入計算后，

//從reduce輸出得到444這個賣家的平均時間為2小時.

mrDriver.withInput(null, mapInputValue1)

.withInput(null, mapInputValue2)

.withInput(null, mapInputValue3)

//表示期望"somePrefix"+444%8這個collector將搜集到數(shù)據(jù)xxx

. withMutiOutput ("somePrefix"+444%8,new Text("444"),new LongWritable(2))

.runTest();

}

如何從文件加載輸入

從以上例子看到使用MRUnit需要重復(fù)寫很多類似的代碼，并且需要把輸入數(shù)據(jù)寫在代碼中，顯得不是很優(yōu)雅，如果能從文件加載數(shù)據(jù)則會方便很多。因此通過使用annotation和擴(kuò)展JUnit runner，增強(qiáng)了MRUnit來解決這個問題。

改造上面的例子，使得map的輸入自動從文件加載，并且消除大量使用MRUnit框架API的代碼。

@RunWith(MRUnitJunit4TestClassRunner.class)

public class XXXMRUseAnnotationTest {

//表示自動初始化mrDriver,并加載數(shù)據(jù)(如果需要)

@MapInputSet

@MapReduce(mapper = Map.class, reducer = Reduce.class)

private MapReduceDriver<LongWritable, Text, Text, TimeInfo, Text, LongWritable> mrDriver;

@Test

@MapInputSet("ConsignTimeMRUseAnnotationTest.txt")//從這里加載輸入數(shù)據(jù)

public void testMapReduce_3record_1user() {

//只需要編寫驗證代碼

mrDriver. withMutiOutput ("somePrefix"+444%8,new Text("444"),new LongWritable(2))