1.概念select、poll、epoll都是事件觸發(fā)機(jī)制,當(dāng)?shù)却氖录l(fā)生就觸發(fā)進(jìn)行處理,用于I/O復(fù)用 2.簡單例子理解3.select函數(shù)3.1函數(shù)詳解int select(int maxfdp1,fd_set *readset,fd_set *writeset,fd_set *exceptset,const struct timeval *timeout) //返回值:就緒描述符的數(shù)目,超時(shí)返回0,出錯(cuò)返回-1 1)第一個(gè)參數(shù)maxfdp1指定待測(cè)試的描述符個(gè)數(shù),它的值是待測(cè)試的最大描述符加1(因此把該參數(shù)命名為maxfdp1),描述字0、1、2...maxfdp1-1均將被測(cè)試(即使你中間有不想測(cè)的) 2)中間的三個(gè)參數(shù)readset、writeset和exceptset指定我們要讓內(nèi)核測(cè)試讀、寫和異常條件的描述符。如果對(duì)某一個(gè)的條件不感興趣,就可以把它設(shè)為空指針。fd_set存放著描述符,它是一個(gè)long類型的數(shù)組,是一個(gè)bitmap,可通過以下四個(gè)宏進(jìn)行設(shè)置: void FD_ZERO(fd_set *fdset); //清空集合 void FD_SET(int fd, fd_set *fdset); //將一個(gè)給定的文件描述符加入集合之中 void FD_CLR(int fd, fd_set *fdset); //將一個(gè)給定的文件描述符從集合中刪除 int FD_ISSET(int fd, fd_set *fdset); // 檢查集合中指定的文件描述符是否可以讀寫 3)timeout告知內(nèi)核等待所指定描述符中的任何一個(gè)就緒可花多少時(shí)間。其timeval結(jié)構(gòu)用于指定這段時(shí)間的秒數(shù)和微秒數(shù) struct timeval { long tv_sec; //seconds long tv_usec; //microseconds }; 這個(gè)參數(shù)有三種可能: ①永遠(yuǎn)等待下去:僅在有一個(gè)描述符準(zhǔn)備好I/O時(shí)才返回;為此,把該參數(shù)設(shè)置為空指針NULL(等到你好了我才返回) ②等待一段固定時(shí)間:在有一個(gè)描述符準(zhǔn)備好I/O時(shí)返回,但是不超過由該參數(shù)所指向的timeval結(jié)構(gòu)中指定的秒數(shù)和微秒數(shù)(我到了固定時(shí)間就返回) ③根本不等待:檢查描述符后立即返回,這稱為輪詢。為此,該參數(shù)必須指向一個(gè)timeval結(jié)構(gòu),而且其中的定時(shí)器值必須為0(我不斷地檢查你好沒好,不管你好沒好我都返回) 3.2實(shí)現(xiàn)過程
如圖,select會(huì)在1~7之間不斷循環(huán) 1)使用copy_from_user將fd_set(描述符集合)拷貝到內(nèi)核 2)注冊(cè)一個(gè)函數(shù)__pollwait,也是就所謂的poll方法 3)遍歷所有描述符fd,調(diào)用其對(duì)應(yīng)的poll方法(對(duì)于socket,這個(gè)poll方法是sock_poll,sock_poll根據(jù)情況會(huì)調(diào)用到tcp_poll,udp_poll或者datagram_poll),poll方法的主要工作就是把current進(jìn)程掛到fd對(duì)應(yīng)的設(shè)備等待隊(duì)列中,當(dāng)fd可讀寫時(shí),會(huì)喚醒等待隊(duì)列上睡眠的進(jìn)程;poll方法返回的是一個(gè)描述讀寫是否就緒的mask掩碼,用這個(gè)mask掩碼給fd_set賦值 4)遍歷完以后,如果發(fā)現(xiàn)有可讀寫的mask掩碼,則跳到7 5)如果沒有,則調(diào)用schedule_timeout使current進(jìn)程進(jìn)入睡眠 6)睡眠期間如果有fd可讀寫時(shí),或者超過了睡眠時(shí)間,current進(jìn)程會(huì)被喚醒獲得CPU進(jìn)行工作,跳到3 7)使用copy_to_user把fd_set從內(nèi)核拷貝到用戶空間 最后,進(jìn)程在用戶空間檢查fd_set,找到可讀寫的fd,對(duì)其進(jìn)行I/O操作 3.3缺點(diǎn)1)select可監(jiān)聽的文件描述符數(shù)量較小,linux上默認(rèn)為1024,由宏定義FD_SETSIZE確定 2)每次調(diào)用select,都需要把整個(gè)fd集合從用戶態(tài)拷貝到內(nèi)核態(tài),返回時(shí)再從內(nèi)核態(tài)拷貝到用戶態(tài),存在開銷 3)current進(jìn)程每次被喚醒時(shí)都要遍歷所有的fd(即輪詢),這樣做效率很低 3.4實(shí)例#include <stdio.h> #include <sys/select.h> #include <sys/time.h> #include <errno.h> #include <stdlib.h> #include <string.h> int max(int a, int b) { return(a >= b ? a : b); } void str_cli(FILE *fp, int sockfd) { int maxfdpl; fd_set rset; char sendline[4096], recvline[4096]; FD_ZERO(&rset); for (;;) { FD_SET(fileno(fp), &rset); FD_SET(sockfd, &rset); maxfdpl = max(fileno(fp), sockfd) 1; if (select(maxfdpl, &rset, NULL, NULL, NULL) < 0) { perror("select"); exit(1); } if (FD_ISSET(sockfd, &rset)) /* socket is readable */ { if (readline(sockfd, recvline, 4096) == 0) { printf("str_cli: server terminated prematurely\n"); exit(1); } fputs(recvline, stdout); } if (FD_ISSET(fileno(fp), &rset)) /* input is readable */ { if (fgets(sendline, 4096, fp) == NULL) return; writen(sockfd, sendline, strlen(sendline)); } } } 4.poll函數(shù)4.1函數(shù)詳解#include <poll.h> int poll(struct pollfd fds[], nfds_t nfds, int timeout); 1)poll使用一個(gè)結(jié)構(gòu)數(shù)組fds來存放套接字描述符,其中每一個(gè)元素為pollfd結(jié)構(gòu) struct pollfd { int fd;//表示文件描述符 short events;//表示請(qǐng)求檢測(cè)的事件 short revents; //表示檢測(cè)之后返回的事件,如果當(dāng)某個(gè)fd有狀態(tài)變化時(shí),revents的值就不為空 }; 為了加快處理速度和提高系統(tǒng)性能,poll將會(huì)把fds中所有struct pollfd表示為內(nèi)核的struct poll_list鏈表,即內(nèi)核層是用鏈表來保存描述符 struct poll_list { struct poll_list *next; int len; struct pollfd entries[0]; }; 2)參數(shù)說明 fds:存放需要被檢測(cè)狀態(tài)的Socket描述符;與select不同(select函數(shù)在調(diào)用之后,會(huì)清空檢測(cè)socket描述符的數(shù)組),每當(dāng)調(diào)用poll之后,不會(huì)清空這個(gè)數(shù)組,而是將有狀態(tài)變化的描述符結(jié)構(gòu)的revents變量狀態(tài)變化,操作起來比較方便; 3)返回值
4.2實(shí)現(xiàn)過程poll的實(shí)現(xiàn)過程與select差不多 4.2優(yōu)點(diǎn)1)poll沒有最大數(shù)量的限制,struct pollfd數(shù)組fds大小的可以根據(jù)我們自己的需要來定義(但是數(shù)量過大后性能也是會(huì)下降) 4.3缺點(diǎn)和select的兩個(gè)缺點(diǎn)一樣 5.epoll函數(shù)epoll是linux下select/poll的改進(jìn) 5.1函數(shù)詳解epoll會(huì)調(diào)用三個(gè)函數(shù),分別如下:
int epoll_create(int size); // size:用來告訴內(nèi)核這個(gè)監(jiān)聽的描述符數(shù)量,必須大于0,否則會(huì)返回錯(cuò)誤EINVAL,這只是對(duì)內(nèi)核初始分配內(nèi)部數(shù)據(jù)結(jié)構(gòu)的一個(gè)建議,從源碼上看,這個(gè)size其實(shí)沒有啥用!??! 1)在內(nèi)核里,一切皆文件,epoll會(huì)在內(nèi)核初始化時(shí)(系統(tǒng)啟動(dòng)時(shí)),注冊(cè)一個(gè)文件系統(tǒng),即開辟出自己的內(nèi)核cache(高速緩存區(qū)),用于存儲(chǔ)需要被監(jiān)控的socket,這些socket會(huì)以紅黑樹的形式保存在內(nèi)核cache里,以支持快速的查找、插入、刪除 2)當(dāng)調(diào)用epoll_create時(shí),會(huì)在epoll文件系統(tǒng)里創(chuàng)建一棵紅黑樹(用來存儲(chǔ)之后epoll_ctl傳來的描述符),還有一個(gè)就緒鏈表(用于存儲(chǔ)準(zhǔn)備就緒的描述符) 3)注意:epoll句柄本身會(huì)占用一個(gè)fd值(linux下可以通過/proc/進(jìn)程id/fd/查看),所以在使用完epoll后,必須調(diào)用close()關(guān)閉,否則可能導(dǎo)致fd被耗盡
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); /* epfd:是epoll_create()的返回值。 op:表示op操作,用三個(gè)宏來表示:添加EPOLL_CTL_ADD,刪除EPOLL_CTL_DEL,修改EPOLL_CTL_MOD,分別添加、刪除和修改對(duì)fd的監(jiān)聽事件 fd:是需要監(jiān)聽的fd(文件描述符) epoll_event:是告訴內(nèi)核需要監(jiān)聽什么事,ET模式也是在這個(gè)結(jié)構(gòu)里設(shè)置 */ 1)調(diào)用copy_from_user把epoll_event結(jié)構(gòu)拷貝到內(nèi)核空間(網(wǎng)上很多博客說epoll使用了共享內(nèi)存,這個(gè)是完全錯(cuò)誤的 ,可以閱讀源碼,會(huì)發(fā)現(xiàn)完全沒有使用共享內(nèi)存的任何api) 2)將需要監(jiān)聽的socket fd加入到紅黑樹中(也可刪除和修改,若存在則立即返回,不存在則添加到樹上),在插入的過程中還會(huì)為這個(gè)socket注冊(cè)一個(gè)回調(diào)函數(shù)ep_poll_callback,當(dāng)它就緒時(shí)時(shí),就會(huì)立刻執(zhí)行這個(gè)回調(diào)函數(shù)(而不是像select/poll中執(zhí)行喚醒操作default_wake_function) 3)回調(diào)函數(shù)ep_poll_callback的作用:會(huì)把就緒的fd放入就緒鏈表,再喚醒current進(jìn)程
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout); epoll_wait會(huì)在1~6之間不斷循環(huán) 1)epoll_wait判斷就緒鏈表是否為空 2)如果不空,則跳到6 3)如果為空,則調(diào)用schedule_timeout使current進(jìn)程進(jìn)入睡眠 4)睡眠期間如果有fd就緒了,就緒fd會(huì)調(diào)用回調(diào)函數(shù)ep_poll_callback,回調(diào)函數(shù)會(huì)把就緒的fd放入就緒鏈表,并喚醒current進(jìn)程,然后跳到1 5)或者超過了睡眠時(shí)間,也跳到1 6)使用__put_user把就緒的fd拷貝到用戶空間 5.2epoll的兩種模式5.2.1水平觸發(fā)模式(LT:level-triggered) 1)LT模式是epoll默認(rèn)的工作模式,可支持阻塞和非阻塞套接字 2)傳統(tǒng)的select/poll都是這種模式 3)實(shí)現(xiàn)過程:當(dāng)一個(gè)fd就緒時(shí),回調(diào)函數(shù)會(huì)把該fd放入就緒鏈表中,這時(shí)調(diào)用epoll_wait,就會(huì)把這個(gè)就緒fd拷貝到用戶態(tài),然后清空就緒鏈表,最后epoll_wait干了件事,就是檢查這個(gè)fd,如果這個(gè)fd確實(shí)未被處理,又把該fd放回到剛剛清空的就緒鏈表,于是這個(gè)fd又會(huì)被下次的epoll_wait返回 5.2.1邊緣觸發(fā)模式(ET:edge-triggered) 1)二者的差異在于LT模式下只要某個(gè)socket處于readable/writable狀態(tài),無論什么時(shí)候進(jìn)行epoll_wait都會(huì)返回該socket;而ET模式下只有某個(gè)fd從unreadable變?yōu)閞eadable或從unwritable變?yōu)閣ritable時(shí)(相當(dāng)于高低電平觸發(fā)),epoll_wait才會(huì)返回該socket 2)這種差異導(dǎo)致ET模式下,正確的讀寫方式必須為: 讀:只要可讀,就一直讀,直到讀完緩沖區(qū) 寫:只要可寫,就一直寫,直到寫滿緩沖區(qū) 為什么?
//讀 if (events[i].events & EPOLLIN) { n = 0; while ((nread = read(fd, buf n, BUFSIZ - 1)) > 0)//直到讀完,讀完時(shí)read返回0 { n = nread; if (nread == -1 && errno != EAGAIN) { perror("read error"); } } ev.data.fd = fd; ev.events = events[i].events | EPOLLOUT; epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &ev); } //寫 if (events[i].events & EPOLLOUT) { int nwrite, data_size = strlen(buf); n = data_size; while (n > 0)//直到寫滿,寫滿時(shí)n減少到0 { nwrite = write(fd, buf data_size - n, n); if (nwrite < n) { if (nwrite == -1 && errno != EAGAIN) { perror("write error"); } break; } n -= nwrite; } ev.data.fd = fd; ev.events = EPOLLIN | EPOLLET; epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &ev); //修改sockfd上要處理的事件為EPOLIN } 3)這樣的讀寫方式導(dǎo)致了ET模式只支持非阻塞套接字,因?yàn)樵谧枞捉幼窒聲?huì)出現(xiàn)一些問題:因?yàn)橐恢弊x直到把數(shù)據(jù)讀完,所以一般在編寫epoll邊緣觸發(fā)模式的程序時(shí),會(huì)用一個(gè)循環(huán)一直讀取socket,當(dāng)沒有數(shù)據(jù)可讀了的時(shí)候,阻塞式socket勢(shì)必就一直阻塞下去了,就不是阻塞在epoll_wait上了,造成其他socket餓死 4)LT模式每次都會(huì)返回可讀的套接口,ET模式滿足邊緣條件時(shí)才返回可讀的套接口,減少了重復(fù)的epoll系統(tǒng)調(diào)用,因此效率要比LT模式高,但是對(duì)編程要求高,需要細(xì)致的處理每個(gè)事件,否則容易發(fā)生丟失事件的情況 5.3優(yōu)點(diǎn)1)epoll可監(jiān)聽的描述符數(shù)量很大,上限為系統(tǒng)所有進(jìn)程最大可打開文件的數(shù)目,具體數(shù)目可以cat /proc/sys/fs/file-max查看(ubuntu14.04上為98875) 2)select/poll每次調(diào)用都要進(jìn)行整個(gè)fd集合在用戶態(tài)和內(nèi)核態(tài)之間的拷貝,而epoll返回時(shí)只需拷貝就緒fd,減少了拷貝的開銷 3)select/poll、epoll都是睡眠和喚醒多次交替,但是select/poll在“醒著”的時(shí)候要遍歷整個(gè)fd集合,而epoll在“醒著”的時(shí)候只要判斷就緒鏈表是否為空就行了,大大提升了效率 5.4epoll源碼/* * 在深入了解epoll的實(shí)現(xiàn)之前, 先來了解內(nèi)核的3個(gè)方面. * 1. 等待隊(duì)列 waitqueue * 我們簡單解釋一下等待隊(duì)列: * 隊(duì)列頭(wait_queue_head_t)往往是資源生產(chǎn)者, * 隊(duì)列成員(wait_queue_t)往往是資源消費(fèi)者, * 當(dāng)頭的資源ready后, 會(huì)逐個(gè)執(zhí)行每個(gè)成員指定的回調(diào)函數(shù), * 來通知它們資源已經(jīng)ready了, 等待隊(duì)列大致就這個(gè)意思. * 2. 內(nèi)核的poll機(jī)制 * 被Poll的fd, 必須在實(shí)現(xiàn)上支持內(nèi)核的Poll技術(shù), * 比如fd是某個(gè)字符設(shè)備,或者是個(gè)socket, 它必須實(shí)現(xiàn) * file_operations中的poll操作, 給自己分配有一個(gè)等待隊(duì)列頭. * 主動(dòng)poll fd的某個(gè)進(jìn)程必須分配一個(gè)等待隊(duì)列成員, 添加到 * fd的對(duì)待隊(duì)列里面去, 并指定資源ready時(shí)的回調(diào)函數(shù). * 用socket做例子, 它必須有實(shí)現(xiàn)一個(gè)poll操作, 這個(gè)Poll是 * 發(fā)起輪詢的代碼必須主動(dòng)調(diào)用的, 該函數(shù)中必須調(diào)用poll_wait(), * poll_wait會(huì)將發(fā)起者作為等待隊(duì)列成員加入到socket的等待隊(duì)列中去. * 這樣socket發(fā)生狀態(tài)變化時(shí)可以通過隊(duì)列頭逐個(gè)通知所有關(guān)心它的進(jìn)程. * 這一點(diǎn)必須很清楚的理解, 否則會(huì)想不明白epoll是如何 * 得知fd的狀態(tài)發(fā)生變化的. * 3. epollfd本身也是個(gè)fd, 所以它本身也可以被epoll, * 可以猜測(cè)一下它是不是可以無限嵌套epoll下去... * * epoll基本上就是使用了上面的1,2點(diǎn)來完成. * 可見epoll本身并沒有給內(nèi)核引入什么特別復(fù)雜或者高深的技術(shù), * 只不過是已有功能的重新組合, 達(dá)到了超過select的效果. */ /* * 相關(guān)的其它內(nèi)核知識(shí): * 1. fd我們知道是文件描述符, 在內(nèi)核態(tài), 與之對(duì)應(yīng)的是struct file結(jié)構(gòu), * 可以看作是內(nèi)核態(tài)的文件描述符. * 2. spinlock, 自旋鎖, 必須要非常小心使用的鎖, * 尤其是調(diào)用spin_lock_irqsave()的時(shí)候, 中斷關(guān)閉, 不會(huì)發(fā)生進(jìn)程調(diào)度, * 被保護(hù)的資源其它CPU也無法訪問. 這個(gè)鎖是很強(qiáng)力的, 所以只能鎖一些 * 非常輕量級(jí)的操作. * 3. 引用計(jì)數(shù)在內(nèi)核中是非常重要的概念, * 內(nèi)核代碼里面經(jīng)常有些release, free釋放資源的函數(shù)幾乎不加任何鎖, * 這是因?yàn)檫@些函數(shù)往往是在對(duì)象的引用計(jì)數(shù)變成0時(shí)被調(diào)用, * 既然沒有進(jìn)程在使用在這些對(duì)象, 自然也不需要加鎖. * struct file 是持有引用計(jì)數(shù)的. */ /* --- epoll相關(guān)的數(shù)據(jù)結(jié)構(gòu) --- */ /* * This structure is stored inside the "private_data" member of the file * structure and rapresent the main data sructure for the eventpoll * interface. */ /* 每創(chuàng)建一個(gè)epoll句柄, 內(nèi)核就會(huì)分配一個(gè)eventpoll與之對(duì)應(yīng)*/ struct eventpoll { /* Protect the this structure access */ spinlock_t lock; /* * This mutex is used to ensure that files are not removed * while epoll is using them. This is held during the event * collection loop, the file cleanup path, the epoll file exit * code and the ctl operations. */ /* 添加, 修改或者刪除監(jiān)聽fd的時(shí)候, 以及epoll_wait返回, 向用戶空間 * 傳遞數(shù)據(jù)時(shí)都會(huì)持有這個(gè)互斥鎖, 所以在用戶空間可以放心的在多個(gè)線程 * 中同時(shí)執(zhí)行epoll相關(guān)的操作, 內(nèi)核級(jí)已經(jīng)做了保護(hù). */ struct mutex mtx; /* Wait queue used by sys_epoll_wait() */ /* 調(diào)用epoll_wait()時(shí), 我們就是"睡"在了這個(gè)等待隊(duì)列上... */ wait_queue_head_t wq; /* Wait queue used by file->poll() */ /* 這個(gè)用于epollfd本事被poll的時(shí)候... */ wait_queue_head_t poll_wait; /* List of ready file descriptors */ /* 所有已經(jīng)ready的epitem都在這個(gè)鏈表里面 */ struct list_head rdllist; /* RB tree root used to store monitored fd structs */ /* 所有要監(jiān)聽的epitem都在這里 */ struct rb_root rbr; /* 這是一個(gè)單鏈表鏈接著所有的struct epitem當(dāng)event轉(zhuǎn)移到用戶空間時(shí) */ * This is a single linked list that chains all the "struct epitem" that * happened while transfering ready events to userspace w / out * holding->lock. * / struct epitem *ovflist; /* The user that created the eventpoll descriptor */ /* 這里保存了一些用戶變量, 比如fd監(jiān)聽數(shù)量的最大值等等 */ struct user_struct *user; }; /* * Each file descriptor added to the eventpoll interface will * have an entry of this type linked to the "rbr" RB tree. */ /* epitem 表示一個(gè)被監(jiān)聽的fd */ struct epitem { /* RB tree node used to link this structure to the eventpoll RB tree */ /* rb_node, 當(dāng)使用epoll_ctl()將一批fds加入到某個(gè)epollfd時(shí), 內(nèi)核會(huì)分配 * 一批的epitem與fds們對(duì)應(yīng), 而且它們以rb_tree的形式組織起來, tree的root * 保存在epollfd, 也就是struct eventpoll中. * 在這里使用rb_tree的原因我認(rèn)為是提高查找,插入以及刪除的速度. * rb_tree對(duì)以上3個(gè)操作都具有O(lgN)的時(shí)間復(fù)雜度 */ struct rb_node rbn; /* List header used to link this structure to the eventpoll ready list */ /* 鏈表節(jié)點(diǎn), 所有已經(jīng)ready的epitem都會(huì)被鏈到eventpoll的rdllist中 */ struct list_head rdllink; /* * Works together "struct eventpoll"->ovflist in keeping the * single linked chain of items. */ /* 這個(gè)在代碼中再解釋... */ struct epitem *next; /* The file descriptor information this item refers to */ /* epitem對(duì)應(yīng)的fd和struct file */ struct epoll_filefd ffd; /* Number of active wait queue attached to poll operations */ int nwait; /* List containing poll wait queues */ struct list_head pwqlist; /* The "container" of this item */ /* 當(dāng)前epitem屬于哪個(gè)eventpoll */ struct eventpoll *ep; /* List header used to link this item to the "struct file" items list */ struct list_head fllink; /* The structure that describe the interested events and the source fd */ /* 當(dāng)前的epitem關(guān)系哪些events, 這個(gè)數(shù)據(jù)是調(diào)用epoll_ctl時(shí)從用戶態(tài)傳遞過來 */ struct epoll_event event; }; struct epoll_filefd { struct file *file; int fd; }; /* poll所用到的鉤子Wait structure used by the poll hooks */ struct eppoll_entry { /* List header used to link this structure to the "struct epitem" */ struct list_head llink; /* The "base" pointer is set to the container "struct epitem" */ struct epitem *base; /* * Wait queue item that will be linked to the target file wait * queue head. */ wait_queue_t wait; /* The wait queue head that linked the "wait" wait queue item */ wait_queue_head_t *whead; }; /* Wrapper struct used by poll queueing */ struct ep_pqueue { poll_table pt; struct epitem *epi; }; /* Used by the ep_send_events() function as callback private data */ struct ep_send_events_data { int maxevents; struct epoll_event __user *events; }; //SYSCALL_DEFINE1是一個(gè)宏,用于定義有一個(gè)參數(shù)的系統(tǒng)調(diào)用函數(shù); //這就是epoll_create真身,先進(jìn)行判斷size是否>0,若是則直接調(diào)用epoll_create1 //所以其實(shí)int epoll_create(int size);中的size真的沒啥用!?。?SYSCALL_DEFINE1(epoll_create, int size) { if (size <= 0) return -EINVAL;//無效的參數(shù),#define EINVAL 22 /* Invalid argument */ return sys_epoll_create1(0); } /* epoll_create1 */ SYSCALL_DEFINE1(epoll_create1, int, flags) { int error; struct eventpoll *ep = NULL;//主描述符 /* Check the EPOLL_* constant for consistency. */ /* 這句沒啥用處... */ BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); /* 對(duì)于epoll來講, 目前唯一有效的FLAG就是CLOEXEC */ if (flags & ~EPOLL_CLOEXEC) return -EINVAL; /* * Create the internal data structure ("struct eventpoll"). */ /* 分配一個(gè)struct eventpoll, 分配和初始化細(xì)節(jié)我們隨后深聊~ */ error = ep_alloc(&ep); if (error < 0) return error; /* * Creates all the items needed to setup an eventpoll file. That is, * a file structure and a free file descriptor. */ /* 這里是創(chuàng)建一個(gè)匿名fd, 說起來就話長了...長話短說: * epollfd本身并不存在一個(gè)真正的文件與之對(duì)應(yīng), 所以內(nèi)核需要?jiǎng)?chuàng)建一個(gè) * "虛擬"的文件, 并為之分配真正的struct file結(jié)構(gòu), 而且有真正的fd. * 這里2個(gè)參數(shù)比較關(guān)鍵: * eventpoll_fops, fops就是file operations, 就是當(dāng)你對(duì)這個(gè)文件(這里是虛擬的)進(jìn)行操作(比如讀)時(shí), * fops里面的函數(shù)指針指向真正的操作實(shí)現(xiàn), 類似C 里面虛函數(shù)和子類的概念. * epoll只實(shí)現(xiàn)了poll和release(就是close)操作, 其它文件系統(tǒng)操作都有VFS全權(quán)處理了. * ep, ep就是struct epollevent, 它會(huì)作為一個(gè)私有數(shù)據(jù)保存在struct file的private指針里面. * 其實(shí)說白了, 就是為了能通過fd找到struct file, 通過struct file能找到eventpoll結(jié)構(gòu). * 如果懂一點(diǎn)Linux下字符設(shè)備驅(qū)動(dòng)開發(fā), 這里應(yīng)該是很好理解的, * 推薦閱讀 <Linux device driver 3rd> */ error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC)); if (error < 0) ep_free(ep); return error; } /* * 創(chuàng)建好epollfd后, 接下來我們要往里面添加fd咯 * 來看epoll_ctl * epfd 就是epollfd * op ADD,MOD,DEL * fd 需要監(jiān)聽的描述符 * event 我們關(guān)心的events */ SYSCALL_DEFINE4(epoll_ctl, int epfd, int op, int fd, struct epoll_event __user* event) { int error; struct file *file, *tfile; struct eventpoll *ep; struct epitem *epi; struct epoll_event epds; error = -EFAULT; /* * 錯(cuò)誤處理以及從用戶空間將epoll_event結(jié)構(gòu)copy到內(nèi)核空間. */ if (ep_op_has_event(op) && copy_from_user(&epds, event, sizeof(struct epoll_event))) goto error_return; /* Get the "struct file *" for the eventpoll file */ /* 取得struct file結(jié)構(gòu), epfd既然是真正的fd, 那么內(nèi)核空間 * 就會(huì)有與之對(duì)于的一個(gè)struct file結(jié)構(gòu) * 這個(gè)結(jié)構(gòu)在epoll_create1()中, 由函數(shù)anon_inode_getfd()分配 */ error = -EBADF; file = fget(epfd); if (!file) goto error_return; /* Get the "struct file *" for the target file */ /* 我們需要監(jiān)聽的fd, 它當(dāng)然也有個(gè)struct file結(jié)構(gòu), 上下2個(gè)不要搞混了哦 */ tfile = fget(fd); if (!tfile) goto error_fput; /* The target file descriptor must support poll */ error = -EPERM; /* 如果監(jiān)聽的文件不支持poll, 那就沒轍了. * 你知道什么情況下, 文件會(huì)不支持poll嗎? */ if (!tfile->f_op || !tfile->f_op->poll) goto error_tgt_fput; /* * We have to check that the file structure underneath the file descriptor * the user passed to us _is_ an eventpoll file. And also we do not permit * adding an epoll file descriptor inside itself. */ error = -EINVAL; /* epoll不能自己監(jiān)聽自己... */ if (file == tfile || !is_file_epoll(file)) goto error_tgt_fput; /* * At this point it is safe to assume that the "private_data" contains * our own data structure. */ /* 取到我們的eventpoll結(jié)構(gòu), 來自與epoll_create1()中的分配 */ ep = file->private_data; /* 接下來的操作有可能修改數(shù)據(jù)結(jié)構(gòu)內(nèi)容, 鎖之~ */ mutex_lock(&ep->mtx); /* * Try to lookup the file inside our RB tree, Since we grabbed "mtx" * above, we can be sure to be able to use the item looked up by * ep_find() till we release the mutex. */ /* 對(duì)于每一個(gè)監(jiān)聽的fd, 內(nèi)核都有分配一個(gè)epitem結(jié)構(gòu), * 而且我們也知道, epoll是不允許重復(fù)添加fd的, * 所以我們首先查找該fd是不是已經(jīng)存在了. * ep_find()其實(shí)就是RBTREE查找, 跟C STL的map差不多一回事, O(lgn)的時(shí)間復(fù)雜度. */ epi = ep_find(ep, tfile, fd); error = -EINVAL; switch (op) { /* 首先我們關(guān)心添加 */ case EPOLL_CTL_ADD: if (!epi) { /* 之前的find沒有找到有效的epitem, 證明是第一次插入, 接受! * 這里我們可以知道, POLLERR和POLLHUP事件內(nèi)核總是會(huì)關(guān)心的 * */ epds.events |= POLLERR | POLLHUP; /* rbtree插入, 詳情見ep_insert()的分析 * 其實(shí)我覺得這里有insert的話, 之前的find應(yīng)該 * 是可以省掉的... */ error = ep_insert(ep, &epds, tfile, fd); } else /* 找到了!? 重復(fù)添加! */ error = -EEXIST; break; /* 刪除和修改操作都比較簡單 */ case EPOLL_CTL_DEL: if (epi) error = ep_remove(ep, epi); else error = -ENOENT; break; case EPOLL_CTL_MOD: if (epi) { epds.events |= POLLERR | POLLHUP; error = ep_modify(ep, epi, &epds); } else error = -ENOENT; break; } mutex_unlock(&ep->mtx); error_tgt_fput: fput(tfile); error_fput: fput(file); error_return: return error; } /* * ep_insert()在epoll_ctl()中被調(diào)用, 完成往epollfd里面添加一個(gè)監(jiān)聽fd的工作 * tfile是fd在內(nèi)核態(tài)的struct file結(jié)構(gòu) */ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,struct file *tfile, int fd) { int error, revents, pwake = 0; unsigned long flags; struct epitem *epi; struct ep_pqueue epq; /* 查看是否達(dá)到當(dāng)前用戶的最大監(jiān)聽數(shù) */ if (unlikely(atomic_read(&ep->user->epoll_watches) >= max_user_watches)) return -ENOSPC; /* 從著名的slab中分配一個(gè)epitem */ if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL))) return -ENOMEM; /* Item initialization follow here ... */ /* 這些都是相關(guān)成員的初始化... */ INIT_LIST_HEAD(&epi->rdllink); INIT_LIST_HEAD(&epi->fllink); INIT_LIST_HEAD(&epi->pwqlist); epi->ep = ep; /* 這里保存了我們需要監(jiān)聽的文件fd和它的file結(jié)構(gòu) */ ep_set_ffd(&epi->ffd, tfile, fd); epi->event = *event; epi->nwait = 0; /* 這個(gè)指針的初值不是NULL哦... */ epi->next = EP_UNACTIVE_PTR; /* Initialize the poll table using the queue callback */ /* 好, 我們終于要進(jìn)入到poll的正題了 */ epq.epi = epi; /* 初始化一個(gè)poll_table * 其實(shí)就是指定調(diào)用poll_wait(注意不是epoll_wait!!!)時(shí)的回調(diào)函數(shù),和我們關(guān)心哪些events, * ep_ptable_queue_proc()就是我們的回調(diào)啦, 初值是所有event都關(guān)心 */ init_poll_funcptr(&epq.pt, ep_ptable_queue_proc); /* * Attach the item to the poll hooks and get current event bits. * We can safely use the file* here because its usage count has * been increased by the caller of this function. Note that after * this operation completes, the poll callback can start hitting * the new item. */ /* 這一部很關(guān)鍵, 也比較難懂, 完全是內(nèi)核的poll機(jī)制導(dǎo)致的... * 首先, f_op->poll()一般來說只是個(gè)wrapper, 它會(huì)調(diào)用真正的poll實(shí)現(xiàn), * 拿UDP的socket來舉例, 這里就是這樣的調(diào)用流程: f_op->poll(), sock_poll(), * udp_poll(), datagram_poll(), sock_poll_wait(), 最后調(diào)用到我們上面指定的 * ep_ptable_queue_proc()這個(gè)回調(diào)函數(shù)...(好深的調(diào)用路徑...). * 完成這一步, 我們的epitem就跟這個(gè)socket關(guān)聯(lián)起來了, 當(dāng)它有狀態(tài)變化時(shí), * 會(huì)通過ep_poll_callback()來通知. * 最后, 這個(gè)函數(shù)還會(huì)查詢當(dāng)前的fd是不是已經(jīng)有啥event已經(jīng)ready了, 有的話 * 會(huì)將event返回. */ revents = tfile->f_op->poll(tfile, &epq.pt); /* * We have to check if something went wrong during the poll wait queue * install process. Namely an allocation for a wait queue failed due * high memory pressure. */ error = -ENOMEM; if (epi->nwait < 0) goto error_unregister; /* Add the current item to the list of active epoll hook for this file */ /* 這個(gè)就是每個(gè)文件會(huì)將所有監(jiān)聽自己的epitem鏈起來 */ spin_lock(&tfile->f_lock); list_add_tail(&epi->fllink, &tfile->f_ep_links); spin_unlock(&tfile->f_lock); /* * Add the current item to the RB tree. All RB tree operations are * protected by "mtx", and ep_insert() is called with "mtx" held. */ /* 都搞定后, 將epitem插入到對(duì)應(yīng)的eventpoll中去 */ ep_rbtree_insert(ep, epi); /* We have to drop the new item inside our item list to keep track of it */ spin_lock_irqsave(&ep->lock, flags); /* If the file is already "ready" we drop it inside the ready list */ /* 到達(dá)這里后, 如果我們監(jiān)聽的fd已經(jīng)有事件發(fā)生, 那就要處理一下 */ if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) { /* 將當(dāng)前的epitem加入到ready list中去 */ list_add_tail(&epi->rdllink, &ep->rdllist); /* Notify waiting tasks that events are available */ /* 誰在epoll_wait, 就喚醒它... */ if (waitqueue_active(&ep->wq)) wake_up_locked(&ep->wq); /* 誰在epoll當(dāng)前的epollfd, 也喚醒它... */ if (waitqueue_active(&ep->poll_wait)) pwake ; } spin_unlock_irqrestore(&ep->lock, flags); atomic_inc(&ep->user->epoll_watches); /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(&ep->poll_wait); return 0; error_unregister: ep_unregister_pollwait(ep, epi); /* * We need to do this because an event could have been arrived on some * allocated wait queue. Note that we don't care about the ep->ovflist * list, since that is used/cleaned only inside a section bound by "mtx". * And ep_insert() is called with "mtx" held. */ spin_lock_irqsave(&ep->lock, flags); if (ep_is_linked(&epi->rdllink)) list_del_init(&epi->rdllink); spin_unlock_irqrestore(&ep->lock, flags); kmem_cache_free(epi_cache, epi); return error; } /* * 這個(gè)是關(guān)鍵性的回調(diào)函數(shù), 當(dāng)我們監(jiān)聽的fd發(fā)生狀態(tài)改變時(shí), 它會(huì)被調(diào)用. * 參數(shù)key被當(dāng)作一個(gè)unsigned long整數(shù)使用, 攜帶的是events. */ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key) { int pwake = 0; unsigned long flags; struct epitem *epi = ep_item_from_wait(wait);//從等待隊(duì)列獲取epitem.需要知道哪個(gè)進(jìn)程掛載到這個(gè)設(shè)備 struct eventpoll *ep = epi->ep;//獲取 spin_lock_irqsave(&ep->lock, flags); /* * If the event mask does not contain any poll(2) event, we consider the * descriptor to be disabled. This condition is likely the effect of the * EPOLLONESHOT bit that disables the descriptor when an event is received, * until the next EPOLL_CTL_MOD will be issued. */ if (!(epi->event.events & ~EP_PRIVATE_BITS)) goto out_unlock; /* * Check the events coming with the callback. At this stage, not * every device reports the events in the "key" parameter of the * callback. We need to be able to handle both cases here, hence the * test for "key" != NULL before the event match test. */ /* 沒有我們關(guān)心的event... */ if (key && !((unsigned long)key & epi->event.events)) goto out_unlock; /* * If we are trasfering events to userspace, we can hold no locks * (because we're accessing user memory, and because of linux f_op->poll() * semantics). All the events that happens during that period of time are * chained in ep->ovflist and requeued later on. */ /* * 這里看起來可能有點(diǎn)費(fèi)解, 其實(shí)干的事情比較簡單: * 如果該callback被調(diào)用的同時(shí), epoll_wait()已經(jīng)返回了, * 也就是說, 此刻應(yīng)用程序有可能已經(jīng)在循環(huán)獲取events, * 這種情況下, 內(nèi)核將此刻發(fā)生event的epitem用一個(gè)單獨(dú)的鏈表 * 鏈起來, 不發(fā)給應(yīng)用程序, 也不丟棄, 而是在下一次epoll_wait * 時(shí)返回給用戶. */ if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) { if (epi->next == EP_UNACTIVE_PTR) { epi->next = ep->ovflist; ep->ovflist = epi; } goto out_unlock; } /* If this file is already in the ready list we exit soon */ /* 將當(dāng)前的epitem放入ready list */ if (!ep_is_linked(&epi->rdllink)) list_add_tail(&epi->rdllink, &ep->rdllist); /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. */ /* 喚醒epoll_wait... */ if (waitqueue_active(&ep->wq)) wake_up_locked(&ep->wq); /* 如果epollfd也在被poll, 那就喚醒隊(duì)列里面的所有成員. */ if (waitqueue_active(&ep->poll_wait)) pwake ; out_unlock: spin_unlock_irqrestore(&ep->lock, flags); /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(&ep->poll_wait); return 1; } /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_wait(2). */ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout) { int error; struct file *file; struct eventpoll *ep; /* The maximum number of event must be greater than zero */ if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) return -EINVAL; /* Verify that the area passed by the user is writeable */ /* 這個(gè)地方有必要說明一下: * 內(nèi)核對(duì)應(yīng)用程序采取的策略是"絕對(duì)不信任", * 所以內(nèi)核跟應(yīng)用程序之間的數(shù)據(jù)交互大都是copy, 不允許(也時(shí)候也是不能...)指針引用. * epoll_wait()需要內(nèi)核返回?cái)?shù)據(jù)給用戶空間, 內(nèi)存由用戶程序提供, * 所以內(nèi)核會(huì)用一些手段來驗(yàn)證這一段內(nèi)存空間是不是有效的. */ if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) { error = -EFAULT; goto error_return; } /* Get the "struct file *" for the eventpoll file */ error = -EBADF; /* 獲取epollfd的struct file, epollfd也是文件嘛 */ file = fget(epfd); if (!file) goto error_return; /* * We have to check that the file structure underneath the fd * the user passed to us _is_ an eventpoll file. */ error = -EINVAL; /* 檢查一下它是不是一個(gè)真正的epollfd... */ if (!is_file_epoll(file)) goto error_fput; /* * At this point it is safe to assume that the "private_data" contains * our own data structure. */ /* 獲取eventpoll結(jié)構(gòu) */ ep = file->private_data; /* Time to fish for events ... */ /* OK, 睡覺, 等待事件到來~~ */ error = ep_poll(ep, events, maxevents, timeout); error_fput: fput(file); error_return: return error; } /* 這個(gè)函數(shù)真正將執(zhí)行epoll_wait的進(jìn)程帶入睡眠狀態(tài)... */ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout) { int res, eavail; unsigned long flags; long jtimeout; wait_queue_t wait;//等待隊(duì)列 /* * Calculate the timeout by checking for the "infinite" value (-1) * and the overflow condition. The passed timeout is in milliseconds, * that why (t * HZ) / 1000. */ /* 計(jì)算睡覺時(shí)間, 毫秒要轉(zhuǎn)換為HZ */ jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ? MAX_SCHEDULE_TIMEOUT : (timeout * HZ 999) / 1000; retry: spin_lock_irqsave(&ep->lock, flags); res = 0; /* 如果ready list不為空, 就不睡了, 直接干活... */ if (list_empty(&ep->rdllist)) { /* * We don't have any available event to return to the caller. * We need to sleep here, and we will be wake up by * ep_poll_callback() when events will become available. */ /* OK, 初始化一個(gè)等待隊(duì)列, 準(zhǔn)備直接把自己掛起, * 注意current是一個(gè)宏, 代表當(dāng)前進(jìn)程 */ init_waitqueue_entry(&wait, current);//初始化等待隊(duì)列,wait表示當(dāng)前進(jìn)程 __add_wait_queue_exclusive(&ep->wq, &wait);//掛載到ep結(jié)構(gòu)的等待隊(duì)列 for (;;) { /* * We don't want to sleep if the ep_poll_callback() sends us * a wakeup in between. That's why we set the task state * to TASK_INTERRUPTIBLE before doing the checks. */ /* 將當(dāng)前進(jìn)程設(shè)置位睡眠, 但是可以被信號(hào)喚醒的狀態(tài), * 注意這個(gè)設(shè)置是"將來時(shí)", 我們此刻還沒睡! */ set_current_state(TASK_INTERRUPTIBLE); /* 如果這個(gè)時(shí)候, ready list里面有成員了, * 或者睡眠時(shí)間已經(jīng)過了, 就直接不睡了... */ if (!list_empty(&ep->rdllist) || !jtimeout) break; /* 如果有信號(hào)產(chǎn)生, 也起床... */ if (signal_pending(current)) { res = -EINTR; break; } /* 啥事都沒有,解鎖, 睡覺... */ spin_unlock_irqrestore(&ep->lock, flags); /* jtimeout這個(gè)時(shí)間后, 會(huì)被喚醒, * ep_poll_callback()如果此時(shí)被調(diào)用, * 那么我們就會(huì)直接被喚醒, 不用等時(shí)間了... * 再次強(qiáng)調(diào)一下ep_poll_callback()的調(diào)用時(shí)機(jī)是由被監(jiān)聽的fd * 的具體實(shí)現(xiàn), 比如socket或者某個(gè)設(shè)備驅(qū)動(dòng)來決定的, * 因?yàn)榈却?duì)列頭是他們持有的, epoll和當(dāng)前進(jìn)程 * 只是單純的等待... **/ jtimeout = schedule_timeout(jtimeout);//睡覺 spin_lock_irqsave(&ep->lock, flags); } __remove_wait_queue(&ep->wq, &wait); /* OK 我們醒來了... */ set_current_state(TASK_RUNNING); } /* Is it worth to try to dig for events ? */ eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR; spin_unlock_irqrestore(&ep->lock, flags); /* * Try to transfer events to user space. In case we get 0 events and * there's still timeout left over, we go trying again in search of * more luck. */ /* 如果一切正常, 有event發(fā)生, 就開始準(zhǔn)備數(shù)據(jù)copy給用戶空間了... */ if (!res && eavail && !(res = ep_send_events(ep, events, maxevents)) && jtimeout) goto retry; return res; }
|
|