進(jìn)程調(diào)度所使用到的數(shù)據(jù)結(jié)構(gòu): 1.就緒隊(duì)列 內(nèi)核為每一個(gè)cpu創(chuàng)建一個(gè)進(jìn)程就緒隊(duì)列,該隊(duì)列上的進(jìn)程均由該cpu執(zhí)行,代碼如下(kernel/sched/core.c)。 1 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); 定義了一個(gè)struct rq結(jié)構(gòu)體數(shù)組,每個(gè)數(shù)組元素是一個(gè)就緒隊(duì)列,對(duì)應(yīng)一個(gè)cpu。下面看下struct rq結(jié)構(gòu)體(kernel/sched/sched.h): ![]() 該結(jié)構(gòu)體是本地cpu所有進(jìn)程組成的就緒隊(duì)列,在linux內(nèi)核中,進(jìn)程被分為普通進(jìn)程和實(shí)時(shí)進(jìn)程,這兩種進(jìn)程的調(diào)度策略是不同的,因此在31-32行可以看到rq結(jié)構(gòu)體中又內(nèi)嵌了struct cfs_rq cfs和struct rt_rq rt兩個(gè)子就緒隊(duì)列,分別來(lái)組織普通進(jìn)程和實(shí)時(shí)進(jìn)程(普通進(jìn)程將采用完全公平調(diào)度策略cfs,而實(shí)時(shí)進(jìn)程將采用實(shí)時(shí)調(diào)度策略),第33行struct dl_rq dl調(diào)度空閑進(jìn)程,暫且不討論。所以,如果咱們研究的是普通進(jìn)程的調(diào)度,需要關(guān)心的就是struct cfs_rq cfs隊(duì)列;如果研究的是實(shí)時(shí)進(jìn)程,就只關(guān)心struct rt_rq rt隊(duì)列。 1.1普通進(jìn)程的就緒隊(duì)列struct cfs_rq(kernel/sched/sched.h) ![]() cfs_rq就緒隊(duì)列是以紅黑樹(shù)的形式來(lái)組織調(diào)度實(shí)體。第12行tasks_timeline成員就是紅黑樹(shù)的樹(shù)根。第13行rb_leftmost指向了紅黑樹(shù)最左邊的左孩子(下一個(gè)可調(diào)度的實(shí)體)。第19行curr指向當(dāng)前正運(yùn)行的實(shí)體,next指向?qū)⒈粏拘训倪M(jìn)程,last指向喚醒next進(jìn)程的進(jìn)程,next和last用法后邊會(huì)提到。第55行rq指向了該cfs_rq就緒隊(duì)列所屬的rq隊(duì)列。 1.2實(shí)時(shí)進(jìn)程的就緒隊(duì)列struct rt_rq(kernel/sched/sched.h) ![]() 2.調(diào)度實(shí)體(include/linux/sched.h) 2.1普通進(jìn)程的調(diào)度實(shí)體sched_entity 1 struct sched_entity { 2 struct load_weight load; /* for load-balancing */ 3 struct rb_node run_node; 4 struct list_head group_node; 5 unsigned int on_rq; 6 7 u64 exec_start; 8 u64 sum_exec_runtime; 9 u64 vruntime; 10 u64 prev_sum_exec_runtime; 11 12 u64 nr_migrations; 13 14 #ifdef CONFIG_SCHEDSTATS 15 struct sched_statistics statistics; 16 #endif 17 18 #ifdef CONFIG_FAIR_GROUP_SCHED 19 int depth; 20 struct sched_entity *parent; 21 /* rq on which this entity is (to be) queued: */ 22 struct cfs_rq *cfs_rq; 23 /* rq "owned" by this entity/group: */ 24 struct cfs_rq *my_q; 25 #endif 26 27 #ifdef CONFIG_SMP 28 /* Per-entity load-tracking */ 29 struct sched_avg avg; 30 #endif 31 }; 每個(gè)進(jìn)程描述符中均包含一個(gè)該結(jié)構(gòu)體變量,內(nèi)核使用該結(jié)構(gòu)體來(lái)將普通進(jìn)程組織到采用完全公平調(diào)度策略的就緒隊(duì)列中(struct rq中的cfs隊(duì)列中,上邊提到過(guò)),該結(jié)構(gòu)體有兩個(gè)作用,一是包含有進(jìn)程調(diào)度的信息(比如進(jìn)程的運(yùn)行時(shí)間,睡眠時(shí)間等等,調(diào)度程序參考這些信息決定是否調(diào)度進(jìn)程),二是使用該結(jié)構(gòu)體來(lái)組織進(jìn)程,第3行的struct rb_node類(lèi)型結(jié)構(gòu)體變量run_node是紅黑樹(shù)節(jié)點(diǎn),因此struct sched_entity調(diào)度實(shí)體將被組織成紅黑樹(shù)的形式,同時(shí)意味著普通進(jìn)程也被組織成紅黑樹(shù)的形式。第18-25行是和組調(diào)度有關(guān)的成員,需要開(kāi)啟組調(diào)度。第20行parent顧名思義指向了當(dāng)前實(shí)體的上一級(jí)實(shí)體,后邊再介紹。第22行的cfs_rq指向了該調(diào)度實(shí)體所在的就緒隊(duì)列。第24行my_q指向了本實(shí)體擁有的就緒隊(duì)列(調(diào)度組),該調(diào)度組(包括組員實(shí)體)屬于下一個(gè)級(jí)別,和本實(shí)體不在同一個(gè)級(jí)別,該調(diào)度組中所有成員實(shí)體的parent域指向了本實(shí)體,這就解釋了上邊的parent,此外,第19行depth代表了此隊(duì)列(調(diào)度組)的深度,每個(gè)調(diào)度組都比其parent調(diào)度組深度大1。內(nèi)核依賴(lài)my_q域?qū)崿F(xiàn)組調(diào)度。 2.2實(shí)時(shí)進(jìn)程的調(diào)度實(shí)體 sched_rt_entity 1 struct sched_rt_entity { 2 struct list_head run_list; 3 unsigned long timeout; 4 unsigned long watchdog_stamp; 5 unsigned int time_slice; 6 7 struct sched_rt_entity *back; 8 #ifdef CONFIG_RT_GROUP_SCHED 9 struct sched_rt_entity *parent; 10 /* rq on which this entity is (to be) queued: */ 11 struct rt_rq *rt_rq; 12 /* rq "owned" by this entity/group: */ 13 struct rt_rq *my_q; 14 #endif 15 }; 該結(jié)構(gòu)體和上個(gè)結(jié)構(gòu)體是類(lèi)似的,只不過(guò)用來(lái)組織實(shí)時(shí)進(jìn)程,實(shí)時(shí)進(jìn)程被組織到struct rq中的rt隊(duì)列中,上邊有提到。每個(gè)進(jìn)程描述符中也包含一個(gè)該結(jié)構(gòu)體。該結(jié)構(gòu)體中并未包含struct rb_node類(lèi)型結(jié)構(gòu)體變量,而在第1行出現(xiàn)了struct list_head類(lèi)型結(jié)構(gòu)體變量run_list,因此可以看出實(shí)時(shí)進(jìn)程的就緒隊(duì)列是雙向鏈表形式,而不是紅黑數(shù)的形式。 3.調(diào)度類(lèi)(kernel/sched/sched.h) 1 struct sched_class { 2 const struct sched_class *next; 3 4 void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); 5 void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); 6 void (*yield_task) (struct rq *rq); 7 bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt); 8 9 void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); 10 11 /* 12 * It is the responsibility of the pick_next_task() method that will 13 * return the next task to call put_prev_task() on the @prev task or 14 * something equivalent. 15 * 16 * May return RETRY_TASK when it finds a higher prio class has runnable 17 * tasks. 18 */ 19 struct task_struct * (*pick_next_task) (struct rq *rq, 20 struct task_struct *prev); 21 void (*put_prev_task) (struct rq *rq, struct task_struct *p); 22 23 #ifdef CONFIG_SMP 24 int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); 25 void (*migrate_task_rq)(struct task_struct *p, int next_cpu); 26 27 void (*post_schedule) (struct rq *this_rq); 28 void (*task_waking) (struct task_struct *task); 29 void (*task_woken) (struct rq *this_rq, struct task_struct *task); 30 31 void (*set_cpus_allowed)(struct task_struct *p, 32 const struct cpumask *newmask); 33 34 void (*rq_online)(struct rq *rq); 35 void (*rq_offline)(struct rq *rq); 36 #endif 37 38 void (*set_curr_task) (struct rq *rq); 39 void (*task_tick) (struct rq *rq, struct task_struct *p, int queued); 40 void (*task_fork) (struct task_struct *p); 41 void (*task_dead) (struct task_struct *p); 42 43 void (*switched_from) (struct rq *this_rq, struct task_struct *task); 44 void (*switched_to) (struct rq *this_rq, struct task_struct *task); 45 void (*prio_changed) (struct rq *this_rq, struct task_struct *task, 46 int oldprio); 47 48 unsigned int (*get_rr_interval) (struct rq *rq, 49 struct task_struct *task); 50 51 #ifdef CONFIG_FAIR_GROUP_SCHED 52 void (*task_move_group) (struct task_struct *p, int on_rq); 53 #endif 54 }; 內(nèi)核聲明了一個(gè)調(diào)度類(lèi)sched_class的結(jié)構(gòu)體類(lèi)型,用來(lái)實(shí)現(xiàn)不同的調(diào)度策略,可以看到該結(jié)構(gòu)體成員都是函數(shù)指針,這些指針指向的函數(shù)就是調(diào)度策略的具體實(shí)現(xiàn),所有和進(jìn)程調(diào)度有關(guān)的函數(shù)都直接或者間接調(diào)用了這些成員函數(shù),來(lái)實(shí)現(xiàn)進(jìn)程調(diào)度。此外,每個(gè)進(jìn)程描述符中都包含一個(gè)指向該結(jié)構(gòu)體類(lèi)型的指針sched_class,指向了所采用的調(diào)度類(lèi)。下面我們看下完全公平調(diào)度策略類(lèi)的定義和初始化(kernel/sched/fair.c)。 1 const struct sched_class fair_sched_class; 定義了一個(gè)全局的調(diào)度策略變量。初始化如下: 1 const struct sched_class fair_sched_class = { 2 .next = &idle_sched_class, 3 .enqueue_task = enqueue_task_fair, 4 .dequeue_task = dequeue_task_fair, 5 .yield_task = yield_task_fair, 6 .yield_to_task = yield_to_task_fair, 7 8 .check_preempt_curr = check_preempt_wakeup, 9 10 .pick_next_task = pick_next_task_fair, 11 .put_prev_task = put_prev_task_fair, 12 13 #ifdef CONFIG_SMP 14 .select_task_rq = select_task_rq_fair, 15 .migrate_task_rq = migrate_task_rq_fair, 16 17 .rq_online = rq_online_fair, 18 .rq_offline = rq_offline_fair, 19 20 .task_waking = task_waking_fair, 21 #endif 22 23 .set_curr_task = set_curr_task_fair, 24 .task_tick = task_tick_fair, 25 .task_fork = task_fork_fair, 26 27 .prio_changed = prio_changed_fair, 28 .switched_from = switched_from_fair, 29 .switched_to = switched_to_fair, 30 31 .get_rr_interval = get_rr_interval_fair, 32 33 #ifdef CONFIG_FAIR_GROUP_SCHED 34 .task_move_group = task_move_group_fair, 35 #endif 36 }; 可以看到該結(jié)構(gòu)體變量中函數(shù)成員很多,它們實(shí)現(xiàn)了不同的功能,待會(huì)用到時(shí)我們?cè)僮龇治觥?/p> 4.進(jìn)程描述符task_struct(include/linux/sched.h) 1 struct task_struct { 2 volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ 3 ..... 4 int on_rq; 5 6 int prio, static_prio, normal_prio; 7 unsigned int rt_priority; 8 const struct sched_class *sched_class; 9 struct sched_entity se; 10 struct sched_rt_entity rt; 11 #ifdef CONFIG_CGROUP_SCHED 12 struct task_group *sched_task_group; 13 #endif 14 struct sched_dl_entity dl; 15 ..... 16 ..... 17 unsigned int policy; 18 ..... 19 ..... 20 struct sched_info sched_info; 21 ..... 22 ..... 23 }; 只列出了和進(jìn)程調(diào)度有關(guān)的成員。第6行三個(gè)變量代表了普通進(jìn)程的三個(gè)優(yōu)先級(jí),第7行的變量代表了實(shí)時(shí)進(jìn)程的優(yōu)先級(jí)。關(guān)于進(jìn)程優(yōu)先級(jí)的概念,大家可以看看《深入理解linux內(nèi)核架構(gòu)》這本書(shū)的進(jìn)程管理章節(jié)。第8-10行就是我們上邊提到的那些結(jié)構(gòu)體在進(jìn)程描述符中的定義。第17行的policy代表了當(dāng)前進(jìn)程的調(diào)度策略,內(nèi)核給出了宏定義,它可以在這些宏中取值,關(guān)于詳細(xì)的講解還是去看《深入理解linux內(nèi)核架構(gòu)》這本書(shū)的進(jìn)程管理部分。policy取了什么值,sched_class也應(yīng)該取相應(yīng)的調(diào)度類(lèi)指針。 進(jìn)程調(diào)度過(guò)程分析: 進(jìn)程調(diào)度過(guò)程分為兩部分,一是對(duì)進(jìn)程信息進(jìn)行修改,主要是修改和調(diào)度相關(guān)的信息,比如進(jìn)程的運(yùn)行時(shí)間,睡眠時(shí)間,進(jìn)程的狀態(tài),cpu的負(fù)荷等等,二是進(jìn)程的切換。和進(jìn)程調(diào)度相關(guān)的所有函數(shù)中,只有schedule函數(shù)是用來(lái)進(jìn)行進(jìn)程切換的,其他函數(shù)都是用來(lái)修改進(jìn)程的調(diào)度信息。schedule函數(shù)我們?cè)谇斑叺牟┪闹幸呀?jīng)探討過(guò)了,這里不再分析。對(duì)于其他函數(shù),我們將按照其功能,逐類(lèi)來(lái)分析。 1.scheduler_tick(kernel/sched/core.c ) 1 void scheduler_tick(void) 2 { 3 int cpu = smp_processor_id(); 4 struct rq *rq = cpu_rq(cpu); 5 struct task_struct *curr = rq->curr; 6 7 sched_clock_tick(); 8 9 raw_spin_lock(&rq->lock); 10 update_rq_clock(rq); 11 curr->sched_class->task_tick(rq, curr, 0); 12 update_cpu_load_active(rq); 13 raw_spin_unlock(&rq->lock); 14 15 perf_event_task_tick(); 16 17 #ifdef CONFIG_SMP 18 rq->idle_balance = idle_cpu(cpu); 19 trigger_load_balance(rq); 20 #endif 21 rq_last_tick_reset(rq); 22 } 該函數(shù)被時(shí)鐘中斷處理程序調(diào)用,將當(dāng)前cpu的負(fù)載情況記載到運(yùn)行隊(duì)列struct rq的某些成員中,并更新當(dāng)前進(jìn)程的時(shí)間片。第3行獲取當(dāng)前的cpu號(hào),第4行獲取當(dāng)前cpu的就緒隊(duì)列(每個(gè)cpu有一個(gè))rq,第5行從就緒隊(duì)列中獲取當(dāng)前運(yùn)行進(jìn)程的描述符,第10行更新就緒隊(duì)列中的clock和clock_task成員值,代表當(dāng)前的時(shí)間,一般我們會(huì)用到clock_task。第11行進(jìn)入當(dāng)前進(jìn)程的調(diào)度類(lèi)的task_tick函數(shù)中,更新當(dāng)前進(jìn)程的時(shí)間片,不同調(diào)度類(lèi)的該函數(shù)實(shí)現(xiàn)不同,待會(huì)我們分析下完全公平調(diào)度類(lèi)的該函數(shù)。第12行更新就緒隊(duì)列的cpu負(fù)載信息。第18行判斷當(dāng)前cpu是否是空閑的,是的話(huà)idle_cpu返回1,否則返回0。第19行掛起SCHED_SOFTIRQ軟中斷函數(shù),去做周期性的負(fù)載平衡操作。第21行將最新的時(shí)鐘滴答數(shù)jiffies存入就緒隊(duì)列的last_sched_tick域中。再來(lái)看下task_tick_fair函數(shù)(kernel/sched/fair.c): 1 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) 2 { 3 struct cfs_rq *cfs_rq; 4 struct sched_entity *se = &curr->se; 5 6 for_each_sched_entity(se) { 7 cfs_rq = cfs_rq_of(se); 8 entity_tick(cfs_rq, se, queued); 9 } 10 11 if (numabalancing_enabled) 12 task_tick_numa(rq, curr); 13 14 update_rq_runnable_avg(rq, 1); 15 } 如果某個(gè)進(jìn)程的調(diào)度類(lèi)采用完全公平調(diào)度類(lèi)的話(huà),那么上個(gè)函數(shù)scheduler_tick第11行所執(zhí)行的task_tick函數(shù)指針,就指向了本函數(shù)??梢曰仡^看看完全公平調(diào)度對(duì)象的初始化,第24行的賦值語(yǔ)句.task_tick = task_tick_fair。回到本函數(shù),第4行獲取當(dāng)前進(jìn)程的普通調(diào)度實(shí)體,將指針存放到se中,第6-8行遍歷當(dāng)前調(diào)度實(shí)體的上一級(jí)實(shí)體,以及上上一級(jí)實(shí)體,以此類(lèi)推,然后在entity_tick函數(shù)中更新調(diào)度實(shí)體的運(yùn)行時(shí)間等信息。在這里用循環(huán)來(lái)遍歷的原因是當(dāng)開(kāi)啟了組調(diào)度后,調(diào)度實(shí)體的parent域就存儲(chǔ)了它的上一級(jí)節(jié)點(diǎn),該實(shí)體和它parent指向的實(shí)體不是同一級(jí)別,因此使用循環(huán)就把從當(dāng)前級(jí)別(組)到最頂層的級(jí)別遍歷完了;如果未選擇組支持,則循環(huán)只執(zhí)行一次,僅對(duì)當(dāng)前調(diào)度實(shí)體進(jìn)行更新。下面看下entity_tick的代碼(kernel/sched/fair.c): 1 static void 2 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) 3 { 4 /* 5 * Update run-time statistics of the 'current'. 6 */ 7 update_curr(cfs_rq); 8 9 /* 10 * Ensure that runnable average is periodically updated. 11 */ 12 update_entity_load_avg(curr, 1); 13 update_cfs_rq_blocked_load(cfs_rq, 1); 14 update_cfs_shares(cfs_rq); 15 16 #ifdef CONFIG_SCHED_HRTICK 17 /* 18 * queued ticks are scheduled to match the slice, so don't bother 19 * validating it and just reschedule. 20 */ 21 if (queued) { 22 resched_task(rq_of(cfs_rq)->curr); 23 return; 24 } 25 /* 26 * don't let the period tick interfere with the hrtick preemption 27 */ 28 if (!sched_feat(DOUBLE_TICK) && 29 hrtimer_active(&rq_of(cfs_rq)->hrtick_timer)) 30 return; 31 #endif 32 33 if (cfs_rq->nr_running > 1) 34 check_preempt_tick(cfs_rq, curr); 35 } 在該函數(shù)中對(duì)調(diào)度實(shí)體(進(jìn)程)的運(yùn)行時(shí)間等信息進(jìn)行更新。第7行update_curr函數(shù)對(duì)當(dāng)前進(jìn)程的運(yùn)行時(shí)間進(jìn)行更新,隨后分析。 第21行如果傳進(jìn)來(lái)的參數(shù)queued不為0的話(huà),當(dāng)前進(jìn)程會(huì)被無(wú)條件設(shè)置重新調(diào)度標(biāo)志(允許被搶占了)。第33-34行如果當(dāng)前cfs_rq隊(duì)列等待調(diào)度的進(jìn)程數(shù)量大于1,那么就執(zhí)行check_preempt_tick函數(shù)檢查當(dāng)前進(jìn)程的時(shí)間片是否用完,用完的話(huà)就需要調(diào)度別的進(jìn)程來(lái)運(yùn)行(具體來(lái)說(shuō),如果當(dāng)前進(jìn)程“真實(shí)時(shí)間片”用完,該函數(shù)給當(dāng)前進(jìn)程設(shè)置need_resched標(biāo)志,然后schedule程序就可以重新在就緒隊(duì)列中調(diào)度新的進(jìn)程),下面分析update_curr函數(shù)(kernel/sched/fair.c): 1 static void update_curr(struct cfs_rq *cfs_rq) 2 { 3 struct sched_entity *curr = cfs_rq->curr; 4 u64 now = rq_clock_task(rq_of(cfs_rq)); 5 u64 delta_exec; 6 7 if (unlikely(!curr)) 8 return; 9 10 delta_exec = now - curr->exec_start; 11 if (unlikely((s64)delta_exec <= 0)) 12 return; 13 14 curr->exec_start = now; 15 16 schedstat_set(curr->statistics.exec_max, 17 max(delta_exec, curr->statistics.exec_max)); 18 19 curr->sum_exec_runtime += delta_exec; 20 schedstat_add(cfs_rq, exec_clock, delta_exec); 21 22 curr->vruntime += calc_delta_fair(delta_exec, curr); 23 update_min_vruntime(cfs_rq); 24 25 if (entity_is_task(curr)) { 26 struct task_struct *curtask = task_of(curr); 27 28 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime); 29 cpuacct_charge(curtask, delta_exec); 30 account_group_exec_runtime(curtask, delta_exec); 31 } 32 33 account_cfs_rq_runtime(cfs_rq, delta_exec); 34 } 該函數(shù)是更新進(jìn)程運(yùn)行時(shí)間最核心的一個(gè)函數(shù)。第3行獲取當(dāng)前的調(diào)度實(shí)體,第4行從就緒隊(duì)列rq的clock_task成員中獲取當(dāng)前時(shí)間,存入now中,該成員我們?cè)趕cheduler_tick函數(shù)中提到過(guò)。第10行用當(dāng)前時(shí)間減去進(jìn)程在上次時(shí)鐘中斷tick中的開(kāi)始時(shí)間得到進(jìn)程運(yùn)行的時(shí)間間隔,存入delta_exec中。第14行當(dāng)前時(shí)間又成為進(jìn)程新的開(kāi)始時(shí)間。第19行將進(jìn)程運(yùn)行的時(shí)間間隔delta_exec累加到調(diào)度實(shí)體的sum_exec_runtime成員中,該成員代表進(jìn)程到目前為止運(yùn)行了多長(zhǎng)時(shí)間。第20行將進(jìn)程運(yùn)行的時(shí)間間隔delta_exec也累加到公平調(diào)度就緒隊(duì)列cfs_rq的exec_clock成員中。第22行calc_delta_fair函數(shù)很關(guān)鍵,它將進(jìn)程執(zhí)行的真實(shí)運(yùn)行時(shí)間轉(zhuǎn)換成虛擬運(yùn)行時(shí)間,然后累加到調(diào)度實(shí)體的vruntime域中,進(jìn)程的虛擬時(shí)間非常重要,完全公平調(diào)度策略就是依賴(lài)該時(shí)間進(jìn)行調(diào)度。關(guān)于進(jìn)程的真實(shí)時(shí)間和虛擬時(shí)間的概念,以及它們之間的轉(zhuǎn)換算法,文章的后面會(huì)提到,詳細(xì)的內(nèi)容大家可以看看《深入理解linux內(nèi)核架構(gòu)》的進(jìn)程管理章節(jié)。第23行更新cfs_rq隊(duì)列中的最小虛擬運(yùn)行時(shí)間min_vruntime,該時(shí)間是就緒隊(duì)列中所有進(jìn)程包括當(dāng)前進(jìn)程的已運(yùn)行的最小虛擬時(shí)間,只能單調(diào)遞增,待會(huì)我們分析update_min_vruntime函數(shù),該函數(shù)比較重要。第25-30行,如果調(diào)度單位是進(jìn)程的話(huà)(不是組),會(huì)更新進(jìn)程描述符中的運(yùn)行時(shí)間。第33行更新cfs_rq隊(duì)列的剩余運(yùn)行時(shí)間,并計(jì)算出期望運(yùn)行時(shí)間,必要的話(huà)可以對(duì)進(jìn)程重新調(diào)度。下面我們先分析update_min_vruntime函數(shù),然后分析calc_delta_fair函數(shù)(kernel/sched/fair.c): 1 static void update_min_vruntime(struct cfs_rq *cfs_rq) 2 { 3 u64 vruntime = cfs_rq->min_vruntime; 4 5 if (cfs_rq->curr) 6 vruntime = cfs_rq->curr->vruntime; 7 8 if (cfs_rq->rb_leftmost) { 9 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost, 10 struct sched_entity, 11 run_node); 12 13 if (!cfs_rq->curr) 14 vruntime = se->vruntime; 15 else 16 vruntime = min_vruntime(vruntime, se->vruntime); 17 } 18 19 /* ensure we never gain time by being placed backwards. */ 20 cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); 21 #ifndef CONFIG_64BIT 22 smp_wmb(); 23 cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; 24 #endif 25 } 每個(gè)cfs_rq隊(duì)列均有一個(gè)min_vruntime成員,裝的是就緒隊(duì)列中所有進(jìn)程包括當(dāng)前進(jìn)程已運(yùn)行的虛擬時(shí)間中最小的那個(gè)時(shí)間。本函數(shù)來(lái)更新這個(gè)時(shí)間。第5-6行如果當(dāng)前有進(jìn)程正在執(zhí)行,將當(dāng)前進(jìn)程已運(yùn)行的虛擬時(shí)間保存在vruntime變量中。第8-17行如果就緒隊(duì)列中有下一個(gè)要被調(diào)度的進(jìn)程(由rb_leftmost指針指向),則進(jìn)入if體,第13-16行從當(dāng)前進(jìn)程和下個(gè)被調(diào)度進(jìn)程中,選擇最小的已運(yùn)行虛擬時(shí)間,保存到vruntime中。第20行從當(dāng)前隊(duì)列的min_vruntime域和vruntime變量中,選最大的保存到隊(duì)列的min_vruntime域中,完成更新。其實(shí)第13-17行是最關(guān)鍵的,保證了隊(duì)列的min_vruntime域中存放的是就緒隊(duì)列中最小的虛擬運(yùn)行時(shí)間,而第20行的作用僅僅是保證min_vruntime域中的值單調(diào)遞增,沒(méi)有別的含義了。隊(duì)列中的min_vruntime成員非常重要,用于在睡眠進(jìn)程被喚醒后以及新進(jìn)程被創(chuàng)建好時(shí),進(jìn)行虛擬時(shí)間補(bǔ)償或者懲罰,后面會(huì)分析到。 1 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se) 2 { 3 if (unlikely(se->load.weight != NICE_0_LOAD)) 4 delta = __calc_delta(delta, NICE_0_LOAD, &se->load); 5 6 return delta; 7 } 第3行判斷當(dāng)前進(jìn)程nice值是否為0,如果是的話(huà),直接返回真實(shí)運(yùn)行時(shí)間delta(nice0級(jí)別的進(jìn)程真實(shí)運(yùn)行時(shí)間和虛擬運(yùn)行時(shí)間值相等);如果不是的話(huà),第4行將真實(shí)時(shí)間轉(zhuǎn)換成虛擬時(shí)間。下面我們分析__calc_delta函數(shù)(kernel/sched/fair.c): 1 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) 2 { 3 u64 fact = scale_load_down(weight); 4 int shift = WMULT_SHIFT; 5 6 __update_inv_weight(lw); 7 8 if (unlikely(fact >> 32)) { 9 while (fact >> 32) { 10 fact >>= 1; 11 shift--; 12 } 13 } 14 15 /* hint to use a 32x32->64 mul */ 16 fact = (u64)(u32)fact * lw->inv_weight; 17 18 while (fact >> 32) { 19 fact >>= 1; 20 shift--; 21 } 22 23 return mul_u64_u32_shr(delta_exec, fact, shift); 24 } 該函數(shù)執(zhí)行了兩種算法:要么是delta_exec * weight / lw.weight,要么是(delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT。計(jì)算的結(jié)果就是轉(zhuǎn)換后的虛擬時(shí)間。至此,scheduler_tick函數(shù)大致就分析完了,當(dāng)然還有一些細(xì)節(jié)沒(méi)有分析到,進(jìn)程調(diào)度這塊非常龐雜,要想把所有函數(shù)分析完非常耗費(fèi)時(shí)間和精力,以后如果分析到的話(huà),再慢慢補(bǔ)充。scheduler_tick函數(shù)主要更新進(jìn)程的運(yùn)行時(shí)間,包括物理時(shí)間和虛擬時(shí)間。 2.進(jìn)程優(yōu)先級(jí)設(shè)置的函數(shù),進(jìn)程的優(yōu)先級(jí)和調(diào)度關(guān)系密切,通過(guò)上邊分析可以看到,計(jì)算進(jìn)程的虛擬運(yùn)行時(shí)間要用到優(yōu)先級(jí),優(yōu)先級(jí)決定進(jìn)程權(quán)重,權(quán)重決定進(jìn)程虛擬時(shí)間的增加速度,最終決定進(jìn)程可運(yùn)行時(shí)間的長(zhǎng)短。權(quán)重越大的進(jìn)程可以執(zhí)行的時(shí)間越長(zhǎng)。從effective_prio函數(shù)開(kāi)始(kernel/sched/core.c): 1 static int effective_prio(struct task_struct *p) 2 { 3 p->normal_prio = normal_prio(p); 4 /* 5 * If we are RT tasks or we were boosted to RT priority, 6 * keep the priority unchanged. Otherwise, update priority 7 * to the normal priority: 8 */ 9 if (!rt_prio(p->prio)) 10 return p->normal_prio; 11 return p->prio; 12 } 該函數(shù)在進(jìn)程創(chuàng)建時(shí)或者在用戶(hù)的nice系統(tǒng)調(diào)用中都會(huì)被調(diào)用到,來(lái)設(shè)置進(jìn)程的活動(dòng)優(yōu)先級(jí)(進(jìn)程的三種優(yōu)先級(jí):活動(dòng)優(yōu)先級(jí)prio,靜態(tài)優(yōu)先級(jí)static_prio,普通優(yōu)先級(jí)normal_prio),該函數(shù)設(shè)計(jì)的有一定技巧性,函數(shù)的返回值是用來(lái)設(shè)置進(jìn)程的活動(dòng)優(yōu)先級(jí),但是在函數(shù)體中也把進(jìn)程的普通優(yōu)先級(jí)設(shè)置了。第3行設(shè)置進(jìn)程的普通優(yōu)先級(jí),隨后分析normal_prio函數(shù)。第9-11行,通過(guò)進(jìn)程的活動(dòng)優(yōu)先級(jí)可以判斷出該進(jìn)程是不是實(shí)時(shí)進(jìn)程,如果是實(shí)時(shí)進(jìn)程,則執(zhí)行11行,返回p->prio,實(shí)時(shí)進(jìn)程的活動(dòng)優(yōu)先級(jí)不變。否則返回p->normal_prio,這說(shuō)明普通進(jìn)程的活動(dòng)優(yōu)先級(jí)等于它的普通優(yōu)先級(jí)。下面我們看看normal_prio函數(shù)(kernel/sched/core.c): 1 static inline int normal_prio(struct task_struct *p) 2 { 3 int prio; 4 5 if (task_has_dl_policy(p)) 6 prio = MAX_DL_PRIO-1; 7 else if (task_has_rt_policy(p)) 8 prio = MAX_RT_PRIO-1 - p->rt_priority; 9 else 10 prio = __normal_prio(p); 11 return prio; 12 } 該函數(shù)用來(lái)設(shè)置進(jìn)程的普通優(yōu)先級(jí)。第5行判斷當(dāng)前進(jìn)程是不是空閑進(jìn)程,是的話(huà)設(shè)置進(jìn)程的普通優(yōu)先級(jí)為-1(-1是空閑進(jìn)程的優(yōu)先級(jí)),否則的話(huà)第7行判斷進(jìn)程是不是實(shí)時(shí)進(jìn)程,是的話(huà)設(shè)置實(shí)時(shí)進(jìn)程的普通優(yōu)先級(jí)為0-99(數(shù)字越小優(yōu)先級(jí)越高),可以看到這塊減去了p->rt_priority,比較奇怪,這是因?yàn)閷?shí)時(shí)進(jìn)程描述符的rt_priority成員中事先存放了它自己的優(yōu)先級(jí)(數(shù)字也是0-99,但在這里數(shù)字越大,優(yōu)先級(jí)越高),因此往p->prio中倒換的時(shí)候,需要處理一下,MAX_RT_PRIO值為100,因此MAX_RT_PRIO-1-(0,99)就倒換成了(99,0),這僅僅是個(gè)小技巧。如果當(dāng)前進(jìn)程也不是實(shí)時(shí)進(jìn)程(說(shuō)明是普通進(jìn)程嘍),則第10行將進(jìn)程的靜態(tài)優(yōu)先級(jí)存入prio中,然后返回(也就是說(shuō)普通進(jìn)程的普通優(yōu)先級(jí)等于其靜態(tài)優(yōu)先級(jí))。 總結(jié)下,總共有三種進(jìn)程:空閑進(jìn)程,實(shí)時(shí)進(jìn)程,普通進(jìn)程;每種進(jìn)程都包含三種優(yōu)先級(jí):活動(dòng)優(yōu)先級(jí),普通優(yōu)先級(jí),靜態(tài)優(yōu)先級(jí)??臻e進(jìn)程的普通優(yōu)先級(jí)永遠(yuǎn)-1,實(shí)時(shí)進(jìn)程的普通優(yōu)先級(jí)是根據(jù)p->rt_priority設(shè)定的(0-99),普通進(jìn)程的普通優(yōu)先級(jí)是根據(jù)其靜態(tài)優(yōu)先級(jí)設(shè)定的(100-139)。 3.進(jìn)程權(quán)重設(shè)置的函數(shù),上邊我們提到,進(jìn)程的優(yōu)先級(jí)決定進(jìn)程的權(quán)重。權(quán)重進(jìn)而決定進(jìn)程運(yùn)行時(shí)間的長(zhǎng)短。我們先分析和權(quán)重相關(guān)的數(shù)據(jù)結(jié)構(gòu)。 struct load_weight(include/linux/sched.h) 1 struct load_weight { 2 unsigned long weight; 3 u32 inv_weight; 4 }; 每個(gè)進(jìn)程描述符,調(diào)度實(shí)體,就緒對(duì)列中都包含一個(gè)該類(lèi)型變量,用來(lái)保存各自的權(quán)重。成員weight中存放進(jìn)程優(yōu)先級(jí)所對(duì)應(yīng)的權(quán)重。成員inv_weight中存放NICE_0_LOAD/weight的結(jié)果,這個(gè)結(jié)果乘以進(jìn)程運(yùn)行的時(shí)間間隔delta_exec就是進(jìn)程運(yùn)行的虛擬時(shí)間。因此引入權(quán)重就是為了計(jì)算進(jìn)程的虛擬時(shí)間。在這里將中間的計(jì)算結(jié)果保存下來(lái),后邊就不用再計(jì)算了,直接可以用。 數(shù)組prio_to_weight[40](kernel/sched/sched.h) 1 static const int prio_to_weight[40] = { 2 /* -20 */ 88761, 71755, 56483, 46273, 36291, 3 /* -15 */ 29154, 23254, 18705, 14949, 11916, 4 /* -10 */ 9548, 7620, 6100, 4904, 3906, 5 /* -5 */ 3121, 2501, 1991, 1586, 1277, 6 /* 0 */ 1024, 820, 655, 526, 423, 7 /* 5 */ 335, 272, 215, 172, 137, 8 /* 10 */ 110, 87, 70, 56, 45, 9 /* 15 */ 36, 29, 23, 18, 15, 10 }; 該數(shù)組是普通進(jìn)程的優(yōu)先級(jí)和權(quán)重對(duì)應(yīng)關(guān)系。數(shù)組元素是權(quán)重值,分別是優(yōu)先級(jí)從100-139或者nice值從-20-+19所對(duì)應(yīng)的權(quán)重值。nice值(-20-+19)是從用戶(hù)空間看到的普通進(jìn)程的優(yōu)先級(jí),和內(nèi)核空間的優(yōu)先級(jí)(100-139)一一對(duì)應(yīng)。struct load_weight中的weight成員存放正是這些權(quán)重值。 中間結(jié)果數(shù)組prio_to_wmult[40] (kernel/sched/sched.h) 1 static const u32 prio_to_wmult[40] = { 2 /* -20 */ 48388, 59856, 76040, 92818, 118348, 3 /* -15 */ 147320, 184698, 229616, 287308, 360437, 4 /* -10 */ 449829, 563644, 704093, 875809, 1099582, 5 /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326, 6 /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587, 7 /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126, 8 /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, 9 /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, 10 }; 該數(shù)組元素就是上個(gè)數(shù)組元素被NICE_0_LOAD除的結(jié)果,一一對(duì)應(yīng)。struct load_weight中的inv_weight成員存放正是這些值。 下邊我們分析和權(quán)重設(shè)置相關(guān)的函數(shù)。從set_load_weight函數(shù)開(kāi)始(kernel/sched/core.c): 1 static void set_load_weight(struct task_struct *p) 2 { 3 int prio = p->static_prio - MAX_RT_PRIO; 4 struct load_weight *load = &p->se.load; 5 6 /* 7 * SCHED_IDLE tasks get minimal weight: 8 */ 9 if (p->policy == SCHED_IDLE) { 10 load->weight = scale_load(WEIGHT_IDLEPRIO); 11 load->inv_weight = WMULT_IDLEPRIO; 12 return; 13 } 14 15 load->weight = scale_load(prio_to_weight[prio]); 16 load->inv_weight = prio_to_wmult[prio]; 17 } 該函數(shù)用來(lái)設(shè)置進(jìn)程p的權(quán)重。第3行將進(jìn)程p的靜態(tài)優(yōu)先級(jí)轉(zhuǎn)換成數(shù)組下標(biāo)(減去100,從100-139--->0-39)。第4行獲取當(dāng)前進(jìn)程的調(diào)度實(shí)體中的權(quán)重結(jié)構(gòu)體指針,存入load中。第9-12行,如果當(dāng)前進(jìn)程是空閑進(jìn)程,則設(shè)置相應(yīng)的權(quán)重和中間計(jì)算結(jié)果。第15-16行,設(shè)置實(shí)時(shí)進(jìn)程和普通進(jìn)程的權(quán)重和中間計(jì)算結(jié)果。 由于就緒隊(duì)列中也包含權(quán)重結(jié)構(gòu)體,所以也要對(duì)它們進(jìn)行設(shè)置。使用以下函數(shù)(kernel/sched/fair.c): 1 static inline void update_load_set(struct load_weight *lw, unsigned long w) 2 { 3 lw->weight = w; 4 lw->inv_weight = 0; 5 } 該函數(shù)用來(lái)設(shè)置就緒隊(duì)列的權(quán)重。 1 static inline void update_load_add(struct load_weight *lw, unsigned long inc) 2 { 3 lw->weight += inc; 4 lw->inv_weight = 0; 5 } 當(dāng)有進(jìn)程加入就緒隊(duì)列,使用該函數(shù)增加就緒隊(duì)列的權(quán)重。 1 static inline void update_load_sub(struct load_weight *lw, unsigned long dec) 2 { 3 lw->weight -= dec; 4 lw->inv_weight = 0; 5 } 當(dāng)有進(jìn)程從就緒隊(duì)列移除時(shí),使用該函數(shù)減少就緒隊(duì)列的權(quán)重。就緒隊(duì)列的load_weight.inv_weight成員總是0,因?yàn)椴粫?huì)使用到就緒隊(duì)列的該域。 4.計(jì)算進(jìn)程延遲周期的相關(guān)函數(shù)。進(jìn)程的延遲周期指的是將所有進(jìn)程執(zhí)行一遍的時(shí)間。當(dāng)就緒隊(duì)列中的進(jìn)程數(shù)量不超出規(guī)定的時(shí)候,內(nèi)核有一個(gè)固定的延遲周期供調(diào)度使用,但是當(dāng)進(jìn)程數(shù)量超出規(guī)定以后,就需要對(duì)該固定延遲周期進(jìn)行擴(kuò)展(不然隨著進(jìn)程越多,每個(gè)進(jìn)程分配的執(zhí)行時(shí)間會(huì)越少)。下面看看sched_slice函數(shù)(kernel/sched/fair.c): 1 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) 2 { 3 u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq); 4 5 for_each_sched_entity(se) { 6 struct load_weight *load; 7 struct load_weight lw; 8 9 cfs_rq = cfs_rq_of(se); 10 load = &cfs_rq->load; 11 12 if (unlikely(!se->on_rq)) { 13 lw = cfs_rq->load; 14 15 update_load_add(&lw, se->load.weight); 16 load = &lw; 17 } 18 slice = __calc_delta(slice, se->load.weight, load); 19 } 20 return slice; 21 } 直接看第18行,__calc_delta用來(lái)計(jì)算當(dāng)前進(jìn)程在延遲周期中可占的時(shí)間(相當(dāng)于“時(shí)間片”,就是當(dāng)前進(jìn)程可用的時(shí)間)。__calc_delta函數(shù)很強(qiáng)大,記得前邊在entity_tick函數(shù)中也調(diào)用過(guò)該函數(shù)(entity_tick--->update_curr--->calc_delta_fair--->__calc_delta),當(dāng)時(shí)使用該函數(shù)是為了將進(jìn)程運(yùn)行過(guò)的物理時(shí)間(真實(shí)時(shí)間)轉(zhuǎn)換成虛擬時(shí)間;而在此處,我們用它來(lái)計(jì)算當(dāng)前進(jìn)程在延遲周期中可占的時(shí)間(相當(dāng)于“時(shí)間片”)。前邊提過(guò),__calc_delta中用到param1 * param2 / param3.weight這個(gè)公式(param代表該函數(shù)接收的參數(shù)),當(dāng)param2的值固定不變(等于NICE_0_LOAD),param3(代表當(dāng)前進(jìn)程的權(quán)重)在變化時(shí),該函數(shù)是用來(lái)轉(zhuǎn)換真實(shí)時(shí)間和虛擬時(shí)間的;當(dāng)param3(代表當(dāng)前cfs_rq的權(quán)重,cfs_rq->load->weight)的值固定不變,param2在變化(代表當(dāng)前進(jìn)程的權(quán)重)時(shí),該函數(shù)則是用來(lái)計(jì)算當(dāng)前進(jìn)程應(yīng)該運(yùn)行的時(shí)間。因此第18行計(jì)算結(jié)果slice就是當(dāng)前進(jìn)程應(yīng)該運(yùn)行的真實(shí)時(shí)間。下面再看一個(gè)函數(shù)sched_vslice(kernel/sched/fair.c): 1 static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) 2 { 3 return calc_delta_fair(sched_slice(cfs_rq, se), se); 4 } 該函數(shù)用來(lái)將進(jìn)程應(yīng)該運(yùn)行的真實(shí)時(shí)間轉(zhuǎn)換成應(yīng)該運(yùn)行的虛擬時(shí)間,以供調(diào)度使用??梢钥吹皆摵瘮?shù)調(diào)用了cals_delta_fair來(lái)進(jìn)行時(shí)間轉(zhuǎn)換,前邊已分析過(guò),不再贅述。 5.選擇下一個(gè)需要調(diào)度的進(jìn)程。所使用的函數(shù)是pick_next_task_fair,代碼如下(kernel/sched/fair.c): ![]() 該函數(shù)會(huì)被賦給公平調(diào)度類(lèi)的pick_next_task成員(.pick_next_task = pick_next_task_fair),在schedule函數(shù)中會(huì)調(diào)用該函數(shù)選擇下一個(gè)需要調(diào)用的進(jìn)程,然后進(jìn)行進(jìn)程切換。第11-12行如果當(dāng)前就緒隊(duì)列中的進(jìn)程數(shù)量為0,則退出函數(shù)。第25-49行對(duì)所有的調(diào)度組進(jìn)行遍歷,從中選擇下一個(gè)可調(diào)度的進(jìn)程,而不只局限在當(dāng)前隊(duì)列的當(dāng)前組。第26行獲取當(dāng)前調(diào)度實(shí)體,第34-37行如果存在一個(gè)當(dāng)前調(diào)度實(shí)體(進(jìn)程)并且正在運(yùn)行,則更新進(jìn)程的運(yùn)行時(shí)間,否則就緒隊(duì)列cfs_rq.curr置為null,表示當(dāng)前無(wú)進(jìn)程運(yùn)行。第47行獲取下一個(gè)調(diào)度實(shí)體,待會(huì)來(lái)分析該函數(shù)。第48行獲取下個(gè)調(diào)度實(shí)體所擁有的就緒隊(duì)列my_q(代表一個(gè)調(diào)度組),如果調(diào)度組非空,則進(jìn)入下一次循環(huán),在新的就緒隊(duì)列(調(diào)度組)中挑選下一個(gè)可調(diào)度進(jìn)程,直到某個(gè)實(shí)體沒(méi)有自己的就緒隊(duì)列為止(遍歷完所有的調(diào)度組了)。退出循環(huán)后,可以發(fā)現(xiàn)此時(shí)的se是所遍歷的最后一個(gè)調(diào)度組的下個(gè)可運(yùn)行實(shí)體。第51行獲取se對(duì)應(yīng)的進(jìn)程描述符,第58-77行,如果當(dāng)前進(jìn)程和下一個(gè)進(jìn)程(se所對(duì)應(yīng)的進(jìn)程)不是同一個(gè)的話(huà),則執(zhí)行if體,第59行將當(dāng)前進(jìn)程的調(diào)度實(shí)體指針裝入pse中,第61-72行如果當(dāng)前進(jìn)程和下一個(gè)調(diào)度的進(jìn)程(se對(duì)應(yīng)的進(jìn)程)不屬于同一調(diào)度組,則進(jìn)入循環(huán)。否則,執(zhí)行第75-76行,將當(dāng)前進(jìn)程放回就緒隊(duì)列,將下個(gè)進(jìn)程從就緒隊(duì)列中拿出,這兩個(gè)函數(shù)涉及到了就緒隊(duì)列的出隊(duì)和入隊(duì)操作,我們?cè)谙逻叿治?。?1-73的循環(huán)根據(jù)當(dāng)前實(shí)體和下個(gè)調(diào)度實(shí)體的節(jié)點(diǎn)深度進(jìn)行調(diào)整(同一個(gè)級(jí)別的進(jìn)程才能切換)。下面看看pick_next_entity,put_prev_entity和set_prev_entity函數(shù)代碼(kernel/sched/fair.c): 1 static struct sched_entity * 2 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) 3 { 4 struct sched_entity *left = __pick_first_entity(cfs_rq); 5 struct sched_entity *se; 6 7 /* 8 * If curr is set we have to see if its left of the leftmost entity 9 * still in the tree, provided there was anything in the tree at all. 10 */ 11 if (!left || (curr && entity_before(curr, left))) 12 left = curr; 13 14 se = left; /* ideally we run the leftmost entity */ 15 16 /* 17 * Avoid running the skip buddy, if running something else can 18 * be done without getting too unfair. 19 */ 20 if (cfs_rq->skip == se) { 21 struct sched_entity *second; 22 23 if (se == curr) { 24 second = __pick_first_entity(cfs_rq); 25 } else { 26 second = __pick_next_entity(se); 27 if (!second || (curr && entity_before(curr, second))) 28 second = curr; 29 } 30 31 if (second && wakeup_preempt_entity(second, left) < 1) 32 se = second; 33 } 34 35 /* 36 * Prefer last buddy, try to return the CPU to a preempted task. 37 */ 38 if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) 39 se = cfs_rq->last; 40 41 /* 42 * Someone really wants this to run. If it's not unfair, run it. 43 */ 44 if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) 45 se = cfs_rq->next; 46 47 clear_buddies(cfs_rq, se); 48 49 return se; 50 } 該函數(shù)選擇下一個(gè)調(diào)度的實(shí)體。第4行將紅黑樹(shù)的最左邊實(shí)體賦給left,第11-12行如果紅黑樹(shù)的最左邊實(shí)體為空或者當(dāng)前實(shí)體運(yùn)行的虛擬時(shí)間小于下一個(gè)實(shí)體(繼續(xù)當(dāng)前的實(shí)體),把當(dāng)前實(shí)體賦給left,實(shí)際上left指向的是下一個(gè)要執(zhí)行的進(jìn)程(該進(jìn)程要么還是當(dāng)前進(jìn)程,要么是下一個(gè)進(jìn)程),第14行將left賦給se,第20-33行如果se進(jìn)程是需要跳過(guò)的進(jìn)程(不能被調(diào)度),執(zhí)行if體,如果se進(jìn)程是當(dāng)前進(jìn)程,則選擇紅黑數(shù)最左的進(jìn)程賦給second,否則se進(jìn)程已經(jīng)是最左的進(jìn)程,就把next指向的進(jìn)程賦給second(next指向的是剛剛被喚醒的進(jìn)程),第32行將second再次賦給se,se是挑選出來(lái)的下個(gè)進(jìn)程。第38-45,再次判斷,要么把last指向的進(jìn)程賦給se,要么把next指向進(jìn)程賦給se,內(nèi)核更傾向于把last賦給se,因?yàn)閘ast是喚醒next進(jìn)程的進(jìn)程(一般就是當(dāng)前進(jìn)程),所以執(zhí)行l(wèi)ast就意味著不用切換進(jìn)程,效率最高。第47行清理掉next和last域。第31,38,44行使用到的wakeup_preempt_entity函數(shù)我們?cè)谶M(jìn)程喚醒那塊再分析。 1 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) 2 { 3 /* 4 * If still on the runqueue then deactivate_task() 5 * was not called and update_curr() has to be done: 6 */ 7 if (prev->on_rq) 8 update_curr(cfs_rq); 9 10 /* throttle cfs_rqs exceeding runtime */ 11 check_cfs_rq_runtime(cfs_rq); 12 13 check_spread(cfs_rq, prev); 14 if (prev->on_rq) { 15 update_stats_wait_start(cfs_rq, prev); 16 /* Put 'current' back into the tree. */ 17 __enqueue_entity(cfs_rq, prev); 18 /* in !on_rq case, update occurred at dequeue */ 19 update_entity_load_avg(prev, 1); 20 } 21 cfs_rq->curr = NULL; 22 } 該函數(shù)將當(dāng)前調(diào)度實(shí)體放回就緒隊(duì)列。第7-8行如果當(dāng)前實(shí)體正在運(yùn)行,更新其時(shí)間片。第17行將當(dāng)前實(shí)體加入到就緒隊(duì)列中,待會(huì)分析__enqueue_entity函數(shù)。第21行將就緒隊(duì)列的curr域置為null,因?yàn)楫?dāng)前進(jìn)程已經(jīng)放回就緒隊(duì)列了,就表示當(dāng)前沒(méi)有進(jìn)程正在執(zhí)行了,因此curr為空。 1 static void 2 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 3 { 4 /* 'current' is not kept within the tree. */ 5 if (se->on_rq) { 6 /* 7 * Any task has to be enqueued before it get to execute on 8 * a CPU. So account for the time it spent waiting on the 9 * runqueue. 10 */ 11 update_stats_wait_end(cfs_rq, se); 12 __dequeue_entity(cfs_rq, se); 13 } 14 15 update_stats_curr_start(cfs_rq, se); 16 cfs_rq->curr = se; 17 #ifdef CONFIG_SCHEDSTATS 18 /* 19 * Track our maximum slice length, if the CPU's load is at 20 * least twice that of our own weight (i.e. dont track it 21 * when there are only lesser-weight tasks around): 22 */ 23 if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) { 24 se->statistics.slice_max = max(se->statistics.slice_max, 25 se->sum_exec_runtime - se->prev_sum_exec_runtime); 26 } 27 #endif 28 se->prev_sum_exec_runtime = se->sum_exec_runtime; 29 } 該函數(shù)將下一個(gè)被調(diào)度實(shí)體從就緒隊(duì)列中取出。第12行實(shí)現(xiàn)取出操作,待會(huì)分析該函數(shù)。第16行將取出的調(diào)度實(shí)體指針賦給就緒隊(duì)列的curr,那么此時(shí)就有了正在運(yùn)行的進(jìn)程了。后邊的代碼更新當(dāng)前進(jìn)程的時(shí)間統(tǒng)計(jì)信息。 6.就緒隊(duì)列的入隊(duì)和出隊(duì)操作(kernel/sched/fair.c)。 1 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 2 { 3 struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; 4 struct rb_node *parent = NULL; 5 struct sched_entity *entry; 6 int leftmost = 1; 7 8 /* 9 * Find the right place in the rbtree: 10 */ 11 while (*link) { 12 parent = *link; 13 entry = rb_entry(parent, struct sched_entity, run_node); 14 /* 15 * We dont care about collisions. Nodes with 16 * the same key stay together. 17 */ 18 if (entity_before(se, entry)) { 19 link = &parent->rb_left; 20 } else { 21 link = &parent->rb_right; 22 leftmost = 0; 23 } 24 } 25 26 /* 27 * Maintain a cache of leftmost tree entries (it is frequently 28 * used): 29 */ 30 if (leftmost) 31 cfs_rq->rb_leftmost = &se->run_node; 32 33 rb_link_node(&se->run_node, parent, link); 34 rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); 35 } 該函數(shù)實(shí)現(xiàn)入隊(duì)操作。第3行獲取就緒隊(duì)列中紅黑樹(shù)的根節(jié)點(diǎn),將樹(shù)根指針保存在link中。第12行parent暫時(shí)指向樹(shù)根。第13行獲得樹(shù)根節(jié)點(diǎn)的調(diào)度實(shí)體,保存在entry中。第18-22行,比較要入隊(duì)的實(shí)體中的已運(yùn)行虛擬時(shí)間和樹(shù)根實(shí)體中的該信息,如果前者小的話(huà),就要插入到樹(shù)的左子樹(shù)上(link指向樹(shù)根的左孩子,再次進(jìn)入循環(huán),類(lèi)似于遞歸),否則就要插入到樹(shù)的右子樹(shù)上(同上)。這塊就將進(jìn)程的調(diào)度策略展現(xiàn)的淋漓盡致:根據(jù)進(jìn)程已運(yùn)行的虛擬時(shí)間來(lái)決定進(jìn)程的調(diào)度,紅黑樹(shù)的左子樹(shù)比右子樹(shù)要先被調(diào)度,已運(yùn)行的虛擬時(shí)間越小的進(jìn)程越在樹(shù)的左側(cè)。第30-31行,如果入隊(duì)的實(shí)體最終被插在左孩子上(該入隊(duì)實(shí)體仍是就緒隊(duì)列上最靠前的實(shí)體,下次就會(huì)被調(diào)用),那么就要讓就緒隊(duì)列的rb_leftmost指向入隊(duì)實(shí)體。rb_leftmost指針始終指向下次要被調(diào)度的實(shí)體(進(jìn)程)。最后兩行要給紅黑數(shù)重新著色。 1 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 2 { 3 if (cfs_rq->rb_leftmost == &se->run_node) { 4 struct rb_node *next_node; 5 6 next_node = rb_next(&se->run_node); 7 cfs_rq->rb_leftmost = next_node; 8 } 9 10 rb_erase(&se->run_node, &cfs_rq->tasks_timeline); 11 } 該函數(shù)實(shí)現(xiàn)出隊(duì)操作。第3行判斷要出隊(duì)的實(shí)體是不是紅黑樹(shù)最左側(cè)的孩子(rb_leftmost所指向的),如果不是的話(huà),第10行直接刪除該出隊(duì)實(shí)體。否則執(zhí)行if體,第6行找到出隊(duì)實(shí)體的下一個(gè)實(shí)體(中序遍歷的下一個(gè)節(jié)點(diǎn),也就是當(dāng)出隊(duì)實(shí)體刪除后,最左邊的孩子),賦給next_node。第7行讓rb_leftmost指向next_node。在刪除掉要出隊(duì)實(shí)體后,下一個(gè)需要被調(diào)度的實(shí)體也就設(shè)置好了。 7.睡眠進(jìn)程被喚醒后搶占當(dāng)前進(jìn)程。當(dāng)某個(gè)資源空出來(lái)后,等待該資源的進(jìn)程就會(huì)被喚醒,喚醒后也許就要搶占當(dāng)前進(jìn)程,因此這塊的函數(shù)也需要分析(kernel/sched/core.c)。 1 static int 2 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) 3 { 4 unsigned long flags; 5 int cpu, success = 0; 6 7 /* 8 * If we are going to wake up a thread waiting for CONDITION we 9 * need to ensure that CONDITION=1 done by the caller can not be 10 * reordered with p->state check below. This pairs with mb() in 11 * set_current_state() the waiting thread does. 12 */ 13 smp_mb__before_spinlock(); 14 raw_spin_lock_irqsave(&p->pi_lock, flags); 15 if (!(p->state & state)) 16 goto out; 17 18 success = 1; /* we're going to change ->state */ 19 cpu = task_cpu(p); 20 21 if (p->on_rq && ttwu_remote(p, wake_flags)) 22 goto stat; 23 24 #ifdef CONFIG_SMP 25 /* 26 * If the owning (remote) cpu is still in the middle of schedule() with 27 * this task as prev, wait until its done referencing the task. 28 */ 29 while (p->on_cpu) 30 cpu_relax(); 31 /* 32 * Pairs with the smp_wmb() in finish_lock_switch(). 33 */ 34 smp_rmb(); 35 36 p->sched_contributes_to_load = !!task_contributes_to_load(p); 37 p->state = TASK_WAKING; 38 39 if (p->sched_class->task_waking) 40 p->sched_class->task_waking(p); 41 42 cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); 43 if (task_cpu(p) != cpu) { 44 wake_flags |= WF_MIGRATED; 45 set_task_cpu(p, cpu); 46 } 47 #endif /* CONFIG_SMP */ 48 49 ttwu_queue(p, cpu); 50 stat: 51 ttwu_stat(p, cpu, wake_flags); 52 out: 53 raw_spin_unlock_irqrestore(&p->pi_lock, flags); 54 55 return success; 56 } 該函數(shù)會(huì)喚醒參數(shù)p指定的進(jìn)程,將進(jìn)程加入就緒隊(duì)列中等待調(diào)度。第15行判斷p進(jìn)程的狀態(tài)碼和傳進(jìn)來(lái)的狀態(tài)碼是否一致,不一致的話(huà)函數(shù)結(jié)束(不一致則說(shuō)明進(jìn)程等待的條件未滿(mǎn)足)。第19行獲取要喚醒進(jìn)程p原先所在的cpu號(hào)。第37行設(shè)置要喚醒進(jìn)程p的狀態(tài)為T(mén)ASK_WAKING。第40行執(zhí)行進(jìn)程p的調(diào)度類(lèi)中的task_waking函數(shù),該函數(shù)指針指向了task_waking_fair函數(shù),該函數(shù)主要是對(duì)睡眠進(jìn)程的已運(yùn)行虛擬時(shí)間進(jìn)行補(bǔ)償,待會(huì)分析該函數(shù)。第42行為剛喚醒進(jìn)程p選擇一個(gè)合適的就緒隊(duì)列供其加入,返回就緒隊(duì)列所在的cpu號(hào)。第43行如果進(jìn)程p所在的原先cpu和為它挑選的cpu不是同一個(gè)的話(huà),第45行更改進(jìn)程p的cpu號(hào)。 1 void wake_up_new_task(struct task_struct *p) 2 { 3 unsigned long flags; 4 struct rq *rq; 5 6 raw_spin_lock_irqsave(&p->pi_lock, flags); 7 #ifdef CONFIG_SMP 8 /* 9 * Fork balancing, do it here and not earlier because: 10 * - cpus_allowed can change in the fork path 11 * - any previously selected cpu might disappear through hotplug 12 */ 13 set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); 14 #endif 15 16 /* Initialize new task's runnable average */ 17 init_task_runnable_average(p); 18 rq = __task_rq_lock(p); 19 activate_task(rq, p, 0); 20 p->on_rq = 1; 21 trace_sched_wakeup_new(p, true); 22 check_preempt_curr(rq, p, WF_FORK); 23 #ifdef CONFIG_SMP 24 if (p->sched_class->task_woken) 25 p->sched_class->task_woken(rq, p); 26 #endif 27 task_rq_unlock(rq, p, &flags); 28 } 該函數(shù)用來(lái)喚醒剛創(chuàng)建好的進(jìn)程,而上個(gè)函數(shù)是用來(lái)喚醒睡眠中的進(jìn)程。第13行為將喚醒的進(jìn)程p設(shè)置合適的cpu。第17行設(shè)置進(jìn)程p的可運(yùn)行時(shí)間長(zhǎng)度(類(lèi)似“時(shí)間片”),第19行activate_task函數(shù)主要將喚醒的進(jìn)程p加入就緒隊(duì)列,并更新隊(duì)列的時(shí)間,初始化進(jìn)程p的時(shí)間等,該函數(shù)中的調(diào)用關(guān)系大致為activate_task->enqueue_task->enqueue_task_fair(p->sched_class->enqueue_task)->enqueue_entity。第22行check_preempt_curr函數(shù)調(diào)用了check_preempt_wakeup函數(shù),來(lái)檢查喚醒進(jìn)程是否能搶占當(dāng)前進(jìn)程,下面分析該函數(shù)(kernel/sched/fair.c)。 1 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) 2 { 3 struct task_struct *curr = rq->curr; 4 struct sched_entity *se = &curr->se, *pse = &p->se; 5 struct cfs_rq *cfs_rq = task_cfs_rq(curr); 6 int scale = cfs_rq->nr_running >= sched_nr_latency; 7 int next_buddy_marked = 0; 8 9 if (unlikely(se == pse)) 10 return; 11 12 /* 13 * This is possible from callers such as move_task(), in which we 14 * unconditionally check_prempt_curr() after an enqueue (which may have 15 * lead to a throttle). This both saves work and prevents false 16 * next-buddy nomination below. 17 */ 18 if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) 19 return; 20 21 if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) { 22 set_next_buddy(pse); 23 next_buddy_marked = 1; 24 } 25 26 /* 27 * We can come here with TIF_NEED_RESCHED already set from new task 28 * wake up path. 29 * 30 * Note: this also catches the edge-case of curr being in a throttled 31 * group (e.g. via set_curr_task), since update_curr() (in the 32 * enqueue of curr) will have resulted in resched being set. This 33 * prevents us from potentially nominating it as a false LAST_BUDDY 34 * below. 35 */ 36 if (test_tsk_need_resched(curr)) 37 return; 38 39 /* Idle tasks are by definition preempted by non-idle tasks. */ 40 if (unlikely(curr->policy == SCHED_IDLE) && 41 likely(p->policy != SCHED_IDLE)) 42 goto preempt; 43 44 /* 45 * do not preempt non-idle tasks (their preemption 46 * is driven by the tick): 47 */ 48 if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) 49 return; 50 51 find_matching_se(&se, &pse); 52 update_curr(cfs_rq_of(se)); 53 BUG_ON(!pse); 54 if (wakeup_preempt_entity(se, pse) == 1) { 55 /* 56 * Bias pick_next to pick the sched entity that is 57 * triggering this preemption. 58 */ 59 if (!next_buddy_marked) 60 set_next_buddy(pse); 61 goto preempt; 62 } 63 64 return; 65 66 preempt: 67 resched_task(curr); 68 /* 69 * Only set the backward buddy when the current task is still 70 * on the rq. This can happen when a wakeup gets interleaved 71 * with schedule on the ->pre_schedule() or idle_balance() 72 * point, either of which can * drop the rq lock. 73 * 74 * Also, during early boot the idle thread is in the fair class, 75 * for obvious reasons its a bad idea to schedule back to it. 76 */ 77 if (unlikely(!se->on_rq || curr == rq->idle)) 78 return; 79 80 if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se)) 81 set_last_buddy(se); 82 } 第21-24行,如果開(kāi)啟了NEXT_BUDDY并且喚醒的進(jìn)程不是新進(jìn)程(而是睡眠進(jìn)程),那么第22行就將cfs_rq的next指針指向喚醒的進(jìn)程,并且設(shè)置標(biāo)記。第36行如果當(dāng)前進(jìn)程可以被搶占,函數(shù)直接返回。否則,第40-42行如果當(dāng)前進(jìn)程是空閑進(jìn)程并且被喚醒的進(jìn)程不是空閑進(jìn)程,則跳到preempt處,設(shè)置need_resched標(biāo)志,完成搶占設(shè)置。第48行,如果被喚醒進(jìn)程是空閑進(jìn)程或者批處理進(jìn)程,直接返回,這些進(jìn)程不能搶占別的進(jìn)程。第51行如果當(dāng)前進(jìn)程和被喚醒進(jìn)程不在同一級(jí)別(同一個(gè)調(diào)度組),則尋找它們的祖先parent,把它們調(diào)整到同一級(jí)別,才能進(jìn)行虛擬運(yùn)行時(shí)間的比較,進(jìn)而決定能否搶占。第54行,對(duì)當(dāng)前進(jìn)程和被喚醒進(jìn)程的虛擬運(yùn)行時(shí)間進(jìn)行比較,可以搶占的話(huà)(喚醒進(jìn)程的虛擬時(shí)間小于當(dāng)前進(jìn)程)執(zhí)行if體,跳到preempt處完成搶占。否則所有都不滿(mǎn)足的話(huà),當(dāng)前進(jìn)程不能被搶占,執(zhí)行第64行返回,隨后分析該函數(shù)。第80-81行如果開(kāi)啟了LAST_BUDDY,就將cfs_rq的last指針指向喚醒進(jìn)程的進(jìn)程。在pick_next_entity函數(shù)中,next和last所指的進(jìn)程會(huì)先于就緒隊(duì)列的left進(jìn)程被選擇。下面分析下wakeup_preempt_entity函數(shù)(kernel/sched/fair.c)。 1 static int 2 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) 3 { 4 s64 gran, vdiff = curr->vruntime - se->vruntime; 5 6 if (vdiff <= 0) 7 return -1; 8 9 gran = wakeup_gran(curr, se); 10 if (vdiff > gran) 11 return 1; 12 13 return 0; 14 } 該函數(shù)是要保證在se實(shí)體在搶占curr實(shí)體時(shí),curr實(shí)體已經(jīng)運(yùn)行過(guò)一段時(shí)間(具體而言,物理時(shí)間1ms),第9行wakeup_gran函數(shù)是將sysctl_sched_wakeup_granularity的值(1ms)轉(zhuǎn)換成se實(shí)體的虛擬時(shí)間,保存在gran中,第10行比較vdiff和gran大小,實(shí)際上是比較curr->vruntime 和 se->vruntime+gran,因此就是想讓curr實(shí)體多執(zhí)行g(shù)ran時(shí)間,才能被搶占。 最后我們?cè)俜治鱿?try_to_wake_up函數(shù)中第40行遺留的那個(gè)函數(shù)指針task_waking,該指針指向了task_waking_fair函數(shù),代碼如下(kernel/sched/fair.c): 1 static void task_waking_fair(struct task_struct *p) 2 { 3 struct sched_entity *se = &p->se; 4 struct cfs_rq *cfs_rq = cfs_rq_of(se); 5 u64 min_vruntime; 6 7 #ifndef CONFIG_64BIT 8 u64 min_vruntime_copy; 9 10 do { 11 min_vruntime_copy = cfs_rq->min_vruntime_copy; 12 smp_rmb(); 13 min_vruntime = cfs_rq->min_vruntime; 14 } while (min_vruntime != min_vruntime_copy); 15 #else 16 min_vruntime = cfs_rq->min_vruntime; 17 #endif 18 19 se->vruntime -= min_vruntime; 20 record_wakee(p); 21 } 該函數(shù)完成對(duì)睡眠進(jìn)程的虛擬時(shí)間補(bǔ)償??紤]到睡眠時(shí)間長(zhǎng)時(shí)間沒(méi)有運(yùn)行,因此第19行從喚醒進(jìn)程se的已運(yùn)行虛擬時(shí)間中減去就緒隊(duì)列的最小虛擬時(shí)間,做點(diǎn)補(bǔ)償,讓其可以多運(yùn)行一點(diǎn)時(shí)間。 8.新進(jìn)程的處理函數(shù)(kernel/sched/fair.c): 1 static void task_fork_fair(struct task_struct *p) 2 { 3 struct cfs_rq *cfs_rq; 4 struct sched_entity *se = &p->se, *curr; 5 int this_cpu = smp_processor_id(); 6 struct rq *rq = this_rq(); 7 unsigned long flags; 8 9 raw_spin_lock_irqsave(&rq->lock, flags); 10 11 update_rq_clock(rq); 12 13 cfs_rq = task_cfs_rq(current); 14 curr = cfs_rq->curr; 15 16 /* 17 * Not only the cpu but also the task_group of the parent might have 18 * been changed after parent->se.parent,cfs_rq were copied to 19 * child->se.parent,cfs_rq. So call __set_task_cpu() to make those 20 * of child point to valid ones. 21 */ 22 rcu_read_lock(); 23 __set_task_cpu(p, this_cpu); 24 rcu_read_unlock(); 25 26 update_curr(cfs_rq); 27 28 if (curr) 29 se->vruntime = curr->vruntime; 30 place_entity(cfs_rq, se, 1); 31 32 if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) { 33 /* 34 * Upon rescheduling, sched_class::put_prev_task() will place 35 * 'current' within the tree based on its new key value. 36 */ 37 swap(curr->vruntime, se->vruntime); 38 resched_task(rq->curr); 39 } 40 41 se->vruntime -= cfs_rq->min_vruntime; 42 43 raw_spin_unlock_irqrestore(&rq->lock, flags); 44 } 該函數(shù)在do_fork--->copy_process函數(shù)中調(diào)用,用來(lái)設(shè)置新創(chuàng)建進(jìn)程的虛擬時(shí)間信息。第5行獲取當(dāng)前的cpu號(hào),第23行為新進(jìn)程設(shè)置該cpu號(hào)。第29行將當(dāng)前進(jìn)程(父進(jìn)程)的虛擬運(yùn)行時(shí)間拷貝給新進(jìn)程(子進(jìn)程)。第30行的place_entity函數(shù)完成新進(jìn)程的“時(shí)間片”計(jì)算以及虛擬時(shí)間懲罰,之后將新進(jìn)程加入紅黑數(shù)中,待會(huì)分析。第32行如果設(shè)置了子進(jìn)程先于父進(jìn)程運(yùn)行的標(biāo)志并且當(dāng)前進(jìn)程不為空且當(dāng)前進(jìn)程已運(yùn)行的虛擬時(shí)間比新進(jìn)程小,則執(zhí)行if體,第37行交換當(dāng)前進(jìn)程和新進(jìn)程的虛擬時(shí)間(新進(jìn)程的虛擬時(shí)間變小,就排在了紅黑樹(shù)的左側(cè),當(dāng)前進(jìn)程之前,下次就能被調(diào)度),第38行設(shè)置重新調(diào)度標(biāo)志。第41行給新進(jìn)程的虛擬運(yùn)行時(shí)間減去隊(duì)列的最小虛擬時(shí)間來(lái)做一點(diǎn)補(bǔ)償(因?yàn)樵谏线叺膒lace_entity函數(shù)中給新進(jìn)程的虛擬時(shí)間加了一次min_vruntime,所以在這里要減去),再來(lái)看看place_entity函數(shù)(kernel/sched/fair.c): 1 static void 2 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) 3 { 4 u64 vruntime = cfs_rq->min_vruntime; 5 6 /* 7 * The 'current' period is already promised to the current tasks, 8 * however the extra weight of the new task will slow them down a 9 * little, place the new task so that it fits in the slot that 10 * stays open at the end. 11 */ 12 if (initial && sched_feat(START_DEBIT)) 13 vruntime += sched_vslice(cfs_rq, se); 14 15 /* sleeps up to a single latency don't count. */ 16 if (!initial) { 17 unsigned long thresh = sysctl_sched_latency; 18 19 /* 20 * Halve their sleep time's effect, to allow 21 * for a gentler effect of sleepers: 22 */ 23 if (sched_feat(GENTLE_FAIR_SLEEPERS)) 24 thresh >>= 1; 25 26 vruntime -= thresh; 27 } 28 29 /* ensure we never gain time by being placed backwards. */ 30 se->vruntime = max_vruntime(se->vruntime, vruntime); 31 } 該函數(shù)完成新進(jìn)程的“時(shí)間片”計(jì)算和虛擬時(shí)間懲罰,并且將新進(jìn)程加入就緒隊(duì)列。第4行將就緒隊(duì)列的min_vruntime值存入vruntime中,第12-13行,如果initial標(biāo)志為1的話(huà)(說(shuō)明當(dāng)前計(jì)算的是新進(jìn)程的時(shí)間),將計(jì)算出的新進(jìn)程的虛擬時(shí)間片累加到vruntime中,累加到原因是調(diào)度系統(tǒng)要保證先把就緒隊(duì)列中的所有的進(jìn)程執(zhí)行一遍之后才能執(zhí)行新進(jìn)程,一會(huì)具體解釋。第16-17行,如果當(dāng)前計(jì)算的不是新進(jìn)程(睡眠的進(jìn)程),把一個(gè)延遲周期的長(zhǎng)度sysctl_sched_latency(6ms)賦給thresh,第24行thresh減半,第26行睡眠進(jìn)程的虛擬運(yùn)行時(shí)間減去減半后的thresh,因?yàn)樗哌M(jìn)程好長(zhǎng)時(shí)間未運(yùn)行,因此要進(jìn)行虛擬時(shí)間補(bǔ)償,把它已運(yùn)行的虛擬時(shí)間減小一點(diǎn),使得它能多運(yùn)行一會(huì)。第30行將設(shè)置好的虛擬時(shí)間保存到進(jìn)程調(diào)度實(shí)體的vruntime域。下面解釋下為什么要對(duì)新進(jìn)程進(jìn)行虛擬時(shí)間懲罰,其實(shí)原因只有一個(gè),就是調(diào)度系統(tǒng)要保證將就緒隊(duì)列中現(xiàn)有的進(jìn)程執(zhí)行一遍之后再執(zhí)行新進(jìn)程,那么就必須使新進(jìn)程的 vruntime=cfs_rq->min_vruntime+新進(jìn)程的虛擬時(shí)間片,才能使得新進(jìn)程插入到紅黑樹(shù)的右邊,最后參與調(diào)度,不然無(wú)法保證所有進(jìn)程在新進(jìn)程之前執(zhí)行。 最后,分析下和調(diào)度相關(guān)的這些函數(shù)執(zhí)行的時(shí)機(jī) 前面在介紹這些函數(shù)的時(shí)候,基本上都提到了會(huì)在哪里調(diào)用這些函數(shù),最后,我們?cè)傧到y(tǒng)總結(jié)一下: 進(jìn)程調(diào)度分為兩個(gè)部分:一是進(jìn)程信息的修改,二是進(jìn)程切換。進(jìn)程切換只有一個(gè)函數(shù)schedule,schedule的運(yùn)行時(shí)機(jī)我們最后分析。其它函數(shù)的運(yùn)行時(shí)機(jī)如下: 1.scheduler_tick函數(shù)是在每個(gè)時(shí)鐘中斷中被調(diào)用,用來(lái)更新當(dāng)前進(jìn)程運(yùn)行的時(shí)間。 2.effective_prio函數(shù)是在創(chuàng)建一個(gè)新進(jìn)程或者用戶(hù)使用nice系統(tǒng)調(diào)用設(shè)置進(jìn)程的優(yōu)先級(jí)時(shí)調(diào)用,用來(lái)設(shè)置進(jìn)程的在內(nèi)核中優(yōu)先級(jí)(不同于nice值)。 3.set_load_weight函數(shù)也是在創(chuàng)建新進(jìn)程或者用戶(hù)使用nice()設(shè)置進(jìn)程的優(yōu)先級(jí)時(shí)調(diào)用,用來(lái)設(shè)置進(jìn)程的權(quán)重。該函數(shù)和2中的函數(shù)其實(shí)是配套使用的,當(dāng)設(shè)置或者改變了一個(gè)進(jìn)程的優(yōu)先級(jí)時(shí),要么就要為這個(gè)進(jìn)程設(shè)置或者改變?cè)搩?yōu)先級(jí)對(duì)應(yīng)的權(quán)重。 4.sched_slice函數(shù)主要是在scheduler_tick->...->check_preempt_tick中調(diào)用(別的地方也有),因此也是每個(gè)時(shí)鐘周期調(diào)用一次,用來(lái)計(jì)算當(dāng)前進(jìn)程應(yīng)該執(zhí)行的“時(shí)間片”,進(jìn)而才能判斷進(jìn)程是否已經(jīng)超出它的時(shí)間片,超出的話(huà)就要設(shè)置搶占標(biāo)志,切換別的進(jìn)程。 5.pick_next_task_fair函數(shù)schedule中調(diào)用,用來(lái)選擇下一個(gè)要被調(diào)度的進(jìn)程,然后才能切換進(jìn)程。它的執(zhí)行時(shí)機(jī)就是schedule的執(zhí)行時(shí)機(jī),隨后分析。 6.__enqueue_entity/__dequeue_entity函數(shù)是在需要入就緒隊(duì)列或者出就緒隊(duì)列的地方被調(diào)用,因此使用它們的地方較多,比如schedule中調(diào)用(間接調(diào)用),就不一一分析了。 7.try_to_wake_up/wake_up_new_task函數(shù),前者在進(jìn)程所等待的資源滿(mǎn)足時(shí)被調(diào)用(一般在中斷內(nèi)調(diào)用);后者是在創(chuàng)建好新進(jìn)程后被調(diào)用。都是用來(lái)喚醒進(jìn)程的,前者喚醒睡眠進(jìn)程,后者喚醒新進(jìn)程并將進(jìn)程加入就緒隊(duì)列。 8.task_fork_fair函數(shù)也是在新進(jìn)程被創(chuàng)建好后調(diào)用,用來(lái)設(shè)置新進(jìn)程的“時(shí)間片”等信息,設(shè)置好以后新進(jìn)程就可以被喚醒了。所以該函數(shù)和wake_up_new_task函數(shù)調(diào)用時(shí)機(jī)是一樣的。 最后,我們分析schedule函數(shù)的調(diào)用時(shí)機(jī)。該函數(shù)是唯一實(shí)現(xiàn)進(jìn)程切換的函數(shù)。 在分析schedule函數(shù)的調(diào)用時(shí)機(jī)之前,我們先為大家介紹下“內(nèi)核控制路徑“的概念。所謂內(nèi)核控制路徑,就是由中斷或者異常引出的執(zhí)行路徑,說(shuō)白了,執(zhí)行中斷或者異常處理程序時(shí)就處在內(nèi)核控制路徑中,此時(shí)代表的也是當(dāng)前進(jìn)程。內(nèi)核控制路徑可以嵌套(也就是可以嵌套中斷),但是無(wú)論嵌套多少,最終都要回到當(dāng)前進(jìn)程中,也就是說(shuō)要從內(nèi)核控制路徑中返回,不可以在內(nèi)核控制路徑中進(jìn)行進(jìn)程切換(因此中斷處理程序中不允許調(diào)用能引起進(jìn)程睡眠的函數(shù))。說(shuō)到底,內(nèi)核要么處在進(jìn)程中,要么處在內(nèi)核控制路徑中(其實(shí)內(nèi)核控制路徑也代表當(dāng)前進(jìn)程,因?yàn)槠溆刑厥庑?,所以我們單列出?lái)談),不會(huì)再有別的什么路徑了。那么進(jìn)程切換的時(shí)機(jī),或者說(shuō)schedule函數(shù)被調(diào)用的時(shí)機(jī),也就只可能發(fā)生于上述兩種路徑中。那么,1.當(dāng)在進(jìn)程中調(diào)用schedule函數(shù)時(shí)(就是ULK這本書(shū)上說(shuō)的直接調(diào)用),表明當(dāng)前進(jìn)程因?yàn)榈却Y源或者別的原因需要掛起,主動(dòng)放棄使用cpu,調(diào)用schedule函數(shù)切換到別的進(jìn)程中;2.當(dāng)在內(nèi)核控制路徑中調(diào)用schedule函數(shù)時(shí)(上邊說(shuō)過(guò)了,內(nèi)核控制路徑中不允許切換進(jìn)程),其實(shí)是在內(nèi)核控制路徑返回進(jìn)程時(shí)調(diào)用(該時(shí)機(jī)就是ULK上說(shuō)的延遲調(diào)用),說(shuō)明有更重要的進(jìn)程等待執(zhí)行,需要搶占當(dāng)前進(jìn)程,因此在中斷處理程序/異常處理程序返回時(shí)都要去檢查當(dāng)前進(jìn)程能否被搶占,可以搶占的話(huà)就要調(diào)用schedule函數(shù)進(jìn)行進(jìn)程切換,包括從系統(tǒng)調(diào)用中返回用戶(hù)空間時(shí)也要檢查(這是統(tǒng)一的,因?yàn)橄到y(tǒng)調(diào)用本身也是異常,因此從系統(tǒng)調(diào)用中返回相當(dāng)于從異常處理程序中返回,通過(guò)系統(tǒng)調(diào)用進(jìn)入到內(nèi)核態(tài)也可以說(shuō)是內(nèi)核控制路徑,但是一般不這么叫)當(dāng)前進(jìn)程的搶占標(biāo)志,能發(fā)生搶占的話(huà)就要去調(diào)用schedule函數(shù)。至此,進(jìn)程切換的時(shí)機(jī)就分析完了。很好記的,要么是進(jìn)程上下文發(fā)生進(jìn)程切換(主動(dòng)調(diào)用schedule),要么是從中斷返回時(shí)切換,因此每次中斷返回時(shí)必須要檢查能否發(fā)生搶占,包括從系統(tǒng)調(diào)用返回也屬于這種情形。
至此,進(jìn)程調(diào)度機(jī)制咱們就分析完了(其實(shí)只分析了CFS調(diào)度)。實(shí)時(shí)進(jìn)程調(diào)度以后再分析!
參考書(shū)籍:《深入理解linux內(nèi)核》 《深入理解linux內(nèi)核架構(gòu)》 參考文章:blog.csdn.net/wudongxu/article/details/8574737 |
|