Group tasks by their thread_group leader and distribute their CPU as though they're one task. The practical upshot of this is that each application is treated as one task no matter how many threads it forks. The significance of this is that massively multithreaded applications such as java applications do not get any more cpu than if they're were not threaded. The unexpected side effect is that doing make -j (any number) will, provided you don't run out of ram, feel like no more load than make -j1 no matter how many CPUs you have. Note that this drastically changes the way CPU is proportioned under load, as each application is seen as only one entity regardless of how many children it forks or threads. 'nice' is still respected. For example, on my quad core machine, running make -j128 feels like no more load than make -j1 except for when disk I/O occurs. The make -j128 proceeds at a rate ever so slightly slower than the make -j4 (which is still optimal). This will need extensive testing to see what disadvantages may occur, as some applications may have depended on getting more CPU by running multiple processes. So far I have yet to encounter a workload where this is a problem. Note that firefox, for example, has many threads and is contained as one application with this patch. It requires a change in mindset about how CPU is distributed in different workloads but I believe will be ideal for the desktop user. Think of it as implementing everything you want out of the complex CGROUPS, but without the overhead or any input or effort on the user's part. -ck --- include/linux/sched.h | 1 + kernel/sched_bfs.c | 31 +++++++++++++++++++++++++++---- 2 files changed, 28 insertions(+), 4 deletions(-) Index: linux-2.6.35.5-ck1/include/linux/sched.h =================================================================== --- linux-2.6.35.5-ck1.orig/include/linux/sched.h 2010-10-04 23:46:58.960976832 +1100 +++ linux-2.6.35.5-ck1/include/linux/sched.h 2010-10-04 23:47:13.436239058 +1100 @@ -1195,6 +1195,7 @@ struct task_struct { struct list_head run_list; u64 last_ran; u64 sched_time; /* sched_clock time spent running */ + unsigned long threads_running; unsigned long rt_timeout; #else /* CONFIG_SCHED_BFS */ Index: linux-2.6.35.5-ck1/kernel/sched_bfs.c =================================================================== --- linux-2.6.35.5-ck1.orig/kernel/sched_bfs.c 2010-10-04 23:46:58.950976651 +1100 +++ linux-2.6.35.5-ck1/kernel/sched_bfs.c 2010-10-04 23:56:17.466562156 +1100 @@ -958,11 +958,16 @@ static int effective_prio(struct task_st return p->prio; } +static inline u64 task_deadline_diff(struct task_struct *p); + /* - * activate_task - move a task to the runqueue. Enter with grq locked. + * activate_task - move a task to the runqueue. Enter with grq locked. The + * number of threads running is stored in the group_leader struct. */ static void activate_task(struct task_struct *p, struct rq *rq) { + unsigned long *threads_running = &p->group_leader->threads_running; + update_clocks(rq); /* @@ -980,6 +985,8 @@ static void activate_task(struct task_st if (task_contributes_to_load(p)) grq.nr_uninterruptible--; enqueue_task(p); + if (++*threads_running > 1) + p->deadline += task_deadline_diff(p); grq.nr_running++; inc_qnr(); } @@ -990,9 +997,13 @@ static void activate_task(struct task_st */ static inline void deactivate_task(struct task_struct *p) { + unsigned long *threads_running = &p->group_leader->threads_running; + if (task_contributes_to_load(p)) grq.nr_uninterruptible++; grq.nr_running--; + if (--*threads_running > 0) + p->deadline -= task_deadline_diff(p); } #ifdef CONFIG_SMP @@ -2491,8 +2502,20 @@ static inline int ms_longest_deadline_di */ static void time_slice_expired(struct task_struct *p) { + unsigned long *threads_running = &p->group_leader->threads_running; + u64 tdd = task_deadline_diff(p); + + /* + * We proportionately increase the deadline according to how many + * threads are running. This effectively makes a thread group have + * the same CPU as one task, no matter how many threads are running. + * time_slice_expired can be called when there may be none running + * when p is deactivated so we must explicitly test for more than 1. + */ + if (*threads_running > 1) + tdd *= *threads_running; p->time_slice = timeslice(); - p->deadline = grq.niffies + task_deadline_diff(p); + p->deadline = grq.niffies + tdd; } /* @@ -3446,7 +3469,7 @@ SYSCALL_DEFINE1(nice, int, increment) * * This is the priority value as seen by users in /proc. * RT tasks are offset by -100. Normal tasks are centered around 1, value goes - * from 0 (SCHED_ISO) up to 82 (nice +19 SCHED_IDLEPRIO). + * from 0 (SCHED_ISO) up to ~190 (nice +19 SCHED_IDLEPRIO). */ int task_prio(const struct task_struct *p) { @@ -3459,7 +3482,7 @@ int task_prio(const struct task_struct * /* Convert to ms to avoid overflows */ delta = NS_TO_MS(p->deadline - grq.niffies); delta = delta * 40 / ms_longest_deadline_diff(); - if (delta > 0 && delta <= 80) + if (delta > 0) prio += delta; if (idleprio_task(p)) prio += 40;