Group tasks by their group leader and distribute their CPU as though they're one task. The practical upshot of this is that each application is treated as one task no matter how many threads it starts or children it forks. The significance of this is that massively multithreaded applications such as java applications do not get any more cpu than if they're were not threaded. The unexpected side effect is that doing make -j (any number) will, provided you don't run out of ram, feel like no more load than make -j1 no matter how many CPUs you have. The same goes for any application with multiple threads or processes. Note that this drastically changes the way CPU is proportioned under load, as each application is seen as only one entity regardless of how many children it forks or threads. 'nice' is still respected. For example, on my quad core machine, running make -j128 feels like no more load than make -j1 except for when disk I/O occurs. The make -j128 proceeds at a rate ever so slightly slower than the make -j4 (which is still optimal). This will need extensive testing to see what disadvantages may occur, as some applications may have depended on getting more CPU by running multiple processes. So far I have yet to encounter a workload where this is a problem. Note that firefox, for example, has many threads and is contained as one application with this patch. It requires a change in mindset about how CPU is distributed in different workloads but I believe will be ideal for the desktop user. Think of it as implementing everything you want out of a more complex group scheduling policy containing each application as an entity, but without the overhead or any input or effort on the user's part. Note that this does not have any effect on throughput either, unlike other approaches to decreasing latency at load. Increasing jobs up to number of CPUs will still increase throughput if they're not competing with other processes for CPU time. To demonstrate the effect this will have, let's use the simplest example of a dual core machine, and one fully CPU bound single threaded workload such as a video encode that encodes 1000 frames per minute, competing with a 'make' compilation. Let's assume that 'make -j2' completes in one minute, and 'make -j1' completes in 2 minutes. Before this patch: make -j1 and no encode: make finishes in 2 minutes make -j2 and no encode: make finishes in 1 minute make -j128 and no encode: make finishes in 1 minute encode no make: 1000 frames are encoded per minute make -j1 and encode: make finishes in 2 minutes, 1000 frames are encoded per minute make -j2 and encode: make finishes in 1.5 minutes, 500 frames are encoded per minute make -j4 and encode: make finishes in 1.25 minutes, 200 frames are encoded per minute make -j24 and encode: make finishes in 1.04 minutes, 40 frames are encoded per minute make -j128 and encode: make finishes in 1.01 minutes, 7 frames are encoded per minute make -j2 and nice +19 encode: make finishes in 1.03 minutes, 30 frames are encoded per minute After this patch: make -j1 and no encode: make finishes in 2 minutes make -j2 and no encode: make finishes in 1 minute make -j128 and no encode: make finishes in 1 minute encode no make: 1000 frames are encoded per minute make -j1 and encode: make finishes in 2 minutes, 1000 frames are encoded per minute make -j2 and encode: make finishes in 2 minutes, 1000 frames are encoded per minute make -j4 and encode: make finishes in 2 minutes, 1000 frames are encoded per minute make -j24 and encode: make finishes in 2 minutes, 1000 frames are encoded per minute make -j128 and encode: make finishes in 1.08 minutes, 150 frames are encoded per minute make -j2 and nice +19 encode: make finishes in 1.06 minutes, 60 frames are encoded per minute -ck --- include/linux/sched.h | 4 +++- kernel/sched_bfs.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 49 insertions(+), 5 deletions(-) Index: linux-2.6.35.7/include/linux/sched.h =================================================================== --- linux-2.6.35.7.orig/include/linux/sched.h 2010-10-06 08:35:33.607634739 +1100 +++ linux-2.6.35.7/include/linux/sched.h 2010-10-06 08:40:37.475246693 +1100 @@ -1192,10 +1192,12 @@ struct task_struct { unsigned int rt_priority; #ifdef CONFIG_SCHED_BFS int time_slice; - u64 deadline; + /* Virtual deadline in niffies, and when the deadline was set */ + u64 deadline, deadline_niffy; struct list_head run_list; u64 last_ran; u64 sched_time; /* sched_clock time spent running */ + unsigned long threads_running; unsigned long rt_timeout; #else /* CONFIG_SCHED_BFS */ Index: linux-2.6.35.7/kernel/sched_bfs.c =================================================================== --- linux-2.6.35.7.orig/kernel/sched_bfs.c 2010-10-06 08:35:33.601634349 +1100 +++ linux-2.6.35.7/kernel/sched_bfs.c 2010-10-06 08:56:51.798045442 +1100 @@ -635,11 +635,26 @@ static int isoprio_suitable(void) return !grq.iso_refractory; } +static inline u64 task_deadline_diff(struct task_struct *p); + /* * Adding to the global runqueue. Enter with grq locked. */ static void enqueue_task(struct task_struct *p) { + s64 max_tdd; + + max_tdd = task_deadline_diff(p); + + /* + * Make sure that when we're queueing this task again that it + * doesn't have any old deadliens from when the program group was + * being penalised and cap the deadline to the highest it could + * be, based on the current number of threads running. + */ + max_tdd *= p->group_leader->threads_running; + if (p->deadline - p->deadline_niffy > max_tdd) + p->deadline = p->deadline_niffy + max_tdd; if (!rt_task(p)) { /* Check it hasn't gotten rt from PI */ if ((idleprio_task(p) && idleprio_suitable(p)) || @@ -939,10 +954,13 @@ static int effective_prio(struct task_st } /* - * activate_task - move a task to the runqueue. Enter with grq locked. + * activate_task - move a task to the runqueue. Enter with grq locked. The + * number of threads running is stored in the group_leader struct. */ static void activate_task(struct task_struct *p, struct rq *rq) { + unsigned long *threads_running = &p->group_leader->threads_running; + update_clocks(rq); /* @@ -959,6 +977,13 @@ static void activate_task(struct task_st p->prio = effective_prio(p); if (task_contributes_to_load(p)) grq.nr_uninterruptible--; + /* + * Adjust deadline according to number of running tasks/threads within + * this program group. This ends up distributing CPU to the program + * group as a single entity. + */ + if (++*threads_running > 1) + p->deadline += task_deadline_diff(p); enqueue_task(p); grq.nr_running++; inc_qnr(); @@ -970,9 +995,13 @@ static void activate_task(struct task_st */ static inline void deactivate_task(struct task_struct *p) { + unsigned long *threads_running = &p->group_leader->threads_running; + if (task_contributes_to_load(p)) grq.nr_uninterruptible++; grq.nr_running--; + if (--*threads_running > 0) + p->deadline -= task_deadline_diff(p); } #ifdef CONFIG_SMP @@ -2471,8 +2500,21 @@ static inline int ms_longest_deadline_di */ static void time_slice_expired(struct task_struct *p) { + unsigned long *threads_running = &p->group_leader->threads_running; + u64 tdd = task_deadline_diff(p); + + /* + * We proportionately increase the deadline according to how many + * threads are running. This effectively makes a thread group have + * the same CPU as one task, no matter how many threads are running. + * time_slice_expired can be called when there may be none running + * when p is deactivated so we must explicitly test for more than 1. + */ + if (*threads_running > 1) + tdd *= *threads_running; p->time_slice = timeslice(); - p->deadline = grq.niffies + task_deadline_diff(p); + p->deadline = grq.niffies + tdd; + p->deadline_niffy = grq.niffies; } /* @@ -3426,7 +3468,7 @@ SYSCALL_DEFINE1(nice, int, increment) * * This is the priority value as seen by users in /proc. * RT tasks are offset by -100. Normal tasks are centered around 1, value goes - * from 0 (SCHED_ISO) up to 82 (nice +19 SCHED_IDLEPRIO). + * from 0 (SCHED_ISO) up to ~900 (nice +19 SCHED_IDLEPRIO). */ int task_prio(const struct task_struct *p) { @@ -3439,7 +3481,7 @@ int task_prio(const struct task_struct * /* Convert to ms to avoid overflows */ delta = NS_TO_MS(p->deadline - grq.niffies); delta = delta * 40 / ms_longest_deadline_diff(); - if (delta > 0 && delta <= 80) + if (delta > 0) prio += delta; if (idleprio_task(p)) prio += 40;