FAQS about BFS. v0.330

Why did I write it?

After years of using my old kernel and numerous hardware upgrades, I finally
had hardware that needed a newer kernel for drivers and to try out the newer
filesystems. Booting the mainline kernel was relatively reassuring in that
the scheduler behaviour was much better than what was in earlier kernels.
However, it didn't take long before I started being disappointed in that too.
Random stalls in mouse movements, keypresses, strange cpu distribution in
various workloads and unpredictable behaviour all around were exactly what I
was hoping had gone away. So I did what I vowed never to do, looked at the
code. After seeing it had grown into a monster of epic proportions I sat down
and thought about what was wrong. One of the key features of fairness and
interactivity that I always argued for were very simple semantics for how
cpu should be distributed, with guaranteed low latencies so that interactivity
was assured by design instead of bolted on. CFS in essence does that, but it
does something else too. It varies timeslice length to try and preserve some
deadline list and it determines cpu distribution based on a run/sleep
relationship. It also is designed to scale to monster proportion hardware
that the common man will never see. The whole sleep calculation thing is
exactly what I found was responsible for making varied behaviour under
different loads and relative starvation and unfairness. It's not a profound
effect in CFS and that's admirable. It just doesn't behave the way I feel
the scheduler should being forward looking only (not calculating sleep) and
it doesn't really make the most of a relatively lightly loaded machine without
many many cpus. So I threw it all out and wrote exactly the opposite.

What is it?

BFS is the Brain Fuck Scheduler. It was designed to be forward looking only,
make the most of lower spec machines, and not scale to massive hardware. ie
it is a desktop orientated scheduler, with extremely low latencies for
excellent interactivity by design rather than "calculated", with rigid
fairness, nice priority distribution and extreme scalability within normal
load levels.

Extreme scalability within normal load levels? Isn't that a contradiction?

For years we've been doing our workloads on linux to have more work than we
had CPUs because we thought that the "jobservers" were limited in their
ability to utilise the CPUs effectively (so we did make -j6 or more on a
quad core machine for example). This scheduler proves that the jobservers
weren't at fault at all, because make -j4 on a quad core machine with BFS
is faster than *any* choice of job numbers on CFS. See reverse scalability
graph courtesy of Serge Belyshev showing various job numbers on a kernel build
on a quad core machine. The problem has always been that the mainline
scheduler can't keep the CPUs busy enough; ie it doesn't make the most of
your hardware in the most common situations on a desktop! Note that the
reverse scalability graph is old; the scalability has improved since then.

Why "Brain Fuck"?

Because it throws out everything about what we know is good about how to
design a modern scheduler in scalability.
Because it's so ridiculously simple.
Because it performs so ridiculously well on what it's good at despite being
that simple.
Because it's designed in such a way that mainline would never be interested
in adopting it, which is how I like it.
Because it will make people sit up and take notice of where the problems are
in the current design.
Because it throws out the philosophy that one scheduler fits all and shows
that you can do a -lot- better with a scheduler designed for a particular
purpose. I don't want to use a steamroller to crack nuts.
Because it actually means that more CPUs means better latencies.
Because I must be fucked in the head to be working on this again.
I'll think of some more becauses later.

How scalable is it?

I don't own the sort of hardware that is likely to suffer from using it, so
I can't find the upper limit. Based on first principles about the overhead
of locking, and the way lookups occur, I'd guess that a machine with 16 CPUS
or more would start to have exponentially less performance (thanks Ingo for
confirming this). Note that the number of logical CPUs is what affects BFS'
scalability, not the physical ones. By that I mean that a hyperthreaded CPU
that is a quad core hyperthreaded is 8 EIGHT logical CPUs. So it is NOT the
same as a quad core without hyperthreading.

Since version 0.300, scalability improvements have been added that should
further improve performance, including NUMA support! No scalability benchmarks
on very big machines.have been performed on new versions to compare its
performance.

The O(n) lookup of BFS will cause people some concern because of the
notation. However, if the actual overhead is very small, then even with
large numbers of n, it can be lower overhead than an O(1) design. Testing
this scheduler vs CFS with the test app "forks" which forks 1000 tasks that
do simple work, shows no difference in time to completion compared to CFS.
That's a load of 1000 on a quad core machine. But note that BFS gets much
faster when the loads are lower and approximate the number of CPUs, which
is much more what you would experience on a desktop.


What about interbench numbers?

Interbench does too many jobs by default on the burn/compile tests. I've put
up interbench results from a quad core where the jobs (4) is equal to the
number of CPUs so the test is more meaningful, and added comments. It appears
I'm about the only person who understands interbench numbers since I wrote
the benchmark, so don't place too much weight on them. The 'latt' test app
recently written by Jens Axboe is a better place for simpler to understand
and useful numbers. The last set of interbench numbers I posted comparing
mainline 2.6.35 and 2.6.35-ck1 (containing BFS) showed BFS outperformed
mainline in many areas.

What features does BFS have and not have?

On top of the current scheduler design, it has a SCHED_IDLEPRIO which actually
does only schedule tasks when idle, and SCHED_ISO for unprivileged realtime
performance. BFS does NOT implement CGROUPS. A desktop user should not need
know about CGROUPS, nor should they need to use them. BFS also does not have
the feature of "lots of tunables I don't understand".

How do you recommend I use this?

It's designed so that you just patch it in and use it. You shouldn't need to
do anything at all. But since people still want to know every last thing...

THESE ARE OPTIONAL FOR LOWEST LATENCY. YOU DO NOT NEED THESE!
Configure your kernel with 1000Hz, preempt ON and disable dynamic ticks.

What about other Hz, no preempt and dynamic ticks?

The reason I recommend the above settings is these will have the best possible
latencies which is by far the most important feature of a scheduler on a
desktop. How much each of these options affects your throughput, or power
consumption will vary wildly between hardware and workloads. If your major
concern is power consumption, enabling dynamic ticks with 1000Hz should bring
you to similar power usage of a 100Hz config without the disadvantage. 100Hz
does seriously detriment the best possible latencies on BFS and I'd at least
recommend 300. This would be, for example, a suitable android phone setting
(300Hz, preempt on, dynticks on). The only reason for disabling preempt is if
your workload is never latency sensitive. If you were planning on running a
distributed computing client (such as folding@home) and nothing else, then
100Hz, no preempt and no dynticks would be best. Also, turning up the
rr_interval would be beneficial (300 seems to be optimal and higher values seem
to not derive any further benefit). Running it on a real server I'd recommend
the default rr_interval, 100Hz, no preempt and dynticks ON (for power saving).

Other tweaks I can try?

You shouldn't need to tune BFS virtually ever. The only tunable for the
scheduler itself is the rr_interval value (see documentation). Try 3ms if
latency is everything to you. When compiling software, do not use more jobs
than you have CPUs! So make -j2 on dual core, -j4 on quad core and so on.
Nice levels are strictly obeyed so if you nice your compiles they'll be
virtually unnoticeable. (nice -n 19 make -j2). Run your distributed computing
clients SCHED_IDLEPRIO (eg folding at home, mprime etc):
schedtool -D -e mprime
This will make your distributed computing client *never* cause slowdowns in
your other userspace applications, at the cost of slightly slower progress
of the client.
Run your audio and video apps SCHED_ISO:
schedtool -I -e amarok
This will run amarok as an unprivileged real-time task. Note that if you start
an application that tries to get real-time scheduling (eg jackd) and you are
not starting it as root, BFS will automatically elevate it to SCHED_ISO for
you to give it the next best thing.

NUMA aware?

It is NOT NUMA aware in the sense that it does any fancy shit on NUMA, but
it will work on NUMA hardware just fine. Only the really big NUMA hardware
is likely to suffer in performance, and this is theoretically only, since
no one has that sort of hardware to prove it to me, but it seems almost
certain. v0.300 onwards have NUMA enhancements.

Multicore processors?

This is where BFS shines.

Single processors?

Single processors benefit a lot from BFS too.

Low power machines? Phones?

BFS is a very low overheard CPU scheduler. These can benefit a lot from it.

Realtime tasks?

Realtime tasks work just like they do on mainline on uniprocessor machines.
However, BFS has the added advantage of dropping unprivileged tasks to
SCHED_ISO as mentioned above. Most people would not be aware, though, that
BFS has a major advantage when it comes to running realtime tasks running
SCHED_FIFO or SCHED_RR on SMP machines. Because BFS uses one global runqueue
for all tasks on the system, when a realtime task wants CPU time, BFS will
find the most suitable CPU anywhere in the system to run it on, by kicking
off the lower priority task running anywhere. Unless the realtime tasks are
programmed specifically binding to discrete CPUs, the separate runqueue
designed schedulers (like mainline CFS) can get into a situation where two
realtime tasks are running on the same CPU, and one will wait for the other
to finish when there may well be a non-realtime task running on another CPU.
This can cause MASSIVE latencies even though the task is running realtime.
Even the realtime-patchset does not prevent this situation from occurring,
whereas BFS avoids this by design.

I found a bug!

Great! Help me debug it. A scheduler is subtle and quick to anger. If you can
code then delve away and see what you can find! I'll take help from anyone.
It's a major ordeal trying to get this thing working on all sorts of hardware.
You can't code? Give me whatever details you've got and I'll see what I can
do. As per usual this stuff comes with no guarantee, and I do not have
infinite time to spend on it. I do NOT get paid to do this and do it just for
the fun of it. I'll do whatever I can to help you but I cannot support this
like a paid project. I'd *love* to see people hacking on the code themselves.

Are you looking at getting this into mainline?

LOL.

No really, are you?

LOL.

Really really, are you?

No. They would be crazy to use this scheduler anyway since it won't scale to
their 4096 cpu machines. The only way is to rewrite it to work that way, or
to have more than one scheduler in the kernel. I don't want to do the former,
and mainline doesn't want to do the latter. Besides, apparently I'm a bad
maintainer, which makes sense since for some reason I seem to want to have
a career, a life, raise a family with kids and have hobbies, all of which
have nothing to do with linux.

Can it be made to scale to 4096 CPUs?

Sure I guess you could run one runqueue per CPU package instead of a global
one and so on, but I have no intention whatsoever at doing that because it
will compromise the performance where *I* care.

Is this stable?

Yes.

Currently known problems?

Nil.

GIT repository?

Sorry, it's not the right tool for me so it's not worth me investing the time
in setting one up.

WINE Games?

Earlier suggestions that wine games would perform poorly on BFS were
unfounded. All the reports I've received are that they perform better.

CPU Accounting is different?

The mainline kernel simply samples cpu accounting per tick. This can be
very inaccurate for very short lived processes. BFS uses accounting of actual
cpu time consumed within the limitations of the current kernel and hardware
and may report very different cpu usage. Do not use the reported cpu usage
between different kernels as some marker of performance. Compare actual work
progress instead (eg. when comparing distributed computing clients).

Will I be maintaining this, even though mainline won't have it?

Yes. For the foreseeable future at least. Once the bugs are ironed out, it
shouldn't be too much effort to keep in sync with mainline.

What is the relationship of BFS to SD/RSDL?

BFS implements the same forward-looking rigidly fair timeslicer philosophy
that SD used, in a foreground - background design, but is otherwise a
completely different scheduler. It is an attempt to fuse all the ideas I
have about scheduling so far, without worrying about trying to make it
infinitely scalable.


Thanks to the guys in irc.oftc.net #ck for inspiration to work on this and
early testing! Many of them sat idle in that channel for years while nothing
happened. The xkcd comic supported_features also helped motivate this
work. Yes I know you probably still can't watch full screen videos on youtube,
but that's not entirely the scheduler's fault.


Work in progress.

Last updated: Sun, 5 Sep 2010
Con Kolivas <kernel@kolivas.org>