diff --git a/Documentation/scheduler/sched-BFS.txt b/Documentation/scheduler/sched-BFS.txt index 6470f30..c028200 100644 --- a/Documentation/scheduler/sched-BFS.txt +++ b/Documentation/scheduler/sched-BFS.txt @@ -13,14 +13,14 @@ one workload cause massive detriment to another. Design summary. -BFS is best described as a single runqueue, O(log n) insertion, O(1) lookup, -earliest effective virtual deadline first design, loosely based on EEVDF -(earliest eligible virtual deadline first) and my previous Staircase Deadline -scheduler. Each component shall be described in order to understand the -significance of, and reasoning for it. The codebase when the first stable -version was released was approximately 9000 lines less code than the existing -mainline linux kernel scheduler (in 2.6.31). This does not even take into -account the removal of documentation and the cgroups code that is not used. +BFS is best described as a single runqueue, O(n) lookup, earliest effective +virtual deadline first design, loosely based on EEVDF (earliest eligible virtual +deadline first) and my previous Staircase Deadline scheduler. Each component +shall be described in order to understand the significance of, and reasoning for +it. The codebase when the first stable version was released was approximately +9000 lines less code than the existing mainline linux kernel scheduler (in +2.6.31). This does not even take into account the removal of documentation and +the cgroups code that is not used. Design reasoning. @@ -62,13 +62,12 @@ Design details. Task insertion. -BFS inserts tasks into each relevant queue as an O(log n) insertion into a -customised skip list (as described by William Pugh). At the time of insertion, -*every* running queue is checked to see if the newly queued task can run on any -idle queue, or preempt the lowest running task on the system. This is how the -cross-CPU scheduling of BFS achieves significantly lower latency per extra CPU -the system has. In this case the lookup is, in the worst case scenario, O(k) -where k is the number of online CPUs on the system. +BFS inserts tasks into each relevant queue as an O(1) insertion into a double +linked list. On insertion, *every* running queue is checked to see if the newly +queued task can run on any idle queue, or preempt the lowest running task on the +system. This is how the cross-CPU scheduling of BFS achieves significantly lower +latency per extra CPU the system has. In this case the lookup is, in the worst +case scenario, O(n) where n is the number of CPUs on the system. Data protection. @@ -93,7 +92,7 @@ the virtual deadline mechanism is explained. Virtual deadline. The key to achieving low latency, scheduling fairness, and "nice level" -distribution in BFS is entirely in the virtual deadline mechanism. The related +distribution in BFS is entirely in the virtual deadline mechanism. The one tunable in BFS is the rr_interval, or "round robin interval". This is the maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy) tasks of the same nice level will be running for, or looking at it the other @@ -118,7 +117,7 @@ higher priority than a currently running task on any cpu by virtue of the fact that it has an earlier virtual deadline than the currently running task. The earlier deadline is the key to which task is next chosen for the first and second cases. Once a task is descheduled, it is put back on the queue, and an -O(1) lookup of all queued-but-not-running tasks is done to determine which has +O(n) lookup of all queued-but-not-running tasks is done to determine which has the earliest deadline and that task is chosen to receive CPU next. The CPU proportion of different nice tasks works out to be approximately the @@ -135,40 +134,26 @@ Task lookup. BFS has 103 priority queues. 100 of these are dedicated to the static priority of realtime tasks, and the remaining 3 are, in order of best to worst priority, -SCHED_ISO (isochronous), SCHED_NORMAL/SCHED_BATCH, and SCHED_IDLEPRIO (idle -priority scheduling). - -When a task of these priorities is queued, it is added to the skiplist with a -different sorting value according to the type of task. For realtime tasks and -isochronous tasks, it is their static priority. For SCHED_NORMAL and -SCHED_BATCH tasks it is their virtual deadline value. For SCHED_IDLEPRIO tasks -it is their virtual deadline value offset by an impossibly large value to ensure -they never go before normal tasks. When isochronous or idleprio tasks do not -meet the conditions that allow them to run with their special scheduling they -are queued as per the remainder of the SCHED_NORMAL tasks. - -Lookup is performed by selecting the very first entry in the "level 0" skiplist -as it will always be the lowest priority task having been sorted while being -entered into the skiplist. This is usually an O(1) operation, however if there -are tasks with limited affinity set and they are not able to run on the current -CPU, the next in the list is checked and so on. - -Thus, the lookup for the common case is O(1) and O(n) in the worst case when -the system has nothing but selectively affined tasks that can never run on the -current CPU. - - -Task removal. - -Removal of tasks in the skip list is an O(k) operation where 0 <= k < 16, -corresponding with the "levels" in the skip list. 16 was chosen as the upper -limit in the skiplist as it guarantees O(log n) insertion for up to 64k -currently active tasks and most systems do not usually allow more than 32k -tasks, and 16 levels makes the skiplist lookup components fit in 2 cachelines. -The skiplist level chosen when inserting a task is pseudo-random but a minor -optimisation is used to limit the max level based on the absolute number of -queued tasks since high levels afford no advantage at low numbers of queued -tasks yet increase overhead. +SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority +scheduling). When a task of these priorities is queued, a bitmap of running +priorities is set showing which of these priorities has tasks waiting for CPU +time. When a CPU is made to reschedule, the lookup for the next task to get +CPU time is performed in the following way: + +First the bitmap is checked to see what static priority tasks are queued. If +any realtime priorities are found, the corresponding queue is checked and the +first task listed there is taken (provided CPU affinity is suitable) and lookup +is complete. If the priority corresponds to a SCHED_ISO task, they are also +taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds +to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this +stage, every task in the runlist that corresponds to that priority is checked +to see which has the earliest set deadline, and (provided it has suitable CPU +affinity) it is taken off the runqueue and given the CPU. If a task has an +expired deadline, it is taken and the rest of the lookup aborted (as they are +chosen in FIFO order). + +Thus, the lookup is O(n) in the worst case only, where n is as described +earlier, as tasks may be chosen before the whole task list is looked over. Scalability. @@ -191,17 +176,30 @@ when it has been deemed their overhead is so marginal that they're worth adding. The first is the local copy of the running process' data to the CPU it's running on to allow that data to be updated lockless where possible. Then there is deference paid to the last CPU a task was running on, by trying that CPU first -when looking for an idle CPU to use the next time it's scheduled. - -The real cost of migrating a task from one CPU to another is entirely dependant -on the cache footprint of the task, how cache intensive the task is, how long -it's been running on that CPU to take up the bulk of its cache, how big the CPU -cache is, how fast and how layered the CPU cache is, how fast a context switch -is... and so on. In other words, it's close to random in the real world where we -do more than just one sole workload. The only thing we can be sure of is that -it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and -utilising idle CPUs is more important than cache locality, and cache locality -only plays a part after that. +when looking for an idle CPU to use the next time it's scheduled. Finally there +is the notion of cache locality beyond the last running CPU. The sched_domains +information is used to determine the relative virtual "cache distance" that +other CPUs have from the last CPU a task was running on. CPUs with shared +caches, such as SMT siblings, or multicore CPUs with shared caches, are treated +as cache local. CPUs without shared caches are treated as not cache local, and +CPUs on different NUMA nodes are treated as very distant. This "relative cache +distance" is used by modifying the virtual deadline value when doing lookups. +Effectively, the deadline is unaltered between "cache local" CPUs, doubled for +"cache distant" CPUs, and quadrupled for "very distant" CPUs. The reasoning +behind the doubling of deadlines is as follows. The real cost of migrating a +task from one CPU to another is entirely dependant on the cache footprint of +the task, how cache intensive the task is, how long it's been running on that +CPU to take up the bulk of its cache, how big the CPU cache is, how fast and +how layered the CPU cache is, how fast a context switch is... and so on. In +other words, it's close to random in the real world where we do more than just +one sole workload. The only thing we can be sure of is that it's not free. So +BFS uses the principle that an idle CPU is a wasted CPU and utilising idle CPUs +is more important than cache locality, and cache locality only plays a part +after that. Doubling the effective deadline is based on the premise that the +"cache local" CPUs will tend to work on the same tasks up to double the number +of cache local CPUs, and once the workload is beyond that amount, it is likely +that none of the tasks are cache warm anywhere anyway. The quadrupling for NUMA +is a value I pulled out of my arse. When choosing an idle CPU for a waking task, the cache locality is determined according to where the task last ran and then idle CPUs are ranked from best @@ -209,26 +207,31 @@ to worst to choose the most suitable idle CPU based on cache locality, NUMA node locality and hyperthread sibling business. They are chosen in the following preference (if idle): - * Same thread, idle or busy cache, idle or busy threads - * Other core, same cache, idle or busy cache, idle threads. - * Same node, other CPU, idle cache, idle threads. - * Same node, other CPU, busy cache, idle threads. - * Other core, same cache, busy threads. - * Same node, other CPU, busy threads. - * Other node, other CPU, idle cache, idle threads. - * Other node, other CPU, busy cache, idle threads. - * Other node, other CPU, busy threads. +* Same core, idle or busy cache, idle threads +* Other core, same cache, idle or busy cache, idle threads. +* Same node, other CPU, idle cache, idle threads. +* Same node, other CPU, busy cache, idle threads. +* Same core, busy threads. +* Other core, same cache, busy threads. +* Same node, other CPU, busy threads. +* Other node, other CPU, idle cache, idle threads. +* Other node, other CPU, busy cache, idle threads. +* Other node, other CPU, busy threads. This shows the SMT or "hyperthread" awareness in the design as well which will choose a real idle core first before a logical SMT sibling which already has -tasks on the physical CPU. Early benchmarking of BFS suggested scalability -dropped off at the 16 CPU mark. However this benchmarking was performed on an -earlier design that was far less scalable than the current one so it's hard to -know how scalable it is in terms of number of CPUs (due to the global -runqueue). Note that in terms of scalability, the number of _logical_ CPUs -matters, not the number of _physical_ CPUs. Thus, a dual (2x) quad core (4X) -hyperthreaded (2X) machine is effectively a 16X. Newer benchmark results are -very promising indeed. Benchmark contributions are most welcome. +tasks on the physical CPU. + +Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark. +However this benchmarking was performed on an earlier design that was far less +scalable than the current one so it's hard to know how scalable it is in terms +of both CPUs (due to the global runqueue) and heavily loaded machines (due to +O(n) lookup) at this stage. Note that in terms of scalability, the number of +_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x) +quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark +results are very promising indeed, without needing to tweak any knobs, features +or options. Benchmark contributions are most welcome. + Features @@ -241,43 +244,30 @@ and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is support for CGROUPS. The average user should neither need to know what these are, nor should they need to be using them to have good desktop behaviour. -Rudimentary support for the CPU controller CGROUP in the form of filesystem -stubs for the expected CGROUP structure to allow applications that demand their -presence to work but they do not have any functionality. -There are two "scheduler" tunables, the round robin interval and the -interactive flag. These can be accessed in +rr_interval + +There is only one "scheduler" tunable, the round robin interval. This can be +accessed in /proc/sys/kernel/rr_interval - /proc/sys/kernel/interactive - -rr_interval value - -The value is in milliseconds, and the default value is set to 6ms. Valid values -are from 1 to 1000. Decreasing the value will decrease latencies at the cost of -decreasing throughput, while increasing it will improve throughput, but at the -cost of worsening latencies. The accuracy of the rr interval is limited by HZ -resolution of the kernel configuration. Thus, the worst case latencies are -usually slightly higher than this actual value. BFS uses "dithering" to try and -minimise the effect the Hz limitation has. The default value of 6 is not an -arbitrary one. It is based on the fact that humans can detect jitter at -approximately 7ms, so aiming for much lower latencies is pointless under most -circumstances. It is worth noting this fact when comparing the latency -performance of BFS to other schedulers. Worst case latencies being higher than -7ms are far worse than average latencies not being in the microsecond range. -Experimentation has shown that rr intervals being increased up to 300 can -improve throughput but beyond that, scheduling noise from elsewhere prevents -further demonstrable throughput. - -interactive flag - -This is a simple boolean that can be set to 1 or 0, set to 1 by default. This -sacrifices some of the interactive performance by giving tasks a degree of -soft affinity for logical CPUs when it will lead to improved throughput, but -enabling it also sacrifices the completely deterministic nature with respect -to latency that BFS otherwise normally provides, and subsequently leads to -slightly higher latencies and a noticeably less interactive system. +The value is in milliseconds, and the default value is set to 6 on a +uniprocessor machine, and automatically set to a progressively higher value on +multiprocessor machines. The reasoning behind increasing the value on more CPUs +is that the effective latency is decreased by virtue of there being more CPUs on +BFS (for reasons explained above), and increasing the value allows for less +cache contention and more throughput. Valid values are from 1 to 1000 +Decreasing the value will decrease latencies at the cost of decreasing +throughput, while increasing it will improve throughput, but at the cost of +worsening latencies. The accuracy of the rr interval is limited by HZ resolution +of the kernel configuration. Thus, the worst case latencies are usually slightly +higher than this actual value. The default value of 6 is not an arbitrary one. +It is based on the fact that humans can detect jitter at approximately 7ms, so +aiming for much lower latencies is pointless under most circumstances. It is +worth noting this fact when comparing the latency performance of BFS to other +schedulers. Worst case latencies being higher than 7ms are far worse than +average latencies not being in the microsecond range. Isochronous scheduling. @@ -358,4 +348,4 @@ of total wall clock time taken and total work done, rather than the reported "cpu usage". -Con Kolivas Tue, 5 Apr 2011 +Con Kolivas Fri Aug 27 2010 diff --git a/Documentation/scheduler/sched-MuQSS.txt b/Documentation/scheduler/sched-MuQSS.txt index 2521d1a..bbd6980 100644 --- a/Documentation/scheduler/sched-MuQSS.txt +++ b/Documentation/scheduler/sched-MuQSS.txt @@ -1,9 +1,10 @@ MuQSS - The Multiple Queue Skiplist Scheduler by Con Kolivas. -See sched-BFS.txt for basic design; MuQSS is a per-cpu runqueue variant with +MuQSS is a per-cpu runqueue variant of the original BFS scheduler with one 8 level skiplist per runqueue, and fine grained locking for much more scalability. + Goals. The goal of the Multiple Queue Skiplist Scheduler, referred to as MuQSS from @@ -19,11 +20,11 @@ scalable to many CPUs and processes. Design summary. MuQSS is best described as per-cpu multiple runqueue, O(log n) insertion, O(1) -lookup, earliest effective virtual deadline first design, loosely based on EEVDF -(earliest eligible virtual deadline first) and my previous Staircase Deadline -scheduler, and evolved from the single runqueue O(n) BFS scheduler. Each -component shall be described in order to understand the significance of, and -reasoning for it. +lookup, earliest effective virtual deadline first tickless design, loosely based +on EEVDF (earliest eligible virtual deadline first) and my previous Staircase +Deadline scheduler, and evolved from the single runqueue O(n) BFS scheduler. +Each component shall be described in order to understand the significance of, +and reasoning for it. Design reasoning. @@ -66,13 +67,279 @@ next task scheduling decision and task wakeup CPU choice to allow balancing to happen by virtue of its choices. -Design: +Design details. + +Custom skip list implementation: + +To avoid the overhead of building up and tearing down skip list structures, +the variant used by MuQSS has a number of optimisations making it specific for +its use case in the scheduler. It uses static arrays of 8 'levels' instead of +building up and tearing down structures dynamically. This makes each runqueue +only scale O(log N) up to 256 tasks. However as there is one runqueue per CPU +it means that it scales O(log N) up to 256 x number of logical CPUs which is +far beyond the realistic task limits each CPU could handle. By being 8 levels +it also makes the array exactly one cacheline in size. Additionally, each +skip list node is bidirectional making insertion and removal amortised O(1), +being O(k) where k is 1-8. Uniquely, we are only ever interested in the very +first entry in each list at all times with MuQSS, so there is never a need to +do a search and thus look up is always O(1). + +Task insertion: + +MuQSS inserts tasks into a per CPU runqueue as an O(log N) insertion into +a custom skip list as described above (based on the original design by William +Pugh). Insertion is ordered in such a way that there is never a need to do a +search by ordering tasks according to static priority primarily, and then +virtual deadline at the time of insertion. + +Niffies: + +Niffies are a monotonic forward moving timer not unlike the "jiffies" but are +of nanosecond resolution. Niffies are calculated per-runqueue from the high +resolution TSC timers, and in order to maintain fairness are synchronised +between CPUs whenever both runqueues are locked concurrently. + +Virtual deadline: + +The key to achieving low latency, scheduling fairness, and "nice level" +distribution in MuQSS is entirely in the virtual deadline mechanism. The one +tunable in MuQSS is the rr_interval, or "round robin interval". This is the +maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy) +tasks of the same nice level will be running for, or looking at it the other +way around, the longest duration two tasks of the same nice level will be +delayed for. When a task requests cpu time, it is given a quota (time_slice) +equal to the rr_interval and a virtual deadline. The virtual deadline is +offset from the current time in niffies by this equation: + + niffies + (prio_ratio * rr_interval) + +The prio_ratio is determined as a ratio compared to the baseline of nice -20 +and increases by 10% per nice level. The deadline is a virtual one only in that +no guarantee is placed that a task will actually be scheduled by this time, but +it is used to compare which task should go next. There are three components to +how a task is next chosen. First is time_slice expiration. If a task runs out +of its time_slice, it is descheduled, the time_slice is refilled, and the +deadline reset to that formula above. Second is sleep, where a task no longer +is requesting CPU for whatever reason. The time_slice and deadline are _not_ +adjusted in this case and are just carried over for when the task is next +scheduled. Third is preemption, and that is when a newly waking task is deemed +higher priority than a currently running task on any cpu by virtue of the fact +that it has an earlier virtual deadline than the currently running task. The +earlier deadline is the key to which task is next chosen for the first and +second cases. + +The CPU proportion of different nice tasks works out to be approximately the + + (prio_ratio difference)^2 + +The reason it is squared is that a task's deadline does not change while it is +running unless it runs out of time_slice. Thus, even if the time actually +passes the deadline of another task that is queued, it will not get CPU time +unless the current running task deschedules, and the time "base" (niffies) is +constantly moving. + +Task lookup: + +As tasks are already pre-ordered according to anticipated scheduling order in +the skip lists, lookup for the next suitable task per-runqueue is always a +matter of simply selecting the first task in the 0th level skip list entry. +In order to maintain optimal latency and fairness across CPUs, MuQSS does a +novel examination of every other runqueue in cache locality order, choosing the +best task across all runqueues. This provides near-determinism of how long any +task across the entire system may wait before receiving CPU time. The other +runqueues are first examine lockless and then trylocked to minimise the +potential lock contention if they are likely to have a suitable better task. +Each other runqueue lock is only held for as long as it takes to examine the +entry for suitability. In "interactive" mode, the default setting, MuQSS will +look for the best deadline task across all CPUs, while in !interactive mode, +it will only select a better deadline task from another CPU if it is more +heavily laden than the current one. + +Lookup is therefore O(k) where k is number of CPUs. + + +Latency. + +Through the use of virtual deadlines to govern the scheduling order of normal +tasks, queue-to-activation latency per runqueue is guaranteed to be bound by +the rr_interval tunable which is set to 6ms by default. This means that the +longest a CPU bound task will wait for more CPU is proportional to the number +of running tasks and in the common case of 0-2 running tasks per CPU, will be +under the 7ms threshold for human perception of jitter. Additionally, as newly +woken tasks will have an early deadline from their previous runtime, the very +tasks that are usually latency sensitive will have the shortest interval for +activation, usually preempting any existing CPU bound tasks. + +Tickless expiry: + +A feature of MuQSS is that it is not tied to the resolution of the chosen tick +rate in Hz, instead depending entirely on the high resolution timers where +possible for sub-millisecond accuracy on timeouts regarless of the underlying +tick rate. This allows MuQSS to be run with the low overhead of low Hz rates +such as 100 by default, benefiting from the improved throughput and lower +power usage it provides. Another advantage of this approach is that in +combination with the Full No HZ option, which disables ticks on running task +CPUs instead of just idle CPUs, the tick can be disabled at all times +regardless of how many tasks are running instead of being limited to just one +running task. Note that this option is NOT recommended for regular desktop +users. + + +Scalability and balancing. + +Unlike traditional approaches where balancing is a combination of CPU selection +at task wakeup and intermittent balancing based on a vast array of rules set +according to architecture, busyness calculations and special case management, +MuQSS indirectly balances on the fly at task wakeup and next task selection. +During initialisation, MuQSS creates a cache coherency ordered list of CPUs for +each logical CPU and uses this to aid task/CPU selection when CPUs are busy. +Additionally it selects any idle CPUs, if they are available, at any time over +busy CPUs according to the following preference: + + * Same thread, idle or busy cache, idle or busy threads + * Other core, same cache, idle or busy cache, idle threads. + * Same node, other CPU, idle cache, idle threads. + * Same node, other CPU, busy cache, idle threads. + * Other core, same cache, busy threads. + * Same node, other CPU, busy threads. + * Other node, other CPU, idle cache, idle threads. + * Other node, other CPU, busy cache, idle threads. + * Other node, other CPU, busy threads. + +Mux is therefore SMT, MC and Numa aware without the need for extra +intermittent balancing to maintain CPUs busy and make the most of cache +coherency. + + +Features + +As the initial prime target audience for MuQSS was the average desktop user, it +was designed to not need tweaking, tuning or have features set to obtain benefit +from it. Thus the number of knobs and features has been kept to an absolute +minimum and should not require extra user input for the vast majority of cases. +There are 3 optional tunables, and 2 extra scheduling policies. The rr_interval, +interactive, and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO +policies. In addition to this, MuQSS also uses sub-tick accounting. What MuQSS +does _not_ now feature is support for CGROUPS. The average user should neither +need to know what these are, nor should they need to be using them to have good +desktop behaviour. However since some applications refuse to work without +cgroups, one can enable them with MuQSS as a stub and the filesystem will be +created which will allow the applications to work. + +rr_interval: + + /proc/sys/kernel/rr_interval + +The value is in milliseconds, and the default value is set to 6. Valid values +are from 1 to 1000 Decreasing the value will decrease latencies at the cost of +decreasing throughput, while increasing it will improve throughput, but at the +cost of worsening latencies. It is based on the fact that humans can detect +jitter at approximately 7ms, so aiming for much lower latencies is pointless +under most circumstances. It is worth noting this fact when comparing the +latency performance of MuQSS to other schedulers. Worst case latencies being +higher than 7ms are far worse than average latencies not being in the +microsecond range. + +interactive: + + /proc/sys/kernel/interactive + +The value is a simple boolean of 1 for on and 0 for off and is set to on by +default. Disabling this will disable the near-determinism of MuQSS when +selecting the next task by not examining all CPUs for the earliest deadline +task, or which CPU to wake to, instead prioritising CPU balancing for improved +throughput. Latency will still be bound by rr_interval, but on a per-CPU basis +instead of across the whole system. + +Isochronous scheduling: + +Isochronous scheduling is a unique scheduling policy designed to provide +near-real-time performance to unprivileged (ie non-root) users without the +ability to starve the machine indefinitely. Isochronous tasks (which means +"same time") are set using, for example, the schedtool application like so: + + schedtool -I -e amarok + +This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works +is that it has a priority level between true realtime tasks and SCHED_NORMAL +which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie, +if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval +rate). However if ISO tasks run for more than a tunable finite amount of time, +they are then demoted back to SCHED_NORMAL scheduling. This finite amount of +time is the percentage of CPU available per CPU, configurable as a percentage in +the following "resource handling" tunable (as opposed to a scheduler tunable): + +iso_cpu: + + /proc/sys/kernel/iso_cpu + +and is set to 70% by default. It is calculated over a rolling 5 second average +Because it is the total CPU available, it means that on a multi CPU machine, it +is possible to have an ISO task running as realtime scheduling indefinitely on +just one CPU, as the other CPUs will be available. Setting this to 100 is the +equivalent of giving all users SCHED_RR access and setting it to 0 removes the +ability to run any pseudo-realtime tasks. + +A feature of MuQSS is that it detects when an application tries to obtain a +realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the +appropriate privileges to use those policies. When it detects this, it will +give the task SCHED_ISO policy instead. Thus it is transparent to the user. + + +Idleprio scheduling: + +Idleprio scheduling is a scheduling policy designed to give out CPU to a task +_only_ when the CPU would be otherwise idle. The idea behind this is to allow +ultra low priority tasks to be run in the background that have virtually no +effect on the foreground tasks. This is ideally suited to distributed computing +clients (like setiathome, folding, mprime etc) but can also be used to start a +video encode or so on without any slowdown of other tasks. To avoid this policy +from grabbing shared resources and holding them indefinitely, if it detects a +state where the task is waiting on I/O, the machine is about to suspend to ram +and so on, it will transiently schedule them as SCHED_NORMAL. Once a task has +been scheduled as IDLEPRIO, it cannot be put back to SCHED_NORMAL without +superuser privileges since it is effectively a lower scheduling policy. Tasks +can be set to start as SCHED_IDLEPRIO with the schedtool command like so: + +schedtool -D -e ./mprime + +Subtick accounting: -MuQSS is an 8 level skip list per runqueue variant of BFS. +It is surprisingly difficult to get accurate CPU accounting, and in many cases, +the accounting is done by simply determining what is happening at the precise +moment a timer tick fires off. This becomes increasingly inaccurate as the timer +tick frequency (HZ) is lowered. It is possible to create an application which +uses almost 100% CPU, yet by being descheduled at the right time, records zero +CPU usage. While the main problem with this is that there are possible security +implications, it is also difficult to determine how much CPU a task really does +use. Mux uses sub-tick accounting from the TSC clock to determine real CPU +usage. Thus, the amount of CPU reported as being used by MuQSS will more +accurately represent how much CPU the task itself is using (as is shown for +example by the 'time' application), so the reported values may be quite +different to other schedulers. When comparing throughput of MuQSS to other +designs, it is important to compare the actual completed work in terms of total +wall clock time taken and total work done, rather than the reported "cpu usage". -See sched-BFS.txt for some of the shared design details. +Symmetric MultiThreading (SMT) aware nice: -Documentation yet to be completed. +SMT, a.k.a. hyperthreading, is a very common feature on modern CPUs. While the +logical CPU count rises by adding thread units to each CPU core, allowing more +than one task to be run simultaneously on the same core, the disadvantage of it +is that the CPU power is shared between the tasks, not summating to the power +of two CPUs. The practical upshot of this is that two tasks running on +separate threads of the same core run significantly slower than if they had one +core each to run on. While smart CPU selection allows each task to have a core +to itself whenever available (as is done on MuQSS), it cannot offset the +slowdown that occurs when the cores are all loaded and only a thread is left. +Most of the time this is harmless as the CPU is effectively overloaded at this +point and the extra thread is of benefit. However when running a niced task in +the presence of an un-niced task (say nice 19 v nice 0), the nice task gets +precisely the same amount of CPU power as the unniced one. MuQSS has an +optional configuration feature known as SMT-NICE which selectively idles the +secondary niced thread for a period proportional to the nice difference, +allowing CPU distribution according to nice level to be maintained, at the +expense of a small amount of extra overhead. If this is configured in on a +machine without SMT threads, the overhead is minimal. -Con Kolivas Sun, 2nd October 2016 +Con Kolivas Sat, 29th October 2016 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1fc4c59..7767a5f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2052,7 +2052,7 @@ config HOTPLUG_CPU config BOOTPARAM_HOTPLUG_CPU0 bool "Set default setting of cpu0_hotpluggable" default n - depends on HOTPLUG_CPU && !SCHED_MUQSS + depends on HOTPLUG_CPU ---help--- Set whether default state of cpu0_hotpluggable is on or off. @@ -2081,7 +2081,7 @@ config BOOTPARAM_HOTPLUG_CPU0 config DEBUG_HOTPLUG_CPU0 def_bool n prompt "Debug CPU0 hotplug" - depends on HOTPLUG_CPU && !SCHED_MUQSS + depends on HOTPLUG_CPU ---help--- Enabling this option offlines CPU0 (if CPU0 can be offlined) as soon as possible and boots up userspace with CPU0 offlined. User diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index 5fa6ee2..824c48d 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -54,7 +54,7 @@ CONFIG_HIGHPTE=y CONFIG_X86_CHECK_BIOS_CORRUPTION=y # CONFIG_MTRR_SANITIZER is not set CONFIG_EFI=y -CONFIG_HZ_1000=y +CONFIG_HZ_100=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y # CONFIG_COMPAT_VDSO is not set diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index d28bdab..b8c4f66 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -52,7 +52,7 @@ CONFIG_NUMA=y CONFIG_X86_CHECK_BIOS_CORRUPTION=y # CONFIG_MTRR_SANITIZER is not set CONFIG_EFI=y -CONFIG_HZ_1000=y +CONFIG_HZ_100=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y # CONFIG_COMPAT_VDSO is not set diff --git a/init/Kconfig b/init/Kconfig index 1323342..15e24c7 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -30,6 +30,7 @@ menu "General setup" config SCHED_MUQSS bool "MuQSS cpu scheduler" + select HIGH_RES_TIMERS ---help--- The Multiple Queue Skiplist Scheduler for excellent interactivity and responsiveness on the desktop and highly scalable deterministic @@ -349,7 +350,7 @@ choice # Kind of a stub config for the pure tick based cputime accounting config TICK_CPU_ACCOUNTING bool "Simple tick based cputime accounting" - depends on !S390 && !NO_HZ_FULL && !SCHED_MUQSS + depends on !S390 && !NO_HZ_FULL help This is the basic tick based cputime accounting that maintains statistics about user, system and idle time spent on per jiffies @@ -374,7 +375,6 @@ config VIRT_CPU_ACCOUNTING_GEN bool "Full dynticks CPU time accounting" depends on HAVE_CONTEXT_TRACKING depends on HAVE_VIRT_CPU_ACCOUNTING_GEN - depends on !SCHED_MUQSS select VIRT_CPU_ACCOUNTING select CONTEXT_TRACKING help @@ -552,7 +552,7 @@ config CONTEXT_TRACKING config CONTEXT_TRACKING_FORCE bool "Force context tracking" depends on CONTEXT_TRACKING - default y if !NO_HZ_FULL + default y if !NO_HZ_FULL && !SCHED_MUQSS help The major pre-requirement for full dynticks to work is to support the context tracking subsystem. But there are also @@ -710,7 +710,6 @@ config RCU_NOCB_CPU bool "Offload RCU callback processing from boot-selected CPUs" depends on TREE_RCU || PREEMPT_RCU depends on RCU_EXPERT || NO_HZ_FULL - depends on !SCHED_MUQSS default n help Use this option to reduce OS jitter for aggressive HPC or diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz index 2a202a8..ecde22d 100644 --- a/kernel/Kconfig.hz +++ b/kernel/Kconfig.hz @@ -5,6 +5,7 @@ choice prompt "Timer frequency" default HZ_250 + default HZ_100 if SCHED_MUQSS help Allows the configuration of the timer frequency. It is customary to have the timer interrupt run at 1000 Hz but 100 Hz may be more diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index a787aa9..77bdf980 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -18,14 +18,14 @@ endif ifdef CONFIG_SCHED_MUQSS obj-y += MuQSS.o clock.o else -obj-y += core.o loadavg.o clock.o cputime.o +obj-y += core.o loadavg.o clock.o obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o obj-$(CONFIG_SMP) += cpudeadline.o obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o endif -obj-y += wait.o swait.o completion.o idle.o +obj-y += wait.o swait.o completion.o idle.o cputime.o obj-$(CONFIG_SMP) += cpupri.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o diff --git a/kernel/sched/MuQSS.c b/kernel/sched/MuQSS.c index 894800a..f66a3da 100644 --- a/kernel/sched/MuQSS.c +++ b/kernel/sched/MuQSS.c @@ -123,6 +123,7 @@ */ #define JIFFIES_TO_NS(TIME) ((TIME) * (1073741824 / HZ)) #define JIFFY_NS (1073741824 / HZ) +#define JIFFY_US (1048576 / HZ) #define NS_TO_JIFFIES(TIME) ((TIME) / JIFFY_NS) #define HALF_JIFFY_NS (1073741824 / HZ / 2) #define HALF_JIFFY_US (1048576 / HZ / 2) @@ -130,12 +131,13 @@ #define MS_TO_US(TIME) ((TIME) << 10) #define NS_TO_MS(TIME) ((TIME) >> 20) #define NS_TO_US(TIME) ((TIME) >> 10) +#define US_TO_NS(TIME) ((TIME) << 10) #define RESCHED_US (100) /* Reschedule if less than this many μs left */ void print_scheduler_version(void) { - printk(KERN_INFO "MuQSS CPU scheduler v0.116 by Con Kolivas.\n"); + printk(KERN_INFO "MuQSS CPU scheduler v0.120 by Con Kolivas.\n"); } /* @@ -171,6 +173,8 @@ static inline int timeslice(void) return MS_TO_US(rr_interval); } +static bool sched_smp_initialized __read_mostly; + /* * The global runqueue data that all CPUs work off. Contains either atomic * variables and a cpu bitmap set atomically. @@ -180,13 +184,11 @@ struct global_rq { atomic_t nr_running ____cacheline_aligned_in_smp; atomic_t nr_uninterruptible ____cacheline_aligned_in_smp; atomic64_t nr_switches ____cacheline_aligned_in_smp; - atomic_t qnr ____cacheline_aligned_in_smp; /* queued not running */ cpumask_t cpu_idle_map ____cacheline_aligned_in_smp; #else atomic_t nr_running ____cacheline_aligned; atomic_t nr_uninterruptible ____cacheline_aligned; atomic64_t nr_switches ____cacheline_aligned; - atomic_t qnr ____cacheline_aligned; /* queued not running */ #endif }; @@ -862,20 +864,24 @@ static inline bool rq_local(struct rq *rq); */ static void update_load_avg(struct rq *rq) { - /* rq clock can go backwards so skip update if that happens */ - if (likely(rq->clock > rq->load_update)) { - unsigned long us_interval = (rq->clock - rq->load_update) >> 10; - long load, curload = rq_load(rq); + unsigned long us_interval; + long load, curload; - load = rq->load_avg - (rq->load_avg * us_interval * 5 / 262144); - if (unlikely(load < 0)) - load = 0; - load += curload * curload * SCHED_CAPACITY_SCALE * us_interval * 5 / 262144; - rq->load_avg = load; - } else + if (unlikely(rq->niffies <= rq->load_update)) return; - rq->load_update = rq->clock; + us_interval = NS_TO_US(rq->niffies - rq->load_update); + curload = rq_load(rq); + load = rq->load_avg - (rq->load_avg * us_interval * 5 / 262144); + if (unlikely(load < 0)) + load = 0; + load += curload * curload * SCHED_CAPACITY_SCALE * us_interval * 5 / 262144; + /* If this CPU has all the load, make it ramp up quickly */ + if (curload > load && curload >= atomic_read(&grq.nr_running)) + load = curload; + rq->load_avg = load; + + rq->load_update = rq->niffies; if (likely(rq_local(rq))) cpufreq_trigger(rq->niffies, rq->load_avg); } @@ -995,26 +1001,6 @@ static inline int task_timeslice(struct task_struct *p) return (rr_interval * task_prio_ratio(p) / 128); } -/* - * qnr is the "queued but not running" count which is the total number of - * tasks on the global runqueue list waiting for cpu time but not actually - * currently running on a cpu. - */ -static inline void inc_qnr(void) -{ - atomic_inc(&grq.qnr); -} - -static inline void dec_qnr(void) -{ - atomic_dec(&grq.qnr); -} - -static inline int queued_notrunning(void) -{ - return atomic_read(&grq.qnr); -} - #ifdef CONFIG_SMP /* Entered with rq locked */ static inline void resched_if_idle(struct rq *rq) @@ -1382,7 +1368,6 @@ static void activate_task(struct task_struct *p, struct rq *rq) enqueue_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; atomic_inc(&grq.nr_running); - inc_qnr(); } /* @@ -1416,7 +1401,7 @@ void set_task_cpu(struct task_struct *p, unsigned int cpu) WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) || lockdep_is_held(&task_rq(p)->lock))); #endif - if (p->wake_cpu == cpu) + if (task_cpu(p) == cpu) return; trace_sched_migrate_task(p, cpu); perf_event_task_migrate(p); @@ -1467,7 +1452,6 @@ static inline void take_task(struct rq *rq, int cpu, struct task_struct *p) sched_info_queued(rq, p); } set_task_cpu(p, cpu); - dec_qnr(); } /* @@ -1480,7 +1464,6 @@ static inline void return_task(struct task_struct *p, struct rq *rq, if (deactivate) deactivate_task(p, rq); else { - inc_qnr(); #ifdef CONFIG_SMP /* * set_task_cpu was called on the running task that doesn't @@ -1831,8 +1814,6 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) } #ifdef CONFIG_SMP -static bool sched_smp_initialized __read_mostly; - void sched_ttwu_pending(void) { struct rq *rq = this_rq(); @@ -2347,6 +2328,16 @@ int sysctl_schedstats(struct ctl_table *table, int write, static inline void init_schedstats(void) {} #endif /* CONFIG_SCHEDSTATS */ +static void update_cpu_clock_switch(struct rq *rq, struct task_struct *p); + +static void account_task_cpu(struct rq *rq, struct task_struct *p) +{ + update_clocks(rq); + /* This isn't really a context switch but accounting is the same */ + update_cpu_clock_switch(rq, p); + p->last_ran = rq->niffies; +} + /* * wake_up_new_task - wake up a newly created task for the first time. * @@ -2372,7 +2363,6 @@ void wake_up_new_task(struct task_struct *p) } double_rq_lock(rq, new_rq); - update_clocks(rq); rq_curr = rq->curr; /* @@ -2380,7 +2370,6 @@ void wake_up_new_task(struct task_struct *p) */ p->prio = rq_curr->normal_prio; - activate_task(p, rq); trace_sched_wakeup_new(p); /* @@ -2391,17 +2380,17 @@ void wake_up_new_task(struct task_struct *p) * modified within schedule() so it is always equal to * current->deadline. */ + account_task_cpu(rq, rq_curr); p->last_ran = rq_curr->last_ran; if (likely(rq_curr->policy != SCHED_FIFO)) { rq_curr->time_slice /= 2; - if (unlikely(rq_curr->time_slice < RESCHED_US)) { + if (rq_curr->time_slice < RESCHED_US) { /* * Forking task has run out of timeslice. Reschedule it and * start its child with a new time slice and deadline. The * child will end up running first because its deadline will * be slightly earlier. */ - rq_curr->time_slice = 0; __set_tsk_resched(rq_curr); time_slice_expired(p, new_rq); if (suitable_idle_cpus(p)) @@ -2424,6 +2413,7 @@ void wake_up_new_task(struct task_struct *p) time_slice_expired(p, new_rq); try_preempt(p, new_rq); } + activate_task(p, new_rq); double_rq_unlock(rq, new_rq); raw_spin_unlock_irqrestore(&p->pi_lock, flags); } @@ -2819,116 +2809,6 @@ DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat); EXPORT_PER_CPU_SYMBOL(kstat); EXPORT_PER_CPU_SYMBOL(kernel_cpustat); -#ifdef CONFIG_IRQ_TIME_ACCOUNTING - -/* - * There are no locks covering percpu hardirq/softirq time. - * They are only modified in account_system_vtime, on corresponding CPU - * with interrupts disabled. So, writes are safe. - * They are read and saved off onto struct rq in update_rq_clock(). - * This may result in other CPU reading this CPU's irq time and can - * race with irq/account_system_vtime on this CPU. We would either get old - * or new value with a side effect of accounting a slice of irq time to wrong - * task when irq is in progress while we read rq->clock. That is a worthy - * compromise in place of having locks on each irq in account_system_time. - */ -static DEFINE_PER_CPU(u64, cpu_hardirq_time); -static DEFINE_PER_CPU(u64, cpu_softirq_time); - -static DEFINE_PER_CPU(u64, irq_start_time); -static int sched_clock_irqtime; - -void enable_sched_clock_irqtime(void) -{ - sched_clock_irqtime = 1; -} - -void disable_sched_clock_irqtime(void) -{ - sched_clock_irqtime = 0; -} - -#ifndef CONFIG_64BIT -static DEFINE_PER_CPU(seqcount_t, irq_time_seq); - -static inline void irq_time_write_begin(void) -{ - __this_cpu_inc(irq_time_seq.sequence); - smp_wmb(); -} - -static inline void irq_time_write_end(void) -{ - smp_wmb(); - __this_cpu_inc(irq_time_seq.sequence); -} - -static inline u64 irq_time_read(int cpu) -{ - u64 irq_time; - unsigned seq; - - do { - seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); - irq_time = per_cpu(cpu_softirq_time, cpu) + - per_cpu(cpu_hardirq_time, cpu); - } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); - - return irq_time; -} -#else /* CONFIG_64BIT */ -static inline void irq_time_write_begin(void) -{ -} - -static inline void irq_time_write_end(void) -{ -} - -static inline u64 irq_time_read(int cpu) -{ - return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); -} -#endif /* CONFIG_64BIT */ - -/* - * Called before incrementing preempt_count on {soft,}irq_enter - * and before decrementing preempt_count on {soft,}irq_exit. - */ -void irqtime_account_irq(struct task_struct *curr) -{ - unsigned long flags; - s64 delta; - int cpu; - - if (!sched_clock_irqtime) - return; - - local_irq_save(flags); - - cpu = smp_processor_id(); - delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time); - __this_cpu_add(irq_start_time, delta); - - irq_time_write_begin(); - /* - * We do not account for softirq time from ksoftirqd here. - * We want to continue accounting softirq time to ksoftirqd thread - * in that case, so as not to confuse scheduler with a special task - * that do not consume any time, but still wants to run. - */ - if (hardirq_count()) - __this_cpu_add(cpu_hardirq_time, delta); - else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) - __this_cpu_add(cpu_softirq_time, delta); - - irq_time_write_end(); - local_irq_restore(flags); -} -EXPORT_SYMBOL_GPL(irqtime_account_irq); - -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ - #ifdef CONFIG_PARAVIRT static inline u64 steal_ticks(u64 steal) { @@ -2990,89 +2870,6 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) # define nsecs_to_cputime(__nsecs) nsecs_to_jiffies(__nsecs) #endif -#ifdef CONFIG_IRQ_TIME_ACCOUNTING -static void irqtime_account_hi_si(void) -{ - u64 *cpustat = kcpustat_this_cpu->cpustat; - u64 latest_ns; - - latest_ns = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)); - if (latest_ns > cpustat[CPUTIME_IRQ]) - cpustat[CPUTIME_IRQ] += (__force u64)cputime_one_jiffy; - - latest_ns = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)); - if (latest_ns > cpustat[CPUTIME_SOFTIRQ]) - cpustat[CPUTIME_SOFTIRQ] += (__force u64)cputime_one_jiffy; -} -#else /* CONFIG_IRQ_TIME_ACCOUNTING */ - -#define sched_clock_irqtime (0) - -static inline void irqtime_account_hi_si(void) -{ -} -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ - -static __always_inline bool steal_account_process_tick(void) -{ -#ifdef CONFIG_PARAVIRT - if (static_key_false(¶virt_steal_enabled)) { - u64 steal; - cputime_t steal_ct; - - steal = paravirt_steal_clock(smp_processor_id()); - steal -= this_rq()->prev_steal_time; - - /* - * cputime_t may be less precise than nsecs (eg: if it's - * based on jiffies). Lets cast the result to cputime - * granularity and account the rest on the next rounds. - */ - steal_ct = nsecs_to_cputime(steal); - this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); - - account_steal_time(steal_ct); - return steal_ct; - } -#endif - return false; -} - -/* - * Accumulate raw cputime values of dead tasks (sig->[us]time) and live - * tasks (sum on group iteration) belonging to @tsk's group. - */ -void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) -{ - struct signal_struct *sig = tsk->signal; - cputime_t utime, stime; - struct task_struct *t; - unsigned int seq, nextseq; - unsigned long flags; - - rcu_read_lock(); - /* Attempt a lockless read on the first round. */ - nextseq = 0; - do { - seq = nextseq; - flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq); - times->utime = sig->utime; - times->stime = sig->stime; - times->sum_exec_runtime = sig->sum_sched_runtime; - - for_each_thread(tsk, t) { - task_cputime(t, &utime, &stime); - times->utime += utime; - times->stime += stime; - times->sum_exec_runtime += task_sched_runtime(t); - } - /* If lockless access failed, take the lock. */ - nextseq = 1; - } while (need_seqretry(&sig->stats_lock, seq)); - done_seqretry_irqrestore(&sig->stats_lock, seq, flags); - rcu_read_unlock(); -} - /* * On each tick, add the number of nanoseconds to the unbanked variables and * once one tick's worth has accumulated, account it allowing for accurate @@ -3197,15 +2994,11 @@ static void pc_user_time(struct rq *rq, struct task_struct *p, unsigned long ns) * Bank in p->sched_time the ns elapsed since the last tick or switch. * CPU scheduler quota accounting is also performed here in microseconds. */ -static void -update_cpu_clock_tick(struct rq *rq, struct task_struct *p) +static void update_cpu_clock_tick(struct rq *rq, struct task_struct *p) { s64 account_ns = rq->niffies - p->last_ran; struct task_struct *idle = rq->idle; - if (steal_account_process_tick()) - goto ts_account; - /* Accurate tick timekeeping */ if (user_mode(get_irq_regs())) pc_user_time(rq, p, account_ns); @@ -3214,10 +3007,6 @@ update_cpu_clock_tick(struct rq *rq, struct task_struct *p) } else pc_idle_time(rq, idle, account_ns); - if (sched_clock_irqtime) - irqtime_account_hi_si(); - -ts_account: /* time_slice accounting is done in usecs to avoid overflow on 32bit */ if (p->policy != SCHED_FIFO && p != idle) p->time_slice -= NS_TO_US(account_ns); @@ -3230,8 +3019,7 @@ ts_account: * Bank in p->sched_time the ns elapsed since the last tick or switch. * CPU scheduler quota accounting is also performed here in microseconds. */ -static void -update_cpu_clock_switch(struct rq *rq, struct task_struct *p) +static void update_cpu_clock_switch(struct rq *rq, struct task_struct *p) { s64 account_ns = rq->niffies - p->last_ran; struct task_struct *idle = rq->idle; @@ -3305,133 +3093,86 @@ unsigned long long task_sched_runtime(struct task_struct *p) return ns; } -/* Compatibility crap */ -void account_user_time(struct task_struct *p, cputime_t cputime, - cputime_t cputime_scaled) -{ -} - -void account_idle_time(cputime_t cputime) +#ifdef CONFIG_HIGH_RES_TIMERS +static inline int hrexpiry_enabled(struct rq *rq) { + if (unlikely(!cpu_active(cpu_of(rq)) || !sched_smp_initialized)) + return 0; + return hrtimer_is_hres_active(&rq->hrexpiry_timer); } /* - * Account guest cpu time to a process. - * @p: the process that the cpu time gets accounted to - * @cputime: the cpu time spent in virtual machine since the last update - * @cputime_scaled: cputime scaled by cpu frequency + * Use HR-timers to deliver accurate preemption points. */ -static void account_guest_time(struct task_struct *p, cputime_t cputime, - cputime_t cputime_scaled) +static void hrexpiry_clear(struct rq *rq) { - u64 *cpustat = kcpustat_this_cpu->cpustat; - - /* Add guest time to process. */ - p->utime += (__force u64)cputime; - p->utimescaled += (__force u64)cputime_scaled; - account_group_user_time(p, cputime); - p->gtime += (__force u64)cputime; - - /* Add guest time to cpustat. */ - if (task_nice(p) > 0) { - cpustat[CPUTIME_NICE] += (__force u64)cputime; - cpustat[CPUTIME_GUEST_NICE] += (__force u64)cputime; - } else { - cpustat[CPUTIME_USER] += (__force u64)cputime; - cpustat[CPUTIME_GUEST] += (__force u64)cputime; - } + if (!hrexpiry_enabled(rq)) + return; + if (hrtimer_active(&rq->hrexpiry_timer)) + hrtimer_cancel(&rq->hrexpiry_timer); } /* - * Account system cpu time to a process and desired cpustat field - * @p: the process that the cpu time gets accounted to - * @cputime: the cpu time spent in kernel space since the last update - * @cputime_scaled: cputime scaled by cpu frequency - * @target_cputime64: pointer to cpustat field that has to be updated + * High-resolution time_slice expiry. + * Runs from hardirq context with interrupts disabled. */ -static inline -void __account_system_time(struct task_struct *p, cputime_t cputime, - cputime_t cputime_scaled, cputime64_t *target_cputime64) +static enum hrtimer_restart hrexpiry(struct hrtimer *timer) { - /* Add system time to process. */ - p->stime += (__force u64)cputime; - p->stimescaled += (__force u64)cputime_scaled; - account_group_system_time(p, cputime); + struct rq *rq = container_of(timer, struct rq, hrexpiry_timer); + struct task_struct *p; - /* Add system time to cpustat. */ - *target_cputime64 += (__force u64)cputime; + /* This can happen during CPU hotplug / resume */ + if (unlikely(cpu_of(rq) != smp_processor_id())) + goto out; - /* Account for system time used */ - acct_update_integrals(p); + /* + * We're doing this without the runqueue lock but this should always + * be run on the local CPU. Time slice should run out in __schedule + * but we set it to zero here in case niffies is slightly less. + */ + p = rq->curr; + p->time_slice = 0; + __set_tsk_resched(p); +out: + return HRTIMER_NORESTART; } /* - * Account system cpu time to a process. - * @p: the process that the cpu time gets accounted to - * @hardirq_offset: the offset to subtract from hardirq_count() - * @cputime: the cpu time spent in kernel space since the last update - * @cputime_scaled: cputime scaled by cpu frequency - * This is for guest only now. + * Called to set the hrexpiry timer state. + * + * called with irqs disabled from the local CPU only */ -void account_system_time(struct task_struct *p, int hardirq_offset, - cputime_t cputime, cputime_t cputime_scaled) +static void hrexpiry_start(struct rq *rq, u64 delay) { + if (!hrexpiry_enabled(rq)) + return; - if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) - account_guest_time(p, cputime, cputime_scaled); + hrtimer_start(&rq->hrexpiry_timer, ns_to_ktime(delay), + HRTIMER_MODE_REL_PINNED); } -/* - * Account for involuntary wait time. - * @steal: the cpu time spent in involuntary wait - */ -void account_steal_time(cputime_t cputime) +static void init_rq_hrexpiry(struct rq *rq) { - u64 *cpustat = kcpustat_this_cpu->cpustat; - - cpustat[CPUTIME_STEAL] += (__force u64)cputime; + hrtimer_init(&rq->hrexpiry_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + rq->hrexpiry_timer.function = hrexpiry; } -/* - * Account for idle time. - * @cputime: the cpu time spent in idle wait - */ -static void account_idle_times(cputime_t cputime) -{ - u64 *cpustat = kcpustat_this_cpu->cpustat; - struct rq *rq = this_rq(); - - if (atomic_read(&rq->nr_iowait) > 0) - cpustat[CPUTIME_IOWAIT] += (__force u64)cputime; - else - cpustat[CPUTIME_IDLE] += (__force u64)cputime; -} - -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE - -void account_process_tick(struct task_struct *p, int user_tick) +static inline int rq_dither(struct rq *rq) { + if (!hrexpiry_enabled(rq)) + return HALF_JIFFY_US; + return 0; } - -/* - * Account multiple ticks of steal time. - * @p: the process from which the cpu time has been stolen - * @ticks: number of stolen ticks - */ -void account_steal_ticks(unsigned long ticks) +#else /* CONFIG_HIGH_RES_TIMERS */ +static inline void init_rq_hrexpiry(struct rq *rq) { - account_steal_time(jiffies_to_cputime(ticks)); } -/* - * Account multiple ticks of idle time. - * @ticks: number of stolen ticks - */ -void account_idle_ticks(unsigned long ticks) +static inline int rq_dither(struct rq *rq) { - account_idle_times(jiffies_to_cputime(ticks)); + return HALF_JIFFY_US; } -#endif +#endif /* CONFIG_HIGH_RES_TIMERS */ /* * Functions to test for when SCHED_ISO tasks have used their allocated @@ -3510,6 +3251,8 @@ static void task_running_tick(struct rq *rq) * allowed to run into the 2nd half of the next tick if they will * run out of time slice in the interim. Otherwise, if they have * less than RESCHED_US μs of time slice left they will be rescheduled. + * Dither is used as a backup for when hrexpiry is disabled or high res + * timers not configured in. */ if (p->time_slice - rq->dither >= RESCHED_US) return; @@ -3519,6 +3262,60 @@ out_resched: rq_unlock(rq); } +#ifdef CONFIG_NO_HZ_FULL +/* + * We can stop the timer tick any time highres timers are active since + * we rely entirely on highres timeouts for task expiry rescheduling. + */ +static void sched_stop_tick(struct rq *rq, int cpu) +{ + if (!hrexpiry_enabled(rq)) + return; + if (!tick_nohz_full_enabled()) + return; + if (!tick_nohz_full_cpu(cpu)) + return; + tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED); +} + +static inline void sched_start_tick(struct rq *rq, int cpu) +{ + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED); +} + +/** + * scheduler_tick_max_deferment + * + * Keep at least one tick per second when a single + * active task is running. + * + * This makes sure that uptime continues to move forward, even + * with a very low granularity. + * + * Return: Maximum deferment in nanoseconds. + */ +u64 scheduler_tick_max_deferment(void) +{ + struct rq *rq = this_rq(); + unsigned long next, now = READ_ONCE(jiffies); + + next = rq->last_jiffy + HZ; + + if (time_before_eq(next, now)) + return 0; + + return jiffies_to_nsecs(next - now); +} +#else +static inline void sched_stop_tick(struct rq *rq, int cpu) +{ +} + +static inline void sched_start_tick(struct rq *rq, int cpu) +{ +} +#endif + /* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. @@ -3539,6 +3336,7 @@ void scheduler_tick(void) rq->last_scheduler_tick = rq->last_jiffy; rq->last_tick = rq->clock; perf_event_task_tick(); + sched_stop_tick(rq, cpu); } #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ @@ -3821,6 +3619,17 @@ static inline void schedule_debug(struct task_struct *prev) */ static inline void set_rq_task(struct rq *rq, struct task_struct *p) { +#ifdef CONFIG_HIGH_RES_TIMERS + if (p == rq->idle || p->policy == SCHED_FIFO) + hrexpiry_clear(rq); + else + hrexpiry_start(rq, US_TO_NS(p->time_slice)); +#endif /* CONFIG_HIGH_RES_TIMERS */ + if (rq->clock - rq->last_tick > HALF_JIFFY_NS) + rq->dither = 0; + else + rq->dither = rq_dither(rq); + rq->rq_deadline = p->deadline; rq->rq_prio = p->prio; #ifdef CONFIG_SMT_NICE @@ -3858,9 +3667,6 @@ static void wake_smt_siblings(struct rq *this_rq) { int other_cpu; - if (!queued_notrunning()) - return; - for_each_cpu(other_cpu, &this_rq->thread_mask) { struct rq *rq; @@ -4005,10 +3811,6 @@ static void __sched notrace __schedule(bool preempt) update_clocks(rq); niffies = rq->niffies; update_cpu_clock_switch(rq, prev); - if (rq->clock - rq->last_tick > HALF_JIFFY_NS) - rq->dither = 0; - else - rq->dither = HALF_JIFFY_US; clear_tsk_need_resched(prev); clear_preempt_need_resched(); @@ -4018,19 +3820,12 @@ static void __sched notrace __schedule(bool preempt) return_task(prev, rq, cpu, deactivate); } - if (unlikely(!queued_notrunning())) { - next = idle; - schedstat_inc(rq, sched_goidle); + next = earliest_deadline_task(rq, cpu, idle); + if (likely(next->prio != PRIO_LIMIT)) + clear_cpuidle_map(cpu); + else { set_cpuidle_map(cpu); update_load_avg(rq); - } else { - next = earliest_deadline_task(rq, cpu, idle); - if (likely(next->prio != PRIO_LIMIT)) - clear_cpuidle_map(cpu); - else { - set_cpuidle_map(cpu); - update_load_avg(rq); - } } set_rq_task(rq, next); @@ -5241,6 +5036,7 @@ SYSCALL_DEFINE0(sched_yield) p = current; rq = this_rq_lock(); + time_slice_expired(p, rq); schedstat_inc(task_rq(p), yld_count); /* @@ -5640,8 +5436,12 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) { __do_set_cpus_allowed(p, new_mask); if (needs_other_cpu(p, task_cpu(p))) { + struct rq *rq; + set_task_cpu(p, valid_task_cpu(p)); + rq = __task_rq_lock(p); resched_task(p); + __task_rq_unlock(rq); } } @@ -7473,6 +7273,8 @@ int sched_cpu_dying(unsigned int cpu) } bind_zero(cpu); double_rq_unlock(rq, cpu_rq(0)); + sched_start_tick(rq, cpu); + hrexpiry_clear(rq); local_irq_restore(flags); return 0; @@ -7643,6 +7445,7 @@ void __init sched_init_smp(void) #else void __init sched_init_smp(void) { + sched_smp_initialized = true; } #endif /* CONFIG_SMP */ @@ -7696,7 +7499,6 @@ void __init sched_init(void) #ifdef CONFIG_SMP init_defrootdomain(); - atomic_set(&grq.qnr, 0); cpumask_clear(&grq.cpu_idle_map); #else uprq = &per_cpu(runqueues, 0); @@ -7729,6 +7531,7 @@ void __init sched_init(void) rq->cpu = i; rq_attach_root(rq, &def_root_domain); #endif + init_rq_hrexpiry(rq); atomic_set(&rq->nr_iowait, 0); } @@ -7936,199 +7739,6 @@ void set_curr_task(int cpu, struct task_struct *p) #endif -/* - * Use precise platform statistics if available: - */ -#ifdef CONFIG_VIRT_CPU_ACCOUNTING - -#ifndef __ARCH_HAS_VTIME_TASK_SWITCH -void vtime_common_task_switch(struct task_struct *prev) -{ - if (is_idle_task(prev)) - vtime_account_idle(prev); - else - vtime_account_system(prev); - -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE - vtime_account_user(prev); -#endif - arch_vtime_task_switch(prev); -} -#endif - -#endif /* CONFIG_VIRT_CPU_ACCOUNTING */ - -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) -{ - *ut = p->utime; - *st = p->stime; -} -EXPORT_SYMBOL_GPL(task_cputime_adjusted); - -void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) -{ - struct task_cputime cputime; - - thread_group_cputime(p, &cputime); - - *ut = cputime.utime; - *st = cputime.stime; -} - -void vtime_account_system_irqsafe(struct task_struct *tsk) -{ - unsigned long flags; - - local_irq_save(flags); - vtime_account_system(tsk); - local_irq_restore(flags); -} -EXPORT_SYMBOL_GPL(vtime_account_system_irqsafe); - -/* - * Archs that account the whole time spent in the idle task - * (outside irq) as idle time can rely on this and just implement - * vtime_account_system() and vtime_account_idle(). Archs that - * have other meaning of the idle time (s390 only includes the - * time spent by the CPU when it's in low power mode) must override - * vtime_account(). - */ -#ifndef __ARCH_HAS_VTIME_ACCOUNT -void vtime_account_irq_enter(struct task_struct *tsk) -{ - if (!in_interrupt() && is_idle_task(tsk)) - vtime_account_idle(tsk); - else - vtime_account_system(tsk); -} -EXPORT_SYMBOL_GPL(vtime_account_irq_enter); -#endif /* __ARCH_HAS_VTIME_ACCOUNT */ - -#else /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ -/* - * Perform (stime * rtime) / total, but avoid multiplication overflow by - * losing precision when the numbers are big. - */ -static cputime_t scale_stime(u64 stime, u64 rtime, u64 total) -{ - u64 scaled; - - for (;;) { - /* Make sure "rtime" is the bigger of stime/rtime */ - if (stime > rtime) { - u64 tmp = rtime; rtime = stime; stime = tmp; - } - - /* Make sure 'total' fits in 32 bits */ - if (total >> 32) - goto drop_precision; - - /* Does rtime (and thus stime) fit in 32 bits? */ - if (!(rtime >> 32)) - break; - - /* Can we just balance rtime/stime rather than dropping bits? */ - if (stime >> 31) - goto drop_precision; - - /* We can grow stime and shrink rtime and try to make them both fit */ - stime <<= 1; - rtime >>= 1; - continue; - -drop_precision: - /* We drop from rtime, it has more bits than stime */ - rtime >>= 1; - total >>= 1; - } - - /* - * Make sure gcc understands that this is a 32x32->64 multiply, - * followed by a 64/32->64 divide. - */ - scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total); - return (__force cputime_t) scaled; -} - -/* - * Adjust tick based cputime random precision against scheduler - * runtime accounting. - */ -static void cputime_adjust(struct task_cputime *curr, - struct prev_cputime *prev, - cputime_t *ut, cputime_t *st) -{ - cputime_t rtime, stime, utime, total; - - stime = curr->stime; - total = stime + curr->utime; - - /* - * Tick based cputime accounting depend on random scheduling - * timeslices of a task to be interrupted or not by the timer. - * Depending on these circumstances, the number of these interrupts - * may be over or under-optimistic, matching the real user and system - * cputime with a variable precision. - * - * Fix this by scaling these tick based values against the total - * runtime accounted by the CFS scheduler. - */ - rtime = nsecs_to_cputime(curr->sum_exec_runtime); - - /* - * Update userspace visible utime/stime values only if actual execution - * time is bigger than already exported. Note that can happen, that we - * provided bigger values due to scaling inaccuracy on big numbers. - */ - if (prev->stime + prev->utime >= rtime) - goto out; - - if (total) { - stime = scale_stime((__force u64)stime, - (__force u64)rtime, (__force u64)total); - utime = rtime - stime; - } else { - stime = rtime; - utime = 0; - } - - /* - * If the tick based count grows faster than the scheduler one, - * the result of the scaling may go backward. - * Let's enforce monotonicity. - */ - prev->stime = max(prev->stime, stime); - prev->utime = max(prev->utime, utime); - -out: - *ut = prev->utime; - *st = prev->stime; -} - -void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) -{ - struct task_cputime cputime = { - .sum_exec_runtime = tsk_seruntime(p), - }; - - task_cputime(p, &cputime.utime, &cputime.stime); - cputime_adjust(&cputime, &p->prev_cputime, ut, st); -} -EXPORT_SYMBOL_GPL(task_cputime_adjusted); - -/* - * Must be called with siglock held. - */ -void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) -{ - struct task_cputime cputime; - - thread_group_cputime(p, &cputime); - cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st); -} -#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ - void init_idle_bootup_task(struct task_struct *idle) {} diff --git a/kernel/sched/MuQSS.h b/kernel/sched/MuQSS.h index 4e3115d..f9510d7 100644 --- a/kernel/sched/MuQSS.h +++ b/kernel/sched/MuQSS.h @@ -2,6 +2,7 @@ #include #include #include +#include "cpuacct.h" #ifndef MUQSS_SCHED_H #define MUQSS_SCHED_H @@ -85,6 +86,10 @@ struct rq { int iso_ticks; bool iso_refractory; +#ifdef CONFIG_HIGH_RES_TIMERS + struct hrtimer hrexpiry_timer; +#endif + #ifdef CONFIG_SCHEDSTATS /* latency stats */ @@ -244,6 +249,55 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) } #endif +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + +DECLARE_PER_CPU(u64, cpu_hardirq_time); +DECLARE_PER_CPU(u64, cpu_softirq_time); + +#ifndef CONFIG_64BIT +DECLARE_PER_CPU(seqcount_t, irq_time_seq); + +static inline void irq_time_write_begin(void) +{ + __this_cpu_inc(irq_time_seq.sequence); + smp_wmb(); +} + +static inline void irq_time_write_end(void) +{ + smp_wmb(); + __this_cpu_inc(irq_time_seq.sequence); +} + +static inline u64 irq_time_read(int cpu) +{ + u64 irq_time; + unsigned seq; + + do { + seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); + irq_time = per_cpu(cpu_softirq_time, cpu) + + per_cpu(cpu_hardirq_time, cpu); + } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); + + return irq_time; +} +#else /* CONFIG_64BIT */ +static inline void irq_time_write_begin(void) +{ +} + +static inline void irq_time_write_end(void) +{ +} + +static inline u64 irq_time_read(int cpu) +{ + return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); +} +#endif /* CONFIG_64BIT */ +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + #ifdef CONFIG_CPU_FREQ DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index a846cf8..f09077a 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -4,7 +4,12 @@ #include #include #include +#ifdef CONFIG_SCHED_MUQSS +#include "MuQSS.h" +#include "stats.h" +#else #include "sched.h" +#endif #ifdef CONFIG_PARAVIRT #include #endif @@ -671,7 +676,7 @@ out: void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) { struct task_cputime cputime = { - .sum_exec_runtime = p->se.sum_exec_runtime, + .sum_exec_runtime = tsk_seruntime(p), }; task_cputime(p, &cputime.utime, &cputime.stime); diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 10e18d2..4008d9f 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -89,7 +89,7 @@ config NO_HZ_IDLE config NO_HZ_FULL bool "Full dynticks system (tickless)" # NO_HZ_COMMON dependency - depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS && !SCHED_MUQSS + depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS # We need at least one periodic CPU for timekeeping depends on SMP depends on HAVE_CONTEXT_TRACKING diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 2c5bc77..b96deed 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -198,8 +198,13 @@ int clockevents_tick_resume(struct clock_event_device *dev) #ifdef CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST +#ifdef CONFIG_SCHED_MUQSS +/* Limit min_delta to 100us */ +#define MIN_DELTA_LIMIT (NSEC_PER_SEC / 10000) +#else /* Limit min_delta to a jiffie */ #define MIN_DELTA_LIMIT (NSEC_PER_SEC / HZ) +#endif /** * clockevents_increase_min_delta - raise minimum delta of a clock event device diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 6c6641c..cab7405 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1323,7 +1323,7 @@ config RCU_PERF_TEST config RCU_TORTURE_TEST tristate "torture tests for RCU" - depends on DEBUG_KERNEL && !SCHED_MUQSS + depends on DEBUG_KERNEL select TORTURE_TEST select SRCU select TASKS_RCU