StarPU Handbook
|
StarPU provides two ways of defining a scheduling policy, a basic monolithic way, and a modular way.
The basic monolithic way is directly connected with the core of StarPU, which means that the policy then has to handle all performance details, such as data prefetching, task performance model calibration, worker locking, etc. examples/scheduler/dummy_sched.c
is a trivial example which does not handle this, and thus e.g. does not achieve any data prefetching or smart scheduling.
The modular way allows to implement just one component, and reuse existing components to cope with all these details. examples/scheduler/dummy_modular_sched.c
is a trivial example very similar to dummy_sched.c
, but implemented as a component, which allows to assemble it with other components, and notably get data prefetching support for free, and task performance model calibration is properly performed, which allows to easily extend it into taking task duration into account, etc.
Make sure to have a look at the Scheduling Policy section, which provides a complete list of the functions available for writing advanced schedulers.
This includes getting an estimation for a task computation completion with starpu_task_expected_length(), for the required data transfers with starpu_task_expected_data_transfer_time_for(), for the required energy with starpu_task_expected_energy(), etc. Other useful functions include starpu_transfer_bandwidth(), starpu_transfer_latency(), starpu_transfer_predict(), ... The successors of a task can be obtained with starpu_task_get_task_succs(). One can also directly test the presence of a data handle with starpu_data_is_on_node(). Prefetches can be triggered by calling either starpu_prefetch_task_input_for(), starpu_idle_prefetch_task_input(), starpu_prefetch_task_input_for_prio(), or starpu_idle_prefetch_task_input_for_prio(). The _prio
versions allow to specify a priority for the transfer (instead of taking the task priority by default). These prefetches are only processed when there are no fetch data requests (i.e. a task is waiting for it) to process. The _idle
versions queue the transfers on the idle prefetch queue, which is only processed when there are no non-idle prefetch to process. starpu_get_prefetch_flag() is a convenient helper for checking the value of the STARPU_PREFETCH environment variable.
Usual functions can be used on tasks, for instance one can use the following to get the data size for a task.
Task queues can be implemented with the starpu_task_list functions.
Access to the hwloc
topology is available with starpu_worker_get_hwloc_obj().
A full example showing how to define a new scheduling policy is available in the StarPU sources in examples/scheduler/dummy_sched.c
.
The scheduler has to provide methods:
The idea is that when a task becomes ready for execution, the starpu_sched_policy::push_task method is called to give the ready task to the scheduler. When a worker is idle, the starpu_sched_policy::pop_task method is called to get a task from the scheduler. It is up to the scheduler to implement what is between. A simple eager scheduler is for instance to make starpu_sched_policy::push_task push the task to a global list, and make starpu_sched_policy::pop_task pop from this list. A scheduler can also use starpu_push_local_task() to directly push tasks to a per-worker queue, and then starpu does not even need to implement starpu_sched_policy::pop_task. If there are no ready tasks within the scheduler, it can just return NULL
, and the worker will sleep.
The starpu_sched_policy section provides the exact rules that govern the methods of the policy.
One can enumerate the workers with this iterator:
To provide synchronization between workers, a per-worker lock exists to protect the data structures of a given worker. It is acquired around scheduler methods, so that the scheduler does not need any additional mutex to protect its per-worker data.
In case the scheduler wants to access another scheduler's data, it should use starpu_worker_lock() and starpu_worker_unlock().
Calling
from a worker A
will however thus make worker A
wait for worker B
to complete its scheduling method. That may be a problem if that method takes a long time, because it is e.g. computing a heuristic or waiting for another mutex, or even cause deadlocks if worker B
is calling
at the same time. In such a case, worker B
must call starpu_worker_relax_on() and starpu_worker_relax_off() around the section which potentially blocks (and does not actually need protection). While a worker is in relaxed mode, e.g. between a pair of starpu_worker_relax_on() and starpu_worker_relax_off() calls, its state can be altered by other threads: for instance, worker A
can push tasks for worker B
. In consequence, worker B
must re-assess its state after
, such as taking possible new tasks pushed to its queue into account.
When the starpu_sched_policy::push_task method has pushed a task for another worker, one has to call starpu_wake_worker_relax_light() so that the worker wakes up and picks it. If the task was pushed on a shared queue, one may want to only wake one idle worker. An example doing this is available in src/sched_policies/eager_central_policy.c
.
A pointer to one data structure specific to the scheduler can be set with starpu_sched_ctx_set_policy_data() and fetched with starpu_sched_ctx_get_policy_data(). Per-worker data structures can then be store in it by allocating a STARPU_NMAXWORKERS -sized array of structures indexed by workers.
A variety of examples of advanced schedulers can be read in src/sched_policies
, for instance random_policy.c
, eager_central_policy.c
, work_stealing_policy.c
Code protected by if (_starpu_get_nsched_ctxs() > 1)
can be ignored, this is for scheduling contexts, which is an experimental feature.
StarPU's Modularized Schedulers are made of individual Scheduling Components Modularizedly assembled as a Scheduling Tree. Each Scheduling Component has an unique purpose, such as prioritizing tasks or mapping tasks over resources. A typical Scheduling Tree is shown below.
| starpu_push_task | | v Fifo_Component | ^ Push | | Can_Push v | Eager_Component | ^ | | v | --------><-------------------><--------- | ^ | ^ Push | | Can_Push Push | | Can_Push v | v | Fifo_Component Fifo_Component | ^ | ^ Pull | | Can_Pull Pull | | Can_Pull v | v | Worker_Component Worker_Component | | starpu_pop_task | | v v
When a task is pushed by StarPU in a Modularized Scheduler, the task moves from a Scheduling Component to an other, following the hierarchy of the Scheduling Tree, and is stored in one of the Scheduling Components of the strategy. When a worker wants to pop a task from the Modularized Scheduler, the corresponding Worker Component of the Scheduling Tree tries to pull a task from its parents, following the hierarchy, and gives it to the worker if it succeded to get one.
Each Scheduling Component must follow the following pre-defined Interface to be able to interact with other Scheduling Components.
The components also provide the following useful methods:
StarPU is currently shipped with the following four Scheduling Components :
Some rules must be followed to ensure the correctness of a Modularized Scheduler :
Most often, components do not need to take locks. This allows e.g. the push operation to be called in parallel when tasks get released in parallel from different workers which have completed different ancestor tasks.
When a component has internal information which needs to be kept coherent, the component can define its own lock at take it as it sees fit, e.g. to protect a task queue. This may however limit scalability of the scheduler. Conversely, since push and pull operations will be called concurrently from different workers, the component might prefer to use a central mutex to serialize all scheduling decisions to avoid pathological cases (all push calls decide to put their task on the same target)
The following code shows how to implement a Tree-Eager-Prefetching Scheduler.
starpu_sched_component_initialize_simple_scheduler() is a helper function which makes it very trivial to assemble a modular scheduler around a scheduling decision component as seen above (here, a dumb eager decision component). Most often a modular scheduler can be implemented that way.
A modular scheduler can also be constructed hierarchically with starpu_sched_component_composed_recipe_create().
That modular scheduler can also be built by hand in the following way:
Other modular scheduler examples can be seen in src/sched_policies/modular_*.c
For instance, modular-heft-prio
needs performance models, decides memory nodes, uses prioritized fifos above and below, and decides the best implementation.
If unsure on the result of the modular scheduler construction, you can run a simple application with FxT enabled (see Generating Traces With FxT), and open the generated file trace.html
in a web-browser.
At the moment, parallel tasks can be managed in modularized schedulers through combined workers: instead of connecting a scheduling component to a worker component, one can connect it to a combined worker component (i.e. a worker component created with a combined worker id). That component will handle creating task aliases for parallel execution and push them to the different workers components.
Each Scheduling Component is instantiated from a Generic Scheduling Component, which implements a generic version of the Interface. The generic implementation of Pull, Can_Pull and Can_Push functions are recursive calls to their parents (respectively to their children). However, as a Generic Scheduling Component do not know how much children it will have when it will be instantiated, it does not implement the Push function.
A Scheduling Component must implement all the functions of the Interface. It is so necessary to implement a Push function to instantiate a Scheduling Component. The implemented Push function is the "fingerprint" of a Scheduling Component. Depending on how functionalities or properties programmers want to give to the Scheduling Component they are implementing, it is possible to reimplement all the functions of the Interface. For example, a Flow-control Component reimplements the Pull and the Can_Push functions of the Interface, allowing to catch the generic recursive calls of these functions. The Pull function of a Flow-control Component can, for example, pop a task from the local storage queue of the Component, and give it to the calling Component which asks for it.
The Tree-Eager-Prefetching Scheduler shown in Section Implementing a Modularized Scheduler follows the previous assumptions :
starpu_push_task Pump | Area 1 | | v -----------------------Fifo_Component----------------------------- Pump | ^ Push | | Can_Push v | Area 2 Eager_Component | ^ | | v | --------><-------------------><--------- | ^ | ^ Push | | Can_Push Push | | Can_Push v | v | -----Fifo_Component-----------------------Fifo_Component---------- | ^ | ^ Pull | | Can_Pull Pull | | Can_Pull Area 3 v | v | Pump Pump Worker_Component Worker_Component
For performance reasons, most of the schedulers shipped with StarPU use simple list-scheduling heuristics, assuming that the application has already set priorities. This is why they do their scheduling between when tasks become available for execution and when a worker becomes idle, without looking at the task graph.
Other heuristics can however look at the task graph. Recording the task graph is expensive, so it is not available by default, the scheduling heuristic has to set _starpu_graph_record
to 1
from the initialization function, to make it available. Then the _starpu_graph*
functions can be used.
src/sched_policies/graph_test_policy.c
is an example of simple greedy policy which automatically computes priorities by bottom-up rank.
The idea is that while the application submits tasks, they are only pushed to a bag of tasks. When the application is finished with submitting tasks, it calls starpu_do_schedule() (or starpu_task_wait_for_all(), which calls starpu_do_schedule()), and the starpu_sched_policy::do_schedule method of the scheduler is called. This method calls _starpu_graph_compute_depths()
to compute the bottom-up ranks, and then uses these ranks to set priorities over tasks.
It then has two priority queues, one for CPUs, and one for GPUs, and uses a dumb heuristic based on the duration of the task over CPUs and GPUs to decide between the two queues. CPU workers can then pop from the CPU priority queue, and GPU workers from the GPU priority queue.
All the Online Performance Tools and Offline Performance Tools can be used to get information about how well the execution proceeded, and thus the overall quality of the execution.
Precise debugging can also be performed by using the STARPU_TASK_BREAK_ON_PUSH, STARPU_TASK_BREAK_ON_SCHED, STARPU_TASK_BREAK_ON_POP, and STARPU_TASK_BREAK_ON_EXEC environment variables. By setting the job_id of a task in these environment variables, StarPU will raise SIGTRAP
when the task is being scheduled, pushed, or popped by the scheduler. This means that when one notices that a task is being scheduled in a seemingly odd way, one can just reexecute the application in a debugger, with some of those variables set, and the execution will stop exactly at the scheduling points of this task, thus allowing to inspect the scheduler state, etc.