StarPU Handbook
|
Some libraries need to be initialized once for each concurrent instance that may run on the machine. For instance, a C++ computation class which is not thread-safe by itself, but for which several instanciated objects of that class can be used concurrently. This can be used in StarPU by initializing one such object per worker. For instance, the libstarpufft
example does the following to be able to use FFTW on CPUs.
Some global array stores the instanciated objects:
At initialisation time of libstarpu, the objects are initialized:
And in the codelet body, they are used:
This however is not sufficient for FFT on CUDA: initialization has to be done from the workers themselves. This can be done thanks to starpu_execute_on_each_worker(). For instance libstarpufft
does the following.
To add a new kind of device to the structure starpu_driver, one needs to:
_starpu_launch_drivers()
to make sure the driver is not always launched. _starpu_run_foobar()
in the corresponding driver. Graphical-oriented applications need to draw the result of their computations, typically on the very GPU where these happened. Technologies such as OpenGL/CUDA interoperability permit to let CUDA directly work on the OpenGL buffers, making them thus immediately ready for drawing, by mapping OpenGL buffer, textures or renderbuffer objects into CUDA. CUDA however imposes some technical constraints: peer memcpy has to be disabled, and the thread that runs OpenGL has to be the one that runs CUDA computations for that GPU.
To achieve this with StarPU, pass the option --disable-cuda-memcpy-peer to configure
(TODO: make it dynamic), OpenGL/GLUT has to be initialized first, and the interoperability mode has to be enabled by using the field starpu_conf::cuda_opengl_interoperability, and the driver loop has to be run by the application, by using the field starpu_conf::not_launched_drivers to prevent StarPU from running it in a separate thread, and by using starpu_driver_run() to run the loop. The examples gl_interop
and gl_interop_idle
show how it articulates in a simple case, where rendering is done in task callbacks. The former uses glutMainLoopEvent
to make GLUT progress from the StarPU driver loop, while the latter uses glutIdleFunc
to make StarPU progress from the GLUT main loop.
Then, to use an OpenGL buffer as a CUDA data, StarPU simply needs to be given the CUDA pointer at registration, for instance:
and display it e.g. in the callback function.
Some users had issues with MKL 11 and StarPU (versions 1.1rc1 and 1.0.5) on Linux with MKL, using 1 thread for MKL and doing all the parallelism using StarPU (no multithreaded tasks), setting the environment variable MKL_NUM_THREADS
to 1
, and using the threaded MKL library, with iomp5
.
Using this configuration, StarPU only uses 1 core, no matter the value of STARPU_NCPU. The problem is actually a thread pinning issue with MKL.
The solution is to set the environment variable KMP_AFFINITY to disabled
(http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm).
When using StarPU on a NetBSD machine, if the topology discovery library hwloc
is used, thread binding will fail. To prevent the problem, you should at least use the version 1.7 of hwloc
, and also issue the following call:
$ sysctl -w security.models.extensions.user_set_cpu_affinity=1
Or add the following line in the file /etc/sysctl.conf
security.models.extensions.user_set_cpu_affinity=1
Yes, this is on purpose.
By default, StarPU uses active polling on task queues, so as to minimize wake-up latency for better overall performance.
If eating CPU time is a problem (e.g. application running on a desktop), pass option --enable-blocking-drivers to configure
. This will add some overhead when putting CPU workers to sleep or waking them, but avoid eating 100% CPU permanently.
If your application only partially uses StarPU, and you do not want to call starpu_init() / starpu_shutdown() at the beginning/end of each section, StarPU workers will poll for work between the sections. To avoid this behavior, you can "pause" StarPU with the starpu_pause() function. This will prevent the StarPU workers from accepting new work (tasks that are already in progress will not be frozen), and stop them from polling for more work.
Note that this does not prevent you from submitting new tasks, but they won't execute until starpu_resume() is called. Also note that StarPU must not be paused when you call starpu_shutdown(), and that this function pair works in a push/pull manner, i.e you need to match the number of calls to these functions to clear their effect.
One way to use these functions could be:
Yes, this is on purpose.
Since GPU devices are way faster than CPUs, StarPU needs to react quickly when a task is finished, to feed the GPU with another task (StarPU actually submits a couple of tasks in advance so as to pipeline this, but filling the pipeline still has to be happening often enough), and thus it has to dedicate threads for this, and this is a very CPU-consuming duty. StarPU thus dedicates one CPU core for driving each GPU by default.
Such dedication is also useful when a codelet is hybrid, i.e. while kernels are running on the GPU, the codelet can run some computation, which thus be run by the CPU core instead of driving the GPU.
One can choose to dedicate only one thread for all the CUDA devices by setting the STARPU_CUDA_THREAD_PER_DEV environment variable to 1
. The application however should use STARPU_CUDA_ASYNC on its CUDA codelets (asynchronous execution), otherwise the execution of a synchronous CUDA codelet will monopolize the thread, and other CUDA devices will thus starve while it is executing.
First make sure that CUDA is properly running outside StarPU: build and run the following program with -lcudart
:
If that program does not find your device, the problem is not at the StarPU level, but the CUDA drivers, check the documentation of your CUDA setup.
First make sure that OpenCL is properly running outside StarPU: build and run the following program with -lOpenCL
:
If that program does not find your device, the problem is not at the StarPU level, but the OpenCL drivers, check the documentation of your OpenCL implementation.
The performance model file, used by StarPU to record the performance of codelets, seem to have been corrupted. Perhaps a previous run of StarPU stopped abruptly, and thus could not save it properly. You can have a look at the file if you can fix it, but the simplest way is to just remove the file and run again, StarPU will just have to re-perform calibration for the corresponding codelet.