The FieldTrip toolbox includes a compact stand-alone toolbox to facilitate distributed computing on a Linux cluster. The idea of the FieldTrip qsub toolbox is to provide you with an easy MATLAB interface to distribute your jobs and not have to go to the Linux command-line to use the qsub command from there. Besides the Torque cluster that we have at the Donders in Nijmegen, it also supports Linux clusters with other PBS versions, Sun Grid Engine (SGE), Oracle Grid Engine or SLURM as the batch queueing system.
You should start by adding the qsub toolbox to your MATLAB path:
>> addpath /home/common/matlab/fieldtrip/qsub/
To execute a few jobs in parallel, you will use qsubcellfun to submit a batch of jobs, or qsubfeval to submit a single job to the cluster. On the MATLAB command line, please have a look at
>> help qsubcellfun
This function is similar to the standard CELLFUN function and applies a function to each element of a cell array. Try the following:
>> qsubcellfun(@randn, {1,1,1,1}, 'memreq', 1024, 'timreq', 60) submitting job irisim_mentat284_p7284_b6_j001... qstat job id 25618.dccn-l014.dccn.nl submitting job irisim_mentat284_p7284_b6_j002... qstat job id 25619.dccn-l014.dccn.nl submitting job irisim_mentat284_p7284_b6_j003... qstat job id 25620.dccn-l014.dccn.nl submitting job irisim_mentat284_p7284_b6_j004... qstat job id 25621.dccn-l014.dccn.nl job irisim_mentat284_p7284_b6_j001 returned, it required 0 seconds and 832.0 KB job irisim_mentat284_p7284_b6_j002 returned, it required 0 seconds and 828.0 KB job irisim_mentat284_p7284_b6_j003 returned, it required 0 seconds and 830.0 KB job irisim_mentat284_p7284_b6_j004 returned, it required 0 seconds and 829.0 KB computational time = 0.1 sec, elapsed = 1.0 sec, speedup 0.0 x ans = [0.1194] [0.3965] [-0.2523] [0.3803]
and compare it with
>> cellfun(@randn, {1,1,1,1}) ans = -2.2588 0.8622 0.3188 -1.3077
The difference in the output formats is due to the UniformOutput argument, which is default false in qsubcellfun and default true in CELLFUN.
The qsubcellfun command creates a bunch of temporally files in your working directory. STDIN.oXXX is the standard output, i.e. the output that MATLAB normally prints in the command window. STDIN.eXXX is an error message file. For the job to complete successfully, this file should be empty. All the temporally files are automatically deleted when the job is completed, or when it is terminated with Ctrl+C, or with an error.
The execution of each job involves writing the input arguments to a file, submitting the job, to Torque, starting MATLAB, reading the file, evaluate the function, writing the output arguments to file and at the end collecting all output arguments of all jobs and rearranging them. Starting MATLAB for each job imposes quite some overhead on the jobs if they are small, that is why qsubcellfun implements “stacking” to combine multiple MATLAB jobs into one job for the Linux cluster. If the jobs that you pass to qsubcellfun are small (less than 180 seconds) they will be stacked automatically. You can control it in detail with the “stack” option in qsubcellfun. For example
>> qsubcellfun(@randn, {1,1,1,1}, 'memreq', 1024, 'timreq', 60, 'stack', 4); stacking 4 matlab jobs in each qsub job submitting job irisim_mentat284_p7284_b7_j001... qstat job id 25677.dccn-l014.dccn.nl ... >> qsubcellfun(@randn, {1,1,1,1}, 'memreq', 1024, 'timreq', 60, 'stack', 1); submitting job irisim_mentat284_p7284_b8_j001... qstat job id 25678.dccn-l014.dccn.nl submitting job irisim_mentat284_p7284_b8_j002... qstat job id 25679.dccn-l014.dccn.nl submitting job irisim_mentat284_p7284_b8_j003... qstat job id 25680.dccn-l014.dccn.nl submitting job irisim_mentat284_p7284_b8_j004... qstat job id 25681.dccn-l014.dccn.nl ...
Note that the stacking implementation is not yet ideal, since with the default option it distributed the 4 jobs into 3+1, whereas 2+2 would be better.
You will have noticed that you have to specify the time and memory requirements for the individual jobs using the 'timreq' and 'memreq' arguments to qsubcellfun. These time and memory requirements are passed to the batch queueing system, which uses them to find an appropriate execution host (i.e one that has enough free memory) and to monitor the usage.
Do not set the requirements too tight, because if the job exceeds the requested resources, it will be killed. However, if you grossly overestimate them, your jobs will be scheduled in a “slow” queue, where only a few jobs can run simultaneously. The queueing and throttling policies on the number and the size of the jobs is to prevent a few large jobs from a single user from blocking all computational resources of the cluster. So the most optimal approach to get your jobs executed is to try and estimate the memory and time requirements as good as you can.
The help of qsubcellfun lists some suggestions on how to estimate the time and memory.
You can check in which queue your job is submitted with QSTAT command on the Linux command-line:
bash-3.2$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 25502.dccn-l014 ...2782_b25_j033 irisim 00:31:59 C max2h2gb 25503.dccn-l014 ...2782_b25_j034 irisim 00:31:54 C max2h2gb
To check the processes on all cluster nodes, use the CLUSTER-QSTAT command on the Linux command-line (*).
bash-3.2$ cluster-qstat
dccn-l014.dccn.nl:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
41649.dccn-l014. stavpel max24h8g stavpel_mentat30 7266 -- -- 8gb 13:56 R 05:27 mentat007/21
41650.dccn-l014. stavpel max24h8g stavpel_mentat30 7292 -- -- 8gb 13:56 R 05:27 mentat007/22
41651.dccn-l014. stavpel max24h8g stavpel_mentat30 7538 -- -- 8gb 13:56 R 05:27 mentat007/23
*) note that this is specific to the Donders linux cluster
Share this page: