Academic and Research Computing
QUICK STUDY #30
October 2001

Batch Processing for NIC Users

Why and Where to Use Batch Processing

Batch processing allows you to submit a long-running job from RCS and log out instead of waiting at the workstation for the job to run. DotCIO uses a queueing system, called DQS, that provides batch processing for numerically-intensive computing (NIC) on several server workstations and a Linux cluster.

To access either service, you must first submit the on-line Batch Service Access Request form. You will receive notification via e-mail once you have been granted access. You will be able to use either service the day after you receive e-mail saying you were granted access. If you need help getting started with the batch services, please send e-mail to Mike Kupferschmid at kupfem@rpi.edu, or Mark Miller at millem@rpi.edu.

Preparing a Batch File

Use an editor to prepare a file containing the sequence of {\ssa UNIX}{\ssc TM} commands that you want to execute in batch, like the following example:

  #$ -cwd
  hostname
  date
  myprog < inputfile
  date

The first line tells DQS to set the current working directory to the directory from which you will submit the job. If your executable or data files are in some other directory, give full path names to them. If your program reads from standard input you will need to redirect that unit from a file, as shown (even if you don't do this, the program will not read the next line in the batch file as its standard input). You can use any name for the batch file, but we suppose in the examples below that it is called fyle.

Picking Queues

Now find the batch queues that are suitable for your job, by entering at a {\ssa UNIX}{\ssc TM} prompt

  qstatus | more

By reading the qstatus listing you can see what queues are idle, what system type they run on, what CPU and memory limits they have, and other resources assigned to them. Queues are also assigned to groups for ease of selection. Pick a group having some idle queues whose CPU time, memory, and temporary disk space limits are large enough to run your job.

Submitting the Job

Next, submit the job. For example, if you picked the group named b-RS6K, you would enter

  qsub fyle -l group.eq.b-RS6K

In response, DQS will queue the job and assign it a job number. The string group.eq.b-RS6K in the example above is called a resource list. Resource lists can specify the values of resources other than the group name. For example (look at the qstatus output), the resource list cpu_limit.eq.12,mem_limit.eq.512,arc.eq.RS6K specifies the same set of queues as group.eq.b-RS6K. Linux cluster users running parallel jobs on n nodes need to include qty.eq.n in the resource list. While the job is waiting to run or running, it will be listed by its job number and your user name in the output from qstatus. On the serial queues, you are allowed to have up to 2 jobs executing at the same time, or a total of 3 jobs executing and waiting.

Getting the Output

When your job is done it disappears from the qstatus listing, and its standard output and standard error are written to files in the directory from which you submitted the job. For our example, these files are named fyle.ojjjjj.ppppp and fyle.ejjjjj.ppppp, where jjjjj is the job number and ppppp is a process number. If you need to see intermediate output, modify your program to write to some other file and close and reopen the file after each block of output that you want to see right away.

Stopping a Job

If you realize after submitting a job that it will not do what you want, or if it seems to be running for longer than you expected, you can stop it. If the job number is 12345, you would enter

  qdel 12345

The qsub commmand told you the job number when you submitted the job, and the number is also displayed by qstatus.

Errors

Most error conditions that occur in using DQS result from transient network outages or delays, but some indicate that DQS system processes are stopped or malfunctioning. If you see the same symptoms repeatedly in attempts that are separated by 10 minutes or more, or if qdel fails to delete your job, please report the trouble to nic-support-l@lists.rpi.edu It will help us to diagnose the problem if you include the following information: the date and time of the incident, the job number, the hostname of the machine, the symptoms, any error messages you receive either on the display or in the .e or .o files, the exact qsub command you used, and the contents of the DQS batch file you submitted (fyle in our examples above).

If your job results in a .o or .e file that has zero length, you probably do not have enough disk space to store the output. Estimate how much space the job will need and use the command fs lq to verify that you have enough. If you determine that you need to update your RCS account quota, you can find information on-line.

Additional Information

For more information on any DQS command, you can read its documentation by using the {\ssa UNIX}{\ssc TM} man command. For example, to read more about qsub you would enter

  man qsub


About this document ...

Published by Academic and Research Computing, RPI, Troy, NY 12180

Send comments to consult@rpi.edu.