Department Home

qr-code info
  • X-ray Lab Home
  • Data Collection Queue
  • News Forum
  • Software List
  • Getting Started...
  • X-ray Lab Guides
  • X-ray Links
  • Calculations
  • Search
  • Internal use only:
  • Data Collection Queue Sign-up
  • E-mail list archives
  • Other:
  • Software
    File Transfer
  • Miscellaneous Links
  • Facility Manager
  • GNQS Job Submission

    Queen is not intended for computational work. This is because computational jobs and interactive work tend to place very different requirements on a system. So on the hive they are separated from one another: queen runs the interactive stuff (including job submission) and the drones do the grunt work of running the jobs.

    This means, though, that you have to use the job queue if you want to run computational jobs, because that's the only way you're going to get access to the computational machines. We use a system called GNQS (Generic Network Queueing System). To use it you'll need to know how to use the command qsub, and you'll probably want to learn a few other commands as well (qstat and qdel, to name a couple). While you can find all you need to know in the man pages this document should give you a gentler introduction to their use here.


    The simple way to submit jobs

    The simplest way to submit a job with qsub is just to type qsub, enter the commands to run, and then hit ctrl-d. For example, here a charmm job is submitted to the queue:

    % qsub
    source /prog/setup charmm
    
    charmm < file.inp > file.out
    
    

    After hitting ctrl-d the job will be submitted to the queue. The charmm output will go into file.out and any other output (e.g., errors, or output from the source command) will go into two files:

    STDIN.e123
    STDIN.o123
    

    The number at the end will be the job number of the submission. Generally errors will be in the .e123 file and other output in the .o123 file, but check both files if you're having problems.


    A bit more sophisticated

    Instead of typing in your commands on the spot like above, you can create a file that looks something like this:

    #!/bin/tcsh
    ###########
    #
    # This is a job script for a charmm job.
    #
    
    source /prog/setup charmm
    
    charmm < input-file > output-file
    
    

    If the above file were called charm-script you could submit it to the queue like this:

    % qsub charmm-script
    

    Putting the job in a script like this is useful if the job is a bit complex, or if you'll be running it (perhaps with minor variations) many times.


    Specifying a queue

    In the above examples the jobs are submitted without worrying about which queue they'll be run in. There are two queues, though: long and short. The long queue is the default queue; unless you have a reason to choose the short queue this is where your jobs should go. There is essentially no time limit on the long queue (about a month, actually) and there are as many as six slots your job could run in.

    The short queue is for quick little jobs where you need the results as soon as possible. These actually work by stealing processor time from long jobs, so the short queue shouldn't be used without reason. For this reason there are only three running slots available in the short queue. There's also a time limit of 8 hours: if your job wont finish in that time it shouldn't be in the short queue. They do run at a higher priority than long jobs though, so if you need something quickly and the system is already loaded this is an option.

    To specify a queue when calling qsub you use the -q option, like so:

    % qsub -q short charmm-script
    

    More sophisticated yet

    In order to run your job most efficiently sometimes you need to use the local disk of the drone you'll be running your job on. [NOTE: emphasize that running on the local disk isn't usually necessary.] You can always access that disk via the path /local, but unless your data files are there as well you'll still be reading data over the network from the file server. If you're good with shell scripting you can probably come up with your own way to do this, but to help a bit I've written couple of scripts: local-long and local-short. These scripts will help automate the process of copying your data to a local disk before running your job and copying it back when you're done. The local-long script will submit jobs to the long queue and local-short will submit jobs to the short queue.

    Using these scripts places a few demands on how you organize your files, though. Typically a job will require input data and will output data as well (often into multiple files). So the first thing to do is to make sure that all your input data files are in the same directory (possibly organized into subdirectories). Ideally, you want only the files you need for the job to be in this directory, because it will be copied across the network at least twice. For this to work properly it's important that you don't reference any of your data files via an absolute path (in files or in symbolic links). Otherwise the job will still access your files in the original location instead of the local copy.

    So lets say that you have all files in a directory named data, and that directory is in /scratch. You'll then need a script (just like the qsub script above) to run the job. Place that script file into the data directory (not in a subdirectory). Then, assuming your script is named charmm-script like the script above, type:

    % local-long charmm-script /scratch/data
    

    The local-long script will call qsub for you and copy the data directory (along with your job script) to a directory in /local on the drone selected by the queueing system to run your job. When the job is completed it will copy the result into a directory in /scratch (both data files and output files). The directory in /scratch will be named after your username with a series of random characters appended. My username is nowan, so I would see a directory in scratch like this:

    nowan.XXXXzPmHKq
    

    Your data will be removed from the /local disk, leaving the disk free for others to use. The qsub output file will be created in the directory where you started the job, and will start with STDIN, just as if you had used the simple method from above. As an added bonus, the output file will contain a breakdown of the time spent copying data and the time spent running the job.


    Checking the queue

    To confirm that your job is running after you submit it you use the qstat command. When run without any options it will check for any jobs submitted by you that are on queen. (Jobs stay on queen until a drone is chosen to run the job.) That's not generally useful, though, as you can't see what jobs are actually running. To do that you use the -d option:

    % qstat -d
    

    This command will check the drones as well as queen for any jobs that you've submitted. You'll see output that looks something like this:

    Destination machine: drone1.med.jhmi.edu
    Destination machine: drone2.med.jhmi.edu
    Destination machine: drone3.med.jhmi.edu
    Request         I.D.   Owner    Queue                                        St
    -------------- ------- -------- --------                                     --
    STDIN             241  nowan    long-dest                                    R
    

    You can see that it checks drone1, drone2, and finally drone3. Since drone3 is running one of my jobs it's listed there. A number of characteristics are listed, but the most useful are the job number (241, in this example) and the queue (long-dest, which is simply an end-point of the long queue).

    You may also want to see what jobs others are running. To do this you can add in the -a option, to see all jobs:

    % qstat -da
    

    This command will list any job that anyone is running, on any machine in the hive. If you're interested in seeing more information about jobs that are running try this for a long listing:

    % qstat -dal
    

    Check the man page for more options, or an explanation of the output of the long listing:

    % man qstat
    

    Deleting and killing jobs

    When you submit a job it doesn't immediately start running. It will sit on queen for a little bit while the queueing system decides which machine to send the job to. Sometimes all the drones are full, so the job stays on queen waiting for a drone to free up. When this happens the job is said to be waiting, as opposed to running. If a job is waiting on queen you can delete it using the qdel command:

    % qdel 241
    

    Where 241 is the job number (from qstat). But sometimes you need to abort a job that has already begun to run. Maybe you made a mistake and the job isn't doing what you want it to, or maybe the computation is proceeding in ways that aren't useful anymore. So to delete the job from the qstat listing above:

    % qdel -k 241.queen@drone3
    

    The -k option is needed because the job is running. By default qstat will not remove a job if it is already running, so it needs to be killed first. The job is also specified in a more verbose way than above. This is because the job is no longer on queen, it is on drone3. The first part (before the period) is the job number, just like above. The next part (after the period and before the '@') is the machine where the job originated. On the hive, this will always be queen. The final portion of the job name is the machine where the job is currently running. We found that by running qstat above.

    document created by Jeremy Hankins, maintained by Crystallography facility


    Johns Hopkins University School of Medicine | Department of Biophysics and Biophysical Chemistry
    Department Faculty | Biophysics Graduate Program Admissions Information | General Information
    Department Courses | Department Facilities | Department News | Home