You are here: Cluster » WebHomeTorque
allegro-amplifyer.png

allegro

Cluster allegro

allegro is a hybrid x86 and GPU cluster and consists of a total of ~1000 cores and roughly 3.5 TB RAM located in different node types.

The nodes are interconnected via InfiniBand for parallel computation jobs ( MPI) and equipped with ample 15TB of cluster-wide hard disk storage.

Job Management

Calculations are scheduled and automatically distributed via the TORQUE resource manager


Getting started

For the impatient, you will find a quick run through here

Getting an account

You need an account at Freie Universität Berlin which has been enabled at the Department of Mathematics and Computer Science

Logging In

The cluster is only reachable from within the Department of Mathematics and Computer Science. 
If you want to get access to the cluster from outside the department, 
please login on one of our ssh remote login nodes and then jump to allegro or use a ssh tunnel.

To login to the cluster, you need an SSH client of some sort. If you are using a linux or unix based system, there is most likely one already available to you in a shell, and you can get to your account very quickly. For Microsoft Windows, we recommend PuTTY.
$ ssh <username>@allegro.imp.fu-berlin.de

Submitting jobs

Save the following as job_script.sh and replace the <USER> and <EMAIL ADDRESS> placeholders.
#!/bin/bash
#PBS -N testjob
#PBS -d /storage/mi/<USER>
#PBS -o testjob.out
#PBS -e testjob.err
#PBS -M <EMAIL ADDRESS>
#PBS -l walltime=00:01:00
#PBS -l nodes=1:ppn=1
#PBS -l pmem=10mb

hostname
date

Then, it's time to start the job via qsub (QueueSUBmit).
$ qsub job_script.sh

You can see your currently running jobs with qstat (QueueSTATus)
$ qstat

Queue Limits

If you submit a job, the system uses the 'smallest possible' queue.

Short description of queue limits:

(long/short is runtime, big/small is memory)
Usage Queue Runtime CPUs Memory count
very little 'micro' < 1 day 1 cpu < 6G 750
small 'small' < 2 days 1 cpu -- 759
short big 'big' < 2 days -- -- 400
long small 'long' > 2 days 1 cpu < 3G 400
all others 'large' > 2 days > 1 cpu -- 10
All undeclared settings are assumed to be in tne largest limit.


Cook Book

Stdout/Stderr Munging
common, useful operations regarding stdout / stderr redirection
Selecting Nodes Classes
selecting node classes, required for consistent running time results.
Selecting Queues
selecting queues for short / long running jobs.
Job arrays
allow you to submit a sequence of similar job scripts that only differ by one environment variable (${PBS_ARRAYID}) (advanced)
Temporary Files and Cleanup
managing temporary files and cleaning up after your jobs (advanced)

Notes

  • The qstat command's output is not in real time. Especially the Time Use (CPU Time!) field is only updated once every several seconds.
  • Use qstat -f and qstat -f - to get detailed info about a job.
  • NEW The stdout and stderr log files are created during run. If writing to $HOME you can follow with 'tail' etc.
  • Useful: -t flag for array execution. Use PBS_ARRAYID=1 bash job_script.sh to simulate one of array execution locally.
  • If you specify -l nodes=1 then you will NOT get the node exclusively. Use -l node=1:ppn=24.
  • The setting of ppn works as a multiplier on the pmem restrictions!
  • If you call a script in your job script then this is not cached. Do not modify included scripts or be aware of the side effects!
  • NEW Since September 12 you can run X11-Programs (ssh -X allegro…) and see the WIndow on your Workstation.

File System Paths

The following file system locations are interesting:

/home/$username Extra User's home on allegro, fast infiniband-connected hard drives.
/nfs/$normal-path
Things that are normally available through the network. For example: /nfs/group, /nfs/home/mi/..., …

Data can be copied from the /nfs/... paths to the home directory on allegro.

/data/scratch Scratch directory for temporary data. It is a good idea to set your TMPDIR environment variable to /data/scratch/$USER/ after creating this directory.
/data/scratch.local
Local scratch directory for temporary data in case you need the speed of a local disk -- beware the limited space.

Cluster Queue Commands

ClusterQueueManagement

The cluster queue is managed by the TORQUE cluster management system. The system knows the following client commands (besides others):

$ checkjob: Display information about a job, e.g. reason for being deferred. $ qsub: Submit jobs to the work queue. $ qstat: Display job queue status. $ qdel: Remove/cancel jobs from queue. $ showq: Show queue status. $ qnodes: Show information about available nodes. $ qalter: Change attributes of running jobs like time and memory consumption estimates.

Cluster Resource Policy

  • The time limit that you give to your jobs is 'hard'. Jobs will be killed if they would still run but your runtime is up.

  • The memory limit that you give to your jobs is 'hard', too. Same as above with time. If you allow the job to have 1024Mb but it uses 1034, it will be killed immediately. The message that you will receive for such an event will read like this: 'job violates resource utilization policies'.

  • Both values, walltime and memory, have no intrinsic limit. You can set them to what you want. There is, however, the policy to give low time and low memory jobs a preference: Walltime < 2h and memory < 2Gbyte will be preferred.

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback