GridLab logo
Public
* About
* News
* Download
* Documents
* Collaborations
Internal
* Meetings
* Links
* Mailing List
* Management
* Yellow Pages
* Our Eyes Only
Information Society Technologies  
 
| Home | Products & Technologies | Support & Downloads | Contact us |  


CGAT Use Cases



How to grid-enable Cactus - Overview






Integrating CGAT into your Cactus Application


  • Requirements

The Grid Application Toolkit ( GAT ) must be installed on your target machine. Get it here!
$GAT_LOCATION must be set to point to your (or system wide ) GAT installation
$GAT_ADAPTOR_PATH must be set to point to a file listing adaptors to be used ( $GAT_LOCATION/lib/GAT/adaptor-list )
$GAT_CONFIG_FILE can be set to point to a file listing advanced configuration options for the GAT ( $GAT_LOCATION/share/GAT/gatrc )
For convenience set $CACTUS_SRC to point to your Cactus source-tree

  • Download CGAT sources from cvs.gridlab.org (user readonly, pass anon)

mkdir -p $CACTUS_SRC/arrangements/CactusGAT
cd  $CACTUS_SRC/arrangements/CactusGAT
cvs -d :pserver:readonly@cvs.gridlab.org:/cvs/gridlab checkout -d CGAT wp-2/Codes/Thorns/CactusGAT/CGAT
  • Reconfigure and Recompile Cactus using an existing Cactus configuration

The following thorns must be listed in your ThornList ( $CACTUS_SRC/configs/<name of existing configuration>/ThornList ) in order to use CGAT

  • IOUtil
  • IOBasic
  • IOASCII
  • IOHDF5Util
  • IOHDF5
  • CGAT
cd $CACTUS_SRC
gmake <name of existing configuration>-rebuild

  • Useful Parameter File Options 


 
Thorn
Name
Type
Description
IO
checkpoint_file
String
name of checkpoint file to be written to disk
IO
recover_file
String
name of checkpoint file on disc to recover from
IO
out_dir
String
directory where to dump output
IO
recover_and_remove
Boolean
switch on/off removing checkpoint file after successful recover
IO
recover
String
switch on/off/autoprobe recovery from checkpoint file
IOHDF5
checkpoint
Boolean
switch on/off checkpoint in HDF5 format
CGAT
replica_home_directory
String
your home directory in replica catalog
CGAT
announce_checkpointfiles
Boolean
switch on/off linking checkpoint files in replica catalog
CGAT
announce_output_directory
Boolean
switch on/off linking output directory in replica catalogue




Test using local GAT Adaptors


  • Environment Settings

export LD_LIBRARY_PATH=$GAT_LOCATION/lib:$LD_LIBRARY_PATH 
export GAT_ADAPTOR_PATH=$GAT_LOCATION/lib/GAT/adaptor-list

To test whether or not checkpointing using the GAT works, create a file gatrc with following content:

[resourcebroker_adaptor]
Adaptor=resourcebroker_adaptor
CheckPointAfterIterations=90

defining, after how many iterations GAT should checkpoint your simulation.

export GAT_CONFIG_FILE=<location of your gatrc>

  • Basic Parameter File Options ( must have )

IOHDF5::checkpoint  = "yes"
IO::checkpoint_file = "myFirstGATCheckpoint"
CGAT::replica_home_directory = "/home/gatuser"
CGAT::announce_checkpointfiles    = "yes"

  •  Expected Cactus Output

 If everything works out well, following Cactus output is expected (perhaps it looks different in your case ):

INFO (CGAT): Invoked checkpoint call-back, checkpoint will be triggered at next iteration
INFO (CGAT): Updating replica at iteration 90.
INFO (CGAT): annihilate logical file
INFO (CGAT): <-- gsiftp://Ikarus/home/robert/TestSuite/./myFirstGATCheckpoint.it_90.h5
INFO (CGAT): --> /home/gatuser/GAT_JOBID:8e5a25e0-1c56-11d9-b031-000d60371fb6/myFirstGATCheckpoint
    90 |    1.552 |   0.03823166 |   0.94869581 |
INFO (IOHDF5): ---------------------------------------------------------
INFO (IOHDF5): Dumping termination checkpoint at iteration 90
INFO (IOHDF5): ---------------------------------------------------------
INFO (CGAT): Shutting down the GAT engine
--------------------------------------------------------------------------------
Done.

and you should find a checkpoint file called myFirstGATCheckpoint.it_90.h5 within your working directory.

You should further check, that the local ( logical ) file:

/tmp/GAT/gat_logicalfilestore/home/gatuser/GAT_JOBID\:8e5a25e0-1c56-11d9-b031-000d60371fb6/myFirstGATCheckpoint

exists and it's content points to the correct ( physical ) checkpoint file:

gsiftp://Ikarus/home/robert/TestSuite/./myFirstGATCheckpoint.it_90.h5

The lengthy string ( GAT_JOBID\:8e5a25e0-1c56-11d9-b031-000d60371fb6 ) is called the GAT_JOBID and uniquely distinguishes every GAT run. You can however override the creation of a random string by setting the environment variable $GAT_JOBID to something more convenient.




Test using remote GAT adaptors


  • Sources for remote GAT adaptors

There are currently many remote adaptors being developed. Check the Gridlab web pages for available remote adaptors designed to access Gridlab services.
Currently available on the Gridlab Testbed are following remote adaptors:

gridlab_util_gsoap_adaptor ( needed to support adaptors using gsoap )
gridlab_file_adaptor ( remote file operations )
gridlab_logicalfile_adaptor ( access gridlab replica service )
gridlab_advertservice_adaptor ( access gridlab advertise service )
gridlab_monitoring_adaptor ( access gridlab resource monitoring service )
gridlab_resource_adaptor ( access gridlab resource management service )
gridlab_tracing_adaptor ( access gridlab logging service )

from www.gridlab.org. Adaptors printed in bold face are currently being used by CGAT.


  • Build and list adaptors in $GAT_ADAPTOR_PATH

If you want to build remote adaptors on your own, check the Gridlab adaptor release page for further details.

List all local adaptors you intend  to use  (  full path and name ) in $GAT_ADAPTOR_PATH
List all remote adaptors you intend to use ( full full and name ) in $GAT_ADAPTOR_PATH

For our purpose following GAT_ADAPTOR_LIST would be reasonable:

# absolute path names to local adaptors
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libfileops_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libfilestream_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libadvertservice_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libendpoint_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libresourcebroker_adaptor.la
# absolute path names to remote adaptors
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_util_gsoap_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_logicalfile_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_advertservice_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_monitoring_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_resource_adaptor.la
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_tracing_adaptor.la

to checkpoint the Cactus application using GRMS ( libgridlab_resource_adaptor ), even if the application is running on internal cluster nodes ( libgridlab_monitoring_adaptor ). The checkpoint files / output would be registered using  the gridlab replica service ( libgridlab_logicalfile_adaptor ).


  • Gridlab Testbed


If you have access to the Gridlab Testbed you can take advantage of a pre installed GAT installation including remote adaptors on most resources. To check, which resources are properly configured with GAT click here.

source /etc/gridlab.conf to set all necessary environment variables.

Integrate CGAT into your cactus application.

Start-up your simulation and initiate a checkpoint using the GRMS Command Line Client part of the Gridlab Resource Management System or the Cactus Portal. To checkpoint your cactus simulation using the Cactus Portal, you will need to have a valid portal-account. You can get it here.


  • Expected Cactus Output

Initializing the GAT Engine and accessing the Gridlab Replica Service for registering the output directory:

INFO (CGAT): Initializing the GAT engine
INFO (CGAT): Registering checkpoint capabilities with the GAT
INFO (CGAT): Announcing output directory to replica service
GSI plugin for gSOAP v2.4:  Established security context with: /O=Grid/O=GridLab/CN=litchi.zib.de
GSI plugin for gSOAP v2.4:  Established security context with: /O=Grid/O=GridLab/CN=litchi.zib.de
GSI plugin for gSOAP v2.4:  Established security context with: /O=Grid/O=GridLab/CN=litchi.zib.de
INFO (CGAT): Announcing output directory ...
INFO (CGAT): <-- gsiftp://rage1.man.poznan.pl/mnt/shared/people/glab040/output/
INFO (CGAT): --> /home/glab040/cactus/output




Remote execution using GRMS


  • Requirements

On server side the GridLab Resource Management Service (GRMS) must be installed on a server within your grid environment.
On client side you will need the command line client from the GRMS distribution on any host you intend to use for job submission.

You can get both here. Download the source code together with examples and follow instructions.
For further reading please have a look at the GRMS User's and Administrator's Guide.

  • Job description xml file

Job submission to the resource management system is handled using a job description in xml format.

  • Command line client usage ws_client.sh

ws_client.sh submit  <jobDescription.xml>
ws_client.sh migrate <jobId> [<jobDescription.xml>]
ws_client.sh cancel  <jobId>
ws_client.sh info    <jobId>

  • Job submission without queuing ( non-parallel )

This may demonstrate, how to submit a job to the resource management system without using a queuing system like PBS on the target host. Simple create a file non-parallel.xml with following content:

<grmsjob appid = "appid" persistent="true">
    <simplejob>
        <resource>
            <hostname>skirit.ics.muni.cz</hostname>
        </resource>
        <executable type="single" count="1">
            <file name="cactus_wavetoy_serial.sh" type="in">
                <url>file:////${HOME}/demo/scripts/cactus_wavetoy_serial.sh</url>
            </file>
            <stdout>
                <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.out</url>
            </stdout>
            <stderr>
                <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.err</url>
            </stderr>
        </executable>
    </simplejob>
</grmsjob >


In this case, we want to execute the shell script cactus_wavetoy_serial.sh on a remote host called skirit.ics.muni.cz. The stdout and stderr is supposed to  be  copied to peyote.aei.mpg.de after successful job execution.  Submitting the job is done calling the command line client:

[robert@Ikarus bin]$ ./ws_client.sh submit non-parallel.xml
- Your DN: /C=US/O=National Center for Supercomputing Applications/CN=Robert Engel
- Service URL: httpg://rage1.man.poznan.pl:8543/axis/services/grms
- Job submitted successfully, jobId=1097768271008_appid_5556

In this case the GridLab Resource Management Service is running on a server named rage1.man.poznan.pl part of the GridLab Testbed. If the job description specified in non-parallel.xml did not contain errors, the JOBID is being returned. You can query the status of your job, request a migrate or cancel using this JOBID.


  • Job submission using a queuing system on the target host ( parallel )

Now we want to  submit a job to a queuing system on the target host using the Resource Management System. This will only require minor changes of the job description xml file:

<grmsjob appid = "appid" persistent="true">
    <simplejob>
        <resource>
            <hostname>skirit.ics.muni.cz</hostname>
            <localrmname>pbs</localrmname>
        </resource>
        <executable type="mpi" count="4">
            <file name="cactus_physics_parallel.sh" type="in">
                <url>file:////${HOME}/demo/scripts/cactus_physics_parallel.sh</url>
            </file>
            <stdout>
                <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.out</url>
            </stdout>
            <stderr>
                <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.err</url>
            </stderr>
        </executable>
    </simplejob>
</grmsjob >

Note: The only difference is to change the executable type to "mpi" and to define the number of nodes to run on "4". Further we specified "pbs" to be used for queuing,


  • Note

GRMS will not handle environment settings by default ( .bash_profile, /etc/gridlab.conf, etc. ). If you need certain environment settings for your executable to run ( LD_LIBRARY_PATH ), you might want to use a script  ( cactus_wavetoy_serial.sh ) to set your environment correctly and to start the executable.

The option persistent="true" will keep all your job-output within a working directory on the target host. By default the working directory used for output is deleted after the job finished ( successful or not ).

If you intend to use the GridLab Resource Management System on the GridLab testbed, check here, to check if GRMS is up and running on the target host.


Troubleshooting


  • Working Cactus CGAT examples

 
If you have access to the Gridlab Testbed, you can take advantage of a system wide Cactus Installation. If you want to check the current state of the testbed click here!

source /etc/gridlab.conf

In $CACTUS_LOCATION ( $CACTUS_DEV_LOCATION ) you will find several executables linked properly against the system wide GAT installation

cactus_wavetoy_serial ( non mpi version of a wavetoy demo )
cactus_wavetoy_parallel ( parallel version of a wavetoy demo )
cactus_physics_parallel ( parallel version of black hole head on collision )

Try to execute $CACTUS_LOCATION/cactus_wavetoy_serial.sh, to check that:

System wide GAT installation in $GAT_LOCATION is properly working
Cactus with CGAT could be build and linked against the GAT
Gridlab resources like like the replica services can be contacted.

  • Getting Help