Skip to content

maniaclab/failed-jobs-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

ATLAS Failed Job Checker

Getting credentials

You will need to email Lincoln Bryant ([email protected]) or Ilija Vukotic ([email protected]) for read-only credentials to access elastic search at UChicago.

Installation / Setup

This software requires both python36-elasticsearch and python36-tabulate packages installed, either through system packages or virtual environment.

CentOS 7

yum install python36-elasticsearch6
yum install python36-tabulate

Virtual Environment

python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

You will need to export the ES_USER and ES_PASS variables into your environment that match the credentials you requested above.

$ ./check2.py --help
usage: FailedJobs [-h] [-s SITE] [-i IDS [IDS ...]] [-l LAST] {site,node}

Query ATLAS Analytics Elasticsearch for failed jobs

positional arguments:
  {site,node}

optional arguments:
  -h, --help            show this help message and exit
  -s SITE, --site SITE  Site, e.g. 'MWT2', 'AGLT2', 'SWT2_CPB', etc
  -i IDS [IDS ...], --ids IDS [IDS ...]
                        space separated list of PanDA job IDs
  -l LAST, --last LAST  Maximum number of hours ago. Only relevant for 'site'
                        mode

Site mode

This mode will look at failures across the entire site for a given number of hours before the current time.

Site mode example

$ ./check2.py site -s MWT2 -l 1
pandaid                                          batchid                                  modificationhost                                 piloterrorcode
-----------------------------------------------  ---------------------------------------  ---------------------------------------------  ----------------
https://bigpanda.cern.ch/job?pandaid=6069497346  iut2-gk02.mwt2.org#9398499.0#1704436223  [email protected]                                     0
https://bigpanda.cern.ch/job?pandaid=6069648316  uiuc-gk02.mwt2.org#1418461.0#1704446683  [email protected]                                 1305
https://bigpanda.cern.ch/job?pandaid=6069652861  uct2-gk02.mwt2.org#8772333.0#1704447294  [email protected]                                 1305
https://bigpanda.cern.ch/job?pandaid=6069287664  iut2-gk02.mwt2.org#9395973.0#1704414228  [email protected]                                     0
https://bigpanda.cern.ch/job?pandaid=6069640354  iut2-gk02.mwt2.org#9400250.0#1704445812  [email protected]              1305
https://bigpanda.cern.ch/job?pandaid=6069633343  uct2-gk02.mwt2.org#8771860.0#1704445805  [email protected]              1305
https://bigpanda.cern.ch/job?pandaid=6069625491  uiuc-gk02.mwt2.org#1418224.0#1704444378  [email protected]                                  1305
https://bigpanda.cern.ch/job?pandaid=6069608909  uct2-gk02.mwt2.org#8771359.0#1704441618  [email protected]                                1305
https://bigpanda.cern.ch/job?pandaid=6069633964  uct2-gk.mwt2.org#8739758.0#1704434949    [email protected]                                 1098

Node mode

This mode will attempt to locate the HTCondor EXECUTE_DIR directory and grep the job's stdout for the Pilot ID for each job running at the site.

Otherwise, PanDA job IDs can be supplied via the -i or --ids flags.

Supplying IDs Example

$ ./check2.py node -i 6069633343
pandaid                                          modificationhost                                 piloterrorcode  jobstatus
-----------------------------------------------  ---------------------------------------------  ----------------  -----------
https://bigpanda.cern.ch/job?pandaid=6069633343  [email protected]              1305  failed

Node Search Example

NOTE This mode requires sudo as it needs to read files in the directories of batch system users.

(venv) [10:01] uct2-c578.mwt2.org:~/failed-jobs-py $ sudo ./check2.py node
pandaid                                          modificationhost               piloterrorcode  jobstatus
-----------------------------------------------  ---------------------------  ----------------  -----------
https://bigpanda.cern.ch/job?pandaid=6069660793  [email protected]              1098  failed
https://bigpanda.cern.ch/job?pandaid=6069661283  [email protected]              1305  failed
https://bigpanda.cern.ch/job?pandaid=6069687372  [email protected]               1305  failed
https://bigpanda.cern.ch/job?pandaid=6069630217  [email protected]                  0  failed
https://bigpanda.cern.ch/job?pandaid=6069115751  [email protected]                  0  failed
https://bigpanda.cern.ch/job?pandaid=6069687373  [email protected]                  0  failed
https://bigpanda.cern.ch/job?pandaid=6069700489  [email protected]                  0  failed
https://bigpanda.cern.ch/job?pandaid=6069659220  [email protected]              1098  failed
https://bigpanda.cern.ch/job?pandaid=6069112758  [email protected]               1305  failed
https://bigpanda.cern.ch/job?pandaid=6069242358  [email protected]              1098  failed
https://bigpanda.cern.ch/job?pandaid=6069661442  [email protected]                 0  failed
https://bigpanda.cern.ch/job?pandaid=6069630218  [email protected]                  0  failed
https://bigpanda.cern.ch/job?pandaid=6069630219  [email protected]                 0  failed
https://bigpanda.cern.ch/job?pandaid=6069115749  [email protected]                  0  failed
https://bigpanda.cern.ch/job?pandaid=6069661445  [email protected]                 0  failed
https://bigpanda.cern.ch/job?pandaid=6069661490  [email protected]                 0  failed
https://bigpanda.cern.ch/job?pandaid=6069242357  [email protected]               1098  failed
https://bigpanda.cern.ch/job?pandaid=6069242356  [email protected]               1098  failed
https://bigpanda.cern.ch/job?pandaid=6069112757  [email protected]               1305  failed
https://bigpanda.cern.ch/job?pandaid=6069660442  [email protected]              1098  failed
https://bigpanda.cern.ch/job?pandaid=6069661446  [email protected]                 0  failed

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages