Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Releases: PanDAWMS/pilot

73.7

05 Dec 15:44
Compare
Choose a tag to compare

Rucio API

  • Rucio API is again used for file transfers
  • Now verifying that the replica is present at the destination after upload
  • Transfer time-outs are based on file size

Support for file based input file lists in direct access jobs

  • I.e. now updating LFNs to full TURLs in input file list stored in a file (@filename in job params)

Added davs to list of allowed schemas for direct access mode

  • Previously test jobs running against the Dynafed at CERN attempting to benchmark its performance failed because the pilot did not allow davs for direct acces

Event service updates

  • Now using different DDM protocol in case of stage-in retry
  • Fixed minor issues caused by failed event range update

Contributions from P. Nilsson, F. Berghaus, W. Guan, T. Javurek, A. Anisenkov.

73.6

19 Nov 10:07
Compare
Choose a tag to compare

New error code

  • ERR_BADXML = 1247, "Badly formed XML" - set if the PoolFileCatalog.xml contains illegal characters, as happened in recent user job
  • Otherwise leads to obscure failure, although not frequently (~once a year error)

list_replicas() usage

  • Now always setting geoip in list_replicas() call as requested by Mario Lassnig/DDM team
  • Otherwise remote replicas are not handled correctly (server will always reply the external door)
  • Previously only set in direct access mode
  • Feature tested at OU and IN2P3, thanks to Horst Severini and Emmanouil Vamvakopoulos

Log file creation updates, requested by R. Walker

Scaling correction for max memory usage on UCORE, requested by R. Walker

  • Now taking schedconfig.corecount into account (job.corecount/PQ.corecount * PQ.maxrss)
  • UCORE queues have maxrss specified for MCORE, but value should be corrected for SCORE jobs
  • Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-463

Rucio mover update

  • First RC versions used API again, but some left questions open during testing so this change was postponed
  • Now sending traces with new rucio options (this change will not be necessary after the API is used again)
    Support for new prodSourceLabel, requested by A. de Silva et al
  • Pilot option -i ALRB will trigger download of job corresponding to prodSourceLabel=rc_alrb
  • For testing new ALRB asetup releases

Code updates by P. Nilsson, T. Javurek, P. Svirin.

73.5

19 Nov 10:02
Compare
Choose a tag to compare
  • Requested by Ivan Glushkov
  • Stopped setting PQ as default local/remoteSite in Rucio traces (now using default ddmendpoint from the job definition)
  • Requested by DDM team
  • Bug fix: Added missing string to integer conversion in os.killpg() call for orphaned processes
  • Problem introduced in previous pilot version, affected killing orphaned processes in looping jobs
  • Requested by Eygene Ryabinkin

73.4

04 Oct 08:16
Compare
Choose a tag to compare
  • Event service
    ES merge fix
    Objectstore stage-in verification

  • Replica priorities update (also done in Pilot 2)
    Optimization of priority sorting for list_replica() output

  • Lingering orphaned processes (also done in Pilot 2)
    Some processes were found hanging even after pilot ended in looping jobs
    Process killer algorithm updated (use group kill for orphaned processes)
    Requested by P. McGuigan (UTA)

Code contributions from Alexey Anisenkov, Wen Guan, Paul Nilsson.

73.3

16 Aug 15:50
Compare
Choose a tag to compare
  • Improvement of the list_replicas() update in version 73.2

Fixed logic to properly consider priority of replica protocols for stage-in. The applied fix can be reverted once the Rucio server-side patch is delivered (which will return an already sorted list to the pilot)

  • Minor update to rucio mover stage-in client

Replacing download() function with download_pfns() / download_dids()
Change discussed in Rucio ticket rucio/rucio#1378

  • Too large log files

It was reported in JIRA ticket https://its.cern.ch/jira/browse/ADCSUPPORT-5078 that once again log files are too big, this time due to a new sub directory _joproxy15. Pilot now removes it as well

Code contributions from Alexey Anisenkov, Tomas Javurek, Paul Nilsson.

73.2

09 Aug 14:53
Compare
Choose a tag to compare

list_replicas()

  • It was realized that both pilot and rucio server called list_replicas() which is unnecessary and has increased the load on the rucio servers due to the ongoing migration to use rucio as sitemover and since rucio will call list_replicas() for each input file download. A quick fix for this is to use the --pfn option with rucio download which will prevent rucio from also calling list_replicas(). It will however bypass useful features including fallbacks so a better solution is being implemented on the rucio side (which will add a locally cached metadata file) which will require another pilot update during the next couple of weeks.

Wrong error message

  • It was discovered that the error message "Payload exceeded max allowed memory" was overwritten by the error message for a kill signal which thus ended up on the monitor page for the failed job. This should now be fixed. Reported by R. Walker

73.1

09 Aug 14:54
Compare
Choose a tag to compare

A new pilot version has been released with a minor update:
In debug mode, the pilot now scans for the latest updated payload log file and sends its tail with each heartbeat (every five minutes). Requested by R. Walker.

73.0

09 Aug 14:54
Compare
Choose a tag to compare

Containers

  • No formal container development in Pilot 1 - all container testing is done with Pilot 2
  • New pilot instruction arriving with job parameters (--containerimage ) removed from job parameters in case it is present (i.e. only acted on in Pilot 2)

Pilot timing

  • Added on-the-fly measurement of CPU consumption time
  • Pilot now reports this timing in job updates

Tracing

  • Removed any present escape characters from stateReason
  • Now reporting localSite properly in traces

Google updates

  • Added https:// as approved protocol for direct access
  • Added escape character for &, needed for turls

LAPP debugging

  • Added detailed rucio output to log

Event service

  • Now using killpg instead of kill, to include child processes in time-outs
  • Now allowing ES merge jobs to select closest inputs
  • On-the-fly CPU consumption time also reported for ES jobs

Contributions from W. Guan, M. Lassnig, N. Magini, P. Nilsson.

72.11

09 Aug 14:55
Compare
Choose a tag to compare

Rucio copytool update (from M. Lassnig):

  • Removed troublesome API fallback
  • Added -v option for more verbose output, requested by Stephane Jezequel

Google testing (from M. Lassnig):

  • Added https protocol to schemas used in replica resolution algorithm

Bug fix:

  • Changed logger.warning -> pUtil.tolog in detect_client_location(), reported by Javier Sanchez Martinez

72.10

09 Aug 14:55
Compare
Choose a tag to compare

The pilot has been updated for an issue seen (at least) at QMUL with metadata containing garbage data. Requested by R. Walker.