Job Submission

Job Submission

./ProcessAna.py –mc –playlist minerva1 –ana_tool CCPi0AnaTool –outdir /minerva/data/users/oaltinok/CCPi0/MC/minerva1

Date Type PlayList Run Subrun ana_tool Other Result
02/28/2014 MC minerva1 CCPi0AnaTool Removed (No –dst)

Job Submission

./ProcessAna.py –mc –playlist minerva1 –dst –ana_tool CCPi0AnaTool –outdir /minerva/data/users/oaltinok/CCPi0/MC/minerva1

Date

Type

PlayList

Run

Subrun

ana_tool

Other

Result

02/28/2014

MC

minerva1

CCPi0AnaTool

–dst

Empty Files

  • I am guessing the canceling the previous job submission affected the current job submission – It is strange

Framework Modification

  • Package List:
    • Tools/CondorUtils – Removed
    • Tools/ProductionScripts
    • Tools/SystemTests – Installed
    • Ana/CCPi0

Job Submission

./ProcessAna.py –mc –run 10206 –subrun 1 –ana_tool CCPi0AnaTool –outdir /minerva/data/users/oaltinok/CCPi0/MC/test

Date

Type

PlayList

Run

Subrun

ana_tool

Other

Result

02/27/2014

MC

10206

1

CCPi0AnaTool

Success

v2_02

  • Static memory allocation introduced AGAIN for performance
  • Objects for CutNumber significantly slows down the whole package
  • Considering converting the package to be Compiled Package
  • Compile may solve performance issues

Manual Job Submission

Instructions from Jeremy:

Remove Jobs

Anybody who can log in as nearonline@mnvonlinelogger.fnal.gov can remove the jobs from the queue.  You’d do:

[nearonline@mnvonlinelogger ~]$ . scripts/setup_nearline_software.sh

followed by:

[nearonline@mnvonlinelogger scripts]$ condor_rm <job_id>

or if you want to get rid of them all in one go:

[nearonline@mnvonlinelogger scripts]$ condor_rm nearonline

 

Manually Submit Jobs

Ages ago I wrote a script for manual DST submission, which I haven’t tried in forever, but I think should still work:

[nearonline@mnvonlinelogger ~]$ scripts/manual_dst_submit.py

Usage: manual_dst_submit.py -r <run> -s <subrun>

Options:

-h, –help            show this help message and exit

-r RUN, –run=RUN     Run number

-s SUBRUN, –subrun=SUBRUN

Subrun number

 

That can probably be used to resubmit once the options file on mnvonlinelogger has been updated and the other machines have synchronized (if you wait for the automatic synchronizations, it’ll take about an hour, or you can run the scripts/nearline_software_sync.sh on each of the worker nodes mnvnearline1-4).  But be sure that the options file on mnvonlinelogger has been updated first.

Nearline Setup Script

  • Modified Script uploaded to CVS
  • [mnvsoft] / AnalysisFramework / Bootstrap / setup / setupSw.sh

Procedure:

  1. Download the latest Bootstrap to your local area
    1. Check  ControlRoomTools/nearline_scripts/README
  2. Copy the locally modified script to the Bootstrap/setup/setupSw.sh
  3. Do not forget to update the release to match with latest version
    1. release = “<latest_version>”
  4. cvs update –A setupSw.sh
  5. cvs commit –m “Modified for Nearline Machine Job Submission” setupSw.sh
  6. cvs tag –F “<latest_version>” setupSw.sh

Nearline Job Submission Fail

  • Modified script did not worked
  • New jobsub_tools sets up via setup_minerva_products.sh
  • Applied same test to run setup_minerva_products.sh
  • Nearline Machines will not run that script
  • After that solution the jobs are NOT crashing

Nearline Setup Script

  • We needed to modify the framework setup script on nearline machines for not using minerva job_sub.
  • For this purpose I have modified the setup.sh script located at mnvonlinelogger.fnal.gov /scratch/nearonline/mirror/mnvsoft/current/setup.sh         –> It corresponds to v10r7p3
  • Here is the change I made:
    • Old:

if [ -e "/grid/fermiapp/minerva/condor/setup.minerva.condor.sh" ]; then

echo ‘Setting up MINERVA batch submission using minerva_jobsub’

source /grid/fermiapp/minerva/condor/setup.minerva.condor.sh

fi

New:

echo “Condor Setup Location = $BASH_SOURCE”

if echo “$BASH_SOURCE” | grep -q “/grid/fermiapp” ; then

echo ‘Setting up MINERVA batch submission using minerva_jobsub’

source /grid/fermiapp/minerva/condor/setup.minerva.condor.sh

else

echo ‘Setting up Nearline batch submission’

fi

  • I have implemented Jeremy’s suggestion, the script checks that

if “/grid/fermiapp” string appears in the $BASH_SOURCE

true –> Means you are running your setup script over /grid

            Run the minerva_jobsub

false –> Means you are on Nearline Machine

            Do Nothing and set the Nearline Condor Settings Later

  • I have tested but of course it will be tested while we are taking data.

v2_01

  • Package moved under NTuple_Analysis
  • New Class: CCInclusive
    • Specific to CCInclusive NTuples
  • Class: CutNumberList
    • Implemented a LinkedList data structure for CutNumbers
      • Adding a new Cut Number is much easier now.
    • Performance issues: To get a specific Cut Number need to search the whole list until you get the corresponding CutNumber
      • Will be updated in next version
  • New Feauture: channelTag
    • edit the “const string channelTag” under Libraries/Folder_List.h
    • each channel will create its own CutTable.txt and Plots will be generated under different folders
      • Output/TextFiles/CutTable_channelTag.txt
      • Output/Plots/channelTag/*
  • Optimized CutNumberList Performance
    • Predefined pointers for default CutNumbers
    • No need to search for Default CutNumbers
  • Class: Muon
    • Muon Class derived from Particle base class
    • Muon Class inherits all Particle behavior and extend it by muon specific parameters
  • Virtual Function set_angleMuon
    • Virtual function in Particle Base Class redefined under Muon
    • No calculation needed for Muon SubClass, the angleMuon is setted to zero

watchdog.sh Script on minerva-rc

  • watchdog.sh script installed on minerva-rc under $HOME/bin/
  • watchdog.sh script checks any running process in the name of RunControl.py and opens RunControl GUI if it is not running on minerva-rc
  • watchdog.sh script creates a log file $HOME/watchdog.log, which records the time and date when the RunControl runned automatically
  • crontab edited to run this script every minute.

Hints:

  • make script executable using:
    • chmod +x watchdog.sh
  • In order to edit the crontab use:
    • crontab -e
  • crontab environment is not same with the script environment
  • dump local environment to a temp file:
    • env > localenv.output
  • Copy all the lines in “localenv.output” to the crontab (before your command)
  • For running commands which have GUI use
    • export DISPLAY=:0 (or localenv has this value)
    • Carefull with remote connections. Remote connection Displays are different than local displays
  • To run a command every minute write the following line in crontab:
    • * * * * * . $HOME/bin/watchdog.sh

Nearline Machines Job Submission Check

Instructions from Jeremy:

One can look at the output of the DST jobs as they are being made via a kind of nasty process (that requires you to know something about the internals of Condor).  I’ll take the most-backed-up run, 10091/30, as an example.

(1) Determine which machine the job is running on. mnvonlinelogger is the head node, so log in there, then:

[nearonline@mnvonlinelogger ~]$ condor_q

– Submitter: mnvonlinelogger.fnal.gov

<http://mnvonlinelogger.fnal.gov> :

<131.225.196.22:9651?CCBID=__131.225.196.22:9618#24088

<http://131.225.196.22:9651?CCBID=131.225.196.22:9618#24088>> :

mnvonlinelogger.fnal.gov <http://mnvonlinelogger.fnal.gov>

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

24247.0   nearonline      1/23 22:08   0+11:17:24 R  0   732.4

MV_00010091_0030_n

24248.0   nearonline      1/23 22:30   0+10:55:45 R  0   732.4

MV_00010091_0031_n

…. <other stuff I snipped>

[nearonline@mnvonlinelogger ~]$ condor_q -long 24247.0

… <snip>

RemoteHost = “slot3@mnvnearline2.fnal.gov

<mailto:slot3@mnvnearline2.fnal.gov>”

… <snip>

 

(2) Log into the machine where the job is running and go to the /scratch/condor/execute directory, which is where Condor output is put locally on the mnvnearline machines while the job runs:

nearonline@mnvnearline2.fnal.__gov

<mailto:nearonline@mnvnearline2.fnal.gov>$ ls

/scratch/condor/execute/

dir_13423  dir_18214  dir_25252  dir_26960  dir_31331  dir_3963

dir_8161  dir_8546

(3) There is one directory for each job; the numbers are the process ID on that machine.  I usually just look in each directory until I find the one with the output files I’m looking for.  In this case, it’s ‘dir_8546′:

nearonline@mnvnearline2.fnal.__gov

<mailto:nearonline@mnvnearline2.fnal.gov>$ find

/scratch/condor/execute/ -name ‘*10091_0030*’

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_RawData.dat

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_1.out

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_1.err

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347.joblog

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_LinjcDST.root

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_DST.root

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_Histos.root

 

So the log file we want to look at is

 

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347.joblog

 

And sure enough, when I read it, I find one of these for each gate:

 

curl_easy_perform() failed: couldn’t connect to server

sleeping 2 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 1 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 15 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 5 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 68 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 62 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 53 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 622 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 448 and retrying for status: 0

Exception:HTTP error: status: 0:<A0>*p^S

TracksPlotAlg                                   WARNING

TracksPlotAlg:: fill():: ‘Infinite’ value is skipped from the

histogram

 

This seems to imply that the database containing the POT info is not responding.

-Jeremy

Nearline Machines Job Submission Problem

Problem: Nearline machines cannot run the jobs. All crashes 10128 1 -30. 31 worked magically, Others wont work

  • First Tries without solution
    • I run  dispatcher_nearline.sh
    • condor_q works
    • used minerva-rc to hard restart daq
  • Checked the Log Files under mnvonlinelogger.fnal.gov
    • /scratch/nearonline/var/job_dump
    • Found that the Jobs Crashes in the stage while loading the jobsub_tools
    • Checked CVS entry for Tools/CondorUtils -> It was edited 3h ago with new options by schellma (Heidi Schellman)
    • Added her to the e-mail conversation and it turned out to be, her team updated the jobsub_tools this morning. (The reason why nearline machines tries to send to jobs to Fermilab grid instead of running on themselves)
    • Heidi reverted back the jobsub_tools version and the Nearline Processes started to work
  • Future Work:
    • Minerva Software Framework setup script under /fermiapp/grid must be edited for nearline machines
    • Modify the framework setup script so as not to do this Condor setup on the nearline machines.
  • Suggestions
    1. Follow the procedures on the post Nearline Machines Job Submission Check to locate the log files
    2. Try to compare the log files with a successful job submission

v1_4 & v1_5

  • Class Optimization
    • All variables declared as pointers
    • CutNumberList  handles CutTable.txt file
  • Particle Class Variables Updated:

 

// Monte Carlo(MC) and Reconstructed(Reco)

TLorentzVector* p4;                 // 4-Momentum of the Particle (Px,Py,Pz,E)

double* angleBeam;                 // Angle wrt Beam in rads

double* angleMuon;                 // Angle wrt Muon in rads

// MC Only

int ind;                                        // indice for MC truth information

// Reco Only

double pID;                     // Particle Score from Reconstructed Values

double trackLength;                 // Track Length in [mm]

Run Controls Automated Notifications – Jeremy e-mail

There are basically 2 different conditions for notices, both of which come from the online monitoring machine:

  1.  Jobs finish too fast or crash.  I set a quasi-arbitrary minimum time (10s) which jobs must stay alive to not generate a warning. Usually jobs which cause warnings correspond to subruns that were cancelled or skipped by the user, so unless there are a lot of them in a row, I generally ignore these.
  2. Condor problems with the nearline Condor queue.  The queue manager will send you a note if
    1. submission fails — always a problem
    2. the queue is full — always a problem unless you know otherwise, since we have far more capacity than we typically use when all four slave machines are running
    3. there are idle jobs in the queue — usually not a problem unless (2) is true also. (Jobs sometimes take a few minutes to get started, so occasionally they show up as idle; I ignore these unless the same job shows up in multiple consecutive warnings.)

Note that activating this new email list will require restarting the run control backend, so we will probably want to wait for some beam downtime to do it.