GMBrowser Update – Uses all gates now

  • Previously GMBrowser that shifters look at only uses a fraction of the gates, because the early processing stages (particularly DecodeRawEvent) were slow.
  • Now that we have a faster version of DecodeRawEvent, and we modified GMBrowser to use all gates
  • Modified following parameters in NearlineCurrent.opts in Tools/DaqRecv/options on mnvonlinelogger, to be 100 percent:
    • PdstlPrescaler.PercentPass          = 25;
    • LinjcPrescaler.PercentPass          = 25;
    • NumibPrescaler.PercentPass       = 20;
  • Ran the “nearline_software_sync.sh” script in all Nearline Machines to get the update
    • mnvnearline1
    • mnvnearline2
    • mnvnearline3
    • mnvnearline4
  • Informed Current Shifter about the update and started GMBrowser at Tufts UROC
    • Will investigate the behavior for some time, until we make this change permanent.

Nearline File Management Problems

We still have problems for nearline file management and I listed the ones I found. Here is the list of folders need to be managed.

  1. Synchronize /scratch/nearonline/var/job_dump/ with /minerva/data/online_processing/swap_area/
  2. Synchronize /scratch/nearonline/var/gmplotter/plotter/ with /minerva/data/users/nearonline/gmbrowser/plotter/
  3. Synchronize /scratch/nearonline/var/gmplotter/www/ with /minerva/data/users/nearonline/gmbrowser/www/
  4. Copy Files from /scratch/nearonline/var/gmplotter/www to minerva@minerva-wbm.fnal.gov:/opt/if-wbm/htdoc/minerva/echecklist/gmb_hists

Here is the status of each section

1) I modified the script to use rsync command to sync between /scratch/nearonline/var/job_dump/ with /minerva/data/online_processing/swap_area/ For now, we have a stable synchronization between two folders however,this method copies the .log files also, which is unnecessary.

2,3) No USER “nearonline” under /minerva/data/users Setup script assigns the following export NEARLINE_BLUEARC_GMPLOTTER_AREA=/minerva/data/users/nearonline/gmbrowser There is a “nearonline” user under /minerva/app, however we should not copy any data file to the /minerva/app.

4) e-Checklist works, I conclude this section works. I did not checked the details.

We should organize a plan to solve all the problems in nearline file management. I propose the following,

  • Lets use rsync command for 1,2,3
  • We need to create a folder “/minerva/data/users/nearonline” and let other systems know where we are copying the files.
  • If there is a folder I forgot to sync between nearline and bluearc, that folder also needs to be added to the script.

Software Update

  • mnvonlinelogger updated
  • Slave Nodes will receive update automatically
    • Updated Packages under cmtuser area
      • Tools/DaqRecv [croce_v3]
        • cvs co -r croce_v3 Tools/DaqRecv
    • Installed Packages under cmtuser area
      • Event/MinervaKernel [croce_v3]
        • This package required for Event/MinervaEvent
        • getpack -u Event/MinervaKernel
      • Event/MinervaEvent [croce_v3]
        • cvs co -r croce_v3 Event/MinervaEvent
    • Built All Packages in the following order
      1. Tools/DaqRecv
      2. Event/MinervaKernel
      3. Event/MinervaEvent
    • Building Commands:
      • cmt config
      • cmt make
      • source setup.sh

Problem after restarting Nearline Machines

Problem:

The automated “nearline_bluearc_copy.sh” script on mnvnearline1 fails to copy necessary files from local_dump_area to online_processing/swap_area
(from /scratch/nearonline/var/job_dump to /minerva/data/online_processing/swap_area)

Investigations:

  • Investigating mnvnearline1:scripts/nearline_bluearc_copy.sh
  • Script runs automatically every 5 minutes.
  • Log file for the script: /scratch/var/nearonline/logs
  • Local copy from HEAD to following folders WORKS
    • NEARLINE_DUMP_AREA /scratch/nearonline/var/job_dump
    • NEARLINE_LOCAL_GMPLOTTER_LOCATION /scratch/nearonline/var/gmplotter
  • The problem is with the python script “filechecklist.py”
  • It does not generate the file list for files from the following folders:
    • $NEARLINE_DUMP_AREA
    • $NEARLINE_LOCAL_GMPLOTTER_LOCATION/plotter
  • It works for the following folder
    • $NEARLINE_LOCAL_GMPLOTTER_LOCATION/www
  • Since there is no file list generated by the python script “filechecklist.py”, NO files copied to the swap_area

Temporary Solution:

  • I modified the script to use rsync command.
    • Now it synchronizes the local_dump_area and online_processing/swap_area
  • Inside the script Jeremy notes that, “using rsync for this stage incurs a lot of overhead on the BlueArc disk”, therefor,  he writes a more efficient script “filechecklist.py” for this task

 Permanent Solution:

  • The Problem is confirmed.
    •  .fileindex under /scratch/nearonline/var/job_dump got corrupted and causing “file checklist.py” to crash for that folder
  • Using rsync manually fixed the .fileindex
  • Software sync between mnvonlinelogger and mnvnearline1 updates the nearline_bluearc_copy.sh script to the original version
  • Now everything works as before. The near ine_bluearc_copy.sh script copies the changed files to bluearc area using “file checklist.py”

 

GMBrowser Problem

  • GMBrowser Live works but shifter can not access to the previous runs and subruns
    • GMBrowser -r xx -s xxx does not work
  • The files are not copied automatically to the /minerva/data/online_processing/swap_area
  • I copied the files manually:
    • Connect to the mnvnearline1 – it has the BlueArc /minerva/ mount
    • Necessary files located: /scratch/nearonline/var/job_dumb
  • This solved the issue for non copied files.
  • I checked the log file for nearline_bluearc_copy.sh script under /scratch/nearonline/var/logs
    • After sometime the auto-script seems to be working.
  • Currently, we are not running any runs, I will check the status again tomorrow

Major Update to RunControl Software (v6r1)

  1. Killed all processes
  2. Jeremy updated the
    • mnvonline0.fnal.gov
    • mnvonline1.fnal.gov
    • mnvonlinelogger.fnal.gov
    • minerva-rc.fnal.gov
  3. I updated the remaining Control Room Computers
    • minerva-evd.fnal.gov
    • minerva-bm.fnal.gov
    • minerva-om-02.fnal.gov
  4. Testing the Updates
    1. Successful Test on Control Room Computers
    2. Successful Test on Rochester UROC
    3. Successful Test on Tufts UROCs
  5. I updated the UROC_sw_manager.py script and notified UROC Users

Fermilab Power Failure

  • On Sunday 03:30 am there was a power failure affecting MINOS and MINERvA underground machines
  • Control Room Computers lost network mount to /minerva/data/
    • GMBrowser needs /minerva/data mounted and it was not working
    • minerva-evd is used by UROCs to mount /minerva/data and they are also affected.
    • Carrie opened a service ticket to ask Computer Division Help for Control Room Computers
    • Computer Division solved the incident and all machines and UROCs working properly.
  • mnvonlinebck1.fnal.gov machine is still down and we have no access to Veto Wall HV Monitoring.
  • e-Checklist can be used either one of the following servers: (minerva-wbm was down due to power glitch)
    • http://minerva-wbm.fnal.gov/minerva/echecklist/mininfo.php
    • http://nusoft.fnal.gov/minerva/echecklist/mininfo.php

Nearonline Dispatcher Crash

  • Yesterday around 15:00, nearonline dispatcher crashed causing no raw data under: /scratch/nearonline/var/job_dump/
  • Manual Job Submission via manual_dst_submit.py did not worked. (No Raw Data)
  • Jeremy manually copied the raw data with the following command and manually submitted the missing jobs:
    • cd /scratch/nearonline/var/job_dump/
    • for ((i=22;i<=32;i++)); do cp /mnvonline0/work/data/rawdata/MV_00010504_00${i}_*_RawData.dat .; done

Nearline: Warning Messages

  • The Nearline Machines uses its own installation called “frozen”
  • Changes on the Framework does not affect them.
  • Install the specific package to cmtuser area and built it to use a modified version of a package.
  • I modified the “frozen” installation by updating it. However it is WRONG.
  • I revert back the update and install the Sim/GiGaCnv to cmt user area on mnvonlinelogger.
  • I updated the Sim/GiGaCnv to the “no_magneticField_warning” version to get the latest modifications on the package.
  • Now the nearline machines uses the local version of Sim/GiGaCnv  and this local version is updated to not to generate the warning messages.

Nearline: Warning Messages

  • Successfully run a local test job using Minerba’s instructions
  • Test Job ran on file: /minerva/data2/rawdata/minerva/raw/numib/00/00/61/35/MV_00006135_0016_numib_v09_1310191557_RawData.dat
  • Warning messages displayed between two processes:
    • BuildRawEventAlg.initiliaze()
    • BuildRawEventAlg.execute()
  • Asked Minerba the file locations for
    • GiGaGeo
    • EventLoopMgr

Nearline: Warning Messages

Warning Messages:

GiGaGeo                      WARNING world():: Magnetic Field is not requested to be loaded

EventLoopMgr           WARNING Unable to locate service “EventSelector”

EventLoopMgr           WARNING No events will be processed from external input.

 

Instructions from Minerba to Run a Local Test Job

Get the packages: in v10r7p3

Tools/DaqRecv

Tools/SystemTests

Ana/Histogramatron

To run one sample of the histograms from nearline, edit the option file at:

DaqRecv/options/Nearline.opts

you need to change the calibration files and the raw data file you want to test.

You can look one file I used to test at:

/minerva/data/users/betan009/Minerva_v10r6p13/Tools/DaqRecv/options/Nearline_test.opts

I would start looking the algorithms that the Nearline runs and I see which one uses the magnetic field.

Please let me know if you have any question.

Manual Job Submission

Instructions from Jeremy:

Remove Jobs

Anybody who can log in as nearonline@mnvonlinelogger.fnal.gov can remove the jobs from the queue.  You’d do:

[nearonline@mnvonlinelogger ~]$ . scripts/setup_nearline_software.sh

followed by:

[nearonline@mnvonlinelogger scripts]$ condor_rm <job_id>

or if you want to get rid of them all in one go:

[nearonline@mnvonlinelogger scripts]$ condor_rm nearonline

 

Manually Submit Jobs

Ages ago I wrote a script for manual DST submission, which I haven’t tried in forever, but I think should still work:

[nearonline@mnvonlinelogger ~]$ scripts/manual_dst_submit.py

Usage: manual_dst_submit.py -r <run> -s <subrun>

Options:

-h, –help            show this help message and exit

-r RUN, –run=RUN     Run number

-s SUBRUN, –subrun=SUBRUN

Subrun number

 

That can probably be used to resubmit once the options file on mnvonlinelogger has been updated and the other machines have synchronized (if you wait for the automatic synchronizations, it’ll take about an hour, or you can run the scripts/nearline_software_sync.sh on each of the worker nodes mnvnearline1-4).  But be sure that the options file on mnvonlinelogger has been updated first.

Nearline Setup Script

  • Modified Script uploaded to CVS
  • [mnvsoft] / AnalysisFramework / Bootstrap / setup / setupSw.sh

Procedure:

  1. Download the latest Bootstrap to your local area
    1. Check  ControlRoomTools/nearline_scripts/README
  2. Copy the locally modified script to the Bootstrap/setup/setupSw.sh
  3. Do not forget to update the release to match with latest version
    1. release = “<latest_version>”
  4. cvs update –A setupSw.sh
  5. cvs commit –m “Modified for Nearline Machine Job Submission” setupSw.sh
  6. cvs tag –F “<latest_version>” setupSw.sh

Nearline Job Submission Fail

  • Modified script did not worked
  • New jobsub_tools sets up via setup_minerva_products.sh
  • Applied same test to run setup_minerva_products.sh
  • Nearline Machines will not run that script
  • After that solution the jobs are NOT crashing

Nearline Setup Script

  • We needed to modify the framework setup script on nearline machines for not using minerva job_sub.
  • For this purpose I have modified the setup.sh script located at mnvonlinelogger.fnal.gov /scratch/nearonline/mirror/mnvsoft/current/setup.sh         –> It corresponds to v10r7p3
  • Here is the change I made:
    • Old:

if [ -e "/grid/fermiapp/minerva/condor/setup.minerva.condor.sh" ]; then

echo ‘Setting up MINERVA batch submission using minerva_jobsub’

source /grid/fermiapp/minerva/condor/setup.minerva.condor.sh

fi

New:

echo “Condor Setup Location = $BASH_SOURCE”

if echo “$BASH_SOURCE” | grep -q “/grid/fermiapp” ; then

echo ‘Setting up MINERVA batch submission using minerva_jobsub’

source /grid/fermiapp/minerva/condor/setup.minerva.condor.sh

else

echo ‘Setting up Nearline batch submission’

fi

  • I have implemented Jeremy’s suggestion, the script checks that

if “/grid/fermiapp” string appears in the $BASH_SOURCE

true –> Means you are running your setup script over /grid

            Run the minerva_jobsub

false –> Means you are on Nearline Machine

            Do Nothing and set the Nearline Condor Settings Later

  • I have tested but of course it will be tested while we are taking data.

Nearline Machines Job Submission Check

Instructions from Jeremy:

One can look at the output of the DST jobs as they are being made via a kind of nasty process (that requires you to know something about the internals of Condor).  I’ll take the most-backed-up run, 10091/30, as an example.

(1) Determine which machine the job is running on. mnvonlinelogger is the head node, so log in there, then:

[nearonline@mnvonlinelogger ~]$ condor_q

– Submitter: mnvonlinelogger.fnal.gov

<http://mnvonlinelogger.fnal.gov> :

<131.225.196.22:9651?CCBID=__131.225.196.22:9618#24088

<http://131.225.196.22:9651?CCBID=131.225.196.22:9618#24088>> :

mnvonlinelogger.fnal.gov <http://mnvonlinelogger.fnal.gov>

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

24247.0   nearonline      1/23 22:08   0+11:17:24 R  0   732.4

MV_00010091_0030_n

24248.0   nearonline      1/23 22:30   0+10:55:45 R  0   732.4

MV_00010091_0031_n

…. <other stuff I snipped>

[nearonline@mnvonlinelogger ~]$ condor_q -long 24247.0

… <snip>

RemoteHost = “slot3@mnvnearline2.fnal.gov

<mailto:slot3@mnvnearline2.fnal.gov>”

… <snip>

 

(2) Log into the machine where the job is running and go to the /scratch/condor/execute directory, which is where Condor output is put locally on the mnvnearline machines while the job runs:

nearonline@mnvnearline2.fnal.__gov

<mailto:nearonline@mnvnearline2.fnal.gov>$ ls

/scratch/condor/execute/

dir_13423  dir_18214  dir_25252  dir_26960  dir_31331  dir_3963

dir_8161  dir_8546

(3) There is one directory for each job; the numbers are the process ID on that machine.  I usually just look in each directory until I find the one with the output files I’m looking for.  In this case, it’s ‘dir_8546′:

nearonline@mnvnearline2.fnal.__gov

<mailto:nearonline@mnvnearline2.fnal.gov>$ find

/scratch/condor/execute/ -name ‘*10091_0030*’

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_RawData.dat

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_1.out

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_1.err

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347.joblog

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_LinjcDST.root

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_DST.root

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347_Histos.root

 

So the log file we want to look at is

 

/scratch/condor/execute/dir___8546/MV_00010091_0030_numib___v09_1401240347.joblog

 

And sure enough, when I read it, I find one of these for each gate:

 

curl_easy_perform() failed: couldn’t connect to server

sleeping 2 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 1 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 15 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 5 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 68 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 62 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 53 and retrying for status: 0

curl_easy_perform() failed: server returned nothing (no headers,

no data)

sleeping 622 and retrying for status: 0

curl_easy_perform() failed: couldn’t connect to server

sleeping 448 and retrying for status: 0

Exception:HTTP error: status: 0:<A0>*p^S

TracksPlotAlg                                   WARNING

TracksPlotAlg:: fill():: ‘Infinite’ value is skipped from the

histogram

 

This seems to imply that the database containing the POT info is not responding.

-Jeremy

Nearline Machines Job Submission Problem

Problem: Nearline machines cannot run the jobs. All crashes 10128 1 -30. 31 worked magically, Others wont work

  • First Tries without solution
    • I run  dispatcher_nearline.sh
    • condor_q works
    • used minerva-rc to hard restart daq
  • Checked the Log Files under mnvonlinelogger.fnal.gov
    • /scratch/nearonline/var/job_dump
    • Found that the Jobs Crashes in the stage while loading the jobsub_tools
    • Checked CVS entry for Tools/CondorUtils -> It was edited 3h ago with new options by schellma (Heidi Schellman)
    • Added her to the e-mail conversation and it turned out to be, her team updated the jobsub_tools this morning. (The reason why nearline machines tries to send to jobs to Fermilab grid instead of running on themselves)
    • Heidi reverted back the jobsub_tools version and the Nearline Processes started to work
  • Future Work:
    • Minerva Software Framework setup script under /fermiapp/grid must be edited for nearline machines
    • Modify the framework setup script so as not to do this Condor setup on the nearline machines.
  • Suggestions
    1. Follow the procedures on the post Nearline Machines Job Submission Check to locate the log files
    2. Try to compare the log files with a successful job submission

Run Controls Automated Notifications – Jeremy e-mail

There are basically 2 different conditions for notices, both of which come from the online monitoring machine:

  1.  Jobs finish too fast or crash.  I set a quasi-arbitrary minimum time (10s) which jobs must stay alive to not generate a warning. Usually jobs which cause warnings correspond to subruns that were cancelled or skipped by the user, so unless there are a lot of them in a row, I generally ignore these.
  2. Condor problems with the nearline Condor queue.  The queue manager will send you a note if
    1. submission fails — always a problem
    2. the queue is full — always a problem unless you know otherwise, since we have far more capacity than we typically use when all four slave machines are running
    3. there are idle jobs in the queue — usually not a problem unless (2) is true also. (Jobs sometimes take a few minutes to get started, so occasionally they show up as idle; I ignore these unless the same job shows up in multiple consecutive warnings.)

Note that activating this new email list will require restarting the run control backend, so we will probably want to wait for some beam downtime to do it.