mnvnearline{1,2,3,4} Kernel Updates

  • Ed Simmonds completed Kernel Updates on mnvnearline{1,2,3,4}.
  • Used beam downtime to our advantage and stopped runcontrol during the updates.
  • All subruns right before downtime have all gates processed:
    • 16566/19
    • 16566/20
    • 16566/21
    • 16566/22 –> 24 Gates only (I stopped run on that subrun)
  • There was no failed jobs and no need for manual job submission.
  • I checked the next subrun(16567/1) via e-Checklist and it was OK

Documentation Update

  • Added two new documentation under Minerva OPS wiki
  • Update Nearline Software
    • https://cdcvs.fnal.gov/redmine/projects/minerva-ops/wiki/Update_Nearline_Software
  • Install a NEW Frameowk Version on Nearline System
    • https://cdcvs.fnal.gov/redmine/projects/minerva-ops/wiki/Install_a_NEW_Framework_Version_on_Nearline_System

 

e-Checklist nusoft works again

  • Backup e-Checklist: http://nusoft.fnal.gov/minerva/echecklist/mininfo.php 
  • Modified scripts 
    • setup_nearline_software.sh 
    • nearline_bluearc_copy.sh 
  • nusoft uses NEARLINE_BLUEARC_GMPLOTTER_AREA, which needs to be under “/minerva/app” NOT under “/minerva/data”
    • NEARLINE_BLUEARC_GMPLOTTER_AREA=/minerva/app/users/nearonline/gmbrowser
  • Changes committed to CVS.
  • Manually synchronized nearline1 with the mnvonlinelogger

GMBrowser Update – Uses all gates now

  • Previously GMBrowser that shifters look at only uses a fraction of the gates, because the early processing stages (particularly DecodeRawEvent) were slow.
  • Now that we have a faster version of DecodeRawEvent, and we modified GMBrowser to use all gates
  • Modified following parameters in NearlineCurrent.opts in Tools/DaqRecv/options on mnvonlinelogger, to be 100 percent:
    • PdstlPrescaler.PercentPass          = 25;
    • LinjcPrescaler.PercentPass          = 25;
    • NumibPrescaler.PercentPass       = 20;
  • Ran the “nearline_software_sync.sh” script in all Nearline Machines to get the update
    • mnvnearline1
    • mnvnearline2
    • mnvnearline3
    • mnvnearline4
  • Informed Current Shifter about the update and started GMBrowser at Tufts UROC
    • Will investigate the behavior for some time, until we make this change permanent.

Nearline File Management Problems

We still have problems for nearline file management and I listed the ones I found. Here is the list of folders need to be managed.

  1. Synchronize /scratch/nearonline/var/job_dump/ with /minerva/data/online_processing/swap_area/
  2. Synchronize /scratch/nearonline/var/gmplotter/plotter/ with /minerva/data/users/nearonline/gmbrowser/plotter/
  3. Synchronize /scratch/nearonline/var/gmplotter/www/ with /minerva/data/users/nearonline/gmbrowser/www/
  4. Copy Files from /scratch/nearonline/var/gmplotter/www to minerva@minerva-wbm.fnal.gov:/opt/if-wbm/htdoc/minerva/echecklist/gmb_hists

Here is the status of each section

1) I modified the script to use rsync command to sync between /scratch/nearonline/var/job_dump/ with /minerva/data/online_processing/swap_area/ For now, we have a stable synchronization between two folders however,this method copies the .log files also, which is unnecessary.

2,3) No USER “nearonline” under /minerva/data/users Setup script assigns the following export NEARLINE_BLUEARC_GMPLOTTER_AREA=/minerva/data/users/nearonline/gmbrowser There is a “nearonline” user under /minerva/app, however we should not copy any data file to the /minerva/app.

4) e-Checklist works, I conclude this section works. I did not checked the details.

We should organize a plan to solve all the problems in nearline file management. I propose the following,

  • Lets use rsync command for 1,2,3
  • We need to create a folder “/minerva/data/users/nearonline” and let other systems know where we are copying the files.
  • If there is a folder I forgot to sync between nearline and bluearc, that folder also needs to be added to the script.

Software Update

  • mnvonlinelogger updated
  • Slave Nodes will receive update automatically
    • Updated Packages under cmtuser area
      • Tools/DaqRecv [croce_v3]
        • cvs co -r croce_v3 Tools/DaqRecv
    • Installed Packages under cmtuser area
      • Event/MinervaKernel [croce_v3]
        • This package required for Event/MinervaEvent
        • getpack -u Event/MinervaKernel
      • Event/MinervaEvent [croce_v3]
        • cvs co -r croce_v3 Event/MinervaEvent
    • Built All Packages in the following order
      1. Tools/DaqRecv
      2. Event/MinervaKernel
      3. Event/MinervaEvent
    • Building Commands:
      • cmt config
      • cmt make
      • source setup.sh

Problem after restarting Nearline Machines

Problem:

The automated “nearline_bluearc_copy.sh” script on mnvnearline1 fails to copy necessary files from local_dump_area to online_processing/swap_area
(from /scratch/nearonline/var/job_dump to /minerva/data/online_processing/swap_area)

Investigations:

  • Investigating mnvnearline1:scripts/nearline_bluearc_copy.sh
  • Script runs automatically every 5 minutes.
  • Log file for the script: /scratch/var/nearonline/logs
  • Local copy from HEAD to following folders WORKS
    • NEARLINE_DUMP_AREA /scratch/nearonline/var/job_dump
    • NEARLINE_LOCAL_GMPLOTTER_LOCATION /scratch/nearonline/var/gmplotter
  • The problem is with the python script “filechecklist.py”
  • It does not generate the file list for files from the following folders:
    • $NEARLINE_DUMP_AREA
    • $NEARLINE_LOCAL_GMPLOTTER_LOCATION/plotter
  • It works for the following folder
    • $NEARLINE_LOCAL_GMPLOTTER_LOCATION/www
  • Since there is no file list generated by the python script “filechecklist.py”, NO files copied to the swap_area

Temporary Solution:

  • I modified the script to use rsync command.
    • Now it synchronizes the local_dump_area and online_processing/swap_area
  • Inside the script Jeremy notes that, “using rsync for this stage incurs a lot of overhead on the BlueArc disk”, therefor,  he writes a more efficient script “filechecklist.py” for this task

 Permanent Solution:

  • The Problem is confirmed.
    •  .fileindex under /scratch/nearonline/var/job_dump got corrupted and causing “file checklist.py” to crash for that folder
  • Using rsync manually fixed the .fileindex
  • Software sync between mnvonlinelogger and mnvnearline1 updates the nearline_bluearc_copy.sh script to the original version
  • Now everything works as before. The near ine_bluearc_copy.sh script copies the changed files to bluearc area using “file checklist.py”

 

GMBrowser Problem

  • GMBrowser Live works but shifter can not access to the previous runs and subruns
    • GMBrowser -r xx -s xxx does not work
  • The files are not copied automatically to the /minerva/data/online_processing/swap_area
  • I copied the files manually:
    • Connect to the mnvnearline1 – it has the BlueArc /minerva/ mount
    • Necessary files located: /scratch/nearonline/var/job_dumb
  • This solved the issue for non copied files.
  • I checked the log file for nearline_bluearc_copy.sh script under /scratch/nearonline/var/logs
    • After sometime the auto-script seems to be working.
  • Currently, we are not running any runs, I will check the status again tomorrow

Major Update to RunControl Software (v6r1)

  1. Killed all processes
  2. Jeremy updated the
    • mnvonline0.fnal.gov
    • mnvonline1.fnal.gov
    • mnvonlinelogger.fnal.gov
    • minerva-rc.fnal.gov
  3. I updated the remaining Control Room Computers
    • minerva-evd.fnal.gov
    • minerva-bm.fnal.gov
    • minerva-om-02.fnal.gov
  4. Testing the Updates
    1. Successful Test on Control Room Computers
    2. Successful Test on Rochester UROC
    3. Successful Test on Tufts UROCs
  5. I updated the UROC_sw_manager.py script and notified UROC Users

Fermilab Power Failure

  • On Sunday 03:30 am there was a power failure affecting MINOS and MINERvA underground machines
  • Control Room Computers lost network mount to /minerva/data/
    • GMBrowser needs /minerva/data mounted and it was not working
    • minerva-evd is used by UROCs to mount /minerva/data and they are also affected.
    • Carrie opened a service ticket to ask Computer Division Help for Control Room Computers
    • Computer Division solved the incident and all machines and UROCs working properly.
  • mnvonlinebck1.fnal.gov machine is still down and we have no access to Veto Wall HV Monitoring.
  • e-Checklist can be used either one of the following servers: (minerva-wbm was down due to power glitch)
    • http://minerva-wbm.fnal.gov/minerva/echecklist/mininfo.php
    • http://nusoft.fnal.gov/minerva/echecklist/mininfo.php

Nearonline Dispatcher Crash

  • Yesterday around 15:00, nearonline dispatcher crashed causing no raw data under: /scratch/nearonline/var/job_dump/
  • Manual Job Submission via manual_dst_submit.py did not worked. (No Raw Data)
  • Jeremy manually copied the raw data with the following command and manually submitted the missing jobs:
    • cd /scratch/nearonline/var/job_dump/
    • for ((i=22;i<=32;i++)); do cp /mnvonline0/work/data/rawdata/MV_00010504_00${i}_*_RawData.dat .; done

Nearline: Warning Messages

  • The Nearline Machines uses its own installation called “frozen”
  • Changes on the Framework does not affect them.
  • Install the specific package to cmtuser area and built it to use a modified version of a package.
  • I modified the “frozen” installation by updating it. However it is WRONG.
  • I revert back the update and install the Sim/GiGaCnv to cmt user area on mnvonlinelogger.
  • I updated the Sim/GiGaCnv to the “no_magneticField_warning” version to get the latest modifications on the package.
  • Now the nearline machines uses the local version of Sim/GiGaCnv  and this local version is updated to not to generate the warning messages.

Nearline: Warning Messages

  • Successfully run a local test job using Minerba’s instructions
  • Test Job ran on file: /minerva/data2/rawdata/minerva/raw/numib/00/00/61/35/MV_00006135_0016_numib_v09_1310191557_RawData.dat
  • Warning messages displayed between two processes:
    • BuildRawEventAlg.initiliaze()
    • BuildRawEventAlg.execute()
  • Asked Minerba the file locations for
    • GiGaGeo
    • EventLoopMgr

Nearline: Warning Messages

Warning Messages:

GiGaGeo                      WARNING world():: Magnetic Field is not requested to be loaded

EventLoopMgr           WARNING Unable to locate service “EventSelector”

EventLoopMgr           WARNING No events will be processed from external input.

 

Instructions from Minerba to Run a Local Test Job

Get the packages: in v10r7p3

Tools/DaqRecv

Tools/SystemTests

Ana/Histogramatron

To run one sample of the histograms from nearline, edit the option file at:

DaqRecv/options/Nearline.opts

you need to change the calibration files and the raw data file you want to test.

You can look one file I used to test at:

/minerva/data/users/betan009/Minerva_v10r6p13/Tools/DaqRecv/options/Nearline_test.opts

I would start looking the algorithms that the Nearline runs and I see which one uses the magnetic field.

Please let me know if you have any question.

Manual Job Submission

Instructions from Jeremy:

Remove Jobs

Anybody who can log in as nearonline@mnvonlinelogger.fnal.gov can remove the jobs from the queue.  You’d do:

[nearonline@mnvonlinelogger ~]$ . scripts/setup_nearline_software.sh

followed by:

[nearonline@mnvonlinelogger scripts]$ condor_rm <job_id>

or if you want to get rid of them all in one go:

[nearonline@mnvonlinelogger scripts]$ condor_rm nearonline

 

Manually Submit Jobs

Ages ago I wrote a script for manual DST submission, which I haven’t tried in forever, but I think should still work:

[nearonline@mnvonlinelogger ~]$ scripts/manual_dst_submit.py

Usage: manual_dst_submit.py -r <run> -s <subrun>

Options:

-h, –help            show this help message and exit

-r RUN, –run=RUN     Run number

-s SUBRUN, –subrun=SUBRUN

Subrun number

 

That can probably be used to resubmit once the options file on mnvonlinelogger has been updated and the other machines have synchronized (if you wait for the automatic synchronizations, it’ll take about an hour, or you can run the scripts/nearline_software_sync.sh on each of the worker nodes mnvnearline1-4).  But be sure that the options file on mnvonlinelogger has been updated first.