hadd with files on LPC EOS disk

There are a number of “Don’t do this” when working with the FNAL LPC EOS disk. They are described here. For example, don’t merge root files that are on EOS, because the EOS disk is mounted via FUSE, so it can cause trouble if there are heavy I/O. Instead, one should use the dedicated EOS or Xrootd commands.

Recently I had to merge root files (using hadd) in multiple directories on EOS, and it turned out to be not so straight forward using the EOS or Xrootd commands. So I had to do some python, listed below.

#!/usr/bin/env python

directories = [

outfile = '/tmp/jiafu/ntuple.root'

def call_cmd(cmd):
  import shlex, subprocess
  p = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE)
  lines = p.stdout.read().split()
  return lines

def list_input_files(directories):
  all_lines = []
  for directory in directories:
    cmd = 'xrdfs root://cmseos.fnal.gov ls -u {0}'.format(directory)
    lines = call_cmd(cmd)
    lines = [line for line in lines if line.endswith('.root')]
    all_lines += lines
  return ' '.join(all_lines)

# Main
if __name__ == '__main__':
  infiles = list_input_files(directories)
  cmd = 'hadd -f {0} {1}'.format(outfile, infiles)
  lines = call_cmd(cmd)
  #print '\n'.join(lines)

CRAB: Resubmit without the project directory

The CRAB project directory is the directory that is created when you make a new CRAB project (i.e. when you do crab submit). Sometimes you might have removed the project directory too quickly, before you realize that you want to resubmit some of the jobs. But without the project directory, you cannot call crab resubmit.

If you know the “task name”, which looks like YYMMDD_HHMMSS:request_name, then it’s possible to recreate the project directory. The timestamp is the time when you call crab submit, whereas the request_name is config.General.requestName from your crab.py. If you don’t remember the task name, you can always check the Task Monitoring dashboard to find out.

First, make an empty directory to be used as the CRAB project directory:


Then, do the following in python:

from CRABClient.UserUtilities import config
from CRABClient.ClientUtilities import createCache

requestarea = PROJDIR
uniquerequestname = TASKNAME

host = 'cmsweb.cern.ch'
port = ''
voRole = ''
voGroup = ''
instance = 'prod'
originalConfig = config()
createCache(requestarea, host, port, uniquerequestname, voRole, voGroup, instance, originalConfig)

Please replace PROJDIR and TASKNAME in the above with the project directory and the task name.

Binary cross entropy in TensorFlow

In Tensorflow, the binary cross entropy loss function is implemented in a way to ensure stability and avoid overflow. The formulation can be found in the official doc. But it’s not very easy to follow when it’s written in pseudo-code. So I decided to type it in TeX (replacing the notation $z$ by $y$).

The logistic loss is

\[\begin{align*} \mathcal{L} &= - y \log(p) - (1 - y) \log(1-p) \\ &= - y \log(\operatorname{sigmoid}(x)) - (1 - y) \log(1-\operatorname{sigmoid}(x)) \\ &= - y \log \left(\frac{1}{1+e^{-x}} \right) - (1 - y) \log \left(1-\frac{1}{1+e^{-x}} \right) \\ &= - y \log \left(\frac{1}{1+e^{-x}} \right) - (1 - y) \log \left(\frac{e^{-x}}{1+e^{-x}} \right) \\ &= y \log({1+e^{-x}}) + (1 - y)\left[- \log(e^{-x}) + \log({1+e^{-x}}) \right] \\ &= y \log({1+e^{-x}}) + (1 - y)\left[x + \log({1+e^{-x}}) \right] \\ &= (1 - y)(x) + \log({1+e^{-x}}) \\ &= x - x \times y + \log({1+e^{-x}}) \end{align*}\]

For $x < 0$, to avoid overflow in $e^{-x}$, we reformulate the above

\[\begin{align*} \mathcal{L} &= x - x \times y + \log({1+e^{-x}}) \\ &= \log(e^{x}) - x \times y + \log({1+e^{-x}}) \\ &= - x \times y + \log(e^{x} \times ({1+e^{-x}})) \\ &= - x \times y + \log(1 + e^{x}) \end{align*}\]

Hence, to ensure stability and avoid overflow, the implementation uses this equivalent formulation

\[\begin{align*} \mathcal{L} &= \max(x,0) - x \times y + \log({1+e^{-|x|}}) \\ &= \operatorname{ReLU(x)} - x \times y + \log({1+e^{-|x|}}) \end{align*}\]

(To be more clear, the last formulation is used to combine $x - x \times y + \log({1+e^{-x}})$ when $x \geq 0$ and $- x \times y + \log(1 + e^{x})$ when $x < 0$).

Python optimizations

The following links provide very useful tips to help speed up your Python codes, some are even useful beyond Python:

Virtualenv issue in CMSSW_9_3_X

I ran into a strange issue related to Python virtualenv and pip in CMSSW_9_3_X. Python version 2.7.11 and Virtualenv version 15.1.0. Doing the following will cause an error:

virtualenv venv
source venv/bin/activate
pip install -U pip

The error message reads:

Traceback (most recent call last):
  File "/tmp/venv/bin/pip", line 7, in <module>
    from pip._internal import main
ImportError: No module named _internal

Apparently it is due to the environment variable $PYTHONPATH not set properly. I fixed it by patching the file venv/bin/activate. Here’s the patch file:

diff --git a/venv/bin/activate b/venv/bin/activate
index 03fa903..c104cf0 100644
--- a/venv/bin/activate
+++ b/venv/bin/activate
@@ -11,6 +11,11 @@ deactivate () {
         export PATH
         unset _OLD_VIRTUAL_PATH
+    if ! [ -z "${_OLD_PYTHONPATH+_}" ] ; then
+        export PYTHONPATH
+        unset _OLD_PYTHONPATH
+    fi
     if ! [ -z "${_OLD_VIRTUAL_PYTHONHOME+_}" ] ; then
         export PYTHONHOME
@@ -47,6 +52,10 @@ _OLD_VIRTUAL_PATH="$PATH"
 export PATH
 # unset PYTHONHOME if set
 if ! [ -z "${PYTHONHOME+_}" ] ; then

To apply, download it as mypatch.txt in the same directory where virtualenv venv was called. Then do:

patch -p1 < mypatch.txt

Now pip install -U pip should work.