Rule of 2 when using an unsafe language

I learned today from Google Security Blog that Google follows the Rule of 2 when writing code in an unsafe language (C/C++). The Rule of 2 says that you should pick no more than 2 of:

  • untrustworthy inputs;
  • unsafe implementation language; and
  • high privilege

In other words, you should “always use a safe language, a sandbox, or not be processing untrustworthy inputs in the first place”.

rule-of-2.png

I thought that this is relevant not only in programming, but also in life. In this internet age, when you read something, you should only read/internalize subjects that you are familiar with (“safe language”), do not spread anything that could be misinformation (“unprivileged sandbox”), or not be reading from untrustworthy sources in the first place.


Graph execution in TensorFlow 2

TensorFlow 2 is eager execution by default. However, as a Keras user, when I do NN training and predictions, TensorFlow is actually running in graph execution mode. Basically, graph execution still offers better performance and can be easily run in parallel. Useful documentation about graph execution can be found at the following:

Although Keras uses graphs by default, it is possible to configure it to run eagerly. See Model training APIs. It is also possible to turn off tf.function everywhere in TensorFlow. See tf.config.run_functions_eagerly.

Note that while TensorFlow 1 was also using graphs, the graphs in TensorFlow 2 are very different compared to those in TensorFlow 1. There is no longer any session.run, feed_dict, etc. See Migrate your TensorFlow 1 code to TensorFlow 2.


Simple fully-connected NN firmware using hls4ml

hls4ml (GitHub repo) is a toolkit that implements fast neural network inferences in FPGAs using High-Level Synthesis (HLS) from Vivado. It can be used to convert NN models from popular ML libraries (e.g. Keras) into VHDL or Verilog code, which can be used to generate the firmware.

The following is an example to convert a simple, 3-hidden-layer fully-connected NN, built with Keras, into HLS firmware using hls4ml. It is done inside a conda environment.

First, create a new environment (tf) and install tensorflow and hls4ml:

conda create -n tf python=3.6
conda activate tf
pip install -U pip
pip install -U tensorflow>=2.4.0
pip install -U git+https://github.com/fastmachinelearning/hls4ml.git@v0.4.0

Create the Keras model. It consists of 3 fully-connected hidden layers with batch normalization and tanh activation. The output is a linear regression node. Save the model as JSON string, and store its weights in a HDF5 file.

import tensorflow as tf

def create_model(input_shape=(40,)):
  # Create a 3-hidden-layer fully-connected NN
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.InputLayer(input_shape=input_shape))
  model.add(tf.keras.layers.Dense(30, kernel_initializer='glorot_uniform', use_bias=False, activation=None, name='dense'))
  model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=1e-4, name='batch_normalization'))
  model.add(tf.keras.layers.Activation('tanh', name='activation'))
  model.add(tf.keras.layers.Dense(20, kernel_initializer='glorot_uniform', use_bias=False, activation=None, name='dense_1'))
  model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=1e-4, name='batch_normalization_1'))
  model.add(tf.keras.layers.Activation('tanh', name='activation_1'))
  model.add(tf.keras.layers.Dense(10, kernel_initializer='glorot_uniform', use_bias=False, activation=None, name='dense_2'))
  model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=1e-4, name='batch_normalization_2'))
  model.add(tf.keras.layers.Activation('tanh', name='activation_2'))
  model.add(tf.keras.layers.Dense(1, kernel_initializer='glorot_uniform', use_bias=False, activation=None, name='dense_3'))
  model.compile(loss='mse', optimizer='adam')
  model.summary()
  return model

def save_model(model, name=None):
  # Save as model.h5, model_weights.h5, and model.json
  if name is None:
    name = model.name
  model.save(name + '.h5')
  model.save_weights(name + '_weights.h5')
  with open(name + '.json', 'w') as outfile:
    outfile.write(model.to_json())
  return

if __name__ == '__main__':
  model = create_model()
  save_model(model, name='model')

Prepare a YAML config file (keras-config.yml). Specify the Xilinx FPGA part number and clock period. Make sure the filenames are correct. The documentation can be found here.

KerasJson: model.json
KerasH5: model_weights.h5
OutputDir: my-hls-test
ProjectName: myproject
XilinxPart: xc7vx690tffg1927-2
ClockPeriod: 5ns

IOType: io_parallel # options: io_serial/io_parallel
HLSConfig:
  Model:
    Precision: ap_fixed<16,6>
    ReuseFactor: 1
    Strategy: Latency  # options: Latency/Resource

Now, feed the config file to hls4ml. It is going to generate the project directory my-hls-test and write the firmware for you.

hls4ml convert -c keras-config.yml
hls4ml build -p my-hls-test -a  # this might take a while

# Alternatively, the last step can be done in the following way.
# The command-line options are shown at the top of build_prj.tcl.
#cd my-hls-test
#vivado_hls -f build_prj.tcl "reset=1 csim=1 synth=1 cosim=0 validation=0"

The report file can be found at my-hls-test/myproject_prj/solution1/syn/report/myproject_csynth.rpt (as I specified OutputDir: my-hls-test and ProjectName: myproject). The Verilog codes are found at my-hls-test/myproject_prj/solution1/syn/verilog/. Have fun!


Multithreading in CMSSW

Documentation about how to do multithreading in the CMSSW framework can be found at the following twikis:

The rule of thumb is that EDProducer or EDFilters should probably be a Stream module, while EDAnalyzers and OutputModules should probably be a Global module. A One module basically exists as a fallback to single-threaded processing.

From the discussion in this pull request, one could use Clang Static Analyzer within the CMSSW framework to check for thread safety. To do that, first check out the Utilities/StaticAnalyzers package.

git cms-addpkg Utilities/StaticAnalyzers

Then, call scram b with certain environment variables.

export USER_CXXFLAGS="-DEDM_ML_DEBUG -w"
export USER_LLVM_CHECKERS="-enable-checker threadsafety -enable-checker optional.ClassChecker -enable-checker cms -disable-checker cms.FunctionDumper"
scram b -k -j $(nproc) checker

The static analyzer results can be viewed in a web browser. See also: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideStaticAnalyzer.


Moving average in Batch Normalization

In TensorFlow/Keras Batch Normalization, the exponential moving average of the population mean and variance are calculated as follows:

moving_mean = moving_mean * momentum + batch_mean * (1 - momentum)
moving_var = moving_var * momentum + batch_var * (1 - momentum)

where momentum is a number close to 1 (default is 0.99). In the actual code, the moving average are updated in a more efficient way:

moving_mean -= (moving_mean - batch_mean) * (1 - momentum)
moving_var -= (moving_var - batch_var) * (1 - momentum)

They are equivalent as shown below ($\mu$ is the moving mean, $\mu_{B}$ is the batch mean, $\alpha$ is the momentum):

\[\begin{align} \mu &= \alpha\mu + (1 - \alpha) \mu_{B} \\ &= \mu - (1 - \alpha) \mu + (1 - \alpha) \mu_{B} \\ &= \mu - (1 - \alpha) (\mu - \mu_{B}) \end{align}\]

Hence, the moving average will decay by the difference between the existing value and the new value, multiplied with a decay factor of (1 - momentum). A lower value of momentum means that older values are forgotten sooner. This results in a faster-changing moving average.