3.4. Framework Registration

Consider the following C++ classes and function:

class hits { ... };
class waveforms { ... };

hits find_hits(waveforms const& ws) { ... }

where the implementations of waveforms, hits, and find_hits are unspecified. Suppose a physicist would like to use the function find_hits to transform a data product labeled "Waveforms" to one labeled "GoodHits" for each spill with unlimited concurrency. This can be achieved by in terms of the C++ registration stanza:

PHLEX_REGISTER_ALGORITHMS()   // <== Registration opener (w/o configuration object)
{
  products("GoodHits") =      // 1. Specification of output data product from find_hits
    transform(                // 2. Higher-order function
      "hit_finder",           // 3. Name assigned to HOF
      find_hits,              // 4. Algorithm/HOF operation
      concurrency::unlimited  // 5. Allowed CPU concurrency
    )
    .family(
      "Waveforms"_in("APA")   // 6. Specification of input data-product family (see text)
    );
}

The registration stanza is included in a C++ file that is compiled into a module, a compiled library that is dynamically loadable by Phlex.

A Python algorithm can be registered with its own companion C++ module or through the Python import helpers that make use of a pre-built, configurable, Phlex module. For the sake of consistency and ease of understaning, the helpers have the same naming and follow the same conventions as the C++ registration.

The stanza is introduced by an opener—e.g. PHLEX_REGISTER_ALGORITHMS()—followed by a registration block, a block of code between two curly braces that contains one or more registration statements. A registration statement is a programming statement that closely follows the equation described in Section 3.5 and is used to register an algorithm with the framework.

\[\ifamily{b}{\text{output}} = \text{HOF}(f_1,\ f_2,\ \dots)\ \ifamily{a}{\text{input}}\]

Specifically, in the registration stanza above, we have the following:

products(...)
  1. This is the equivalent of the output family \(\ifamily{b}{\text{output}}\), which is formed from specification(s) of the data product(s) created by the algorithm [DUNE 156]. One of the fields of the data-product specification is the data layer to which the data products will belong [DUNE 90]. Phlex does not require the output and input data layers to be the same.

transform(...)

Fully specifying the mathematical expression \(\text{HOF}(f_1,\ f_2,\ \dots)\) requires several items:

  1. The HOF to be used,

  2. The name to assign to the configured HOF,

  3. The algorithm/HOF operator(s) to be used (i.e. \(f_1,\ f_2,\ \dots\)), and

  4. The maximum number of CPU threads the framework can use when invoking the algorithm [DUNE 152].

family(...)
  1. The specification of the input family \(\ifamily{a}{\text{input}}\) requires (a) the specification of data products that serve as input family elements [DUNE 65], and (b) the label of the data layer in which the input data products are found. In the registration code above, this is achieved by providing the expression "Waveforms"_in("APA"), which instructs the framework to create a family of waveforms that reside in APAs [1].

The set of information required by the framework for registering an algorithm largely depends on the HOF being used (see the Section 3.5 for specific interface). However, in general, the registration code will specify which data products are required/produced by the algorithm [DUNE 111] and the hardware resources required by the algorithm [DUNE 9]. Note that the input and output data-product specifications are matched with the corresponding types of the registered algorithm’s function signature. In other words:

  • "Waveforms" specifies a data product whose C++ type is that of the first (and, in this case, only) input parameter to find_hits (i.e. waveforms).

  • "GoodHits" specifies a data product whose C++ type is the hits return type of find_hits.

When executed, the above code creates a configured higher-order function, which serves as a node in the function-centric data-flow graph.

The registration block may contain any code supported by C++. The block, however, must contain a registration statement to execute an algorithm.

Important

A module must contain only one registration stanza. Note that multiple registration statements may be made in each stanza.

3.4.1. Algorithms with Multiple Input Data Products

The registration example given above in Section 3.4 creates an output family by applying a one-parameter algorithm find_hits to each element of the input family, as specified by family("Waveforms"_in("APA")). In many cases, however, the algorithm will require more than one data product. Consider another algorithm find_hits_subtract_pedestals, which forms hits by first subtracting pedestal values from the waveforms, both of which are presented to the algorithm as data products from the APA. The interface of the algorithm and its registration would look like:

class hits { ... };
class waveforms { ... };
class pedestals { ... };
hits find_hits_subtract_pedestals(waveforms const&, pedestals const&) {...}

PHLEX_REGISTER_ALGORITHMS(config)
{
  products("GoodHits") =
    transform("find_hits", find_hits_subtract_pedestals, concurrency::unlimited)
    .family("Waveforms"_in("APA"), "Pedestals"_in("APA"));
}

The elements of the input family are thus pairs of the data products labeled "Waveforms" and "Pedestals" in each APA. [2] In this case, the data cell for both data products is the same—i.e. for a given invocation of find_hits_subtract_pedestals, both data products will be associated with the same APA.

There are cases, however, where an algorithm needs to operate on data products from different data cells [DUNE 89].

Note

The number of arguments presented to the family(...) clause must match the number of input parameters to the registered algorithm. The order of the family(...) arguments also corresponds to the order of the algorithm’s input parameters.

3.4.1.1. Data Products from Different Data Layers

Consider the operator \(\textit{make\_vertices}\) in Fig. 3.1 that requires two arguments: the \(\textit{GoodTracks}\) collection for each APA (data layer APA), and the detector \(\textit{Geometry}\) that applies for the entire job (data layer Job) [3]. This would be expressed in C++ as:

vertices make_vertices(tracks const&, geometry const&) { ... }

PHLEX_REGISTER_ALGORITHMS(config)
{
  products("Vertices") =
    transform("vertex_maker", make_vertices, concurrency::unlimited)
    .family("GoodHits"_in("APA"), "Geometry"_in("Job"));
}

where the data layers are explicit in the family statement.

Phlex supports such uses cases [DUNE 113], even if the specified data layers are unrelated to each other. For example, suppose an algorithm needed to access a data product from a Spill, and it also required a calibration offset provided from an external database table [DUNE 35]. Instead of providing a separate mechanism for handling calibration constants, a separate layer could be invented (e.g. Calibration) whose data cells corresponded to intervals of validity. So long as a relation can be defined between specific Spill data cells and specific Calibration data cells, the framework can use that relation to form the input family of Spill-Calibration data-product pairs that are presented to the algorithm. How the relation between data cells is defined is referred to as data marshaling, and it is described further in the technical design (under preparation).

3.4.1.2. Data Products from Adjacent Data Cells

In some cases, it may be necessary to simultaneously access data products from adjacent data-products sets [DUNE 91], where adjacency is defined by the user [DUNE 92]. The notion of adjacency can be critical for (e.g.) time-windowed processing (see Section 3.5.7), where the details of the “next” time bin are needed to accurately calculate properties of the “current” time bin.

Supporting the processing of adjacent data cells is described further in the technical design (under preparation).

3.4.2. Accessing Configuration Information

Instead of hard-coding all pieces of registration information, it is desirable to specify a subset of such information through a program’s run-time configuration. To do this, an additional argument (e.g. config) is passed to the registration opener:

PHLEX_REGISTER_ALGORITHMS(config)
{
  auto selected_data_layer = config.get<std::string>("data_layer");

  products("GoodHits") =
    transform("hit_finder", find_hits, concurrency::unlimited)
    .family("Waveforms"_in(selected_data_layer));
}

Note

As discussed in the technical design (under preparation), the registration code will have access only to the configuration relevant to the algorithm being registered, and to certain framework-level configuration such as debug level, verbosity, or parallelization options.

Except for the specification of find_hits as the algorithm to be invoked, and transform as the HOF, all other pieces of information may be provided through the configuration.

3.4.3. Framework Dependence in Registration Code

Usually, classes like waveforms and hits and algorithms like find_hits are framework-independent (see Section 1.4). There may be scenarios, however, where dependence on framework interface is required, especially if framework-specific metadata types are used by the algorithm. In such cases, it is strongly encouraged to keep framework dependence within the module itself and, more specifically, within the registration stanza. This can be often achieved by registering closure objects that are generated by lambda expressions.

For example, suppose a physicist would like to create an algorithm find_hits_debug that reports a spill number when making tracks. By specifying a lambda expression that takes a phlex::handle<waveforms> object, the data product can be passed to the find_hits_debug function, along with the spill number from the metadata accessed from the handle:

hits find_hits_debug(waveforms const& ws, std::size_t apa_number) { ... }

PHLEX_REGISTER_ALGORITHMS(m)
{
  products("GoodHits") =
    transform(
      "hit_finder",
      [](phlex::handle<waveforms> ws) { return find_hits_debug(*ws, ws.id().number()); },
      concurrency::unlimited
    )
    .family("Waveforms"_in("APA"));
}

The lambda expression does depend on framework interface; the find_hits_debug function, however, retains its framework independence.

3.4.4. Member Functions of Classes

In some cases, it may be necessary to register a class and its member functions with the framework. This is done by first creating an instance of the class by invoking make<T>(args...), where T is the user-defined type, and args... are the arguments presented to T’s constructor. For example, the find_hits algorithm author could have instead created a hit_finder class, whose constructor takes a parameter called sigma_threshold:

class hit_finder {
public:
  hit_finder(float sigma_threshold);
  hits find(waveforms const& ws) const;
  ...
};

PHLEX_REGISTER_ALGORITHMS(config)
{
  auto sigma_threshold = config.get<float>("sigma_threshold");
  auto selected_data_layer = config.get<std::string>("data_layer");

  products("GoodHits") =
    make<hit_finder>(sigma_threshold)  // <= Make framework-owned instance of hit_finder
      .transform("hit_finder", &hit_finder::find, concurrency::unlimited)
      .family("Waveforms"_in(selected_data_scope));
}

Note that the hit_finder instance created in the code above is owned by the framework. The hit_finder::find member function’s address is registered in the transform(...) clause, thus instructing the framework to invoke find, bound to the framework-owned hit_finder instance.

Note

Algorithm authors should first attempt to implement algorithms as free functions (see Section 2.4.1). Registering class instances and their member functions with the framework should only be considered when:

  • multiple processing steps must work together, relying on shared internal data, or

  • supporting legacy code that relies on object-oriented design.

3.4.5. Overloaded Functions

Phlex performs a substantial amount of type deduction through the transform(...) clause. This works well except in cases where the registered algorithms are overloaded functions. For example, suppose one wants to register C++’s overloaded std::sqrt(...) function with the framework. Simply specifying transform(..., std::sqrt) will fail at compile time as the compiler will not be able to determine which overload is desired.

Instead, the code author can use the following [4]:

transform(..., [](double x){ return std::sqrt(x); }, ...);

where the desired overload is selected based on the double argument to the lambda expression.

Footnotes

References