3.2. Data Organization¶
This section provides a conceptual overview of data products, data cells, data layers, and data-layer hierarchies, as defined in [DUNE 85] [DUNE 86] [DUNE 87] [DUNE 88]. In addition, we discuss data-product families and data-cell families. This section aims to establish a mental model for how all of these concepts facilitate scientific workflows without delving into implementation specifics.
Data products represent things like raw detector readouts, calibration information, and derived physics quantities. [DUNE 40]. We call these kinds of things represented by data products conceptual data products. Data product types are the programming language representations of conceptual data products. A data layer is an experiment-defined level of aggregation of data products. Some example data layers are Run, Subrun, Spill, and an interval of validity for some flavor of calibration. Each Phlex job includes a Job data layer at the top of the data-layer hierarchy. A data cell is a collection of data products, associated with a data layer. A data-cell family is a family of data cells that are in the same data layer. The Job layer always includes a single data cell. Fig. 3.2 illustrates the relationships between all of these.
![digraph {
fontname="Helvetica,Arial"
node [shape="plaintext" fontname="Helvetica,Arial"]
edge [arrowhead="none"]
subgraph cluster_categories {
label=<<b>Data layers</b>>
color=none
job_category [label=<Job>]
run_category [label=<Run> fontcolor="gray"]
spill_category [label=<Spill>]
apa_category [label=<APA>]
job_category -> run_category -> spill_category -> apa_category
}
node [shape="box" style="filled,rounded"]
subgraph cluster_job_family {
style="filled,rounded"
fillcolor="white"
job_family_label [label=<<i>Job family</i>> shape="plaintext" margin=0 style="rounded"]
job [label=<Job> fillcolor="lightyellow"]
}
run1 [label=<Run<sub>1</sub>> fillcolor="gray98" fontcolor="gray" color="gray"]
run2 [label=<Run<sub>2</sub>> fillcolor="gray98" fontcolor="gray" color="gray"]
subgraph cluster_spill_family {
style="filled,rounded"
fillcolor="lightcyan"
family_label_0 [label=<<i>Spill family</i>> shape="plaintext" margin=0 style="rounded"]
spill1 [label=<Spill<sub>1,1</sub>> fillcolor="lightblue"]
spill2 [label=<Spill<sub>1,2</sub>> fillcolor="lightblue"]
spill3 [label=<Spill<sub>2,1</sub>> fillcolor="lightblue"]
}
apa3 [label="..." shape="plaintext" margin=0 style="rounded"]
job -> {run1 run2}
run1 -> {spill1 spill2}
run2 -> spill3
spill3 -> apa3
node [style="rounded,filled" fillcolor="lightgreen"]
subgraph cluster_apa_family_1 {
style="filled,rounded"
fillcolor="#e6ffe6"
family_label_1 [label=<<i>APA family</i>> shape="plaintext" margin=0 style="rounded"]
apa11 [label=<APA<sub>1,1,1</sub>>]
apa1Dots [label="..." shape="plaintext" margin=0 style="rounded"]
apa1N [label=<APA<sub>1,1,<i>n</i></sub>>]
}
spill1 -> family_label_1 [style=invis]
spill1 -> apa11
spill1 -> apa1Dots [style=invis]
spill1 -> apa1N
subgraph cluster_apa_family_2 {
style="filled,rounded"
fillcolor="#e6ffe6"
apa21 [label=<APA<sub>1,2,1</sub>>]
apa2Dots [label="..." shape="plaintext" margin=0 style="rounded"]
apa2N [label=<APA<sub>1,2,<i>n</i></sub>>]
family_label_2 [label=<<i>APA family</i>> shape="plaintext" margin=0 style="rounded"]
}
spill2 -> apa21
spill2 -> apa2Dots [style=invis]
spill2 -> apa2N
spill2 -> family_label_2 [style=invis]
node [shape="plaintext" style="rounded" margin="0"]
subgraph cluster_product_sequence_1 {
style="filled,rounded"
fillcolor="#eee2ee"
sequence_label_1 [label=<<i>Waveforms family</i>> shape="plaintext" margin=0 style="rounded" fontname="Helvetica,Arial"]
t11 [label=<Waveforms<sub>1,1,1</sub>> fontsize=11.5 fontcolor="purple"]
t1Dots [label="..." shape="plaintext" margin=0 style="rounded" fontname="Helvetica,Arial"]
t1N [label=<Waveforms<sub>1,1,<i>n</i></sub>> fontsize=11.5 fontcolor="purple"]
}
edge [style=dotted]
family_label_1 -> sequence_label_1 [style=invis]
apa11 -> t11
apa1Dots -> t1Dots [style=invis]
apa1N -> t1N
subgraph cluster_product_sequence_2 {
style="filled,rounded"
fillcolor="#eee2ee"
sequence_label_2 [label=<<i>Waveforms family</i>> shape="plaintext" margin=0 style="rounded" fontname="Helvetica,Arial"]
t21 [label=<Waveforms<sub>1,2,1</sub>> fontsize=11.5 fontcolor="purple"]
t2Dots [label="..." shape="plaintext" margin=0 style="rounded" fontname="Helvetica,Arial"]
t2N [label=<Waveforms<sub>1,2,<i>n</i></sub>> fontsize=11.5 fontcolor="purple"]
}
family_label_2 -> sequence_label_2 [style=invis]
apa21 -> t21
apa2Dots -> t2Dots [style=invis]
apa2N -> t2N
}](../images/graphviz-5a6c7fa2bbb2a97c8a5b37f0fc8defa5127adf71.png)
Fig. 3.2 The data organization corresponding to part of Fig. 3.1. The framework-provided Job data layer and three different user-defined (not special to the Phlex framework) data layers are shown: Run, Spill, and APA. Rectangles with labels \(\textsf{Run}_i\), \(\textsf{Spill}_{i,j}\), and \(\textsf{APA}_{i,j,k}\) represent data cells. The pale green rectangles show two data-cell families; these are identified as families because they are the result of executing the \(\textit{unfold(into\_apas)}\) node shown in Fig. 3.1. A solid line from one data cell to another data cell represents a logical association between the two data cells. The bottom rectangle shows that \(\textsf{Waveforms}_{1,1,1}\) is in the data cell \(\textsf{APA}_{1,1,1}\), etc. Each pale purple rectangle indicates the data-product family created by unfolding each \(\textit{SimDepos}\) object as shown in Fig. 3.1.¶
In Fig. 3.2, the Run data layer exists, but as no algorithm in Fig. 3.1 requires any data products from a Run data cell, the framework does not create any data-cell families corresponding to the Run.
3.2.1. Data Products¶
Data products are entities that encapsulate processed or raw data, of all kinds, separate from the algorithms that create them [DUNE 110]. They serve as the primary medium for communication between algorithms, ensuring seamless data exchange across processing steps [DUNE 111]. They are associated with (rather than containing) metadata and provenance information that describe how the data products were created [DUNE 121]. They are not tied to specific hardware or algorithm implementations, ensuring independence and reproducibility [DUNE 63]. They are also not tied to any specific IO back end, but must support reading and writing with both ROOT [DUNE 74] and HDF5 [DUNE 141]. They enable the framework to present data produced by one algorithm to subsequent algorithms, supporting iterative and chained processing workflows [DUNE 20].
3.2.1.1. Structure and Representation¶
The in-memory layout of a data product is determined by its type in the specified programming language. Phlex does not require that the in-memory representation of a data product shall be the same as its persistent representation [DUNE 2]. In general, a single conceptual data product can be represented by multiple programming language types. This includes representing a single conceptual data product in multiple supported programming languages.
The framework provides the ability to determine the memory footprint of each data product [DUNE 154].
3.2.1.2. Defining Data Product Types¶
Data product types are not defined by the framework. Framework users are expected to define their own data product types [DUNE 85].
3.2.2. Data Layers, Data Cells, and Families¶
As illustrated in Fig. 3.2, data products are organized into user-defined data cells, families, layers, and hierarchies, supporting varying levels of granularity [DUNE 86] [DUNE 87] [DUNE 88]. They can be unfolded into finer-grained units, enabling detailed analysis or reprocessing at different scales [DUNE 43]. This provides the ability to process data too large to fit into memory at one time [DUNE 25].
3.2.3. Data Product Management¶
Management of the data products returned by an algorithm is taken over by the framework. Read-only access to input data products is provided to algorithms [DUNE 121] [DUNE 130]. Read-only access to a data product must not mutate it. Data products that are intended to be written out are sent to the IO system as soon as they are created [DUNE 142]. Data products are removed from memory as soon as they are no longer needed for writing or as input to another algorithm [DUNE 142].
3.2.4. Data Product Identification¶
Each data product is associated with a specific set of metadata describing the algorithms and configurations used in their creation. These metadata allow that creation to be reproducible [DUNE 122]. The metadata are stored along with the data in the framework output file, and the IO interface allows access to the metadata [DUNE 121].
The data products created by an algorithm are associated with metadata that identify the algorithm that created them. Such metadata include:
the creator, the name of the algorithm that created the data product
an identifier for the data cells with which the data product is associated (e.g. Spill, Run, Calibration Interval, or other experiment-defined layer)
the processing phase, an identifier for the job in which the data product was created
an individual name for the data product (which may be empty), to distinguish between multiple products of the same type created by the same algorithm.
In addition to these metadata, a data product is also specified by its type.
The metadata are stored in the framework output file, and the IO interface allows access to the metadata [DUNE 121].
The metadata are also used in data product lookup, to specify which data products are to be provided as inputs to an algorithm. The algorithms are configured to identify the inputs in which they are interested by selecting on any of the metadata defined above, as well as by the programming language types of their inputs.