SAMPIClyser package

Module contents

class sampiclyser.SAMPIC_Run_Decoder(run_dir_path: Path)[source]

Bases: object

Decode and process a complete SAMPIC run.

Provides a one-pass, memory-efficient workflow for:
  1. Reading raw SAMPIC binary files from a run directory.

  2. Extracting and decoding header metadata.

  3. Streaming hit records in fixed-size chunks.

  4. Writing decoded hits and metadata to Feather, Parquet, or ROOT formats.

The class preserves metadata both as raw bytes (for Arrow/Parquet) and as native Python types (for ROOT), and supports arbitrarily large files without loading everything into memory.

Variables:
  • run_base_path (pathlib.Path) – Path to the directory containing all binary files for one run.

  • run_header (SampicHeader) – Parsed header metadata for the current file being processed.

  • run_files (list[pathlib.Path]) – List of all SAMPIC binary files in run_base_path, in sort order.

decode_data(limit_hits: int = 0, feather_path: Path | None = None, parquet_path: Path | None = None, root_path: Path | None = None, root_tree: str = 'sampic_hits', extra_header_bytes: int = 1, chunk_size: int = 65536, batch_size: int = 100000) None[source]

Decode hit records from SAMPIC run files and export to Feather, Parquet, and/or ROOT.

This method streams parsed hit-record dictionaries (via parse_hit_records), accumulates them in batches to build a pandas DataFrame, and then writes each batch to the specified output formats. It never holds all records in memory at once.

Parameters:
  • limit_hits (int, optional) – Maximum number of hit records to process across all run files. A value of 0 (default) means “no limit” (process all hits).

  • feather_path (pathlib.Path or None, optional) – If not None, path to write the DataFrame in Feather format.

  • parquet_path (pathlib.Path or None, optional) – If not None, path to write the DataFrame in Parquet format.

  • root_path (pathlib.Path or None, optional) – If not None, path to write the DataFrame to a ROOT file.

  • root_tree (str, optional) – Name of the TTree inside the ROOT file (default: “sampic_hits”).

  • extra_header_bytes (int, optional) – Number of bytes to include after the detected header boundary (default: 1 to capture the trailing newline).

  • chunk_size (int, optional) – Byte size for each memory-mapped file read chunk (default: 64 KiB).

  • batch_size (int, optional) – Number of records to collect before flushing to output (default: 100 000).

Raises:

ValueError – If header parsing fails, or if writing to any format encounters missing or mismatched branch/column schemas.

Notes

  • Feather and Parquet outputs preserve exact column dtypes by casting before writing.

  • ROOT output is written via uproot using dict-of-NumPy-arrays (or mktree + extend) to ensure correct branch types.

  • Each batch is written immediately; the final partial batch is flushed at the end.

decode_sampic_header(header_bytes: bytes, keep_unparsed: bool = True) SampicHeader[source]

Parse raw header bytes into a SampicHeader instance.

The header consists of one or more lines; each line starts and ends with “===” and contains fields separated by “===”. Field syntax may vary (e.g. “key: value”, “key value”, or composite “part1 = x part2 = y”).

Parameters:
  • header_bytes (bytes) – Raw bytes of the header section, from file start up to the header end (inclusive of delimiters and any extra bytes).

  • keep_unparsed (bool, optional) – If True (default), any fields that are not recognized are stored in the extra dict of the returned SampicHeader; if False, they are discarded.

Returns:

SampicHeader – A dataclass containing all parsed header values and optionally any unrecognized fields in its extra attribute.

Raises:

ValueError – If the header_bytes cannot be decoded into valid text, or if required header fields are missing or malformed.

Notes

This method:
  1. Splits header_bytes on lines beginning/ending with “===”.

  2. For each field fragment, calls _parse_header_field.

  3. Collects any unparsed text in SampicHeader.extra.

front_end_fpga_re = re.compile('^FRONT-END FPGA INDEX: (\\d+) FIRMWARE VERSION (.+) BASELINE VALUE: ([\\d\\.]+)')
open_sampic_file_in_chunks_and_get_header(file_path: Path, extra_header_bytes: int, chunk_size: int = 65536, debug: bool = False) Generator[Tuple[bytes, Generator[bytes, None, None]], None, None][source]

Memory-map a SAMPIC file, extract its header, and stream the remainder in chunks.

This context manager opens file_path in read-only mode, mmaps the entire file, and locates the header boundary as the last ‘=’ byte before the first 0x00. It returns the header (including extra_header_bytes) and a generator yielding the file body in chunk_size-byte blocks. On exit, both the file and the mmap are cleanly closed.

During this process, self.current_filesize is set to the size of the file.

Parameters:
  • file_path (pathlib.Path) – Path to the binary SAMPIC file to read.

  • extra_header_bytes (int) – Number of bytes to include after the header delimiter (=) in the returned header.

  • chunk_size (int, optional) – Size of each chunk (in bytes) produced by the body generator. Default is 64 KiB.

  • debug (bool, optional) – If True, print debugging information. Default is False.

Yields:
  • header_bytes (bytes) – The raw header bytes, from the file start up through the computed end.

  • body_gen (generator of bytes) – Generator yielding successive chunk_size-byte slices of the file body.

Raises:

ValueError – If the header delimiter cannot be located (i.e. no ‘=’ before the first 0x00), indicating a malformed file.

parse_hit_records(limit_hits: int = 0, extra_header_bytes: int = 1, chunk_size: int = 65536) Generator[Dict[str, Any], None, None][source]

Stream and decode hit records from all files in the run.

This generator method opens each SAMPIC binary file in turn, extracts its header (via open_sampic_file_in_chunks_and_get_header), checks for header consistency across files, then streams the body in fixed-size chunks, parsing out complete hit records until either the file ends or limit_hits is reached.

Parameters:
  • limit_hits (int, optional) – Maximum number of hit records to yield across all files. A value of 0 (default) means no limit (process all hits).

  • extra_header_bytes (int, optional) – Number of bytes to include _after_ the header delimiter when extracting the header (default is 1 to include the newline).

  • chunk_size (int, optional) – Size in bytes of each data chunk read from the body (default is 64 KiB). Larger chunks may be more efficient but use more memory.

Yields:

record (dict) – A mapping from field names (str) to parsed values (int, float, bool, list, etc.) for each hit record.

Raises:

ValueError – If header parsing fails or a file’s header does not match the previously parsed header (mismatched run files).

Notes

  • Uses a rolling buffer to accumulate bytes from the stream until a full record can be parsed by try_parse_record.

  • After parsing each record, advances the buffer and continues until all records are yielded or limit_hits is reached.

prepare_header_metadata() Dict[bytes, bytes][source]

Pack run-header attributes into raw byte metadata for columnar files.

Generates a mapping of metadata keys to byte-encoded values suitable for Arrow/Parquet file schemas, preserving binary precision and type.

Returns:

metadata (dict of bytes → bytes) – Byte-to-byte mapping where:

  • Text fields (e.g. software_version) are ASCII-encoded.

  • timestamp is a little-endian 8-byte float (struct.pack(‘<d’, …)).

  • num_channels and enabled_channels_mask are little-endian 4-byte unsigned ints (struct.pack(‘<I’, …)).

  • Boolean flags (reduced_data_type, without_waveform, etc.) are stored as a single byte: b’' for False, b’’ for True.

Notes

Keys are raw byte strings (e.g. b’software_version’), matching the Arrow metadata API expectations. This preserves full fidelity for programmatic reloading via decode_byte_metadata.

prepare_root_header_metadata() Dict[str, object][source]

Build a Python-native metadata dict for ROOT TTree output.

Collects all run-header fields into native Python types so they can be written directly as branches in a ROOT metadata TTree.

Returns:

metadata (dict of str → object) – Dictionary mapping metadata keys to Python values, including:

  • software_version : str

  • timestamp : datetime.datetime

  • sampic_mezzanine_board_version : str

  • num_channels : int

  • ctrl_fpga_firmware_version : str

  • sampling_frequency : str

  • enabled_channels_mask : int

  • reduced_data_type : bool

  • without_waveform : bool

  • tdc_like_files : bool

  • hit_number_format : str

  • unix_time_format : str

  • data_format : str

  • trigger_position_format : str

  • data_samples_format : str

  • inl_correction : bool

  • adc_correction : bool

Notes

All values are in their natural Python form (no byte-packing), ready for conversion to Awkward or NumPy arrays when writing via uproot.

timestamp_re = re.compile('^UnixTime = (.+) date = (.+) time = (.+ms)')
write_root_header(froot: WritableDirectory) None[source]

Embed run-header metadata into a ROOT file as a metadata TTree.

Converts the dict returned by prepare_root_header_metadata into Awkward arrays of strings and writes them as two branches (‘key’ and ‘value’) in a TTree named ‘metadata’. Existing metadata trees of the same name are overwritten.

Parameters:

froot (uproot.WritableDirectory) – An open ROOT file handle (from uproot.recreate or uproot.update) into which the metadata TTree will be written.

Returns:

None

Notes

  • Keys and values are both stored as variable-length strings using Awkward Arrays (ak.from_iter).

  • The resulting TTree will have two string branches:
    • key : metadata field names

    • value : metadata field values (all converted to str)

  • If a ‘metadata’ TTree already exists, it is replaced.

sampiclyser.check_time_ordering(file_path: Path, use_unix_time: bool = False, find_all: bool = False, batch_size: int = 100000, root_tree: str = 'sampic_hits') List[Tuple[int, float, float]][source]

Verify that hit records in a SAMPIC output file are non-decreasing in time.

Streams through the file in memory-efficient batches, reconstructs or reads each hit’s timestamp, and checks for any out-of-order intervals.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file (.parquet, .feather, or .root).

  • use_unix_time (bool, optional) – If True, use the ‘UnixTime’ column directly as the hit timestamp. Otherwise, applies a custom reconstruction algorithm (must be implemented in _reconstruct_time). Default is False.

  • find_all (bool, optional) – If True, continue scanning the entire file and collect all out-of-order events; if False, stop at the first detection. Default is False.

  • batch_size (int, optional) – Number of rows to read per batch from open_hit_reader. Default is 100000.

  • root_tree (str, optional) – Name of the TTree inside a ROOT file (only used for .root). Default is “sampic_hits”.

Returns:

list of (hit_index, previous_time, current_time) – A list of tuples for each detected out-of-order event, where: - hit_index is the zero-based index of the later (out-of-order) hit. - previous_time is the timestamp of the immediately preceding hit. - current_time is the timestamp of the out-of-order hit. If no violations are found, an empty list is returned.

Raises:

ValueError – If use_unix_time is False and no reconstruction algorithm is provided in _reconstruct_time.

sampiclyser.get_channel_hits(file_path: Path, batch_size: int = 100000, root_tree: str = 'sampic_hits') DataFrame[source]

Compute per-channel hit counts by streaming only the ‘Channel’ column.

Supports Feather, Parquet, or ROOT (.root) files written by the Sampic decoder. Reads data in batches (to bound memory use) and tallies the number of rows (hits) observed on each channel.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file. Must have suffix .feather, .parquet, or .root.

  • batch_size (int, optional) – Number of entries to read per iteration (default: 100000).

  • root_tree (str, optional) – Name of the TTree inside the ROOT file to read (only used if file_path is .root; default: “sampic_hits”).

Returns:

pandas.DataFrame – A DataFrame with two columns:

  • Channel (int): channel identifier

  • Hits (int): total number of hits on that channel

Rows are sorted by increasing Channel.

Raises:

ValueError – If the file suffix is not one of .feather, .parquet, or .root.

sampiclyser.get_file_metadata(file_path: Path) dict[str, object][source]

Load metadata from a SAMPIC output file, selecting the appropriate reader.

This function examines the file extension of file_path and invokes the corresponding metadata decoder:

  • Parquet (.parquet, .pq): uses pyarrow.parquet metadata and decode_byte_metadata for byte-to-type conversion.

  • Feather (.feather): uses pyarrow.ipc schema metadata and decode_byte_metadata.

  • ROOT (.root): uses uproot to read a ´metadata´ TTree via load_root_metadata.

Parameters:

file_path (pathlib.Path) – Path to the input file whose metadata to extract. Supported suffixes are .parquet, .pq, .feather, and .root.

Returns:

metadata (dict of str → object) – Dictionary of metadata fields mapped to native Python values, where each value may be one of:

  • str For textual fields (software versions, format strings).

  • int For numeric fields (e.g. num_channels, masks).

  • bool For flag fields (reduced_data_type, etc.).

  • datetime.datetime For timestamp fields.

Raises:

ValueError – If file_path has an unsupported suffix or if metadata loading fails for any reason.

sampiclyser.plot_channel_hit_rate(file_path: Path, channel: int = 0, bin_size: float = 1.0, batch_size: int = 100000, plot_hits: bool = False, start_time: datetime | float | None = None, end_time: datetime | float | None = None, root_tree: str = 'sampic_hits', scale_factor: float = 1.0, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, color='C0', title: str | None = None) Figure[source]

Plot the hit rate (or raw hits) as a function of time from large data files.

Streams the “UnixTime” column in batches from a Feather, Parquet, or ROOT file, bins events into fixed-width time intervals, and renders a CMS-style time series.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file; supported suffixes are .feather, .parquet, .pq, and .root.

  • channel (int, optional) – The SAMPIC channel to plot (default: 0).

  • bin_size (float, optional) – Width of each time bin in seconds; values below 0.1 are rounded up to 0.1 (default: 1.0).

  • batch_size (int, optional) – Number of entries to read per I/O batch (default: 100000).

  • plot_hits (bool, optional) – If True, plot the raw count per bin; otherwise plot the rate (count divided by bin_size) (default: False).

  • start_time (datetime.datetime, float, or None, optional) – Start of the time window for plotting, as a datetime or UNIX timestamp. If None, uses the file’s “start_of_run” metadata. Aligned to the nearest lower multiple of bin_size (default: None).

  • end_time (datetime.datetime, float, or None, optional) – End of the time window for plotting, as a datetime or UNIX timestamp. If None, determined from the data. Aligned to the nearest upper multiple of bin_size (default: None).

  • root_tree (str, optional) – Name of the TTree in a ROOT file (only used if file_path ends in .root; default: “sampic_hits”).

  • scale_factor (float, optional) – Multiplier applied to each bin’s count (e.g. to account for central trigger multiplicity) before plotting (default: 1.0).

  • label (str, optional) – experiment label (default: “PPS”).

  • log_y (bool, optional) – If True, use a logarithmic y-axis (default: False).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (6, 4)).

  • rlabel (str, optional) – Additional right-hand label (e.g. collision energy) (default: “(13 TeV)”).

  • is_data (bool, optional) – If True, annotate plots as “Data”; if False, annotate as “Simulation” (default: True).

  • color (color spec, optional) – Matplotlib color for the line or bars (default: “C0”).

  • title (str or None, optional) – Main title for the figure; if None, no title is drawn (default: None).

Returns:

fig (matplotlib.figure.Figure) – Figure object containing the hit-rate (or hit-count) vs. time plot, styled according to CMS conventions.

Raises:

ValueError – If file_path has an unsupported suffix.

Notes

  • Time bins are computed as floor((t - t0)/bin_size) indices, then shifted back to absolute times for plotting.

  • X-axis tick formatting uses Matplotlib’s AutoDateLocator and AutoDateFormatter for sensible date/time labels across variable spans.

sampiclyser.plot_channel_hits(df: DataFrame, first_channel: int, last_channel: int, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, color='C0', title: str | None = None) Figure[source]

Draw a CMS-style bar histogram of hit counts per channel.

Parameters:
  • df (pandas.DataFrame) – Summary table with two columns: - Channel (int): channel indices - Hits (int): hit counts per channel

  • first_channel (int) – Lowest channel index to include on the x-axis.

  • last_channel (int) – Highest channel index to include on the x-axis.

  • label (str, optional) – Text label for the experiment (default: “PPS”).

  • log_y (bool, optional) – If True, use a logarithmic y-axis (default: False).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (6, 4)).

  • rlabel (str, optional) – Right-hand text label, typically collision energy (default: “(13 TeV)”).

  • is_data (bool, optional) – If True, annotate the plot as “Data”; if False, annotate as “Simulation” (default: True).

  • color (any, optional) – Matplotlib color spec for the bars (default: “C0”).

  • title (str or None, optional) – Main title displayed above the axes; if None, no title is shown.

Returns:

matplotlib.figure.Figure – The Figure object containing the histogram.

Raises:

ValueError – If last_channel is less than first_channel.

Notes

  • Channels missing from df are shown with zero hits.

  • In linear mode, y-axis tick labels are formatted in uppercase scientific notation (e.g. “4.0E6”).

  • The plot uses mplhep.style.* with label and rlabel positioned according to respective styling conventions.

  • The is_data flag controls the “Data” vs. “Simulation” annotation.

sampiclyser.plot_channel_waveforms(file_path: Path, root_tree: str = 'sampic_hits', batch_size: int = 100000, first_hit: int = 0, num_hits: int = 10, channel_filter: list[int] | None = None, interpolation_method: str | None = 'sinc', interpolation_factor: int = 4, interpolation_parameter: int = 8, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, title: str | None = None, file_name_id: str | None = None, cmap: str | None = None, time_scale: float = 1000000000, plot_sample_types: bool = True) Figure[source]

Plot multiple waveform hits from a SAMPIC data file in CMS style.

This function streams hit records from the specified data file (Parquet, Feather, or ROOT), applies optional interpolation and circular-buffer reordering, and draws each waveform with distinct coloring and markers. It then assembles CMS-standard annotations, a consolidated legend, and automatically generated titles.

Parameters:
  • file_path (pathlib.Path) – Path to the input file containing SAMPIC hit data. Supported formats: Parquet (.parquet, .pq), Feather (.feather), or ROOT (.root).

  • root_tree (str, default “sampic_hits”) – Name of the TTree inside a ROOT file to read.

  • batch_size (int, default 100000) – Number of hits to read per iteration when streaming.

  • first_hit (int, default 0) – Zero-based index of the first hit to plot (skips earlier hits).

  • num_hits (int, default 10) – Maximum number of hit waveforms to display.

  • channel_filter (list of int or None, optional) – If provided, only hits from these channel indices are plotted.

  • interpolation_method ({‘sinc’,’hann’,’hamming’,’lanczos’,’resample’,’resample_poly’}, optional) – Method for upsampling the waveform before plotting.

  • interpolation_factor (int, default 4) – Upsampling factor for interpolation.

  • interpolation_parameter (int, default 8) – Kernel/window size or filter parameter for the chosen interpolation.

  • label (str, default “PPS”) – experiment label shown on the plot.

  • log_y (bool, default False) – If True, use a logarithmic scale for the y-axis.

  • figsize (tuple of float, default (6, 4)) – Figure size in inches.

  • rlabel (str, default “(13 TeV)”) – Right-hand label (e.g. collision energy) in the CMS annotation.

  • is_data (bool, default True) – If True, annotate as data; otherwise as simulation.

  • title (str or None, optional) – Custom plot title. If None, an automatic title is generated.

  • file_name_id (str or None, optional) – Identifier for the input file used in the auto-title; defaults to file name.

  • cmap (str or None, optional) – Name of a Matplotlib colormap for channel coloring; defaults to style cycle.

  • time_scale (float, default 1E9) – Multiplicative factor to apply to the time-axis before plotting.

  • plot_sample_types (bool, default True) – If True, will plot the different distinc sample types with different symbols.

Returns:

fig (matplotlib.figure.Figure) – Figure object containing the selected waveforms of V vs time, styled according to CMS conventions.

Raises:
  • ValueError – If the input file format is unsupported, or if key columns are missing.

  • RuntimeError – If metadata cannot be extracted or plot configuration is invalid.

Notes

  • Uses open_hit_reader and select_waveforms to stream and filter hits.

  • Delegates single-waveform rendering to plot_waveform.

  • Finalizes annotations with finalize_waveform_legend and set_waveform_titles_and_labels.

sampiclyser.plot_hit_rate(file_path: Path, bin_size: float = 1.0, batch_size: int = 100000, plot_hits: bool = False, start_time: datetime | float | None = None, end_time: datetime | float | None = None, root_tree: str = 'sampic_hits', scale_factor: float = 1.0, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, color='C0', title: str | None = None) Figure[source]

Plot the hit rate (or raw hits) as a function of time from large data files.

Streams the “UnixTime” column in batches from a Feather, Parquet, or ROOT file, bins events into fixed-width time intervals, and renders a CMS-style time series.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file; supported suffixes are .feather, .parquet, .pq, and .root.

  • bin_size (float, optional) – Width of each time bin in seconds; values below 0.1 are rounded up to 0.1 (default: 1.0).

  • batch_size (int, optional) – Number of entries to read per I/O batch (default: 100000).

  • plot_hits (bool, optional) – If True, plot the raw count per bin; otherwise plot the rate (count divided by bin_size) (default: False).

  • start_time (datetime.datetime, float, or None, optional) – Start of the time window for plotting, as a datetime or UNIX timestamp. If None, uses the file’s “start_of_run” metadata. Aligned to the nearest lower multiple of bin_size (default: None).

  • end_time (datetime.datetime, float, or None, optional) – End of the time window for plotting, as a datetime or UNIX timestamp. If None, determined from the data. Aligned to the nearest upper multiple of bin_size (default: None).

  • root_tree (str, optional) – Name of the TTree in a ROOT file (only used if file_path ends in .root; default: “sampic_hits”).

  • scale_factor (float, optional) – Multiplier applied to each bin’s count (e.g. to account for central trigger multiplicity) before plotting (default: 1.0).

  • label (str, optional) – experiment label (default: “PPS”).

  • log_y (bool, optional) – If True, use a logarithmic y-axis (default: False).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (6, 4)).

  • rlabel (str, optional) – Additional right-hand label (e.g. collision energy) (default: “(13 TeV)”).

  • is_data (bool, optional) – If True, annotate plots as “Data”; if False, annotate as “Simulation” (default: True).

  • color (color spec, optional) – Matplotlib color for the line or bars (default: “C0”).

  • title (str or None, optional) – Main title for the figure; if None, no title is drawn (default: None).

Returns:

fig (matplotlib.figure.Figure) – Figure object containing the hit-rate (or hit-count) vs. time plot, styled according to CMS conventions.

Raises:

ValueError – If file_path has an unsupported suffix.

Notes

  • Time bins are computed as floor((t - t0)/bin_size) indices, then shifted back to absolute times for plotting.

  • X-axis tick formatting uses Matplotlib’s AutoDateLocator and AutoDateFormatter for sensible date/time labels across variable spans.

sampiclyser.plot_hitmap(summary_df: DataFrame, specs: Sequence[SensorSpec], layout: Tuple[int, int], figsize: Tuple[int, float] = (8, 6), cmap: str = 'viridis', log_z: bool = False, title: str | None = None, do_sampic_ch: bool = False, do_board_ch: bool = False, center_fontsize: int = 14, coordinates: str = 'local') Figure[source]

Draw a grid of sensor hitmaps with a shared color scale.

Each sensor’s hit counts are rendered according to its geometry (grid, grouped, or scatter) in a subplot arranged by layout. All subplots share the same color normalization (linear or logarithmic), and have equal aspect ratio to preserve pixel shapes.

Parameters:
  • summary_df (pandas.DataFrame) – DataFrame with columns “Channel” (int) and “Hits” (int).

  • specs (sequence of SensorSpec) – Specifications for each sensor to plot (one subplot per spec).

  • layout (tuple of int) – Subplot grid dimensions as (nrows, ncols).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (8, 6)).

  • cmap (str, optional) – Matplotlib colormap name to use for all sensors (default: “viridis”).

  • log_z (bool, optional) – If True, apply logarithmic normalization on the color (z) axis (default: False for linear scale).

  • title (str or None, optional) – Overall figure title; if None, no supertitle is drawn (default: None).

  • do_sampic_ch (bool, optional) – If True, annotate each pixel/group with its SAMPIC channel index (default: False).

  • do_board_ch (bool, optional) – If True, annotate each pixel/group with its board channel index (default: False).

  • center_fontsize (int, optional) – Font size for center annotations (default: 14).

  • coordinates ({‘local’, ‘global’}, optional) – Coordinate system for rendering: - ‘local’: use sensor’s native coordinates. - ‘global’: apply global_rotation_units and global_flip from each spec

    to map to a common global frame (default: ‘local’).

Returns:

fig (matplotlib.figure.Figure) – The Figure object containing the arranged hitmap subplots.

Notes

  • Uses sampiclyser_style for style formatting.

  • A single colorbar is added to the first subplot, reflecting all panels.

  • Subplot aspect is set to ‘equal’ so that pixels are not distorted.

sampiclyser.reorder_hits(input_path: ~pathlib.Path, output_feather_path: ~pathlib.Path | None = None, output_parquet_path: ~pathlib.Path | None = None, output_root_path: ~pathlib.Path | None = None, root_tree: str = 'sampic_hits', use_unix_time: bool = False, batch_size: int = 100000, max_time_offset: float | None = None, schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) None[source]

Stream and reorder hit records into non-decreasing time order, writing to a new file.

This function reads hit records from input_path (Feather, Parquet, or ROOT) in memory-efficient batches, reconstructs or reads each hit’s timestamp, and emits them in sorted order (non-decreasing time) to output_path in the same or different format. A sliding window (heap) is used to handle small out-of-order intervals up to max_time_offset if provided, otherwise fully buffers.

Parameters:
  • input_path (pathlib.Path) – Path to the input decoded SAMPIC data file (.parquet, .pq, .feather, or .root).

  • output_feather_path (pathlib.Path or None, optional) – If not None, path to write the reordered data in Feather format.

  • output_parquet_path (pathlib.Path or None, optional) – If not None, path to write the reordered data in Parquet format.

  • output_root_path (pathlib.Path or None, optional) – If not None, path to write the reordered data to a ROOT file. Writes data TTree with same root_tree name and a metadata TTree.

  • root_tree (str, optional) – Name of the TTree inside the ROOT file (default: “sampic_hits”).

  • use_unix_time (bool, default False) – If True, use the ‘UnixTime’ column directly; otherwise use sampic_reconstruct_time_dict to compute timestamps.

  • batch_size (int, default 100000) – Number of rows/entries to read per batch from the input.

  • max_time_offset (float or None, optional) – Maximum lag (seconds) allowed before emitting buffered records. Records older than (current_max_time - max_time_offset) are written. If None, all records are buffered and sorted fully before writing.

  • schemaInfo (dict, default SAMPIC_Schema_Info) – Mapping of field names to type definitions used for ROOT dtype.

Returns:

None

Raises:
  • ValueError – If timestamp reconstruction fails or required columns missing.

  • RuntimeError – If no output path is provided.

Notes

  • Metadata from the input file is propagated to all output formats.

  • Parquet and Feather writing use PyArrow writers with an explicit schema.

  • ROOT writing uses uproot to extend a TTree in streaming fashion.

sampiclyser.set_mplhep_style(style: str = 'CMS')[source]

Submodules

sampiclyser.sampic_decoder module

class sampiclyser.sampic_decoder.SAMPIC_Run_Decoder(run_dir_path: Path)[source]

Bases: object

Decode and process a complete SAMPIC run.

Provides a one-pass, memory-efficient workflow for:
  1. Reading raw SAMPIC binary files from a run directory.

  2. Extracting and decoding header metadata.

  3. Streaming hit records in fixed-size chunks.

  4. Writing decoded hits and metadata to Feather, Parquet, or ROOT formats.

The class preserves metadata both as raw bytes (for Arrow/Parquet) and as native Python types (for ROOT), and supports arbitrarily large files without loading everything into memory.

Variables:
  • run_base_path (pathlib.Path) – Path to the directory containing all binary files for one run.

  • run_header (SampicHeader) – Parsed header metadata for the current file being processed.

  • run_files (list[pathlib.Path]) – List of all SAMPIC binary files in run_base_path, in sort order.

decode_data(limit_hits: int = 0, feather_path: Path | None = None, parquet_path: Path | None = None, root_path: Path | None = None, root_tree: str = 'sampic_hits', extra_header_bytes: int = 1, chunk_size: int = 65536, batch_size: int = 100000) None[source]

Decode hit records from SAMPIC run files and export to Feather, Parquet, and/or ROOT.

This method streams parsed hit-record dictionaries (via parse_hit_records), accumulates them in batches to build a pandas DataFrame, and then writes each batch to the specified output formats. It never holds all records in memory at once.

Parameters:
  • limit_hits (int, optional) – Maximum number of hit records to process across all run files. A value of 0 (default) means “no limit” (process all hits).

  • feather_path (pathlib.Path or None, optional) – If not None, path to write the DataFrame in Feather format.

  • parquet_path (pathlib.Path or None, optional) – If not None, path to write the DataFrame in Parquet format.

  • root_path (pathlib.Path or None, optional) – If not None, path to write the DataFrame to a ROOT file.

  • root_tree (str, optional) – Name of the TTree inside the ROOT file (default: “sampic_hits”).

  • extra_header_bytes (int, optional) – Number of bytes to include after the detected header boundary (default: 1 to capture the trailing newline).

  • chunk_size (int, optional) – Byte size for each memory-mapped file read chunk (default: 64 KiB).

  • batch_size (int, optional) – Number of records to collect before flushing to output (default: 100 000).

Raises:

ValueError – If header parsing fails, or if writing to any format encounters missing or mismatched branch/column schemas.

Notes

  • Feather and Parquet outputs preserve exact column dtypes by casting before writing.

  • ROOT output is written via uproot using dict-of-NumPy-arrays (or mktree + extend) to ensure correct branch types.

  • Each batch is written immediately; the final partial batch is flushed at the end.

decode_sampic_header(header_bytes: bytes, keep_unparsed: bool = True) SampicHeader[source]

Parse raw header bytes into a SampicHeader instance.

The header consists of one or more lines; each line starts and ends with “===” and contains fields separated by “===”. Field syntax may vary (e.g. “key: value”, “key value”, or composite “part1 = x part2 = y”).

Parameters:
  • header_bytes (bytes) – Raw bytes of the header section, from file start up to the header end (inclusive of delimiters and any extra bytes).

  • keep_unparsed (bool, optional) – If True (default), any fields that are not recognized are stored in the extra dict of the returned SampicHeader; if False, they are discarded.

Returns:

SampicHeader – A dataclass containing all parsed header values and optionally any unrecognized fields in its extra attribute.

Raises:

ValueError – If the header_bytes cannot be decoded into valid text, or if required header fields are missing or malformed.

Notes

This method:
  1. Splits header_bytes on lines beginning/ending with “===”.

  2. For each field fragment, calls _parse_header_field.

  3. Collects any unparsed text in SampicHeader.extra.

front_end_fpga_re = re.compile('^FRONT-END FPGA INDEX: (\\d+) FIRMWARE VERSION (.+) BASELINE VALUE: ([\\d\\.]+)')
open_sampic_file_in_chunks_and_get_header(file_path: Path, extra_header_bytes: int, chunk_size: int = 65536, debug: bool = False) Generator[Tuple[bytes, Generator[bytes, None, None]], None, None][source]

Memory-map a SAMPIC file, extract its header, and stream the remainder in chunks.

This context manager opens file_path in read-only mode, mmaps the entire file, and locates the header boundary as the last ‘=’ byte before the first 0x00. It returns the header (including extra_header_bytes) and a generator yielding the file body in chunk_size-byte blocks. On exit, both the file and the mmap are cleanly closed.

During this process, self.current_filesize is set to the size of the file.

Parameters:
  • file_path (pathlib.Path) – Path to the binary SAMPIC file to read.

  • extra_header_bytes (int) – Number of bytes to include after the header delimiter (=) in the returned header.

  • chunk_size (int, optional) – Size of each chunk (in bytes) produced by the body generator. Default is 64 KiB.

  • debug (bool, optional) – If True, print debugging information. Default is False.

Yields:
  • header_bytes (bytes) – The raw header bytes, from the file start up through the computed end.

  • body_gen (generator of bytes) – Generator yielding successive chunk_size-byte slices of the file body.

Raises:

ValueError – If the header delimiter cannot be located (i.e. no ‘=’ before the first 0x00), indicating a malformed file.

parse_hit_records(limit_hits: int = 0, extra_header_bytes: int = 1, chunk_size: int = 65536) Generator[Dict[str, Any], None, None][source]

Stream and decode hit records from all files in the run.

This generator method opens each SAMPIC binary file in turn, extracts its header (via open_sampic_file_in_chunks_and_get_header), checks for header consistency across files, then streams the body in fixed-size chunks, parsing out complete hit records until either the file ends or limit_hits is reached.

Parameters:
  • limit_hits (int, optional) – Maximum number of hit records to yield across all files. A value of 0 (default) means no limit (process all hits).

  • extra_header_bytes (int, optional) – Number of bytes to include _after_ the header delimiter when extracting the header (default is 1 to include the newline).

  • chunk_size (int, optional) – Size in bytes of each data chunk read from the body (default is 64 KiB). Larger chunks may be more efficient but use more memory.

Yields:

record (dict) – A mapping from field names (str) to parsed values (int, float, bool, list, etc.) for each hit record.

Raises:

ValueError – If header parsing fails or a file’s header does not match the previously parsed header (mismatched run files).

Notes

  • Uses a rolling buffer to accumulate bytes from the stream until a full record can be parsed by try_parse_record.

  • After parsing each record, advances the buffer and continues until all records are yielded or limit_hits is reached.

prepare_header_metadata() Dict[bytes, bytes][source]

Pack run-header attributes into raw byte metadata for columnar files.

Generates a mapping of metadata keys to byte-encoded values suitable for Arrow/Parquet file schemas, preserving binary precision and type.

Returns:

metadata (dict of bytes → bytes) – Byte-to-byte mapping where:

  • Text fields (e.g. software_version) are ASCII-encoded.

  • timestamp is a little-endian 8-byte float (struct.pack(‘<d’, …)).

  • num_channels and enabled_channels_mask are little-endian 4-byte unsigned ints (struct.pack(‘<I’, …)).

  • Boolean flags (reduced_data_type, without_waveform, etc.) are stored as a single byte: b’' for False, b’’ for True.

Notes

Keys are raw byte strings (e.g. b’software_version’), matching the Arrow metadata API expectations. This preserves full fidelity for programmatic reloading via decode_byte_metadata.

prepare_root_header_metadata() Dict[str, object][source]

Build a Python-native metadata dict for ROOT TTree output.

Collects all run-header fields into native Python types so they can be written directly as branches in a ROOT metadata TTree.

Returns:

metadata (dict of str → object) – Dictionary mapping metadata keys to Python values, including:

  • software_version : str

  • timestamp : datetime.datetime

  • sampic_mezzanine_board_version : str

  • num_channels : int

  • ctrl_fpga_firmware_version : str

  • sampling_frequency : str

  • enabled_channels_mask : int

  • reduced_data_type : bool

  • without_waveform : bool

  • tdc_like_files : bool

  • hit_number_format : str

  • unix_time_format : str

  • data_format : str

  • trigger_position_format : str

  • data_samples_format : str

  • inl_correction : bool

  • adc_correction : bool

Notes

All values are in their natural Python form (no byte-packing), ready for conversion to Awkward or NumPy arrays when writing via uproot.

timestamp_re = re.compile('^UnixTime = (.+) date = (.+) time = (.+ms)')
write_root_header(froot: WritableDirectory) None[source]

Embed run-header metadata into a ROOT file as a metadata TTree.

Converts the dict returned by prepare_root_header_metadata into Awkward arrays of strings and writes them as two branches (‘key’ and ‘value’) in a TTree named ‘metadata’. Existing metadata trees of the same name are overwritten.

Parameters:

froot (uproot.WritableDirectory) – An open ROOT file handle (from uproot.recreate or uproot.update) into which the metadata TTree will be written.

Returns:

None

Notes

  • Keys and values are both stored as variable-length strings using Awkward Arrays (ak.from_iter).

  • The resulting TTree will have two string branches:
    • key : metadata field names

    • value : metadata field values (all converted to str)

  • If a ‘metadata’ TTree already exists, it is replaced.

class sampiclyser.sampic_decoder.SampicHeader(software_version: str = '', timestamp: ~datetime.datetime | None = None, sampic_mezzanine_board_version: str = '', num_channels: int = 0, ctrl_fpga_firmware_version: str = '', front_end_fpga_firmware_version: ~typing.List[str] = <factory>, front_end_fpga_baseline: ~typing.List[float] = <factory>, sampling_frequency: str = '', enabled_channels_mask: int = 0, reduced_data_type: bool = False, without_waveform: bool = False, tdc_like_files: bool = True, hit_number_format: str = '', unix_time_format: str = '', data_format: str = '', trigger_position_format: str = '', data_samples_format: str = '', inl_correction: bool = False, adc_correction: bool = False, extra: dict[str, str] = <factory>)[source]

Bases: object

Parsed header metadata from a SAMPIC file.

Variables:
  • software_version (str) – Version of the SAMPIC DAQ software.

  • timestamp (datetime.datetime) – Run start timestamp as a Python datetime.

  • sampic_mezzanine_board_version (str) – Version identifier of the mezzanine board.

  • num_channels (int) – Total number of channels in this run.

  • ctrl_fpga_firmware_version (str) – Version of the control FPGA firmware.

  • front_end_fpga_firmware_version (list of str) – Firmware versions for each front-end FPGA.

  • front_end_fpga_baseline (list of float) – Baseline values for each front-end FPGA, affecting all associated ADC channels.

  • sampling_frequency (str) – System data acquisition sampling frequency specification.

  • enabled_channels_mask (int) – Bitmask indicating which channels were enabled.

  • reduced_data_type (bool) – Whether reduced-data format was used.

  • without_waveform (bool) – Whether waveform data were omitted.

  • tdc_like_files (bool) – Whether files are in TDC-like format.

  • hit_number_format (str) – Format string for hit numbering.

  • unix_time_format (str) – Format string for Unix timestamps.

  • data_format (str) – Format string for data values.

  • trigger_position_format (str) – Format string for trigger-position values.

  • data_samples_format (str) – Format string for the data-sample values.

  • inl_correction (bool) – Whether INL correction was applied.

  • adc_correction (bool) – Whether ADC correction was applied.

  • extra (dict of str → str) – Any unrecognized header fields (key/value both decoded as ASCII).

adc_correction: bool = False
ctrl_fpga_firmware_version: str = ''
data_format: str = ''
data_samples_format: str = ''
enabled_channels_mask: int = 0
extra: dict[str, str]
front_end_fpga_baseline: List[float]
front_end_fpga_firmware_version: List[str]
hit_number_format: str = ''
inl_correction: bool = False
num_channels: int = 0
reduced_data_type: bool = False
sampic_mezzanine_board_version: str = ''
sampling_frequency: str = ''
software_version: str = ''
tdc_like_files: bool = True
timestamp: datetime | None = None
trigger_position_format: str = ''
unix_time_format: str = ''
without_waveform: bool = False
sampiclyser.sampic_decoder.build_empty_root_data_with_schema(schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) Dict[str, ndarray][source]

Construct an empty data dictionary for ROOT branches based on schema.

Parameters:

schemaInfo (dict of str -> tuple) – Mapping of column names to schema tuples of the form: (pandas_dtype, pyarrow_type, numpy_dtype, [optional array size]). The numpy_dtype at index 2 and optional array size at index 3 are used.

Returns:

dict – Mapping of column names to empty numpy.ndarray with correct shape and dtype.

Notes

  • Scalar fields produce 1D arrays of length zero.

  • Fixed-size array fields (length provided in schemaInfo[3]) produce 2D arrays with shape (0, size).

sampiclyser.sampic_decoder.build_schema(metadata: ~typing.Dict[bytes, bytes] | None = None, schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) Schema[source]

Construct a PyArrow Schema from predefined field types and optional metadata.

This utility builds a fully-specified Arrow Schema by iterating over a schemaInfo mapping of field names to type definitions. It includes only those fields whose PyArrow type is non-null, and attaches any provided metadata as key/value byte pairs.

Parameters:
  • metadata (dict of bytes->bytes, optional) – Key/value metadata to attach to the schema (e.g. from file header). If None, no metadata will be set on the schema.

  • schemaInfo (dict of str->tuple) –

    Mapping from field name to a tuple defining:
    • pandas dtype string (unused here)

    • PyArrow DataType (used)

    • NumPy dtype for ROOT output

    • optional list/array size for array fields

    Only entries where the PyArrow DataType is not None are used.

Returns:

pa.Schema – A PyArrow Schema containing all enabled fields and attached metadata.

Raises:

KeyError – If any key in schemaInfo is missing its PyArrow type entry.

sampiclyser.sampic_decoder.convert_df_with_schema(df: ~pandas.core.frame.DataFrame, schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) DataFrame[source]

Cast DataFrame columns to types defined in the schema information.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame whose columns should be converted in-place.

  • schemaInfo (dict of str -> tuple) – Mapping of column names to schema tuples of the form: (pandas_dtype, pyarrow_type, numpy_dtype, [optional array size]). Only the pandas_dtype at index 0 is used for conversion when not None.

Returns:

pandas.DataFrame – The same DataFrame with its columns cast to the specified pandas dtypes.

Notes

  • Columns not present in schemaInfo or with None pandas dtype are left unchanged.

  • Conversion is done in-place; the returned DataFrame is the same object.

sampiclyser.sampic_decoder.get_root_data_with_schema(df: ~pandas.core.frame.DataFrame, schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) Dict[str, ndarray][source]

Prepare a dictionary of numpy arrays from a DataFrame for ROOT writing.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing fields defined in the schema.

  • schemaInfo (dict of str -> tuple) – Mapping of column names to schema tuples of the form: (pandas_dtype, pyarrow_type, numpy_dtype, [optional array size]). The numpy_dtype at index 2 is used for array construction when not None.

Returns:

dict – Mapping of column names to numpy.ndarray with appropriate dtype.

Raises:

ValueError – If conversion of any column fails due to incompatible data or dtype.

Notes

  • Only columns with a non-None numpy dtype in schemaInfo are included.

  • Any ValueError during conversion prints the offending column names for debugging.

sampiclyser.sampic_decoder.prepare_header_metadata_in_bytes(metadata: Dict[str, Any]) Dict[bytes, bytes][source]

Convert a metadata dictionary of Python values to a bytes-to-bytes mapping suitable for Arrow.

This utility encodes string, datetime, integer, and boolean metadata values into raw bytes for storage in Feather or Parquet file metadata. Unsupported keys are skipped with a warning.

Parameters:

metadata (dict of str -> object) – Mapping of metadata keys to Python values. Supported keys and types:

  • Text fields (ASCII strings): ‘software_version’, ‘sampic_mezzanine_board_version’, ‘ctrl_fpga_firmware_version’, ‘sampling_frequency’, ‘hit_number_format’, ‘unix_time_format’, ‘data_format’, ‘trigger_position_format’, ‘data_samples_format’

  • Timestamp field (datetime.datetime): ‘timestamp’

  • Integer fields (int): ‘num_channels’, ‘enabled_channels_mask’

  • Boolean flags (bool): ‘reduced_data_type’, ‘without_waveform’, ‘tdc_like_files’, ‘inl_correction’, ‘adc_correction’

Returns:

dict of bytes -> bytes – Mapping of ASCII-encoded keys to packed byte values: - String values → ASCII-encoded bytes - Timestamp → little-endian IEEE-754 double of POSIX seconds - Integers → little-endian uint32 - Booleans → single byte 0x01 (True) or 0x00 (False)

Raises:

ValueError – If a value has an unexpected type for a supported key.

Notes

  • Keys not recognized in the supported list are skipped with a printed warning.

  • Use this mapping as the schema.metadata for PyArrow Schema or as the metadata argument in Feather/Parquet writers.

sampiclyser.sampic_tools module

class sampiclyser.sampic_tools.TimestampedRecord(timestamp: float, record: Any)[source]

Container for a hit record with an associated timestamp, sortable by time.

Variables:
  • timestamp (float) – The hit timestamp (seconds since epoch or reconstructed).

  • record (Any) – The full hit data (e.g., dict of field values). Not used for ordering.

record: Any
timestamp: float
sampiclyser.sampic_tools.apply_interpolation_method(x_orig: ndarray, y_orig: ndarray, period: float, interpolation_method: str | None = 'sinc', interpolation_factor: int = 4, interpolation_parameter: int = 8, offset: float | None = None) Tuple[ndarray, ndarray][source]

Interpolate a uniformly sampled waveform using various methods.

Parameters:
  • x_orig (ndarray, shape (N,)) – Original sample times (must be monotonically increasing and uniformly spaced by period).

  • y_orig (ndarray, shape (N,)) – Original sample values.

  • period (float) – Time interval between consecutive samples in x_orig.

  • interpolation_method ({‘sinc’, ‘hann’, ‘hamming’, ‘lanczos’, ‘resample’, ‘resample_poly’}, optional) – Which method to use: - ‘sinc’ : ideal sinc interpolation (no window) - ‘hann’ : windowed-sinc with Hann window - ‘hamming’ : windowed-sinc with Hamming window - ‘lanczos’ : Lanczos kernel of order interpolation_parameter - ‘resample’ : FFT-based resample via scipy.signal.resample - ‘resample_poly’: polyphase FIR via scipy.signal.resample_poly Default is ‘sinc’.

  • interpolation_factor (int, optional) – Upsampling factor: number of output points = len(x_orig) * interpolation_factor. Must be ≥ 1. Default is 4.

  • interpolation_parameter (int, optional) – Secondary parameter for certain methods: - For ‘hann’/’hamming’, this is the half-width M of the windowed-sinc. - For ‘lanczos’, this is the Lanczos order a. - For ‘resample_poly’, this is the FIR window’s beta (in a Kaiser window). Ignored by ‘sinc’ and ‘resample’. Default is 8.

  • offset (float or None, optional) – Baseline offset to subtract before interpolation, then add back afterward. If None, treated as zero. Default is None.

Returns:

  • x_fine (ndarray, shape (N * interpolation_factor,)) – Uniformly spaced output time axis from x_orig[0] to x_orig[-1].

  • y_fine (ndarray, same shape as x_fine) – Interpolated sample values.

Raises:

ValueError

  • If x_orig and y_orig have mismatched lengths or are not 1-D. - If interpolation_factor < 1. - If interpolation_method is unrecognized.

Examples

>>> import numpy as np
>>> t = np.linspace(0, 1, 11)
>>> y = np.sin(2*np.pi*5*t)
>>> t_fine, y_fine = apply_interpolation_method(t, y, t[1]-t[0], interpolation_method='sinc')
sampiclyser.sampic_tools.check_time_ordering(file_path: Path, use_unix_time: bool = False, find_all: bool = False, batch_size: int = 100000, root_tree: str = 'sampic_hits') List[Tuple[int, float, float]][source]

Verify that hit records in a SAMPIC output file are non-decreasing in time.

Streams through the file in memory-efficient batches, reconstructs or reads each hit’s timestamp, and checks for any out-of-order intervals.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file (.parquet, .feather, or .root).

  • use_unix_time (bool, optional) – If True, use the ‘UnixTime’ column directly as the hit timestamp. Otherwise, applies a custom reconstruction algorithm (must be implemented in _reconstruct_time). Default is False.

  • find_all (bool, optional) – If True, continue scanning the entire file and collect all out-of-order events; if False, stop at the first detection. Default is False.

  • batch_size (int, optional) – Number of rows to read per batch from open_hit_reader. Default is 100000.

  • root_tree (str, optional) – Name of the TTree inside a ROOT file (only used for .root). Default is “sampic_hits”.

Returns:

list of (hit_index, previous_time, current_time) – A list of tuples for each detected out-of-order event, where: - hit_index is the zero-based index of the later (out-of-order) hit. - previous_time is the timestamp of the immediately preceding hit. - current_time is the timestamp of the out-of-order hit. If no violations are found, an empty list is returned.

Raises:

ValueError – If use_unix_time is False and no reconstruction algorithm is provided in _reconstruct_time.

sampiclyser.sampic_tools.decode_byte_metadata(byte_metadata: dict[bytes, bytes]) dict[str, object][source]

Decode raw byte-to-byte metadata into native Python types.

Parameters:

byte_metadata (dict of bytes → bytes) – Mapping of raw metadata keys and values as read from an Arrow or Parquet file. Keys and values are both byte strings.

Returns:

metadata (dict of str → object) – Decoded metadata where each key is ASCII-decoded, and each value is converted according to its semantic type:

  • str: - Version/info fields such as software_version,

    sampic_mezzanine_board_version, ctrl_fpga_firmware_version, sampling_frequency, hit_number_format, etc.

  • datetime.datetime: - The timestamp field, unpacked from a little-endian float64.

  • int: - Numeric fields such as num_channels and enabled_channels_mask,

    unpacked from little-endian uint32.

  • bool: - Flag fields such as reduced_data_type, without_waveform,

    tdc_like_files, inl_correction, and adc_correction, where a single zero byte means False and any other byte means True.

Raises:
  • KeyError – If a required metadata key is missing from the input dictionary.

  • struct.error – If unpacking a numeric or timestamp field fails due to incorrect byte length.

Notes

  • Entries whose keys decode to 'ARROW:schema' or 'pandas' are ignored.

  • Any unrecognized keys will still be included in the output as their raw ASCII-decoded byte sequence, with the byte value left unchanged.

sampiclyser.sampic_tools.extract_SAMPIC_time_and_record(batch: RecordBatch | Array, idx: int) Tuple[float, Dict[str, Any]][source]

Reconstruct SAMPIC hit timestamp and return full record from a batch.

Builds a dict of all fields for the specified hit and uses sampic_reconstruct_time_dict to compute its timestamp.

Parameters:
  • batch (RecordBatch or awkward.highlevel.Array) – A batch of hit records containing all fields required for reconstruction.

  • idx (int) – Zero-based index within the batch to extract.

Returns:

  • ts (float) – The reconstructed timestamp (in seconds) for the hit.

  • rec (dict) – Mapping of field names to their Python values for that hit.

Raises:
  • KeyError – If any required field is missing from the batch.

  • IndexError – If idx is out of bounds for the batch.

  • TypeError – If batch is neither a RecordBatch nor an Awkward Array.

  • ValueError – If sampic_reconstruct_time_dict(rec) fails or returns a non-numeric value.

sampiclyser.sampic_tools.extract_ts_SAMPIC(batch: RecordBatch | Array, idx: int) float[source]

Reconstruct a hit timestamp using SAMPIC-specific logic from a batch record.

This utility assembles all fields of a single hit record into a dictionary and applies the sampic_reconstruct_time_dict function to compute the timestamp. Supports both PyArrow RecordBatch and Awkward Array as batch inputs.

Parameters:
  • batch (RecordBatch or awkward.highlevel.Array) – A batch of hit records containing all necessary fields for time reconstruction.

  • idx (int) – Zero-based index of the hit within the batch whose timestamp to reconstruct.

Returns:

float – The reconstructed hit timestamp (in seconds) as computed by SAMPIC logic.

Raises:
  • KeyError – If a required field is missing from the batch.

  • TypeError – If batch is neither a RecordBatch nor an Awkward Array.

  • IndexError – If idx is out of range for the provided batch.

  • ValueError – If sampic_reconstruct_time_dict fails or returns an invalid value.

sampiclyser.sampic_tools.extract_ts_unix_time(batch: RecordBatch | Array, idx: int) float[source]

Extract a single hit timestamp from the ‘UnixTime’ column in a batch.

This utility handles both PyArrow RecordBatch and Awkward Array inputs, returning the timestamp as a native Python float.

Parameters:
  • batch (RecordBatch or awkward.highlevel.Array) – A batch of hit records containing a ‘UnixTime’ field.

  • idx (int) – Zero-based index of the hit within the batch from which to extract the timestamp.

Returns:

float – The Unix timestamp (in seconds) of the specified hit.

Raises:
  • KeyError – If the ‘UnixTime’ column is not present in the batch.

  • TypeError – If batch is not a supported type (neither RecordBatch nor Awkward Array).

  • IndexError – If idx is out of bounds for the batch.

sampiclyser.sampic_tools.extract_unix_time_and_record(batch: RecordBatch | Array, idx: int) Tuple[float, Dict[str, Any]][source]

Extract the UnixTime timestamp and full record from a batch at a given index.

This utility supports both PyArrow RecordBatch and Awkward Array inputs. It builds a dictionary of all fields for the specified hit and returns its UnixTime timestamp along with the record dict.

Parameters:
  • batch (RecordBatch or awkward.highlevel.Array) – A batch of hit records containing a ‘UnixTime’ field among others.

  • idx (int) – Zero-based index within the batch to extract.

Returns:

  • ts (float) – The UnixTime timestamp (in seconds) of the hit.

  • rec (dict) – Mapping of field names to their Python values for that hit.

Raises:
  • KeyError – If ‘UnixTime’ or any other field is missing from the batch.

  • IndexError – If idx is out of bounds for the batch.

  • TypeError – If batch is neither a RecordBatch nor an Awkward Array.

sampiclyser.sampic_tools.finalize_waveform_legend(ax: Axes, label_mode: Literal['channel', 'hit', 'both', 'none'], plot_sample_types: bool, plot_buffer_start: bool, explicit_labels: bool) None[source]

Clean up and draw one or two legends for waveform plots.

When only channel labels are shown (and not individual hits), collapses duplicate “Channel N” entries, sorts them numerically, and places them in the main legend. Optionally, if plot_sample_types is True and explicit_labels is False, a secondary legend is drawn for the sample-type markers (buffer-start, hit-samples, trigger).

Parameters:
  • ax (matplotlib.axes.Axes) – The axes containing plotted lines and scatters.

  • label_mode (Literal) – This controls how the waveforms are labelles: - ‘channel’: label waveforms with the channel number - ‘hit’: label waveforms with the hit id - ‘both’: label waveforms with both channel number and hit id - ‘none’: do not label waveforms label_channel : bool

    Whether the plot includes “Channel N” labels. If True and label_hit is False, duplicate channel entries will be merged and sorted.

    label_hitbool

    Whether the plot includes “Hit M” labels. Currently only used to decide whether to collapse channel labels (i.e. collapse only when label_channel and not label_hit).

  • plot_sample_types (bool) – Whether the plot used separate markers for hit-samples and triggers. If True and explicit_labels is False, a second legend is drawn explaining those marker types.

  • plot_buffer_start (bool) – Whether the plot included a special “buffer start” marker. If so, that entry is included in the secondary legend.

  • explicit_labels (bool) – If True, assume that all scatter calls already set their own labels, and do not auto-generate a secondary legend for sample types.

Returns:

None – This function operates in-place on the Axes, adding one or two legends.

Raises:

IndexError – If legend handle/label extraction finds no entries when sorting.

sampiclyser.sampic_tools.get_channel_hits(file_path: Path, batch_size: int = 100000, root_tree: str = 'sampic_hits') DataFrame[source]

Compute per-channel hit counts by streaming only the ‘Channel’ column.

Supports Feather, Parquet, or ROOT (.root) files written by the Sampic decoder. Reads data in batches (to bound memory use) and tallies the number of rows (hits) observed on each channel.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file. Must have suffix .feather, .parquet, or .root.

  • batch_size (int, optional) – Number of entries to read per iteration (default: 100000).

  • root_tree (str, optional) – Name of the TTree inside the ROOT file to read (only used if file_path is .root; default: “sampic_hits”).

Returns:

pandas.DataFrame – A DataFrame with two columns:

  • Channel (int): channel identifier

  • Hits (int): total number of hits on that channel

Rows are sorted by increasing Channel.

Raises:

ValueError – If the file suffix is not one of .feather, .parquet, or .root.

sampiclyser.sampic_tools.get_file_metadata(file_path: Path) dict[str, object][source]

Load metadata from a SAMPIC output file, selecting the appropriate reader.

This function examines the file extension of file_path and invokes the corresponding metadata decoder:

  • Parquet (.parquet, .pq): uses pyarrow.parquet metadata and decode_byte_metadata for byte-to-type conversion.

  • Feather (.feather): uses pyarrow.ipc schema metadata and decode_byte_metadata.

  • ROOT (.root): uses uproot to read a ´metadata´ TTree via load_root_metadata.

Parameters:

file_path (pathlib.Path) – Path to the input file whose metadata to extract. Supported suffixes are .parquet, .pq, .feather, and .root.

Returns:

metadata (dict of str → object) – Dictionary of metadata fields mapped to native Python values, where each value may be one of:

  • str For textual fields (software versions, format strings).

  • int For numeric fields (e.g. num_channels, masks).

  • bool For flag fields (reduced_data_type, etc.).

  • datetime.datetime For timestamp fields.

Raises:

ValueError – If file_path has an unsupported suffix or if metadata loading fails for any reason.

sampiclyser.sampic_tools.get_period_from_file_metadata(metadata: dict[str, object]) float[source]

Compute the sampling period (seconds per sample) from file metadata.

Parameters:

metadata (dict of str → object) – Dictionary of file metadata. Must contain the key 'sampling_frequency' whose value is a string of the form "<freq> <unit>", where:

  • <freq> is a floating-point number (e.g. “5.0”)

  • <unit> is either “MS/s” (megasamples per second) or “kS/s” (kilosamples per second).

Returns:

period (float) – Time interval between consecutive samples, in seconds.

Raises:
  • KeyError – If the metadata dict does not include the “sampling_frequency” key.

  • RuntimeError – If the unit parsed from “sampling_frequency” is not one of “MS/s” or “kS/s”.

Examples

>>> meta = {"sampling_frequency": "5 MS/s"}
>>> get_period_from_file_metadata(meta)
2e-07
>>> meta = {"sampling_frequency": "10 kS/s"}
>>> get_period_from_file_metadata(meta)
0.0001
sampiclyser.sampic_tools.lanczos_interpolation(t_orig: ndarray, y_orig: ndarray, t_new: ndarray, a: int = 3) ndarray[source]

Interpolate a uniformly sampled signal using the Lanczos kernel.

The Lanczos filter uses a windowed sinc kernel of order a: L(x) = sinc(x) · sinc(x / a) for |x| ≤ a, zero otherwise. This yields near-ideal bandlimited interpolation with reduced ringing.

Parameters:
  • t_orig (ndarray of float, shape (N,)) – Original, uniformly spaced sample times.

  • y_orig (ndarray of float, shape (N,)) – Original sample values.

  • t_new (ndarray of float, shape (M,)) – Desired output times (must lie within the range of t_orig).

  • a (int, optional) – Lanczos order (kernel half-width in samples). Common choices are 2 or 3. Default is 3.

Returns:

y_new (ndarray of float, shape (M,)) – Interpolated values at t_new.

Raises:

ValueError – If t_orig and y_orig are not 1D arrays of equal length, or if t_new lies outside the range [t_orig.min(), t_orig.max()], or if a is not a positive integer.

Notes

  • Assumes uniform spacing in t_orig. If spacing varies, results will be invalid.

  • For each t_new[i], the kernel covers indices k from ⌈(t_new[i]-t_orig[0])/T⌉ - a + 1 to ⌊(t_new[i]-t_orig[0])/T⌋ + a, where T = t_orig[1] - t_orig[0].

  • Out-of-bounds sample indices are clipped to the valid range.

  • The kernel is defined as: ` L_n = sinc((t_new-t_orig[n])/T) * sinc((t_new-t_orig[n])/(a*T)) ` and y_new[i] = Σ_n y_orig[n] · L_n.

Examples

>>> import numpy as np
>>> t = np.linspace(0, 1, 11)
>>> y = np.sin(2*np.pi*5*t)
>>> t_fine = np.linspace(0, 1, 101)
>>> y_fine = lanczos_interpolation(t, y, t_fine, a=3)
sampiclyser.sampic_tools.load_root_metadata(file_path: str) dict[str, object][source]

Read metadata from a ‘metadata’ TTree in a ROOT file and decode to Python types.

Parameters:

file_path (str) – Filesystem path to the ROOT file containing a TTree named ‘metadata’ with two branches: ‘key’ and ‘value’. Both branches should contain strings.

Returns:

metadata (dict of str → object) – Dictionary mapping each metadata key to a Python value, converted as follows:

  • datetime.datetime If the key is ‘timestamp’, the string is parsed via datetime.datetime.fromisoformat.

  • int For ‘num_channels’ and ‘enabled_channels_mask’, the string is cast to int.

  • bool For flags (‘reduced_data_type’, ‘without_waveform’, ‘tdc_like_files’, ‘inl_correction’, ‘adc_correction’), the string ‘False’False, all other values → True.

  • str All other entries are left as Python strings.

Raises:
  • KeyError – If the TTree ‘metadata’ or the branches ‘key’/’value’ are not found.

  • ValueError – If a timestamp string cannot be parsed by fromisoformat, or if an integer conversion fails.

Notes

  • This function uses uproot.open to read the ROOT file in read-only mode.

  • It expects the metadata tree to have exactly two branches, ‘key’ and ‘value’, both containing arrays of equal length.

sampiclyser.sampic_tools.open_hit_reader(file_path: Path, cols: Sequence[str], batch_size: int = 100000, root_tree: str = 'sampic_hits') Iterator[RecordBatch | Array][source]

Stream selected columns from a SAMPIC output file in memory-efficient batches.

This function reads only the specified columns from large data files (Parquet, Feather, or ROOT) in fixed-size batches, yielding either Arrow RecordBatches (for Parquet/Feather) or dictionaries of NumPy arrays (for ROOT).

Parameters:
  • file_path (pathlib.Path) – Path to the input file. Supported extensions: - .parquet or .pq for Parquet - .feather for Arrow Feather IPC - .root for ROOT files containing a TTree named root_tree

  • cols (sequence of str) – Names of the columns (or ROOT branches) to read.

  • batch_size (int, optional) – Maximum number of rows/entries per yielded batch (default: 100000).

  • root_tree (str, optional) – Name of the ROOT TTree to read from (default: “sampic_hits”).

Yields:

RecordBatch or dict

  • For Parquet/Feather: pyarrow.RecordBatch containing the requested columns.

  • For ROOT: dict mapping branch names to NumPy arrays for each batch.

Raises:

ValueError – If the file extension is not among the supported types.

sampiclyser.sampic_tools.ordinal(n: int) str[source]

Convert an integer to its English ordinal string (e.g., 1 → “1st”). Original function from https://stackoverflow.com/a/20007730, then adjusted with minor tweaks

Parameters:

n (int) – The integer to convert.

Returns:

str – The integer followed by its ordinal suffix: “st” for numbers ending in 1, “nd” for numbers ending in 2, “rd” for numbers ending in 3, and “th” otherwise. Special cases 11, 12, and 13 all use “th”.

Notes

  • English ordinals use “th” for the teens (11, 12, 13), even though they end in 1-3.

  • For all other numbers, the suffix is chosen by the last digit: 1→”st”, 2→”nd”, 3→”rd”, otherwise “th”.

  • This simple list-based lookup (with min(n % 10, 4)) is a common Python recipe.

sampiclyser.sampic_tools.plot_channel_hit_rate(file_path: Path, channel: int = 0, bin_size: float = 1.0, batch_size: int = 100000, plot_hits: bool = False, start_time: datetime | float | None = None, end_time: datetime | float | None = None, root_tree: str = 'sampic_hits', scale_factor: float = 1.0, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, color='C0', title: str | None = None) Figure[source]

Plot the hit rate (or raw hits) as a function of time from large data files.

Streams the “UnixTime” column in batches from a Feather, Parquet, or ROOT file, bins events into fixed-width time intervals, and renders a CMS-style time series.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file; supported suffixes are .feather, .parquet, .pq, and .root.

  • channel (int, optional) – The SAMPIC channel to plot (default: 0).

  • bin_size (float, optional) – Width of each time bin in seconds; values below 0.1 are rounded up to 0.1 (default: 1.0).

  • batch_size (int, optional) – Number of entries to read per I/O batch (default: 100000).

  • plot_hits (bool, optional) – If True, plot the raw count per bin; otherwise plot the rate (count divided by bin_size) (default: False).

  • start_time (datetime.datetime, float, or None, optional) – Start of the time window for plotting, as a datetime or UNIX timestamp. If None, uses the file’s “start_of_run” metadata. Aligned to the nearest lower multiple of bin_size (default: None).

  • end_time (datetime.datetime, float, or None, optional) – End of the time window for plotting, as a datetime or UNIX timestamp. If None, determined from the data. Aligned to the nearest upper multiple of bin_size (default: None).

  • root_tree (str, optional) – Name of the TTree in a ROOT file (only used if file_path ends in .root; default: “sampic_hits”).

  • scale_factor (float, optional) – Multiplier applied to each bin’s count (e.g. to account for central trigger multiplicity) before plotting (default: 1.0).

  • label (str, optional) – experiment label (default: “PPS”).

  • log_y (bool, optional) – If True, use a logarithmic y-axis (default: False).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (6, 4)).

  • rlabel (str, optional) – Additional right-hand label (e.g. collision energy) (default: “(13 TeV)”).

  • is_data (bool, optional) – If True, annotate plots as “Data”; if False, annotate as “Simulation” (default: True).

  • color (color spec, optional) – Matplotlib color for the line or bars (default: “C0”).

  • title (str or None, optional) – Main title for the figure; if None, no title is drawn (default: None).

Returns:

fig (matplotlib.figure.Figure) – Figure object containing the hit-rate (or hit-count) vs. time plot, styled according to CMS conventions.

Raises:

ValueError – If file_path has an unsupported suffix.

Notes

  • Time bins are computed as floor((t - t0)/bin_size) indices, then shifted back to absolute times for plotting.

  • X-axis tick formatting uses Matplotlib’s AutoDateLocator and AutoDateFormatter for sensible date/time labels across variable spans.

sampiclyser.sampic_tools.plot_channel_hits(df: DataFrame, first_channel: int, last_channel: int, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, color='C0', title: str | None = None) Figure[source]

Draw a CMS-style bar histogram of hit counts per channel.

Parameters:
  • df (pandas.DataFrame) – Summary table with two columns: - Channel (int): channel indices - Hits (int): hit counts per channel

  • first_channel (int) – Lowest channel index to include on the x-axis.

  • last_channel (int) – Highest channel index to include on the x-axis.

  • label (str, optional) – Text label for the experiment (default: “PPS”).

  • log_y (bool, optional) – If True, use a logarithmic y-axis (default: False).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (6, 4)).

  • rlabel (str, optional) – Right-hand text label, typically collision energy (default: “(13 TeV)”).

  • is_data (bool, optional) – If True, annotate the plot as “Data”; if False, annotate as “Simulation” (default: True).

  • color (any, optional) – Matplotlib color spec for the bars (default: “C0”).

  • title (str or None, optional) – Main title displayed above the axes; if None, no title is shown.

Returns:

matplotlib.figure.Figure – The Figure object containing the histogram.

Raises:

ValueError – If last_channel is less than first_channel.

Notes

  • Channels missing from df are shown with zero hits.

  • In linear mode, y-axis tick labels are formatted in uppercase scientific notation (e.g. “4.0E6”).

  • The plot uses mplhep.style.* with label and rlabel positioned according to respective styling conventions.

  • The is_data flag controls the “Data” vs. “Simulation” annotation.

sampiclyser.sampic_tools.plot_channel_waveforms(file_path: Path, root_tree: str = 'sampic_hits', batch_size: int = 100000, first_hit: int = 0, num_hits: int = 10, channel_filter: list[int] | None = None, interpolation_method: str | None = 'sinc', interpolation_factor: int = 4, interpolation_parameter: int = 8, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, title: str | None = None, file_name_id: str | None = None, cmap: str | None = None, time_scale: float = 1000000000, plot_sample_types: bool = True) Figure[source]

Plot multiple waveform hits from a SAMPIC data file in CMS style.

This function streams hit records from the specified data file (Parquet, Feather, or ROOT), applies optional interpolation and circular-buffer reordering, and draws each waveform with distinct coloring and markers. It then assembles CMS-standard annotations, a consolidated legend, and automatically generated titles.

Parameters:
  • file_path (pathlib.Path) – Path to the input file containing SAMPIC hit data. Supported formats: Parquet (.parquet, .pq), Feather (.feather), or ROOT (.root).

  • root_tree (str, default “sampic_hits”) – Name of the TTree inside a ROOT file to read.

  • batch_size (int, default 100000) – Number of hits to read per iteration when streaming.

  • first_hit (int, default 0) – Zero-based index of the first hit to plot (skips earlier hits).

  • num_hits (int, default 10) – Maximum number of hit waveforms to display.

  • channel_filter (list of int or None, optional) – If provided, only hits from these channel indices are plotted.

  • interpolation_method ({‘sinc’,’hann’,’hamming’,’lanczos’,’resample’,’resample_poly’}, optional) – Method for upsampling the waveform before plotting.

  • interpolation_factor (int, default 4) – Upsampling factor for interpolation.

  • interpolation_parameter (int, default 8) – Kernel/window size or filter parameter for the chosen interpolation.

  • label (str, default “PPS”) – experiment label shown on the plot.

  • log_y (bool, default False) – If True, use a logarithmic scale for the y-axis.

  • figsize (tuple of float, default (6, 4)) – Figure size in inches.

  • rlabel (str, default “(13 TeV)”) – Right-hand label (e.g. collision energy) in the CMS annotation.

  • is_data (bool, default True) – If True, annotate as data; otherwise as simulation.

  • title (str or None, optional) – Custom plot title. If None, an automatic title is generated.

  • file_name_id (str or None, optional) – Identifier for the input file used in the auto-title; defaults to file name.

  • cmap (str or None, optional) – Name of a Matplotlib colormap for channel coloring; defaults to style cycle.

  • time_scale (float, default 1E9) – Multiplicative factor to apply to the time-axis before plotting.

  • plot_sample_types (bool, default True) – If True, will plot the different distinc sample types with different symbols.

Returns:

fig (matplotlib.figure.Figure) – Figure object containing the selected waveforms of V vs time, styled according to CMS conventions.

Raises:
  • ValueError – If the input file format is unsupported, or if key columns are missing.

  • RuntimeError – If metadata cannot be extracted or plot configuration is invalid.

Notes

  • Uses open_hit_reader and select_waveforms to stream and filter hits.

  • Delegates single-waveform rendering to plot_waveform.

  • Finalizes annotations with finalize_waveform_legend and set_waveform_titles_and_labels.

sampiclyser.sampic_tools.plot_hit_rate(file_path: Path, bin_size: float = 1.0, batch_size: int = 100000, plot_hits: bool = False, start_time: datetime | float | None = None, end_time: datetime | float | None = None, root_tree: str = 'sampic_hits', scale_factor: float = 1.0, label: str = 'PPS', log_y: bool = False, figsize: tuple[float, float] = (6, 4), rlabel: str = '(13 TeV)', is_data: bool = True, color='C0', title: str | None = None) Figure[source]

Plot the hit rate (or raw hits) as a function of time from large data files.

Streams the “UnixTime” column in batches from a Feather, Parquet, or ROOT file, bins events into fixed-width time intervals, and renders a CMS-style time series.

Parameters:
  • file_path (pathlib.Path) – Path to the input data file; supported suffixes are .feather, .parquet, .pq, and .root.

  • bin_size (float, optional) – Width of each time bin in seconds; values below 0.1 are rounded up to 0.1 (default: 1.0).

  • batch_size (int, optional) – Number of entries to read per I/O batch (default: 100000).

  • plot_hits (bool, optional) – If True, plot the raw count per bin; otherwise plot the rate (count divided by bin_size) (default: False).

  • start_time (datetime.datetime, float, or None, optional) – Start of the time window for plotting, as a datetime or UNIX timestamp. If None, uses the file’s “start_of_run” metadata. Aligned to the nearest lower multiple of bin_size (default: None).

  • end_time (datetime.datetime, float, or None, optional) – End of the time window for plotting, as a datetime or UNIX timestamp. If None, determined from the data. Aligned to the nearest upper multiple of bin_size (default: None).

  • root_tree (str, optional) – Name of the TTree in a ROOT file (only used if file_path ends in .root; default: “sampic_hits”).

  • scale_factor (float, optional) – Multiplier applied to each bin’s count (e.g. to account for central trigger multiplicity) before plotting (default: 1.0).

  • label (str, optional) – experiment label (default: “PPS”).

  • log_y (bool, optional) – If True, use a logarithmic y-axis (default: False).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (6, 4)).

  • rlabel (str, optional) – Additional right-hand label (e.g. collision energy) (default: “(13 TeV)”).

  • is_data (bool, optional) – If True, annotate plots as “Data”; if False, annotate as “Simulation” (default: True).

  • color (color spec, optional) – Matplotlib color for the line or bars (default: “C0”).

  • title (str or None, optional) – Main title for the figure; if None, no title is drawn (default: None).

Returns:

fig (matplotlib.figure.Figure) – Figure object containing the hit-rate (or hit-count) vs. time plot, styled according to CMS conventions.

Raises:

ValueError – If file_path has an unsupported suffix.

Notes

  • Time bins are computed as floor((t - t0)/bin_size) indices, then shifted back to absolute times for plotting.

  • X-axis tick formatting uses Matplotlib’s AutoDateLocator and AutoDateFormatter for sensible date/time labels across variable spans.

sampiclyser.sampic_tools.plot_waveform(ax: Axes, hid: int, channel: int, baseline: float, samp_arr: ndarray, trig_arr: ndarray, period: float, color: Any, interp_kwargs: Dict[str, Any], label_mode: Literal['channel', 'hit', 'both', 'none'], reorder_circular_buffer: bool, reorder_samp_arr: bool, plot_sample_types: bool, plot_buffer_start: bool, explicit_labels: bool, time_scale: float) None[source]

Plot a single SAMPIC waveform on the given Axes, with optional interpolation, buffer reordering, and differentiated markers for sample types.

Parameters:
  • ax (matplotlib.axes.Axes) – The axes to draw on.

  • hid (int) – Hit index (used when plot_single_channel=True to label each hit).

  • channel (int) – SAMPIC channel number (used in legend when plot_single_channel=False).

  • baseline (float) – Baseline offset to add back to interpolated samples.

  • samp_arr (ndarray of float, shape (N,)) – Raw ADC sample values.

  • trig_arr (ndarray of {0,1}, shape (N,)) – Trigger markers, with a contiguous block of 1s (possibly wrapping).

  • period (float) – Time interval between samples in seconds.

  • color (any) – Matplotlib color spec (e.g. “C0”, RGB tuple, etc.).

  • interp_kwargs (dict) – Keyword arguments for apply_interpolation_method, including: - ‘interpolation_method’: {‘sinc’,’hann’,’hamming’,’lanczos’,’resample’,’resample_poly’} - ‘interpolation_factor’: int >=1 - ‘interpolation_parameter’: method-specific int

  • label_mode (Literal) – This controls how the waveforms are labelles: - ‘channel’: label waveforms with the channel number - ‘hit’: label waveforms with the hit id - ‘both’: label waveforms with both channel number and hit id - ‘none’: do not label waveforms

  • reorder_circular_buffer (bool) – If True, rotate trig_arr (and optionally samp_arr) so trigger block appears at the end.

  • reorder_samp_arr (bool) – If reorder_circular_buffer is True, also rotate samp_arr.

  • plot_sample_types (bool) – If True, uses separate markers for (non-trigger), (trigger), and (buffer start) samples. If False, plots all samples as dots.

  • plot_buffer_start (bool) – If plotting sample types, plot a distinct marker (‘>’) at the true buffer-start position (from reordering).

  • explicit_labels (bool) – If True, explicitly add labels for the different marker types

  • time_scale (float) – Scale to be applied to the time axis before plotting

Returns:

None

Raises:

ValueError – If array lengths differ, or if trig_arr is not 1D or contains no 1s. Passes through errors from apply_interpolation_method or reorder_circular_samples_with_trigger.

Notes

  • Marker sizes are squared values (s=marker_size**2) for clarity.

sampiclyser.sampic_tools.reorder_circular_samples_with_trigger(trig_arr: ndarray, samp_arr: ndarray, reorder_samples: bool) Tuple[ndarray, ndarray, ndarray][source]

Rotate a circular buffer so that the contiguous trigger block (1s) appears at the end.

Verify that all trigger markers (1s) in a circular array are contiguous, then rotate both the trigger array and optionally the associated sample array so that the block of 1s appears at the end of the array.

This is useful when interpreting a circular buffer of ADC samples where the trigger position-marked by one or more consecutive 1s—may wrap around the end of the buffer. After reordering, the data will be in true time-order, with the trigger block at the end.

In circular context, a block of 1s may wrap from the end back to the start of the array (e.g. [1,0,0,1] is a valid 2-wide trigger block). This function verifies that all 1s form exactly one circularly-contiguous block, then performs a roll so that those 1s occupy the last positions in the array. The sample array is optionally rotated identically to preserve alignment.

Parameters:
  • trig_arr (ndarray of int (0 or 1), shape (N,)) – Circular array marking trigger positions. Must contain one or more contiguous 1s; all other entries must be 0.

  • samp_arr (ndarray, shape (N,)) – Sample values corresponding to each position in trig_arr.

  • reorder_samples (bool) – If True, rotate samp_arr identically to trig_arr; if False, leave samp_arr unchanged.

Returns:

  • trig_reordered (ndarray of int, shape (N,)) – The trigger array rotated so that its contiguous 1s occupy the final positions of the array.

  • samp_reordered (ndarray, shape (N,)) – The sample array, either rotated in lock-step (if reorder_samples) or returned unchanged.

  • start_indicator (ndarray of int (0 or 1), shape (N,)) – All zeros except a single 1 at the index where the original buffer start appears in the reordered buffer.

Raises:

ValueError – If trig_arr and samp_arr have different lengths. If trig_arr does not contain any 1s. If the 1s in trig_arr are not contiguous.

Examples

>>> trig = np.array([0, 0, 1, 1, 0, 0])
>>> samp = np.arange(6)
>>> t_new, s_new, start_mask = reorder_circular_samples_with_trigger(trig, samp)
>>> t_new
array([0, 0, 0, 0, 1, 1])
>>> s_new
array([4, 5, 0, 1, 2, 3])
>>> start_mask
array([0, 0, 1, 0, 0, 0])
sampiclyser.sampic_tools.reorder_hits(input_path: ~pathlib.Path, output_feather_path: ~pathlib.Path | None = None, output_parquet_path: ~pathlib.Path | None = None, output_root_path: ~pathlib.Path | None = None, root_tree: str = 'sampic_hits', use_unix_time: bool = False, batch_size: int = 100000, max_time_offset: float | None = None, schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) None[source]

Stream and reorder hit records into non-decreasing time order, writing to a new file.

This function reads hit records from input_path (Feather, Parquet, or ROOT) in memory-efficient batches, reconstructs or reads each hit’s timestamp, and emits them in sorted order (non-decreasing time) to output_path in the same or different format. A sliding window (heap) is used to handle small out-of-order intervals up to max_time_offset if provided, otherwise fully buffers.

Parameters:
  • input_path (pathlib.Path) – Path to the input decoded SAMPIC data file (.parquet, .pq, .feather, or .root).

  • output_feather_path (pathlib.Path or None, optional) – If not None, path to write the reordered data in Feather format.

  • output_parquet_path (pathlib.Path or None, optional) – If not None, path to write the reordered data in Parquet format.

  • output_root_path (pathlib.Path or None, optional) – If not None, path to write the reordered data to a ROOT file. Writes data TTree with same root_tree name and a metadata TTree.

  • root_tree (str, optional) – Name of the TTree inside the ROOT file (default: “sampic_hits”).

  • use_unix_time (bool, default False) – If True, use the ‘UnixTime’ column directly; otherwise use sampic_reconstruct_time_dict to compute timestamps.

  • batch_size (int, default 100000) – Number of rows/entries to read per batch from the input.

  • max_time_offset (float or None, optional) – Maximum lag (seconds) allowed before emitting buffered records. Records older than (current_max_time - max_time_offset) are written. If None, all records are buffered and sorted fully before writing.

  • schemaInfo (dict, default SAMPIC_Schema_Info) – Mapping of field names to type definitions used for ROOT dtype.

Returns:

None

Raises:
  • ValueError – If timestamp reconstruction fails or required columns missing.

  • RuntimeError – If no output path is provided.

Notes

  • Metadata from the input file is propagated to all output formats.

  • Parquet and Feather writing use PyArrow writers with an explicit schema.

  • ROOT writing uses uproot to extend a TTree in streaming fashion.

sampiclyser.sampic_tools.reprocess_data_files(input_path: ~pathlib.Path, output_feather_path: ~pathlib.Path | None = None, output_parquet_path: ~pathlib.Path | None = None, output_root_path: ~pathlib.Path | None = None, root_tree: str = 'sampic_hits', schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}, new_columns: ~typing.List[str] | None = None, restrict_columns: ~typing.List[str] | None = None) Iterator[Tuple[List[str], Schema, RecordBatchFileWriter | None, ParquetWriter | None, WritableTree | None]][source]

Context manager to open/prepare readers and writers for re-processing SAMPIC data.

Reads schema & metadata from input_path (Parquet, Feather, or ROOT), optionally restricts or extends the column list, then yields:

  1. columns: final list of column names

  2. schema: Arrow Schema with metadata embedded

  3. feather_writer: IPC writer for Feather (or None)

  4. parquet_writer: ParquetWriter instance (or None)

  5. root_tree_obj: Writable ROOT TTree (or None)

Upon exit, finalizes metadata and closes all writers.

Parameters:
  • input_path (pathlib.Path) – Path to an existing decoded SAMPIC input file. Supported extensions: .parquet/.pq, .feather, .root.

  • output_feather_path (pathlib.Path or None, optional) – Path for the output Feather file; if None, Feather writing is disabled.

  • output_parquet_path (pathlib.Path or None, optional) – Path for the output Parquet file; if None, Parquet writing is disabled.

  • output_root_path (pathlib.Path or None, optional) – Path for the output ROOT file; if None, ROOT writing is disabled.

  • root_tree (str) – Name of the TTree for ROOT I/O (default: “sampic_hits”).

  • schemaInfo – Mapping of column names to (pandas_dtype, pa_type, numpy_dtype, optional_size). Used to rebuild the Arrow schema if new_columns or restrict_columns is given.

  • new_columns (list of str) – Extra column names to append to those discovered in input_path.

  • restrict_columns (list of str) – If given, only the intersection of this list and the discovered columns is used.

Yields:
  • columns (list of str) – The final ordered list of column names for downstream reads/writes.

  • schema (pa.Schema) – Arrow schema for reading and writing data.

  • feather_writer (pa.ipc.RecordBatchFileWriter or None) – Open IPC writer for Feather, or None if disabled.

  • parquet_writer (pq.ParquetWriter or None) – Open ParquetWriter, or None if disabled.

  • root_tree_obj (uproot.writing.WritableTree or None) – Writable TTree for ROOT output, or None if disabled.

Raises:

ValueError – If input_path has unsupported suffix.

Notes

  • Input metadata (from get_file_metadata) is embedded in the Arrow schema and, upon context exit, written into the ROOT metadata TTree if used.

  • All open writers are closed automatically in the finally block.

  • If both new_columns and restrict_columns are None, the input file’s native schema is preserved; otherwise a new schema is built via build_schema.

sampiclyser.sampic_tools.reprocess_noop(input_path: ~pathlib.Path, output_feather_path: ~pathlib.Path | None = None, output_parquet_path: ~pathlib.Path | None = None, output_root_path: ~pathlib.Path | None = None, root_tree: str = 'sampic_hits', batch_size: int = 100000, schemaInfo: ~typing.Dict[str, ~typing.Tuple] = {'ADCCounterLatched': ('int32', DataType(int32), <class 'numpy.int32'>), 'Amplitude': ('float32', DataType(float), <class 'numpy.float32'>), 'Baseline': ('float32', DataType(float), <class 'numpy.float32'>), 'Cell': ('int32', DataType(int32), <class 'numpy.int32'>), 'Channel': ('int32', DataType(int32), <class 'numpy.int32'>), 'DataSample': (None, ListType(list<item: float>), <class 'numpy.float32'>, 64), 'DataSize': ('int32', DataType(int32), <class 'numpy.int32'>), 'FPGATimeStamp': ('uint64', DataType(uint64), <class 'numpy.float64'>), 'HITNumber': ('int32', DataType(int32), <class 'numpy.int32'>), 'OrderedCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'PhysicalCell0Time': ('float64', DataType(double), <class 'numpy.float64'>), 'RawPeak': ('float32', DataType(float), <class 'numpy.float32'>), 'RawTOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'StartOfADCRamp': ('int32', DataType(int32), <class 'numpy.int32'>), 'TOTValue': ('int32', DataType(int32), <class 'numpy.int32'>), 'Time': ('float64', DataType(double), <class 'numpy.float64'>), 'TimeStampA': ('int32', DataType(int32), <class 'numpy.int32'>), 'TimeStampB': ('int32', DataType(int32), <class 'numpy.int32'>), 'TriggerPosition': (None, ListType(list<item: int32>), <class 'numpy.int32'>, 64), 'UnixTime': ('float64', DataType(double), <class 'numpy.float64'>)}) None[source]

A “no-op” reprocessor that copies Sampic data from one format to another in batches.

Reads the same columns from input_path (Parquet, Feather, or ROOT) and writes them unmodified to any of the enabled outputs (Feather, Parquet, or ROOT), streaming in batch_size chunks. Metadata is inherited and re-emitted automatically, so this can convert file formats or update compression, etc.

Parameters:
  • input_path (pathlib.Path) – Path to the input decoded SAMPIC data file. Supported suffixes: .parquet/.pq, .feather, or .root. For ROOT, root_tree must point to an existing TTree of hits.

  • output_feather_path (pathlib.Path or None, default None) – If provided, write output as an Arrow IPC/Feather file.

  • output_parquet_path (pathlib.Path or None, default None) – If provided, write output as a Parquet file.

  • output_root_path (pathlib.Path or None, default None) – If provided, write output as a ROOT file with the same TTree name plus a metadata TTree.

  • root_tree (str, default “sampic_hits”) – Name of the hit TTree for ROOT input/output.

  • batch_size (int, default 100000) – Number of rows/entries to read per batch from the input.

  • schemaInfo (dict, default SAMPIC_Schema_Info) – Mapping of field names to type definitions used for ROOT dtype.

Returns:

None

Raises:
  • RuntimeError – If no output path is specified.

  • ValueError – If required columns are missing or conversion fails.

Notes

  • This performs no transformation on the data itself—fields, datatypes, and ordering are preserved.

  • For ROOT output, array columns are converted via get_root_data_with_schema to fixed-length NumPy vectors.

  • Metadata from the input file is propagated to all output formats.

  • Parquet and Feather writing use PyArrow writers with an explicit schema.

sampiclyser.sampic_tools.sampic_reconstruct_time_dict(rec: dict) float[source]

Function implementing the custom SAMPIC time reconstruction logic, operating on a dict of SAMPIC recorded fields

Parameters:

rec (dict) – Dictionary containing the required SAMPIC fields for time reconstruction Required:

UnixTime -

Raises:

ValueError – If use_unix_time is False and no reconstruction algorithm is provided in _reconstruct_time.

sampiclyser.sampic_tools.select_waveforms(batches: Iterator[RecordBatch | Array], first_hit: int, num_hits: int, channel_filter: Set[int] | None = None) Iterator[Tuple[int, float, int, ndarray, ndarray]][source]

Flatten record batches or Awkward arrays into individual waveform records, with optional hit-index slicing and channel filtering.

Parameters:
  • batches (iterator of RecordBatch or awkward.highlevel.Array) – Stream of data blocks each containing the fields ‘HITNumber’, ‘Channel’, ‘Baseline’, ‘DataSize’, ‘TriggerPosition’, and ‘DataSample’.

  • first_hit (int) – Number of initial hits to skip before yielding.

  • num_hits (int) – Maximum number of hits to yield after skipping first_hit.

  • channel_filter (set of int or None, optional) – If provided, only waveforms whose channel index is in this set are yielded.

Yields:
  • hit_number (int) – The sequential hit number given by SAMPIC to this waveform.

  • channel (int) – SAMPIC channel index for this waveform.

  • baseline (float) – Baseline offset for the waveform.

  • n_samples (int) – Number of ADC samples in this waveform.

  • trigger_positions (ndarray of int) – 1D array of length n_samples, with 0/1 indicating trigger positions.

  • samples (ndarray of float) – 1D array of length n_samples containing the ADC values.

Raises:

ValueError – If a batch is missing any of the required fields.

Notes

  • Uses duck typing to detect a PyArrow RecordBatch (has .column()) vs. an Awkward Array (indexable by field name).

  • Stops iteration once num_hits waveforms have been yielded.

sampiclyser.sampic_tools.set_mplhep_style(style: str = 'CMS')[source]
sampiclyser.sampic_tools.set_waveform_titles_and_labels(ax: Axes, file_path: Path, file_name_id: str | None = None, title: str | None = None, channel_filter: List[int] | None = None, first_hit: int = 0, hits_plotted: int = 1, time_scale: float = 1.0) None[source]

Set the main title and axis labels for a waveform plot.

If title is provided, it is used verbatim. Otherwise an automatic title is constructed based on file_name_id, which run hit range, and optionally a channel filter.

The x-axis label is set to “Time […]” with units chosen from the time_scale (s, ms, µs, ns, ps). The y-axis is always labeled “Voltage [V]”.

Parameters:
  • ax (matplotlib.axes.Axes) – The Axes object on which to set titles and labels.

  • file_path (pathlib.Path) – Path of the source data file; used to default file_name_id.

  • file_name_id (str or None, optional) – Short identifier for the file (e.g. filename without path). If None or empty, defaults to file_path.name.

  • title (str or None, optional) – If provided, this exact string is set as the plot title. If None, an automatic title is generated.

  • channel_filter (list of int or None, optional) – If plotting only a subset of channels, used to annotate the title: - Single-element list → “Channel N” - Multi-element list → “Selected Channel”

  • first_hit (int, default 0) – Index of the first hit plotted (0-based). Used in automatic title when title is None.

  • hits_plotted (int, default 1) – Number of hits actually drawn. Used in automatic title when title is None.

  • time_scale (float, default 1.0) – Factor applied to the “time” values before labeling and tick formatting. Must be one of [1, 1e3, 1e6, 1e9, 1e12], corresponding to seconds, milliseconds, microseconds, nanoseconds, and picoseconds.

Returns:

None

Raises:

RuntimeError – If time_scale is not one of the recognized values.

Examples

>>> fig, ax = plt.subplots()
>>> set_waveform_titles_and_labels(
...     ax,
...     Path("/data/run123"),
...     file_name_id="run123",
...     title=None,
...     channel_filter=[2],
...     first_hit=5,
...     hits_plotted=10,
...     time_scale=1e6
... )
sampiclyser.sampic_tools.windowed_sinc_interpolation(t_orig: ndarray, y_orig: ndarray, t_new: ndarray, window: str = 'hann', M: int = 8) ndarray[source]

Band-limited interpolation of a uniformly sampled signal using a windowed sinc kernel.

Constructs a truncated sinc filter of half-width M samples, applies a smooth tapering window (Hann or Hamming) to reduce ringing, and convolves it with the input data to estimate values at new time points.

Parameters:
  • t_orig (ndarray of float, shape (N,)) – Original sample times (must be uniformly spaced).

  • y_orig (ndarray of float, shape (N,)) – Original sample values.

  • t_new (ndarray of float, shape (M_new,)) – Desired output times (must lie within the range of t_orig).

  • window ({‘hann’, ‘hamming’}, optional) – Type of tapering window to apply to the sinc kernel. Default is ‘hann’.

  • M (int, optional) – Half-width of the truncated sinc kernel, in number of original samples. Total kernel length will be 2*M + 1. Default is 8.

Returns:

y_new (ndarray of float, shape (M_new,)) – Interpolated values at t_new.

Raises:

ValueError – If t_orig is not at least two points, if window is not recognized, or if t_new contains values outside the range of t_orig.

Notes

  • This implementation assumes t_orig is uniformly spaced. If that is not the case, consider first resampling to a uniform grid or using a more general interpolator.

  • The Hann window is defined as $$ w[k] = 0.5 + 0.5cosbigl(2pi k/(2M+1)bigr),quad k=-Mldots M. $$

  • The Hamming window uses $$ w[k] = 0.54 + 0.46cosbigl(2pi k/(2M+1)bigr). $$

  • Each output sample y_new[i] is $$ sum_{k=-M}^{M} y_{!n},text{sinc}bigl((t_{!new}-t_{!orig,n})/Tbigr),w[k], $$ where (T) is the uniform sample spacing and (n) is chosen so that (t_{!orig,n}) is the nearest original sample to each (t_{!new,i}).

Examples

>>> import numpy as np
>>> t = np.linspace(0, 1, 11)
>>> y = np.sin(2*np.pi*5*t)
>>> t_fine = np.linspace(0, 1, 101)
>>> y_fine = windowed_sinc_interpolation(t, y, t_fine, window='hann', M=4)

sampiclyser.sensor_hitmaps module

class sampiclyser.sensor_hitmaps.SensorSpec(name: str, sampic_map: Dict[int, int], geometry: Tuple, cmap: str = 'viridis', global_rotation_units: int = 0, global_flip: bool = False)[source]

Specification for plotting a single sensor’s hitmap, including local-to-global coordinate mapping.

Supports three geometry types (grid, grouped, scatter) and optional global transformations (multiples of 90° rotations and an optional mirror flip).

Variables:
  • name (str) – Human-readable title for the sensor (used as the subplot title).

  • sampic_map (dict of int → int) – Mapping from SAMPIC channel indices to sensor channel identifiers.

  • geometry (tuple) –

    Defines the sensor’s layout and drawing method. The first element is the geometry type, one of:

    • ”grid” A regular n_rows × n_cols grid with 1:1 channel→pixel mapping: (“grid”, n_rows, n_cols, chan2coord) where chan2coord maps sensor channel → (row, col) coordinates.

    • ”grouped” Multiple pixels per channel on a grid: (“grouped”, chan2pixels, n_rows, n_cols) where chan2pixels maps sensor channel → list of (row, col) pixel coordinates.

    • ”scatter” Arbitrary pixel centers and sizes: (“scatter”, chan2coords, pixel_width, pixel_height) where chan2coords maps sensor channel → (x, y) center coordinates.

  • cmap (str) – Name of the Matplotlib colormap to use for this sensor (default: “viridis”).

  • global_rotation_units (int) – Number of 90° clockwise rotations to apply to the local coordinate system to obtain the global orientation (0-3).

  • global_flip (bool) – If True, mirror the coordinates horizontally before rotation to align the local layout with the global coordinate frame.

cmap: str = 'viridis'
geometry: Tuple
global_flip: bool = False
global_rotation_units: int = 0
name: str
sampic_map: Dict[int, int]
sampiclyser.sensor_hitmaps.convert_nrows_ncols_to_global(nrows: int, ncols: int, rotations: int) tuple[int, int][source]

Compute the global grid dimensions after applying quarter-turn rotations.

Parameters:
  • nrows (int) – Number of rows in the local (unrotated) grid.

  • ncols (int) – Number of columns in the local (unrotated) grid.

  • rotations (int) – Number of 90° clockwise rotations to apply. Only the parity of rotations modulo 2 affects the shape: even → no swap, odd → swap rows and columns.

Returns:

  • global_nrows (int) – Number of rows in the grid after rotation.

  • global_ncols (int) – Number of columns in the grid after rotation.

Examples

>>> convert_nrows_ncols_to_global(10, 5, 0)
(10, 5)
>>> convert_nrows_ncols_to_global(10, 5, 1)
(5, 10)
>>> convert_nrows_ncols_to_global(10, 5, 2)
(10, 5)
>>> convert_nrows_ncols_to_global(10, 5, 3)
(5, 10)
sampiclyser.sensor_hitmaps.convert_r_c_to_global(r_local: int, c_local: int, rotations: int, do_mirror: bool, nrows: int, ncols: int) tuple[int, int][source]

Map local (row, column) indices to global coordinates with optional mirroring and rotation.

Parameters:
  • r_local (int) – Row index in the sensor’s local coordinate system (0-based).

  • c_local (int) – Column index in the sensor’s local coordinate system (0-based).

  • rotations (int) – Number of 90° clockwise rotations to apply. Effective rotations are taken modulo 4 (i.e. 0, 1, 2, or 3).

  • do_mirror (bool) – If True, reflect the column coordinate horizontally before rotation.

  • nrows (int) – Total number of rows in the local grid.

  • ncols (int) – Total number of columns in the local grid.

Returns:

r_global, c_global (tuple of int) – The transformed (row, column) in the global coordinate system after applying mirroring and rotation.

Notes

  • Mirroring (if enabled) flips c_local to ncols - 1 - c_local.

  • Rotation is performed clockwise in 90° increments: - 0 → (r, c) - 1 → (c, nrows - 1 - r) - 2 → (nrows - 1 - r, ncols - 1 - c) - 3 → (ncols - 1 - c, r)

  • Inputs outside expected ranges (e.g. negative indices) are not checked, so passing invalid r_local/c_local may lead to unexpected results.

sampiclyser.sensor_hitmaps.plot_hitmap(summary_df: DataFrame, specs: Sequence[SensorSpec], layout: Tuple[int, int], figsize: Tuple[int, float] = (8, 6), cmap: str = 'viridis', log_z: bool = False, title: str | None = None, do_sampic_ch: bool = False, do_board_ch: bool = False, center_fontsize: int = 14, coordinates: str = 'local') Figure[source]

Draw a grid of sensor hitmaps with a shared color scale.

Each sensor’s hit counts are rendered according to its geometry (grid, grouped, or scatter) in a subplot arranged by layout. All subplots share the same color normalization (linear or logarithmic), and have equal aspect ratio to preserve pixel shapes.

Parameters:
  • summary_df (pandas.DataFrame) – DataFrame with columns “Channel” (int) and “Hits” (int).

  • specs (sequence of SensorSpec) – Specifications for each sensor to plot (one subplot per spec).

  • layout (tuple of int) – Subplot grid dimensions as (nrows, ncols).

  • figsize (tuple of float, optional) – Figure size in inches as (width, height) (default: (8, 6)).

  • cmap (str, optional) – Matplotlib colormap name to use for all sensors (default: “viridis”).

  • log_z (bool, optional) – If True, apply logarithmic normalization on the color (z) axis (default: False for linear scale).

  • title (str or None, optional) – Overall figure title; if None, no supertitle is drawn (default: None).

  • do_sampic_ch (bool, optional) – If True, annotate each pixel/group with its SAMPIC channel index (default: False).

  • do_board_ch (bool, optional) – If True, annotate each pixel/group with its board channel index (default: False).

  • center_fontsize (int, optional) – Font size for center annotations (default: 14).

  • coordinates ({‘local’, ‘global’}, optional) – Coordinate system for rendering: - ‘local’: use sensor’s native coordinates. - ‘global’: apply global_rotation_units and global_flip from each spec

    to map to a common global frame (default: ‘local’).

Returns:

fig (matplotlib.figure.Figure) – The Figure object containing the arranged hitmap subplots.

Notes

  • Uses sampiclyser_style for style formatting.

  • A single colorbar is added to the first subplot, reflecting all panels.

  • Subplot aspect is set to ‘equal’ so that pixels are not distorted.