mmif.utils package

Package containing utility modules for handling different types of source documents, and general implementation of common data structures and algorithms.

Subpackages

Submodules

mmif.utils.sequence_helper module

This module provides helpers for handling sequence labeling. Specifically, it provides

  • a generalized label re-mapper for “post-binning” of labels

  • conversion from a list of CLAMS annotations (with classification props) into a list of reals (scores by labels), can be combined with the label re-mapper mentioned above

  • mmif.utils.sequence_helper.smooth_outlying_short_intervals(): a simple smoothing algorithm by trimming “short” outlier sequences

However, it DOES NOT provide

  • direct conversion between CLAMS annotations. For example, it does not directly handle stitching of TimePoint into TimeFrames.

  • support for multi-class scenario, such as handling of _competing_ subsequence or overlapping labels.

Some functions can use optional external libraries (e.g., numpy) for better performance. Hence, if you see a warning about missing optional packages, you might want to install them by running pip install mmif-python[seq].

mmif.utils.sequence_helper.smooth_outlying_short_intervals(scores: List[float], min_spseq_size: int, min_snseq_size: int, min_score: float = 0.5)[source]

Given a list of scores, a score threshold, and smoothing parameters, identify the intervals of “positive” scores by “trimming” the short positive sequences (“spseq”) and short negative sequences (“snseq”). To decide the positivity, first step is binarization of the scores by the min_score threshold. Given Sr as “raw” input real-number scores list, and min_score=0.5,

Sr: [0.3, 0.6, 0.2, 0.8, 0.2, 0.9, 0.8, 0.5, 0.1, 0.5, 0.8, 0.3, 1.0, 0.7, 0.5, 0.5, 0.5, 0.8, 0.3, 0.6]

the binarization is done by simply comparing each score to the threshold to get S list of binary scores

1.0 :                                     |                      
0.9 :                |                    |                      
0.8 :          |     |  |           |     |              |       
0.7 :          |     |  |           |     |  |           |       
0.6 :    |     |     |  |           |     |  |           |     | 
0.5 :----+-----+-----+--+--+-----+--+-----+--+--+--+--+--+-----+-
0.4 :    |     |     |  |  |     |  |     |  |  |  |  |  |     | 
0.3 : |  |     |     |  |  |     |  |  |  |  |  |  |  |  |  |  | 
0.2 : |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |  |  |  | 
0.1 : |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
0.0 +------------------------------------------------------------
raw :.3 .6 .2 .8 .2 .9 .8 .5 .1 .5 .8 .3 1. .7 .5 .5 .5 .8 .3 .6
 S  : 0  1  0  1  0  1  1  0  0  0  1  0  1  1  0  1  1  1  0  1 

Note that the size of a positive or negative sequence can be as small as 1.

Then, here are examples of smoothing a list of binary scores into intervals, by trimming “very short” (under thresholds) sequences of positive or negative:

Note

legends:

  • t is unit index (e.g. time index)

  • S is the list of binary scores (zeros and ones)

  • I is the list of intervals after smoothing

  1. with params min_spseq_size==1, min_snseq_size==4

    t: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    S: [0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]
    I: [0, 1--1--1--1--1--1--1--1--1--1--1--1, 0--0--0--0--0--0, 1]
    

    Explanation: min_snseq_size is used to smooth short sequences of negative predictions. In this, zeros from t[7:10] are smoothed into “one” I, while zeros from t[13:19] are kept as “zero” I. Note that the “short” snseqs at the either ends (t[0:1]) are never smoothed.

  2. with params min_spseq_size==4, min_snseq_size==2

    t: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    S: [0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]
    I: [0, 1--1--1--1--1--1, 0--0--0--0--0--0--0--0--0--0--0--0--0]
    

    Explanation: min_spseq_size is used to smooth short sequences of positive predictions. In this example, the spseqs of ones from both t[10:13] and t[19:20] are smoothed. Note that the “short” spseqs at the either ends (t[19:20]) are always smoothed.

  3. with params min_spseq_size==4, min_snseq_size==4

    t: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    S: [0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]
    I: [0, 1--1--1--1--1--1--1--1--1--1--1--1--0--0--0--0--0--0--0]
    

    Explanation: When two threshold parameters are working together, the algorithm will prioritize the smoothing of the snseqs over the smoothing of the spseqs. Thus, in this example, the snseq t[7:10] gets first smoothed “up” before the spseq t[10:13] is smoothed “down”, resulting in a long final I.

  4. with params min_spseq_size==4, min_snseq_size==4

    t: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    S: [1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1]
    I: [1--1--1--1--1--1--1, 0--0--0--0, 1--1--1--1--1--1--1--1--1]
    

    Explanation: Since smoothing of snseqs is prioritized, short spseqs at the beginning or the end can be kept.

  5. with params min_spseq_size==1, min_snseq_size==1

    t: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    S: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1]
    I: [0--0--0, 1--1--1--1, 0--0--0--0, 1--1--1--1, 0--0--0, 1--1]
    

    Explanation: When both width thresholds are set to 1, the algorithm works essentially in the “stitching” only mode.

Parameters:
  • scoresSORTED list of scores to be smoothed. The score list is assumed to be “exhaust” the entire time or space of the underlying document segment. (Sorted by the start, and then by the end of anchors)

  • min_spseq_size – minimum size of a positive sequence not to be smoothed (greater or equal to)

  • min_snseq_size – minimum size of a negative sequence not to be smoothed (greater or equal to)

  • min_score – minimum threshold to use to discard low-scored units (strictly less than)

Returns:

list of tuples of start(inclusive)/end(exclusive) indices of the “positive” sequences. Negative sequences (regardless of their size) are not included in the output.

mmif.utils.sequence_helper.validate_labelset(annotations: Iterable[Annotation]) List[str][source]

Simple check for a list of annotations to see if they have the same label set.

Raise:

AttributeError if an element in the input list doesn’t have the labelset property

Raise:

ValueError if different labelset values are found

Returns:

a list of the common labelset value (list of label names)

mmif.utils.sequence_helper.build_label_remapper(src_labels: List[str], dst_labels: Dict[str, str | int | float | bool | None]) Dict[str, str | int | float | bool | None][source]

Build a label remapper dictionary from source and destination labels.

Parameters:
  • src_labels – a list of all labels on the source side

  • dst_labels – a dict from source labels to destination labels. Source labels not in this dict will be remapped to a negative label (-).

Returns:

a dict that exhaustively maps source labels to destination labels

mmif.utils.sequence_helper.build_score_lists(classifications: ~typing.List[~typing.Dict], label_remapper: ~typing.Dict, score_remap_op: ~typing.Callable[[...], float] = <built-in function max>) Tuple[Dict[str, int], numpy.ndarray][source]

Build lists of scores indexed by the label names.

Parameters:
  • classifications – list of dictionaries of classification results, taken from input annotation objects

  • label_remapper – a dictionary that maps source label names to destination label names (formerly “postbin”)

  • score_remap_op – a function to remap the scores from multiple source labels binned to a destination label common choices are max, min, or sum

Returns:

  1. a dictionary that maps label names to their index in the score list

  2. 2-d numpy array of scores, of which rows are indexed by label map dict (first return value)

mmif.utils.text_document_helper module

mmif.utils.text_document_helper.slice_text(mmif_obj, start: int, end: int, unit: str = 'milliseconds') str[source]

Extracts text from tokens within a specified time range.

Parameters:
  • mmif_obj – MMIF object to search for tokens

  • start – start time point

  • end – end time point

  • unit – time unit for start and end parameters (default: “milliseconds”)

Returns:

space-separated string of token words found in the time range

mmif.utils.timeunit_helper module

mmif.utils.timeunit_helper.convert(t: int | float | str, in_unit: str, out_unit: str, fps: float) int | float | str[source]

Converts time from one unit to another. Works with frames, seconds, milliseconds.

Parameters:
  • t – time value to convert

  • in_unit – input time unit, one of frames, seconds, milliseconds

  • out_unit – output time unit, one of frames, seconds, milliseconds

  • fps – frames per second

Returns:

converted time value

mmif.utils.video_document_helper module

class mmif.utils.video_document_helper.SamplingMode(value)[source]

Bases: Enum

Determines how timepoints are selected from a TimeFrame.

REPRESENTATIVES = 'representatives'
SINGLE = 'single'
ALL = 'all'
mmif.utils.video_document_helper.capture(video_document: Document)[source]

Captures a video file using OpenCV and adds fps, frame count, and duration as properties to the document.

Parameters:

video_documentDocument instance that holds a video document ("@type": ".../VideoDocument/...")

Returns:

OpenCV VideoCapture object

mmif.utils.video_document_helper.get_framerate(video_document: Document) float[source]

Gets the frame rate of a video document. First by checking the fps property of the document, then by capturing the video.

Parameters:

video_documentDocument instance that holds a video document ("@type": ".../VideoDocument/...")

Returns:

frames per second as a float, rounded to 2 decimal places

mmif.utils.video_document_helper.extract_frames_as_images(video_document: Document, framenums: Iterable[int], as_PIL: bool = False, record_ffmpeg_errors: bool = False)[source]

Extracts frames from a video document as a list of numpy.ndarray. Use with sample_frames() function to get the list of frame numbers first.

Parameters:
  • video_documentDocument instance that holds a video document ("@type": ".../VideoDocument/...")

  • framenums – iterable integers representing the frame numbers to extract

  • as_PIL – return PIL.Image.Image instead of ndarray

  • record_ffmpeg_errors – if True, records and warns about FFmpeg stderr output during extraction

Returns:

frames as a list of ndarray or Image

mmif.utils.video_document_helper.get_mid_framenum(mmif: Mmif, time_frame: Annotation) int[source]

Deprecated since version Use: extract_frames_by_mode() instead.

mmif.utils.video_document_helper.extract_mid_frame(mmif: Mmif, time_frame: Annotation, as_PIL: bool = False)[source]

Deprecated since version Use: extract_frames_by_mode() instead.

Extracts the middle frame of a time interval annotation as a numpy ndarray.

Parameters:
  • mmifMmif instance

  • time_frameAnnotation instance that holds a time interval annotation ("@type": ".../TimeFrame/...")

  • as_PIL – return Image instead of ndarray

Returns:

frame as a numpy.ndarray or PIL.Image.Image

mmif.utils.video_document_helper.get_representative_framenums(mmif: Mmif, time_frame: Annotation) List[int][source]

Deprecated since version Use: extract_frames_by_mode() instead.

Calculates the representative frame numbers from an annotation. To pick the representative frames, it first looks up the representatives property of the TimeFrame annotation. If it is not found, it will calculate the number of the middle frame.

Parameters:
  • mmifMmif instance

  • time_frameAnnotation instance that holds a time interval annotation containing a representatives property ("@type": ".../TimeFrame/...")

Returns:

representative frame number as an integer

mmif.utils.video_document_helper.get_representative_framenum(mmif: Mmif, time_frame: Annotation) int[source]

Deprecated since version Use: extract_frames_by_mode() instead.

A thin wrapper around get_representative_framenums() to return a single representative frame number. Always return the first frame number found.

mmif.utils.video_document_helper.extract_representative_frame(mmif: Mmif, time_frame: Annotation, as_PIL: bool = False, first_only: bool = True)[source]

Deprecated since version Use: extract_frames_by_mode() instead.

Extracts the representative frame of an annotation as a numpy ndarray or PIL Image.

Parameters:
  • mmifMmif instance

  • time_frameAnnotation instance that holds a time interval annotation ("@type": ".../TimeFrame/...")

  • as_PIL – return Image instead of ndarray

  • first_only – return the first representative frame only

Returns:

frame as a numpy.ndarray or PIL.Image.Image

mmif.utils.video_document_helper.extract_target_frames(mmif: Mmif, annotation: Annotation, min_timepoints: int = 0, max_timepoints: int = 9223372036854775807, fraction: float = 1.0, as_PIL: bool = False)[source]

Extracts frames corresponding to the timepoints listed in the targets property of an annotation. Selection of timepoints is based on minimum, maximum, and fraction of targets to include.

Parameters:
  • mmifMmif instance

  • annotationAnnotation instance containing a targets property

  • min_timepoints – minimum number of timepoints to include

  • max_timepoints – maximum number of timepoints to include

  • fraction – fraction of targets to include (ideally)

  • as_PIL – return Image instead of ndarray

Returns:

a tuple containing (list of frames, list of selected target IDs)

mmif.utils.video_document_helper.extract_frames_by_mode(mmif: Mmif, time_frame: Annotation, mode: SamplingMode | None = None, as_PIL: bool = False) List[source]

Extracts frames from a TimeFrame annotation based on a sampling mode. If mode is not specified, uses the context-level default (set via _sampling_mode context variable).

Parameters:
  • mmifMmif instance

  • time_frame – TimeFrame annotation to sample from

  • modeSamplingMode, or None to use the context default

  • as_PIL – return PIL Images instead of ndarrays

Returns:

list of frames (may be empty for REPRESENTATIVES mode when no representatives exist)

mmif.utils.video_document_helper.sample_frames(start_frame: int, end_frame: int, sample_rate: float = 1) List[int][source]

Helper function to sample frames from a time interval. Can also be used as a “cutoff” function when used with start_frame==0 and sample_rate==1.

Parameters:
  • start_frame – start frame of the interval

  • end_frame – end frame of the interval

  • sample_rate – sampling rate (or step) to configure how often to take a frame, default is 1, meaning all consecutive frames are sampled

Returns:

list of frame numbers to extract

mmif.utils.video_document_helper.get_annotation_property(mmif, annotation, prop_name)[source]

Deprecated since version 1.0.8: Will be removed in 2.0.0. Use mmif.serialize.annotation.Annotation.get_property() method instead.

Get a property value from an annotation. If the property is not found in the annotation, it will look up the metadata of the annotation’s parent view and return the value from there.

Parameters:
  • mmif – MMIF object containing the annotation

  • annotation – Annotation object to get property from

  • prop_name – name of the property to retrieve

Returns:

the property value

mmif.utils.video_document_helper.convert_timepoint(mmif: Mmif, timepoint: Annotation, out_unit: str) int | float | str[source]

Converts a time point included in an annotation to a different time unit. The input annotation must have timePoint property.

Parameters:
  • mmif – input MMIF to obtain fps and input timeunit

  • timepointAnnotation instance with timePoint property

  • out_unit – time unit to which the point is converted (frames, seconds, milliseconds)

Returns:

frame number (integer) or second/millisecond (float) of input timepoint

mmif.utils.video_document_helper.convert_timeframe(mmif: Mmif, time_frame: Annotation, out_unit: str) Tuple[int | float | str, int | float | str][source]

Converts start and end points in a TimeFrame annotation a different time unit.

Parameters:
  • mmifMmif instance

  • time_frameAnnotation instance that holds a time interval annotation ("@type": ".../TimeFrame/...")

  • out_unit – time unit to which the point is converted

Returns:

tuple of frame numbers, seconds/milliseconds, or ISO notation of TimeFrame’s start and end

mmif.utils.video_document_helper.framenum_to_second(video_doc: Document, frame: int)[source]

Converts a frame number to a second value.

mmif.utils.video_document_helper.framenum_to_millisecond(video_doc: Document, frame: int)[source]

Converts a frame number to a millisecond value.

mmif.utils.video_document_helper.second_to_framenum(video_doc: Document, second) int[source]

Converts a second value to a frame number.

mmif.utils.video_document_helper.millisecond_to_framenum(video_doc: Document, millisecond: float) int[source]

Converts a millisecond value to a frame number.

mmif.utils.workflow_helper module

mmif.utils.workflow_helper.group_views_by_app(views: ViewsList) List[List[Any]][source]

Groups views into app executions based on app and timestamp.

An “app” is a set of views produced by the same app at the exact same timestamp.

mmif.utils.workflow_helper.generate_param_hash(params: dict) str[source]

Generate MD5 hash from a parameter dictionary.

Parameters are sorted alphabetically, joined as key=value pairs, and hashed using MD5. This is not for security purposes, only for generating consistent identifiers.

Parameters:

params – Dictionary of parameters

Returns:

MD5 hash string (32 hex characters)

mmif.utils.workflow_helper.generate_workflow_identifier(mmif_input: str | Path | Mmif, return_param_dicts: Literal[True]) Tuple[str, List[dict]][source]
mmif.utils.workflow_helper.generate_workflow_identifier(mmif_input: str | Path | Mmif, return_param_dicts: Literal[False] = False) str

Generate a workflow identifier string from a MMIF file or object.

The identifier follows the storage directory structure format: app_name/version/param_hash/app_name2/version2/param_hash2/…

Uses view.metadata.parameters (raw user-passed values) for hashing to ensure reproducibility. Views with errors or warnings are excluded from the identifier; empty views are included.

Parameters:
  • mmif_input – Path to MMIF file (str or Path) or a Mmif object

  • return_param_dicts – If True, also return the parameter dictionaries

Returns:

Workflow identifier string, or tuple of (identifier, param_dicts) if return_param_dicts=True

pydantic model mmif.utils.workflow_helper.SingleMmifStats[source]

Bases: BaseModel

Aggregated statistics for a single MMIF file.

Show JSON schema
{
   "title": "SingleMmifStats",
   "description": "Aggregated statistics for a single MMIF file.",
   "type": "object",
   "properties": {
      "appCount": {
         "description": "Total number of app executions identified.",
         "title": "Appcount",
         "type": "integer"
      },
      "errorViews": {
         "description": "List of view IDs that contain errors.",
         "items": {
            "type": "string"
         },
         "title": "Errorviews",
         "type": "array"
      },
      "warningViews": {
         "description": "List of view IDs that contain warnings.",
         "items": {
            "type": "string"
         },
         "title": "Warningviews",
         "type": "array"
      },
      "emptyViews": {
         "description": "List of view IDs that contain no annotations.",
         "items": {
            "type": "string"
         },
         "title": "Emptyviews",
         "type": "array"
      },
      "annotationCountByType": {
         "additionalProperties": {
            "type": "integer"
         },
         "description": "Total annotation counts across the file.",
         "title": "Annotationcountbytype",
         "type": "object"
      }
   },
   "required": [
      "appCount"
   ]
}

Fields:
field annotation_count_by_type: Dict[str, int] [Optional] (alias 'annotationCountByType')

Total annotation counts across the file.

field app_count: int [Required] (alias 'appCount')

Total number of app executions identified.

field empty_views: List[str] [Optional] (alias 'emptyViews')

List of view IDs that contain no annotations.

field error_views: List[str] [Optional] (alias 'errorViews')

List of view IDs that contain errors.

field warning_views: List[str] [Optional] (alias 'warningViews')

List of view IDs that contain warnings.

pydantic model mmif.utils.workflow_helper.AppProfiling[source]

Bases: BaseModel

Profiling data for a single app execution.

Show JSON schema
{
   "title": "AppProfiling",
   "description": "Profiling data for a single app execution.",
   "type": "object",
   "properties": {
      "runningTimeMS": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Execution time in milliseconds.",
         "title": "Runningtimems"
      }
   }
}

Fields:
field running_time_ms: int | None = None (alias 'runningTimeMS')

Execution time in milliseconds.

pydantic model mmif.utils.workflow_helper.AppExecution[source]

Bases: BaseModel

Represents a single execution of an app, which may produce multiple views.

Show JSON schema
{
   "title": "AppExecution",
   "description": "Represents a single execution of an app, which may produce multiple views.",
   "type": "object",
   "properties": {
      "app": {
         "description": "The URI of the app.",
         "title": "App",
         "type": "string"
      },
      "viewIds": {
         "description": "List of view IDs generated by this execution.",
         "items": {
            "type": "string"
         },
         "title": "Viewids",
         "type": "array"
      },
      "appConfiguration": {
         "additionalProperties": true,
         "description": "Configuration parameters used for this execution.",
         "title": "Appconfiguration",
         "type": "object"
      },
      "appProfiling": {
         "$ref": "#/$defs/AppProfiling",
         "description": "Profiling data for this execution."
      },
      "annotationCountByType": {
         "additionalProperties": {
            "type": "integer"
         },
         "description": "Counts of annotations produced, grouped by type.",
         "title": "Annotationcountbytype",
         "type": "object"
      }
   },
   "$defs": {
      "AppProfiling": {
         "description": "Profiling data for a single app execution.",
         "properties": {
            "runningTimeMS": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Execution time in milliseconds.",
               "title": "Runningtimems"
            }
         },
         "title": "AppProfiling",
         "type": "object"
      }
   },
   "required": [
      "app",
      "viewIds"
   ]
}

Fields:
field annotation_count_by_type: Dict[str, int] [Optional] (alias 'annotationCountByType')

Counts of annotations produced, grouped by type.

field app: str [Required]

The URI of the app.

field app_configuration: Dict [Optional] (alias 'appConfiguration')

Configuration parameters used for this execution.

field app_profiling: AppProfiling [Optional] (alias 'appProfiling')

Profiling data for this execution.

field view_ids: List[str] [Required] (alias 'viewIds')

List of view IDs generated by this execution.

pydantic model mmif.utils.workflow_helper.SingleMmifDesc[source]

Bases: BaseModel

Description of a workflow extracted from a single MMIF file.

Show JSON schema
{
   "title": "SingleMmifDesc",
   "description": "Description of a workflow extracted from a single MMIF file.",
   "type": "object",
   "properties": {
      "workflowId": {
         "description": "Unique identifier for the workflow structure.",
         "title": "Workflowid",
         "type": "string"
      },
      "stats": {
         "$ref": "#/$defs/SingleMmifStats",
         "description": "Statistics about the views and annotations."
      },
      "apps": {
         "description": "Sequence of app executions in the workflow.",
         "items": {
            "$ref": "#/$defs/AppExecution"
         },
         "title": "Apps",
         "type": "array"
      }
   },
   "$defs": {
      "AppExecution": {
         "description": "Represents a single execution of an app, which may produce multiple views.",
         "properties": {
            "app": {
               "description": "The URI of the app.",
               "title": "App",
               "type": "string"
            },
            "viewIds": {
               "description": "List of view IDs generated by this execution.",
               "items": {
                  "type": "string"
               },
               "title": "Viewids",
               "type": "array"
            },
            "appConfiguration": {
               "additionalProperties": true,
               "description": "Configuration parameters used for this execution.",
               "title": "Appconfiguration",
               "type": "object"
            },
            "appProfiling": {
               "$ref": "#/$defs/AppProfiling",
               "description": "Profiling data for this execution."
            },
            "annotationCountByType": {
               "additionalProperties": {
                  "type": "integer"
               },
               "description": "Counts of annotations produced, grouped by type.",
               "title": "Annotationcountbytype",
               "type": "object"
            }
         },
         "required": [
            "app",
            "viewIds"
         ],
         "title": "AppExecution",
         "type": "object"
      },
      "AppProfiling": {
         "description": "Profiling data for a single app execution.",
         "properties": {
            "runningTimeMS": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Execution time in milliseconds.",
               "title": "Runningtimems"
            }
         },
         "title": "AppProfiling",
         "type": "object"
      },
      "SingleMmifStats": {
         "description": "Aggregated statistics for a single MMIF file.",
         "properties": {
            "appCount": {
               "description": "Total number of app executions identified.",
               "title": "Appcount",
               "type": "integer"
            },
            "errorViews": {
               "description": "List of view IDs that contain errors.",
               "items": {
                  "type": "string"
               },
               "title": "Errorviews",
               "type": "array"
            },
            "warningViews": {
               "description": "List of view IDs that contain warnings.",
               "items": {
                  "type": "string"
               },
               "title": "Warningviews",
               "type": "array"
            },
            "emptyViews": {
               "description": "List of view IDs that contain no annotations.",
               "items": {
                  "type": "string"
               },
               "title": "Emptyviews",
               "type": "array"
            },
            "annotationCountByType": {
               "additionalProperties": {
                  "type": "integer"
               },
               "description": "Total annotation counts across the file.",
               "title": "Annotationcountbytype",
               "type": "object"
            }
         },
         "required": [
            "appCount"
         ],
         "title": "SingleMmifStats",
         "type": "object"
      }
   },
   "required": [
      "workflowId",
      "stats",
      "apps"
   ]
}

Fields:
field apps: List[AppExecution] [Required]

Sequence of app executions in the workflow.

field stats: SingleMmifStats [Required]

Statistics about the views and annotations.

field workflow_id: str [Required] (alias 'workflowId')

Unique identifier for the workflow structure.

mmif.utils.workflow_helper.describe_single_mmif(mmif_input: str | Path | Mmif) dict[source]

Reads a MMIF file or object and extracts the workflow specification from it.

This function provides an app-centric summarization of the workflow. The conceptual hierarchy is that a workflow is a sequence of apps, and each app execution can produce one or more views. This function groups views that share the same app and metadata.timestamp into a single logical “app execution”.

Note

For MMIF files generated by apps based on clams-python <= 1.3.3, all views are independently timestamped. This means that even if multiple views were generated by a single execution of an app, their metadata.timestamp values will be unique. As a result, the grouping logic will treat each view as a separate app execution. The change that aligns timestamps for views from a single app execution is implemented in clams-python PR #271.

The output is a serialized SingleMmifDesc object.

pydantic model mmif.utils.workflow_helper.SingleMmifDesc[source]
Parameters:

mmif_input – Path to MMIF file (str or Path) or a Mmif object

Returns:

A dictionary containing the workflow specification.

pydantic model mmif.utils.workflow_helper.AppProfilingStats[source]

Bases: BaseModel

Aggregated profiling statistics for an app across a workflow.

Show JSON schema
{
   "title": "AppProfilingStats",
   "description": "Aggregated profiling statistics for an app across a workflow.",
   "type": "object",
   "properties": {
      "avgRunningTimeMS": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Average execution time in milliseconds.",
         "title": "Avgrunningtimems"
      },
      "minRunningTimeMS": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Minimum execution time in milliseconds.",
         "title": "Minrunningtimems"
      },
      "maxRunningTimeMS": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Maximum execution time in milliseconds.",
         "title": "Maxrunningtimems"
      },
      "stdevRunningTimeMS": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Standard deviation of execution time.",
         "title": "Stdevrunningtimems"
      }
   }
}

Fields:
field avg_running_time_ms: float | None = None (alias 'avgRunningTimeMS')

Average execution time in milliseconds.

field max_running_time_ms: float | None = None (alias 'maxRunningTimeMS')

Maximum execution time in milliseconds.

field min_running_time_ms: float | None = None (alias 'minRunningTimeMS')

Minimum execution time in milliseconds.

field stdev_running_time_ms: float | None = None (alias 'stdevRunningTimeMS')

Standard deviation of execution time.

pydantic model mmif.utils.workflow_helper.WorkflowAppExecution[source]

Bases: BaseModel

Aggregated information about an app’s usage within a specific workflow across multiple files.

Show JSON schema
{
   "title": "WorkflowAppExecution",
   "description": "Aggregated information about an app's usage within a specific workflow across multiple files.",
   "type": "object",
   "properties": {
      "app": {
         "description": "The URI of the app.",
         "title": "App",
         "type": "string"
      },
      "appConfiguration": {
         "additionalProperties": true,
         "description": "Representative configuration (usually from the first occurrence).",
         "title": "Appconfiguration",
         "type": "object"
      },
      "appProfiling": {
         "$ref": "#/$defs/AppProfilingStats",
         "description": "Aggregated profiling statistics."
      }
   },
   "$defs": {
      "AppProfilingStats": {
         "description": "Aggregated profiling statistics for an app across a workflow.",
         "properties": {
            "avgRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Average execution time in milliseconds.",
               "title": "Avgrunningtimems"
            },
            "minRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Minimum execution time in milliseconds.",
               "title": "Minrunningtimems"
            },
            "maxRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Maximum execution time in milliseconds.",
               "title": "Maxrunningtimems"
            },
            "stdevRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Standard deviation of execution time.",
               "title": "Stdevrunningtimems"
            }
         },
         "title": "AppProfilingStats",
         "type": "object"
      }
   },
   "required": [
      "app"
   ]
}

Fields:
field app: str [Required]

The URI of the app.

field app_configuration: Dict [Optional] (alias 'appConfiguration')

Representative configuration (usually from the first occurrence).

field app_profiling: AppProfilingStats [Optional] (alias 'appProfiling')

Aggregated profiling statistics.

pydantic model mmif.utils.workflow_helper.WorkflowCollectionEntry[source]

Bases: BaseModel

Summary of a unique workflow found within a collection.

Show JSON schema
{
   "title": "WorkflowCollectionEntry",
   "description": "Summary of a unique workflow found within a collection.",
   "type": "object",
   "properties": {
      "workflowId": {
         "description": "Unique identifier for the workflow.",
         "title": "Workflowid",
         "type": "string"
      },
      "mmifs": {
         "description": "List of filenames belonging to this workflow.",
         "items": {
            "type": "string"
         },
         "title": "Mmifs",
         "type": "array"
      },
      "mmifCount": {
         "description": "Number of MMIF files matching this workflow.",
         "title": "Mmifcount",
         "type": "integer"
      },
      "apps": {
         "description": "Sequence of apps in this workflow with aggregated stats.",
         "items": {
            "$ref": "#/$defs/WorkflowAppExecution"
         },
         "title": "Apps",
         "type": "array"
      }
   },
   "$defs": {
      "AppProfilingStats": {
         "description": "Aggregated profiling statistics for an app across a workflow.",
         "properties": {
            "avgRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Average execution time in milliseconds.",
               "title": "Avgrunningtimems"
            },
            "minRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Minimum execution time in milliseconds.",
               "title": "Minrunningtimems"
            },
            "maxRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Maximum execution time in milliseconds.",
               "title": "Maxrunningtimems"
            },
            "stdevRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Standard deviation of execution time.",
               "title": "Stdevrunningtimems"
            }
         },
         "title": "AppProfilingStats",
         "type": "object"
      },
      "WorkflowAppExecution": {
         "description": "Aggregated information about an app's usage within a specific workflow across multiple files.",
         "properties": {
            "app": {
               "description": "The URI of the app.",
               "title": "App",
               "type": "string"
            },
            "appConfiguration": {
               "additionalProperties": true,
               "description": "Representative configuration (usually from the first occurrence).",
               "title": "Appconfiguration",
               "type": "object"
            },
            "appProfiling": {
               "$ref": "#/$defs/AppProfilingStats",
               "description": "Aggregated profiling statistics."
            }
         },
         "required": [
            "app"
         ],
         "title": "WorkflowAppExecution",
         "type": "object"
      }
   },
   "required": [
      "workflowId",
      "mmifs",
      "mmifCount",
      "apps"
   ]
}

Fields:
field apps: List[WorkflowAppExecution] [Required]

Sequence of apps in this workflow with aggregated stats.

field mmif_count: int [Required] (alias 'mmifCount')

Number of MMIF files matching this workflow.

field mmifs: List[str] [Required]

List of filenames belonging to this workflow.

field workflow_id: str [Required] (alias 'workflowId')

Unique identifier for the workflow.

pydantic model mmif.utils.workflow_helper.MmifCountByStatus[source]

Bases: BaseModel

Breakdown of MMIF files in a collection by their processing status.

Show JSON schema
{
   "title": "MmifCountByStatus",
   "description": "Breakdown of MMIF files in a collection by their processing status.",
   "type": "object",
   "properties": {
      "total": {
         "description": "Total number of MMIF files found.",
         "title": "Total",
         "type": "integer"
      },
      "successful": {
         "description": "Number of files processed without errors.",
         "title": "Successful",
         "type": "integer"
      },
      "withErrors": {
         "description": "Number of files containing error views.",
         "title": "Witherrors",
         "type": "integer"
      },
      "withWarnings": {
         "description": "Number of files containing warning views.",
         "title": "Withwarnings",
         "type": "integer"
      },
      "invalid": {
         "description": "Number of files that failed to parse as valid MMIF.",
         "title": "Invalid",
         "type": "integer"
      }
   },
   "required": [
      "total",
      "successful",
      "withErrors",
      "withWarnings",
      "invalid"
   ]
}

Fields:
field invalid: int [Required]

Number of files that failed to parse as valid MMIF.

field successful: int [Required]

Number of files processed without errors.

field total: int [Required]

Total number of MMIF files found.

field with_errors: int [Required] (alias 'withErrors')

Number of files containing error views.

field with_warnings: int [Required] (alias 'withWarnings')

Number of files containing warning views.

pydantic model mmif.utils.workflow_helper.CollectionMmifDesc[source]

Bases: BaseModel

Summary of a collection of MMIF files.

Show JSON schema
{
   "title": "CollectionMmifDesc",
   "description": "Summary of a collection of MMIF files.",
   "type": "object",
   "properties": {
      "mmifCountByStatus": {
         "$ref": "#/$defs/MmifCountByStatus",
         "description": "Counts of MMIF files by status."
      },
      "workflows": {
         "description": "List of unique workflows identified in the collection.",
         "items": {
            "$ref": "#/$defs/WorkflowCollectionEntry"
         },
         "title": "Workflows",
         "type": "array"
      },
      "annotationCountByType": {
         "additionalProperties": {
            "type": "integer"
         },
         "description": "Total annotation counts across the entire collection.",
         "title": "Annotationcountbytype",
         "type": "object"
      }
   },
   "$defs": {
      "AppProfilingStats": {
         "description": "Aggregated profiling statistics for an app across a workflow.",
         "properties": {
            "avgRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Average execution time in milliseconds.",
               "title": "Avgrunningtimems"
            },
            "minRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Minimum execution time in milliseconds.",
               "title": "Minrunningtimems"
            },
            "maxRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Maximum execution time in milliseconds.",
               "title": "Maxrunningtimems"
            },
            "stdevRunningTimeMS": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Standard deviation of execution time.",
               "title": "Stdevrunningtimems"
            }
         },
         "title": "AppProfilingStats",
         "type": "object"
      },
      "MmifCountByStatus": {
         "description": "Breakdown of MMIF files in a collection by their processing status.",
         "properties": {
            "total": {
               "description": "Total number of MMIF files found.",
               "title": "Total",
               "type": "integer"
            },
            "successful": {
               "description": "Number of files processed without errors.",
               "title": "Successful",
               "type": "integer"
            },
            "withErrors": {
               "description": "Number of files containing error views.",
               "title": "Witherrors",
               "type": "integer"
            },
            "withWarnings": {
               "description": "Number of files containing warning views.",
               "title": "Withwarnings",
               "type": "integer"
            },
            "invalid": {
               "description": "Number of files that failed to parse as valid MMIF.",
               "title": "Invalid",
               "type": "integer"
            }
         },
         "required": [
            "total",
            "successful",
            "withErrors",
            "withWarnings",
            "invalid"
         ],
         "title": "MmifCountByStatus",
         "type": "object"
      },
      "WorkflowAppExecution": {
         "description": "Aggregated information about an app's usage within a specific workflow across multiple files.",
         "properties": {
            "app": {
               "description": "The URI of the app.",
               "title": "App",
               "type": "string"
            },
            "appConfiguration": {
               "additionalProperties": true,
               "description": "Representative configuration (usually from the first occurrence).",
               "title": "Appconfiguration",
               "type": "object"
            },
            "appProfiling": {
               "$ref": "#/$defs/AppProfilingStats",
               "description": "Aggregated profiling statistics."
            }
         },
         "required": [
            "app"
         ],
         "title": "WorkflowAppExecution",
         "type": "object"
      },
      "WorkflowCollectionEntry": {
         "description": "Summary of a unique workflow found within a collection.",
         "properties": {
            "workflowId": {
               "description": "Unique identifier for the workflow.",
               "title": "Workflowid",
               "type": "string"
            },
            "mmifs": {
               "description": "List of filenames belonging to this workflow.",
               "items": {
                  "type": "string"
               },
               "title": "Mmifs",
               "type": "array"
            },
            "mmifCount": {
               "description": "Number of MMIF files matching this workflow.",
               "title": "Mmifcount",
               "type": "integer"
            },
            "apps": {
               "description": "Sequence of apps in this workflow with aggregated stats.",
               "items": {
                  "$ref": "#/$defs/WorkflowAppExecution"
               },
               "title": "Apps",
               "type": "array"
            }
         },
         "required": [
            "workflowId",
            "mmifs",
            "mmifCount",
            "apps"
         ],
         "title": "WorkflowCollectionEntry",
         "type": "object"
      }
   },
   "required": [
      "mmifCountByStatus",
      "workflows"
   ]
}

Fields:
field annotation_count_by_type: Dict[str, int] [Optional] (alias 'annotationCountByType')

Total annotation counts across the entire collection.

field mmif_count_by_status: MmifCountByStatus [Required] (alias 'mmifCountByStatus')

Counts of MMIF files by status.

field workflows: List[WorkflowCollectionEntry] [Required]

List of unique workflows identified in the collection.

mmif.utils.workflow_helper.describe_mmif_collection(mmif_dir: str | Path) dict[source]

Reads all MMIF files in a directory and extracts a summarized workflow specification.

This function provides an overview of a collection of MMIF files, aggregating statistics across multiple files.

The output is a serialized CollectionMmifDesc object.

pydantic model mmif.utils.workflow_helper.CollectionMmifDesc[source]
Parameters:

mmif_dir – Path to the directory containing MMIF files.

Returns:

A dictionary containing the summarized collection specification.