clams.app package

Core package providing classes for representing CLAMS apps.

class clams.app.ClamsApp[source]

Bases: ABC

An abstract class to define API’s for ClamsApps. A CLAMS app should inherit this class and then can be used with classes in restify to work as web applications.

_RAW_PARAMS_KEY = '#RAW#'
_abc_impl = <_abc._abc_data object>
abstract _annotate(mmif: Mmif, _raw_parameters=None, **refined_parameters) Mmif[source]

An abstract method to generate (or load if stored elsewhere) the app metadata at runtime. All CLAMS app must implement this.

This is where the bulk of your logic will go. A typical implementation of this method would be

  1. Create a new view (or views) by calling new_view() on the input mmif object.

  2. Call sign_view() with the input runtime parameters for the record.

  3. Call new_contain() on the new view object with any annotation properties specified by the configuration.

  4. Process the data and create Annotation objects and add them to the new view.

  5. While doing so, get help from DocumentTypes, AnnotationTypes classes to generate @type strings.

  6. Return the mmif object

Parameters:
  • mmif – An input MMIF object to annotate

  • runtime_params – An arbitrary set of k-v pairs to configure the app at runtime

Returns:

A Mmif object of the annotated output, ready for serialization

abstract _appmetadata() AppMetadata[source]

An abstract method to generate the app metadata.

Returns:

A Python object of the metadata, must be JSON-serializable

static _check_mmif_compatibility(target_specver, input_specver)[source]
static _cuda_device_name_concat(name, mem)[source]
static _cuda_memory_to_str(mem) str[source]
static _get_available_vram() int[source]

Get currently available VRAM in bytes (GPU-wide, across all processes).

Uses nvidia-smi to get actual available memory, not just current process.

Returns:

Available VRAM in bytes, or 0 if unavailable

_get_profile_path(param_hash: str) Path[source]

Get filesystem path for memory profile file.

Profile files are stored in a per-app directory under user’s cache.

Parameters:

param_hash – Hash of parameters from mmif.utils.cli.describe.generate_param_hash()

Returns:

Path to the profile file

_load_appmetadata() AppMetadata[source]

A private method to load the app metadata. This is called in __init__, (only once) and it uses three sources to load the metadata (in the order of priority):

  1. using a metadata.py file (recommended)

  2. using self._appmetadata() method (legacy, no longer recommended)

In any case, AppMetadata class must be useful.

For metadata specification, see https://clams.ai/clams-python/appmetadata.jsonschema.

static _profile_cuda_memory(func)[source]

Decorator for profiling CUDA memory usage and managing VRAM availability.

This decorator: 1. Checks VRAM requirements before execution (if conditions met) 2. Rejects requests if insufficient VRAM 3. Records peak memory usage after execution 4. Calls empty_cache() for cleanup

Parameters:

func – The function to wrap (typically _annotate)

Returns:

Decorated function that returns (result, cuda_profiler) where cuda_profiler is dict with “<GPU_NAME>, <GPU_TOTAL_MEMORY>” keys and dict values containing ‘available_before’ and ‘peak’ memory in bytes

_record_vram_usage(parameters: dict, peak_bytes: int) None[source]

Record peak memory usage to profile file.

Uses atomic write (temp + rename) to avoid corruption from concurrent writes. Only updates if new value is higher.

Profile files are JSON containing: - peak_bytes: Peak VRAM usage by the torch process - parameters: Original parameters for human readability

Parameters:
  • parameters – Request parameters (for hash and recording)

  • peak_bytes – Measured peak VRAM usage

_refine_params(**runtime_params: List[str])[source]

Method to “fill” the parameter dictionary with default values, when a key-value is not specified in the input. The input map is not really “filled” as a copy of it is returned with addition of default values. :param runtime_params: key-value pairs of runtime parameters :return: a copy of parameter map, with default values added :raises ValueError: when a value for a required parameter is not found in the input

annotate(mmif: str | dict | Mmif, **runtime_params: List[str]) str[source]

A public method to invoke the primary app function. It’s essentially a wrapper around _annotate() method where some common operations (that are invoked by keyword arguments) are implemented.

The input may be a raw MMIF (str, dict, or Mmif) or a JSON envelope wrapping both "parameters" and "mmif". Envelope detection and unwrapping happen here so every execution path (HTTP, CLI, direct Python API) is envelope-aware. When an envelope is given, its parameters are merged under runtime_params (explicitly-passed parameters take priority on key collision).

Parameters:
  • mmif – An input MMIF object, or a JSON envelope, to annotate

  • runtime_params – An arbitrary set of k-v pairs to configure the app at runtime

Returns:

Serialized JSON string of the output of the app

appmetadata(**kwargs: List[str]) str[source]

A public method to get metadata for this app as a string.

Returns:

Serialized JSON string of the metadata

get_configuration(**runtime_params)[source]
static open_document_location(document: str | ~mmif.serialize.annotation.Document, opener: ~typing.Any = <built-in function open>, **openerargs)[source]

A context-providing file opener. A user can provide their own opener class/method and parameters. By default, with will use python built-in open to open the location of the document.

Parameters:
  • document – A Document object that has location

  • opener – A Python class or method that can be used to open a file (e.g. PIL.Image for an image file)

  • openerargs – Parameters that are passed to the opener

Returns:

record_error(mmif: str | dict | Mmif, **runtime_conf: List[str]) Mmif[source]

A method to record an error instead of annotation results in the view this app generated. For logging purpose, the runtime parameters used when the error occurred must be passed as well.

Parameters:
  • mmif – input MMIF object

  • runtime_conf – parameters passed to annotate when the app encountered the error

Returns:

An output MMIF with a new view with the error encoded in the view metadata

set_error_view(mmif: str | dict | Mmif, **runtime_conf: List[str]) Mmif[source]

A method to record an error instead of annotation results in the view this app generated. For logging purpose, the runtime parameters used when the error occurred must be passed as well.

Parameters:
  • mmif – input MMIF object

  • runtime_conf – parameters passed to annotate when the app encountered the error

Returns:

An output MMIF with a new view with the error encoded in the view metadata

sign_view(view: View, runtime_conf: dict) None[source]

A method to “sign” a new view that this app creates at the beginning of annotation. Signing will populate the view metadata with information and configuration of this app. The parameters passed to the _annotate() must be passed to this method. This means all parameters for “common” configuration that are consumed in annotate() should not be recorded in the view metadata. :param view: a view to sign :param runtime_conf: runtime configuration of the app as k-v pairs

universal_parameters = [{'choices': None, 'default': False, 'description': 'The JSON body of the HTTP response will be re-formatted with 2-space indentation', 'multivalued': False, 'name': 'pretty', 'type': 'boolean'}, {'choices': None, 'default': True, 'description': 'The running time of the app will be recorded in the view metadata', 'multivalued': False, 'name': 'runningTime', 'type': 'boolean'}, {'choices': None, 'default': False, 'description': 'The hardware information (architecture, GPU and vRAM) will be recorded in the view metadata', 'multivalued': False, 'name': 'hwFetch', 'type': 'boolean'}, {'choices': ['representatives', 'single', 'all'], 'default': 'representatives', 'description': 'Sampling mode for TimeFrame annotations. Has no effect when the app does not process TimeFrames. "representatives" uses all representative timepoints if present, otherwise skips the TimeFrame. "single" uses the middle representative if present, otherwise extracts an image from the midpoint of the start/end interval (midpoint is calculated by floor division of the sum of start and end). "all" uses all target timepoints if present, otherwise extracts all images from the time interval.', 'multivalued': False, 'name': 'tfSamplingMode', 'type': 'string'}]
static validate_document_locations(mmif: str | Mmif) None[source]

Validate files encoded in the input MMIF.

Parameters:

mmif – An input MMIF with zero or more Document

Raises:

FileNotFoundError – When any of files is not found at its location

class clams.app.ClamsHFPromptableApp[source]

Bases: ClamsPromptableApp

Base class for promptable CLAMS apps backed by a local HuggingFace transformers model. Layers HF-specific inference plumbing on top of ClamsPromptableApp: model loading via clams.backends.hf.load_hf_model(), and a concrete generate() implementation that runs N independent prompts in one HF forward pass via the standard chat-template -> model.generate -> batch_decode pipeline.

Concrete subclasses declare the model class via MODEL_CLS plus a handful of optional dtype/padding hints, and the family of pinned model revisions via analyzer_versions in metadata.py. The SDK auto-derives a model runtime parameter (choices = keys of analyzer_versions), and the dev’s _annotate calls load_model() to (lazily) load the requested family member. Singleton families (one entry in analyzer_versions) eagerly pre-load in __init__ so single-model apps preserve warm-start semantics. Example:

class MyVLMCaptioner(ClamsHFPromptableApp):
    MODEL_CLS = AutoModelForImageTextToText
    DTYPE = torch.bfloat16
    PADDING_SIDE = 'left'

    # In metadata.py:
    #     analyzer_versions={
    #         "HuggingFaceTB/SmolVLM2-2.2B-Instruct": "482adb5",
    #     }
    # plus a call to
    # ClamsHFPromptableApp.inject_promptable_parameters(metadata).

    def _annotate(self, mmif, **parameters):
        self.load_model(parameters['model'])
        # ... self.generate(prompt, images=image_groups, ...)
        # ... self.response_to_grounded_textdocument(...)
        ...

Requires the [hf] extra (pip install clams-python[hf]).

DTYPE: Any | None = None

Torch dtype for the model (e.g. torch.bfloat16). When None, the model class’s own default is used (typically float32). Also used to cast pixel_values in generate().

MODEL_CLS: Any | None = None

transformers model class (e.g. AutoModelForImageTextToText, AutoModelForCausalLM). Subclasses MUST set this.

MODEL_KWARGS: dict | None = None

Extra kwargs forwarded to MODEL_CLS.from_pretrained().

PADDING_SIDE: str | None = None

Tokenizer padding side. Set to 'left' for decoder-only batched generation; leave None otherwise.

PROCESSOR_CLS: Any | None = None

transformers processor / tokenizer / feature-extractor class. Defaults to AutoProcessor (set by clams.backends.hf.load_hf_model() when None).

PROCESSOR_KWARGS: dict | None = None

Extra kwargs forwarded to PROCESSOR_CLS.from_pretrained().

_abc_impl = <_abc._abc_data object>
_model_cache: Dict[Tuple[str, str], Tuple[Any, Any, str]]

Per-(model_id, revision) cache of loaded (processor, model, device) triples. Populated by load_model(); survives for the lifetime of this app instance.

_refine_params(**runtime_params)[source]

Expand model from the raw HF id (org/name) to org/name@<revision> so the resolved revision lands in view.metadata.appConfiguration['model'].

static build_gen_kwargs(max_new_tokens: int = 512, temperature: float = 0.0, top_p: float = 1.0, top_k: int = 50, **_unused) dict[source]

Translate the SDK’s promptable-parameter values into HuggingFace model.generate() kwargs. Greedy decoding (do_sample=False) when temperature == 0.0; sampled decoding with the given top_p / top_k otherwise.

Subclasses MAY override to add model-specific generation kwargs (num_beams, repetition_penalty, custom stopping criteria, do_sample overrides, etc.). The base implementation accepts any extra keyword args and silently ignores them, so subclasses can pass through the full **parameters dict from _annotate without filtering.

generate(prompt: List[str], system_prompt: str = '', images: List[List[Any]] | None = None, audios: List[List[Any]] | None = None, prompt_mode: str = 'turn-taking', **generation_params) List[str][source]

Default implementation of the ClamsPromptableApp.generate() contract for HuggingFace transformers models. Runs N prompts in one forward pass; returns N decoded strings.

Each inner list of images / audios is the bundled content for one prompt. When both images and audios are given they must have the same outer length (multimodal pairs are stitched by index). When both are None, runs as a single text-only prompt.

The default body is the canonical HF chat-model pipeline: build_conversation() -> apply_chat_template -> model.generate -> batch_decode. Subclasses can customize finer-grained pieces via build_conversation() (model-specific message shape) and build_gen_kwargs() (model-specific generation kwargs) without touching this method.

static inject_promptable_parameters(metadata: AppMetadata) None[source]

Add the SDK-managed promptable parameters AND a model parameter derived from metadata.analyzer_versions to the app metadata. Overrides ClamsPromptableApp.inject_promptable_parameters() for HF apps; call this at the end of your app’s appmetadata() function in metadata.py if your app subclasses ClamsHFPromptableApp.

Parameters:

metadata – the AppMetadata instance being built. metadata.analyzer_versions MUST already be set to a non-empty Dict[str, str] (model id -> commit hash); this helper reads it to derive the model parameter’s choices.

Raises:

ValueError – if metadata.analyzer_versions is missing or empty.

load_model(model_id_or_with_rev: str) Tuple[Any, Any, str][source]

Load (or return cached) (processor, model, device) for the given model id. Accepts both refined (org/name@rev) and raw (org/name) forms; for raw form, the revision is looked up from self.metadata.analyzer_versions. Caches results per (model_id, revision) and updates self.processor, self.model, self.device to the loaded triple so subsequent generate() calls operate on it.

Parameters:

model_id_or_with_rev – HF model id, optionally with @<revision> suffix.

Returns:

(processor, model, device) tuple for the loaded model. Same references are also stored on self.

Raises:

KeyError – if a raw model id is passed and is not in analyzer_versions.

processor: Any

References to the currently-active loaded model. Set by load_model(); generate() and friends read from here. None until the first load_model call (or until __init__ eager-loads a singleton family).

class clams.app.ClamsPromptableApp[source]

Bases: ClamsApp

Base class for CLAMS apps that wrap a promptable model (an LLM or other multimodal model, local or remote). Standardizes the runtime parameter surface (prompt, generation hyperparameters, parallelism control) and provides helpers for building chat conversations and persisting model responses into MMIF.

The standardized parameters are listed in promptable_parameters and added to an app’s metadata via inject_promptable_parameters(). Promptable-app developers MUST call that helper at the end of their appmetadata() function in metadata.py. The reservation rule (these parameter names are SDK-managed and apps cannot redeclare them) is enforced implicitly via AppMetadata.add_parameter()’s existing duplicate-name check.

Inference is performed by generate(), which subclasses MUST implement. The base class provides:

_abc_impl = <_abc._abc_data object>
_build_single_turn(text, system_prompt, images, audios)[source]
_build_turn_taking(prompts, system_prompt, images, audios)[source]

Alternating user/assistant turns; one inference call. Even indices in prompts are user turns, odd indices are pre-written assistant exemplars. Images/audios (if any) are attached to the final user turn (the actual query).

_build_user_only(prompts, system_prompt, images, audios)[source]

N progressively-extending conversation prefixes, one per user turn. Assistant slots between users have content=None as placeholders for the caller’s successive generation results.

static _make_user_content(text, images=None, audios=None)[source]

Build the content list for a user-role message.

build_conversation(prompt: str | List[str] | List[dict], system_prompt: str = '', images: List[Any] | None = None, audios: List[Any] | None = None, prompt_mode: str = 'turn-taking') List[dict] | List[List[dict]][source]

Build a chat-template-compatible message list.

Parameters:
  • prompt – a plain string, a List[str] of prompt turns, or a pre-built List[dict] of role/content message objects (returned as-is; pass-through for advanced callers that constructed the conversation themselves).

  • system_prompt – if non-empty, prepended as a system-role message.

  • images – optional list of image inputs to include in the (final) user turn’s content. Each appears as a {'type': 'image', 'image': <input>} entry.

  • audios – optional list of audio inputs to include in the (final) user turn’s content. Each appears as a {'type': 'audio', 'audio': <input>} entry.

  • prompt_mode"turn-taking" (default) or "user-only". Only meaningful when prompt is a multi-element list; ignored otherwise. See promptable_parameters for semantics.

Returns:

  • For single-shot prompts (string or single-element list) and for multi-element turn-taking mode: a single List[dict] of role/content messages, ready to feed to a chat-template applier (e.g., processor.apply_chat_template).

  • For multi-element user-only mode: a List[List[dict]] of N progressively-extending conversation prefixes, one per user turn. Each prefix ends in a user turn; assistant turns between users are stored with content=None as placeholders for the caller to fill in with successive generation results.

Subclasses may override to access model-specific state (self.processor, self.tokenizer, etc.) during formatting; the base implementation is back-end-agnostic.

abstract generate(prompt: List[str], system_prompt: str = '', images: List[List[Any]] | None = None, audios: List[List[Any]] | None = None, prompt_mode: str = 'turn-taking', **generation_params) List[str][source]

Run N independent prompts in one inference call and return N outputs. Subclasses MUST implement this.

Each inner list of images / audios is the bundled multimodal content for ONE prompt – the model sees those items as one composite input and produces one output. The outer list spans N prompts processed in parallel (when the backend supports it; sequentially otherwise).

  • Single-prompt call: images=[[img1, img2]] -> one output (composite over the two bundled images).

  • Per-input broadcast: images=[[img1], [img2], [img3]] -> three outputs (one per image). Caller assembles the singleton-wrap shape.

  • Multimodal pair: images=[[img1]], audios=[[au1]] -> one output. When both images and audios are given they must have the same outer length; index i of each pairs into prompt i.

Parameters:
  • prompt – a List[str] of prompt turns. A single-element list is one-shot. A multi-element list is multi-turn and is assembled according to prompt_mode.

  • system_prompt – optional system-role text prepended to the conversation. Applies to every prompt in the batch.

  • images – optional List[List[Any]] – N groups, one per prompt; each inner list is the bundled images for that prompt.

  • audios – optional List[List[Any]] – N groups, one per prompt; each inner list is the bundled audio clips for that prompt.

  • prompt_mode"turn-taking" (default) or "user-only"; see promptable_parameters.

  • generation_params – any additional backend-specific generation kwargs (maxNewTokens, temperature, topP, topK, etc.).

Returns:

a List[str] with one entry per prompt in the batch. For prompt_mode='user-only' multi-turn, each prompt’s entry is the assistant’s final reply across its N user turns.

Return type:

List[str]

static inject_promptable_parameters(metadata: AppMetadata) None[source]

Add the SDK-managed promptable parameters to metadata. Call this at the end of your app’s appmetadata() function in metadata.py if your app subclasses ClamsPromptableApp.

The reservation rule is enforced implicitly: if the app had already called metadata.add_parameter('prompt', ...) (or any other promptable name) before this helper, the helper’s own add_parameter call will trip the existing duplicate-name ValueError in AppMetadata.add_parameter().

Parameters:

metadata – the AppMetadata instance being built

promptable_parameters = [{'description': 'User prompt(s) sent to the model. A single value runs as a one-shot generation. A multi-value list is interpreted as a multi-turn static prompt; see ``promptMode`` for how turns are assembled.', 'multivalued': True, 'name': 'prompt', 'type': 'string'}, {'default': '', 'description': 'Optional system-role text prepended to the conversation. Empty by default.', 'name': 'systemPrompt', 'type': 'string'}, {'choices': ['user-only', 'turn-taking'], 'default': 'turn-taking', 'description': 'How to interpret a multi-value ``prompt`` list. Has no effect when ``prompt`` has a single value. For semantics of each choice and worked examples, see https://clams.ai/clams-python/app-baseclasses.html#promptable-multiturn', 'name': 'promptMode', 'type': 'string'}, {'default': 512, 'description': "Maximum number of new tokens generated per inference call. Forwarded to the backend's ``generate``-equivalent. Larger values grow the KV cache linearly and increase GPU memory usage; reduce if VRAM is constrained.", 'name': 'maxNewTokens', 'type': 'integer'}, {'default': 0.0, 'description': 'Sampling temperature. The default ``0.0`` selects deterministic / greedy decoding for maximum reproducibility; override for sampled generation.', 'name': 'temperature', 'type': 'number'}, {'default': 1.0, 'description': 'Nucleus-sampling cumulative probability cutoff. Only meaningful when ``temperature`` is greater than 0.', 'name': 'topP', 'type': 'number'}, {'default': 50, 'description': 'Top-K sampling cutoff. Only meaningful when ``temperature`` is greater than 0.', 'name': 'topK', 'type': 'integer'}, {'default': 1, 'description': "Number of independent prompts the app runs in parallel (stacks into a single forward pass). The *size* of each prompt (how many images, how long the system/user text is, etc.) is NOT regulated by this parameter; that is each app's responsibility. Prompt count and per-prompt content size combine multiplicatively for GPU memory, so the two can blow up together. Catastrophic example: ``tfSamplingMode=all`` on a TimeFrame without ``targets`` expands that TF into one image per native-FPS frame (300 images for a 10-second TF at 30fps); ``parallelPrompts=4`` then runs 4 such prompts in one forward pass (~1200 images), guaranteed OOM. Keep at ``1`` on memory-tight setups; raise only when per-prompt content is small and bounded.", 'name': 'parallelPrompts', 'type': 'integer'}]

SDK-managed runtime parameters injected into every promptable app. These names are reserved; apps cannot redeclare them with customized specs.

response_to_grounded_textdocument(view: View, source: str, response: str, origins: List[str] | None = None, origination: str | None = None, reasoning_trace: str | None = None) Tuple[Any, Any][source]

Persist a single LLM text response into a view. Writes one TextDocument (containing the response) plus possible grounding via an Alignment annotation and origins / origination properties on the TD.

The two grounding link kinds are semantically distinct:

  • source is the coarse cross-modal grounding – the single annotation id that the response is anchored to. Written into the new Alignment (source -> td). Typical value: the parent TimeFrame for a captioning/OCR app.

  • origins are the finer derivation grounding – a list of annotation ids the response was specifically derived from (e.g. the TimePoints whose frames were fed to the model). Written into TextDocument.origins. See https://clams.ai/clams-vocabulary/Document for vocabulary semantics.

Parameters:
  • view – the View to write into. The caller is responsible for having called View.new_contain() for TextDocument and Alignment first if needed.

  • sourceid of the annotation to record as the cross-modal anchor of the response (see above).

  • response – the text generated by the model.

  • origins – optional list of ids of annotations the response was derived from. Must be paired with origination.

  • origination – nature of the derivation, written to TextDocument.origination. Accepted values per the vocabulary include 'derived', 'transcription', 'topologically-identical'. Must be paired with origins.

  • reasoning_trace – optional model-side reasoning trace (a chain-of-thought / scratchpad string, NOT a Python traceback). NOT YET SUPPORTED – passing a non-None value raises NotImplementedError. Storage convention is still being decided at clamsproject/clams-python#263.

Returns:

(TextDocument, Alignment) tuple of the new annotations.

Raises:

ValueError – if exactly one of origins / origination is set; they must be supplied together or both omitted.