mmif.serialize module

mmif.serialize.model module

The model module contains the classes used to represent an abstract MMIF object as a live Python object.

The MmifObject class or one of its derivatives is subclassed by all other classes defined in this SDK, except for MmifObjectEncoder.

These objects are generally instantiated from JSON, either as a string or as an already-loaded Python dictionary. This base class provides the core functionality for deserializing MMIF JSON data into live objects and serializing live objects into MMIF JSON data. Specialized behavior for the different components of MMIF is added in the subclasses.

class mmif.serialize.model.DataDict(mmif_obj: bytes | str | dict | None = None)[source]

Bases: MmifObject, Generic[T, S]

empty()[source]
get(key: T, default=None) S | None[source]
items()[source]
keys()[source]
update(other, overwrite)[source]
values()[source]
class mmif.serialize.model.DataList(mmif_obj: bytes | str | list | None = None)[source]

Bases: MmifObject, Generic[T]

The DataList class is an abstraction that represents the various lists found in a MMIF file, such as documents, subdocuments, views, and annotations.

Parameters:

mmif_obj (Union[str, list]) – the data that the list contains

append(value, overwrite)[source]
deserialize(mmif_json: str | list) None[source]

Passes the input data into the internal deserializer.

empty()[source]
get(key: str) T | None[source]

Standard dictionary-style get() method, albeit with no default parameter. Relies on the implementation of __getitem__.

Will return None if the key is not found.

Parameters:

key – the key to search for

Returns:

the value matching that key

class mmif.serialize.model.MmifObject(mmif_obj: bytes | str | dict | None = None)[source]

Bases: object

Abstract superclass for MMIF related key-value pair objects.

Any MMIF object can be initialized as an empty placeholder or an actual representation with a JSON formatted string or equivalent dict object argument.

This superclass has four specially designed instance variables, and these variable names cannot be used as attribute names for MMIF objects.

  1. _unnamed_attributes: Only can be either None or an empty dictionary. If it’s set to None, it means the class won’t take any Additional Attributes in the JSON schema sense. If it’s a dict, users can throw any k-v pairs to the class, EXCEPT for the reserved two key names.

  2. _attribute_classes: This is a dict from a key name to a specific python class to use for deserialize the value. Note that a key name in this dict does NOT have to be a named attribute, but is recommended to be one.

  3. _required_attributes: This is a simple list of names of attributes that are required in the object. When serialize, an object will skip its empty (e.g. zero-length, or None) attributes unless they are in this list. Otherwise, the serialized JSON string would have empty representations (e.g. "", []).

  4. _exclude_from_diff: This is a simple list of names of attributes that should be excluded from the diff calculation in __eq__.

# TODO (krim @ 8/17/20): this dict is however, a duplicate with the type hints in the class definition. Maybe there is a better way to utilize type hints (e.g. getting them as a programmatically), but for now developers should be careful to add types to hints as well as to this dict.

Also note that those special attributes MUST be set in the __init__() before calling super method, otherwise deserialization will not work.

And also, a subclass that has one or more named attributes, it must set those attributes in the __init__() before calling super method. When serializing a MmifObject, all empty attributes will be ignored, so for optional named attributes, you must leave the values empty (len == 0), but NOT None. Any None-valued named attributes will cause issues with current implementation.

Parameters:

mmif_obj – JSON string or dict to initialize an object. If not given, an empty object will be initialized, sometimes with an ID value automatically generated, based on its parent object.

deserialize(mmif_json: str | dict) None[source]

Takes a JSON-formatted string or a simple dict that’s json-loaded from such a string as an input and populates object’s fields with the values specified in the input.

Parameters:

mmif_json – JSON-formatted string or dict from such a string that represents a MMIF object

disallow_additional_properties() None[source]

Call this method in __init__() to prevent the insertion of unnamed attributes after initialization.

static is_empty(obj) bool[source]

return True if the obj is None or “emtpy”. The emptiness first defined as having zero length. But for objects that lack __len__ method, we need additional check.

reserved_names: Set[str] = {'_attribute_classes', '_exclude_from_diff', '_required_attributes', '_unnamed_attributes', 'reserved_names'}[source]
serialize(pretty: bool = False) str[source]

Generates JSON representation of an object.

Parameters:

pretty – If True, returns string representation with indentation.

Returns:

JSON string of the object.

set_additional_property(key: str, value: Any) None[source]

Method to set values in _unnamed_attributes.

Parameters:
  • key – the attribute name

  • value – the desired value

Returns:

None

Raise:

AttributeError if additional properties are disallowed by disallow_additional_properties()

class mmif.serialize.model.MmifObjectEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

Encoder class to define behaviors of de-/serialization

default(obj: MmifObject)[source]

Overrides default encoding behavior to prioritize MmifObject.serialize().

mmif.serialize.mmif module

The mmif module contains the classes used to represent a full MMIF file as a live Python object.

See the specification docs and the JSON Schema file for more information.

class mmif.serialize.mmif.Mmif(mmif_obj: bytes | str | dict | None = None, *, validate: bool = True)[source]

Bases: MmifObject

MmifObject that represents a full MMIF file.

Parameters:
  • mmif_obj – the JSON data

  • validate – whether to validate the data against the MMIF JSON schema.

add_document(document: Document, overwrite=False) None[source]

Appends a Document object to the documents list.

Fails if there is already a document with the same ID in the MMIF object.

Parameters:
  • document – the Document object to add

  • overwrite – if set to True, will overwrite an existing view with the same ID

Returns:

None

add_view(view: View, overwrite=False) None[source]

Appends a View object to the views list.

Fails if there is already a view with the same ID in the MMIF object.

Parameters:
  • view – the Document object to add

  • overwrite – if set to True, will overwrite an existing view with the same ID

Returns:

None

generate_capital_annotations()[source]

Automatically convert any “pending” temporary properties from Document objects to Annotation objects . The generated Annotation objects are then added to the last View in the views lists.

See https://github.com/clamsproject/mmif-python/issues/226 for rationale behind this behavior and discussion.

get_alignments(at_type1: str | TypesBase, at_type2: str | TypesBase) Dict[str, List[Annotation]][source]

Finds views where alignments between two given annotation types occurred.

Returns:

a dict that keyed by view IDs (str) and has lists of alignment Annotation objects as values.

get_all_views_contain(at_types: TypesBase | str | List[str | TypesBase]) List[View][source]

Returns the list of all views in the MMIF if given types are present in that view’s ‘contains’ metadata.

Parameters:

at_types – a list of types or just a type to check for. When given more than one types, all types must be found.

Returns:

the list of views that contain the type

get_document_by_id(doc_id: str) Document[source]

Finds a Document object with the given ID.

Parameters:

doc_id – the ID to search for

Returns:

a reference to the corresponding document, if it exists

Raises:

KeyError – if there is no corresponding document

get_document_location(m_type: DocumentTypes | str, path_only=False) str | None[source]

Method to get the location of first document of given type.

Parameters:

m_type – the type to search for

Returns:

the value of the location field in the corresponding document

get_documents_by_app(app_id: str) List[Document][source]

Method to get all documents object queries by its originated app name.

Parameters:

app_id – the app name to search for

Returns:

a list of documents matching the requested app name, or an empty list if the app not found

get_documents_by_property(prop_key: str, prop_value: str) List[Document][source]

Method to retrieve documents by an arbitrary key-value pair in the document properties objects.

Parameters:
  • prop_key – the metadata key to search for

  • prop_value – the metadata value to match

Returns:

a list of documents matching the requested metadata key-value pair

get_documents_by_type(doc_type: str | DocumentTypes) List[Document][source]

Method to get all documents where the type matches a particular document type, which should be one of the CLAMS document types.

Parameters:

doc_type – the type of documents to search for, must be one of Document type defined in the CLAMS vocabulary.

Returns:

a list of documents matching the requested type, or an empty list if none found.

get_documents_in_view(vid: str | None = None) List[Document][source]

Method to get all documents object queries by a view id.

Parameters:

vid – the source view ID to search for

Returns:

a list of documents matching the requested source view ID, or an empty list if the view not found

get_documents_locations(m_type: DocumentTypes | str, path_only=False) List[str | None][source]

This method returns the file paths of documents of given type. Only top-level documents have locations, so we only check them.

Parameters:

m_type – the type to search for

Returns:

a list of the values of the location fields in the corresponding documents

get_end(annotation: Annotation) int | float[source]

An alias to get_anchor_point method with start=False.

get_start(annotation: Annotation) int | float[source]

An alias to get_anchor_point method with start=True.

get_view_by_id(req_view_id: str) View[source]

Finds a View object with the given ID.

Parameters:

req_view_id – the ID to search for

Returns:

a reference to the corresponding view, if it exists

Raises:

Exception – if there is no corresponding view

get_view_contains(at_types: TypesBase | str | List[str | TypesBase]) View | None[source]

Returns the last view appended that contains the given types in its ‘contains’ metadata.

Parameters:

at_types – a list of types or just a type to check for. When given more than one types, all types must be found.

Returns:

the view, or None if the type is not found

get_views_contain(at_types: TypesBase | str | List[str | TypesBase]) List[View][source]

An alias to get_all_views_contain method.

get_views_for_document(doc_id: str) List[View][source]

Returns the list of all views that have annotations anchored on a particular document. Note that when the document is inside a view (generated during the pipeline’s running), doc_id must be prefixed with the view_id.

id_delimiter: ClassVar[str] = ':'[source]
new_view() View[source]

Creates an empty view with a new ID and appends it to the views list.

Returns:

a reference to the new View object

new_view_id() str[source]

Fetches an ID for a new view.

Returns:

the ID

sanitize()[source]

Sanitizes a Mmif object by running some safeguards. Concretely, it performs the following before returning the JSON string.

  1. validating output using built-in MMIF jsonschema

  2. remove non-existing annotation types from contains metadata

serialize(pretty: bool = False, sanitize: bool = False, autogenerate_capital_annotations=True) str[source]

Serializes the MMIF object to a JSON string.

Parameters:
  • sanitize – If True, performs some sanitization of before returning the JSON string. See sanitize() for details.

  • autogenerate_capital_annotations – If True, automatically convert any “pending” temporary properties from Document objects to Annotation objects. See generate_capital_annotations() for details.

  • pretty – If True, returns string representation with indentation.

Returns:

JSON string of the MMIF object.

static validate(json_str: bytes | str | dict) None[source]

Validates a MMIF JSON object against the MMIF Schema. Note that this method operates before processing by MmifObject._load_str, so it expects @ and not _ for the JSON-LD @-keys.

Raises:

jsonschema.exceptions.ValidationError – if the input fails validation

Parameters:

json_str – a MMIF JSON dict or string

Returns:

None

view_prefix: ClassVar[str] = 'v_'[source]

mmif.serialize.view module

The view module contains the classes used to represent a MMIF view as a live Python object.

In MMIF, views are created by apps in a pipeline that are annotating data that was previously present in the MMIF file.

class mmif.serialize.view.Contain(mmif_obj: bytes | str | dict | None = None)[source]

Bases: DataDict[str, str]

Contain object that represents the metadata of a single annotation type in the contains metadata of a MMIF view.

class mmif.serialize.view.View(view_obj: bytes | str | dict | None = None)[source]

Bases: MmifObject

View object that represents a single view in a MMIF file.

A view is identified by an ID, and contains certain metadata, a list of annotations, and potentially a JSON-LD @context IRI.

If view_obj is not provided, an empty View will be generated.

Parameters:

view_obj – the JSON data that defines the view

add_annotation(annotation: Annotation, overwrite=False) Annotation[source]

Adds an annotation to the current view.

Fails if there is already an annotation with the same ID in the view, unless overwrite is set to True.

Parameters:
Raises:

KeyError – if overwrite is set to False and an annotation with the same ID exists in the view

Returns:

the same Annotation object passed in as annotation

add_document(document: Document, overwrite=False) Annotation[source]

Appends a Document object to the annotations list.

Fails if there is already a document with the same ID in the annotations list.

Parameters:
  • document – the Document object to add

  • overwrite – if set to True, will overwrite an existing view with the same ID

Returns:

None

get_annotation_by_id(ann_id) Annotation[source]
get_annotations(at_type: str | TypesBase | None = None, **properties) Generator[Annotation, None, None][source]

Look for certain annotations in this view, specified by parameters

Parameters:
  • at_type – @type of the annotations to look for. When this is None, any @type will match.

  • properties – properties of the annotations to look for. When given more than one property, all properties must match. Note that annotation type metadata are specified in the contains view metadata, not in individual annotation objects.

get_document_by_id(doc_id) Document[source]
get_documents() List[Document][source]
new_annotation(at_type: str | TypesBase, aid: str | None = None, overwrite=False, **properties) Annotation[source]

Generates a new mmif.serialize.annotation.Annotation object and adds it to the current view.

Fails if there is already an annotation with the same ID in the view, unless overwrite is set to True.

Parameters:
  • at_type – the desired @type of the annotation.

  • aid – the desired ID of the annotation, when not given, the mmif SDK tries to automatically generate an ID based on Annotation type and existing annotations in the view.

  • overwrite – if set to True, will overwrite an existing annotation with the same ID.

Raises:

KeyError – if overwrite is set to False and an annotation with the same ID exists in the view.

Returns:

the generated mmif.serialize.annotation.Annotation

new_contain(at_type: str | TypesBase, **contains_metadata) Contain | None[source]

Adds a new element to the contains metadata.

Parameters:
  • at_type – the @type of the annotation type being added

  • contains_metadata – any metadata associated with the annotation type

Returns:

the generated Contain object

new_textdocument(text: str, lang: str = 'en', did: str | None = None, overwrite=False, **properties) Document[source]

Generates a new mmif.serialize.annotation.Document object, particularly typed as TextDocument and adds it to the current view.

Fails if there is already a text document with the same ID in the view, unless overwrite is set to True.

Parameters:
  • text – text content of the new document

  • lang – ISO 639-1 code of the language used in the new document

  • did – the desired ID of the document, when not given, the mmif SDK tries to automatically generate an ID based on Annotation type and existing documents in the view.

  • overwrite – if set to True, will overwrite an existing document with the same ID

Raises:

KeyError – if overwrite is set to False and an document with the same ID exists in the view

Returns:

the generated mmif.serialize.annotation.Document

set_error(err_message: str, err_trace: str) None[source]
class mmif.serialize.view.ViewMetadata(viewmetadata_obj: bytes | str | dict | None = None)[source]

Bases: MmifObject

ViewMetadata object that represents the metadata object within a MMIF view.

Parameters:

viewmetadata_obj – the JSON data that defines the metadata

add_contain(contain: Contain, at_type: str | TypesBase)[source]
add_parameter(param_key, param_value)[source]
add_parameters(**runtime_params)[source]
add_warnings(*warnings: Warning)[source]
emtpy_warnings()[source]
get_parameter(param_key)[source]
new_contain(at_type: str | TypesBase, **contains_metadata) Contain | None[source]

Adds a new element to the contains dictionary.

Parameters:
  • at_type – the @type of the annotation type being added

  • contains_metadata – any metadata associated with the annotation type

Returns:

the generated Contain object

set_error(message: str, stack_trace: str)[source]

mmif.serialize.annotation module

The annotation module contains the classes used to represent a MMIF annotation as a live Python object.

In MMIF, annotations are created by apps in a pipeline as a part of a view. For documentation on how views are represented, see mmif.serialize.view.

class mmif.serialize.annotation.Annotation(anno_obj: bytes | str | dict | None = None)[source]

Bases: MmifObject

MmifObject that represents an annotation in a MMIF view.

add_property(name: str, value: str | int | float | bool | None | List[str | int | float | bool | None] | List[List[str | int | float | bool | None]] | Dict[str, str | int | float | bool | None] | Dict[str, List[str | int | float | bool | None]]) None[source]

Adds a property to the annotation’s properties. :param name: the name of the property :param value: the property’s desired value :return: None

property at_type: TypesBase[source]
static check_prop_value_is_simple_enough(value: str | int | float | bool | None | List[str | int | float | bool | None] | List[List[str | int | float | bool | None]] | Dict[str, str | int | float | bool | None] | Dict[str, List[str | int | float | bool | None]]) bool[source]
get(prop_name: str) AnnotationProperties | str | int | float | bool | None | List[str | int | float | bool | None] | List[List[str | int | float | bool | None]] | Dict[str, str | int | float | bool | None] | Dict[str, List[str | int | float | bool | None]][source]

A special getter for Annotation properties. This is to allow for directly accessing properties without having to go through the properties object, or view-level annotation properties encoded in the view.metadata.contains dict. Note that the regular props will take the priority over the ephemeral props when there are conflicts.

get_property(prop_name: str) AnnotationProperties | str | int | float | bool | None | List[str | int | float | bool | None] | List[List[str | int | float | bool | None]] | Dict[str, str | int | float | bool | None] | Dict[str, List[str | int | float | bool | None]][source]

A special getter for Annotation properties. This is to allow for directly accessing properties without having to go through the properties object, or view-level annotation properties encoded in the view.metadata.contains dict. Note that the regular props will take the priority over the ephemeral props when there are conflicts.

property id: str[source]
is_document()[source]
is_type(at_type: str | TypesBase) bool[source]

Check if the @type of this object matches.

property parent: str[source]
class mmif.serialize.annotation.AnnotationProperties(mmif_obj: bytes | str | dict | None = None)[source]

Bases: MmifObject, MutableMapping[str, T]

AnnotationProperties object that represents the properties object within a MMIF annotation.

Parameters:

mmif_obj – the JSON data that defines the properties

class mmif.serialize.annotation.Document(doc_obj: bytes | str | dict | None = None)[source]

Bases: Annotation

Document object that represents a single document in a MMIF file.

A document is identified by an ID, and contains certain attributes and potentially contains the contents of the document itself, metadata about how the document was created, and/or a list of subdocuments grouped together logically.

If document_obj is not provided, an empty Document will be generated.

Parameters:

document_obj – the JSON data that defines the document

add_property(name: str, value: str | int | float | bool | None | List[str | int | float | bool | None]) None[source]

Adds a property to the document’s properties.

Unlike the parent Annotation class, added properties of a Document object can be lost during serialization unless it belongs to somewhere in a Mmif object. This is because we want to keep Document object as “read-only” as possible. Thus, if you want to add a property to a Document object,

  • add the document to a Mmif object (either in the documents list or in a view from the views list), or

  • directly write to Document.properties instead of using this method (which is not recommended).

With the former method, the SDK will record the added property as a Annotation annotation object, separate from the original Document object. See Mmif.generate_capital_annotations() for more.

A few notes to keep in mind:

  1. You can’t overwrite an existing property of a Document object.

  2. A MMIF can have multiple Annotation objects with the same property name but different values. When this happens, the SDK will only keep the latest value (in order of appearances in views list) of the property, effectively overwriting the previous values.

get(prop_name)[source]

A special getter for Document properties. The major difference from the super class’s Annotation.get() method is that Document class has one more set of “pending” properties, that are added after the Document object is created and will be serialized as a separate Annotation object of which @type = Annotation. The pending properties will take the priority over the regular properties when there are conflicts.

get_property(prop_name)[source]

A special getter for Document properties. The major difference from the super class’s Annotation.get() method is that Document class has one more set of “pending” properties, that are added after the Document object is created and will be serialized as a separate Annotation object of which @type = Annotation. The pending properties will take the priority over the regular properties when there are conflicts.

property location: str | None[source]

location property must be a legitimate URI. That is, should the document be a local file then the file:// scheme must be used. Returns None when no location is set.

location_address() str | None[source]

Retrieves the full address from the document location URI. Returns None when no location is set.

location_path(nonexist_ok=True) str | None[source]

Retrieves a path that’s resolved to a pathname in the local file system. To obtain the original value of the “path” part in the location string (before resolving), use properties.location_path_literal method. Returns None when no location is set.

Parameters:

nonexist_ok – if False, raise FileNotFoundError when the resolved path doesn’t exist

location_scheme() str | None[source]

Retrieves URI scheme of the document location. Returns None when no location is set.

property text_language: str[source]
property text_value: str[source]
class mmif.serialize.annotation.DocumentProperties(mmif_obj: bytes | str | dict | None = None)[source]

Bases: AnnotationProperties

DocumentProperties object that represents the properties object within a MMIF document.

Parameters:

mmif_obj – the JSON data that defines the properties

property location: str | None[source]

location property must be a legitimate URI. That is, should the document be a local file then the file:// scheme must be used. Returns None when no location is set.

location_address() str | None[source]

Retrieves the full address from the document location URI. Returns None when no location is set.

location_path() str | None[source]
location_path_literal() str | None[source]

Retrieves only path name of the document location (hostname is ignored). Returns None when no location is set.

location_path_resolved(nonexist_ok=True) str | None[source]

Retrieves only path name of the document location (hostname is ignored), and then try to resolve the path name in the local file system. This method should be used when the document scheme is file or empty. For other schemes, users should install mmif-locdoc-<scheme> plugin.

Returns None when no location is set. Raise ValueError when no code found to resolve the given location scheme.

location_scheme() str | None[source]

Retrieves URI scheme of the document location. Returns None when no location is set.

property text_language: str[source]
property text_value: str[source]
class mmif.serialize.annotation.Text(text_obj: bytes | str | dict | None = None)[source]

Bases: MmifObject

property lang: str[source]
property value: str[source]