mmif.utils.summarizer package

Package containing the code to generate a summary from a MMIF file.

mmif.utils.summarizer.argparser()[source]
mmif.utils.summarizer.pp_args(args)[source]
mmif.utils.summarizer.main()[source]

Submodules

mmif.utils.summarizer.config module

mmif.utils.summarizer.graph module

class mmif.utils.summarizer.graph.Graph(mmif: Any)[source]

Bases: object

Graph implementation for a MMIF document. Each node contains an annotation or document. Alignments are stored separately. Edges between nodes are created from the alignments and added to the Node.targets property. The first edge added to Node.targets is the document that the Node points to (if there is one).

The goal for the graph is to store all useful annotation and to have simple ways to trace nodes all the way up to the primary data.

Variables:
  • mmif – the MMIF document that we are creating a graph for

  • documents – list of the top-level documents

  • nodes – dictionary of nodes, indexed on node identifier

  • alignments – list of <View, Annotation> pairs

  • token_idx – an instance of TokenIndex

add_node(view, annotation)[source]

Add an annotation as a node to the graph.

add_edge(view, alignment)[source]
get_node(node_id: str) Node | None[source]

Return the Node instance from the node index.

get_nodes(short_at_type: str, view_id=None)[source]

Get all nodes for an annotation type, using the short form. If a view identifier is provided then only include nodes from that view.

statistics() defaultdict[source]

Collect counts for node types in each view.

class mmif.utils.summarizer.graph.TokenIndex(tokens)[source]

Bases: object

The tokens are indexed on the identifier on the TextDocument that they occur in and for each text document we have a list of <offsets, Node> pairs

{'v_4:td1': [
    ((0, 5), <summarizer.graph.Node object at 0x1039996d0>),
    ((5, 6), <summarizer.graph.Node object at 0x103999850>),
    ...]
}
get_tokens_for_node(node: Node)[source]

Return all tokens included in the span of a node.

pp(fname=None)[source]

mmif.utils.summarizer.nodes module

class mmif.utils.summarizer.nodes.Node(graph, view, annotation)[source]

Bases: object

add_local_anchors()[source]

Get the anchors that you can get from the annotation itself, which includes the start and end offsets, the coordinates, the timePoint of a BoundingBox and any annotation with targets.

add_anchors_from_targets()[source]

Get start and end offsets or timePoints from the targets and add them to the anchors, but only if there were no anchors on the node already. This has two cases: one for TimeFrames and one for text intervals.

add_anchors_from_alignment(target: Any, debug=False)[source]
summary()[source]

The default summary is just the identfier, this should typically be overriden by sub classes.

has_label()[source]

Only TimeFrameNodes can have labels so this returns False.

pp(close=True)[source]
class mmif.utils.summarizer.nodes.TimeFrameNode(graph, view, annotation)[source]

Bases: Node

start()[source]
end()[source]
frame_type()[source]
has_label()[source]

Only TimeFrameNodes can have labels so this returns False.

representatives() list[source]

Return a list of the representative TimePoints.

summary()[source]

The summary of a time frame just contains the identifier, start, end and frame type.

class mmif.utils.summarizer.nodes.EntityNode(graph, view, annotation)[source]

Bases: Node

start_in_video()[source]
end_in_video()[source]
summary()[source]

The summary for entities needs to include where in the video or image the entity occurs, it is not enough to just give the text document.

anchor() dict[source]

The anchor is the position in the video that the entity is linked to. This anchor cannot be found in the document property because that points to a text document that was somehow derived from the video document. Some graph traversal is needed to get the anchor, but we know that the anchor is always a time frame or a bounding box.

anchor2()[source]

The anchor is the position in the video that the entity is linked to. This anchor cannot be found in the document property because that points to a text document that was somehow derived from the video document. Some graph traversal is needed to get the anchor, but we know that the anchor is always a time frame or a bounding box.

find_boundingbox_or_timeframe()[source]
class mmif.utils.summarizer.nodes.Nodes[source]

Bases: object

Factory class for Node creation. Use Node for creation unless a special class was registered for the kind of annotation we have.

node_classes = {'NamedEntity': <class 'mmif.utils.summarizer.nodes.EntityNode'>, 'TimeFrame': <class 'mmif.utils.summarizer.nodes.TimeFrameNode'>}
classmethod new(graph, view, annotation)[source]

mmif.utils.summarizer.summary module

Main classes for the summarizer.

exception mmif.utils.summarizer.summary.SummaryException[source]

Bases: Exception

class mmif.utils.summarizer.summary.Summary(mmif_file)[source]

Bases: object

Implements the summary of a MMIF file.

Variables:
  • fname – name of the input mmif file

  • mmif – instance of mmif.serialize.Mmif

  • graph – instance of graph.Graph

  • documents – instance of Documents

  • views – instance of Views

  • transcript – instance of Transcript

  • timeframes – instance of TimeFrames

  • entities – instance of Entities

  • captions – instance of Captions

add_warning(warning: str)[source]
validate()[source]

Minimal validation of the input. Mostly a place holder because all it does now is to check how many video documents there are.

video_documents()[source]
to_dict()[source]
report(outfile=None)[source]
print_warnings()[source]
pp()[source]
class mmif.utils.summarizer.summary.Documents(summary: Summary)[source]

Bases: object

Contains a list of document summaries, which are dictionaries with just the id, type and location properties.

static summary(doc)[source]
pp()[source]
class mmif.utils.summarizer.summary.Annotations(summary)[source]

Bases: object

Contains a dictionary of Annotation object summaries, indexed on view identifiers.

get(item)[source]
get_all_annotations()[source]
class mmif.utils.summarizer.summary.Document(summary)[source]

Bases: object

Collects some document-level information, including MMIF version, size of the MMIF file and some information from the SWT document annotation.

class mmif.utils.summarizer.summary.Views(summary)[source]

Bases: object

Contains a list of view summaries, which are dictionaries with just the id, app and timestamp properties.

get_view_summary(view)[source]
pp()[source]
class mmif.utils.summarizer.summary.Transcript(summary)[source]

Bases: object

The transcript contains the string value from the first text document in the last ASR view. It issues a warning if there is more than one text document in the view.

static transcript_size(sentences)[source]
collect_targets(s_nodes)[source]

For each node (in this context a sentence node), collect all target nodes (which are tokens) and return them as a list of lists, with one list for each node.

create_sentences(t_nodes, sentence_size=12)[source]

If there is no sentence structure then we create it just by chopping th input into slices of some pre-determined length.

class mmif.utils.summarizer.summary.TranscriptElement(identifier: str, sentence: list, transcript: CharacterList)[source]

Bases: object

Utility class to handle data associated with an element from a transcript, which is created from a sentence which is a list of Token Nodes. Initialization has the side effect of populating the full transcript which is an instance of CharacterList and which is also accessed here.

as_json()[source]
class mmif.utils.summarizer.summary.Nodes(summary)[source]

Bases: object

Abstract class to store instances of subclasses of graph.Node. The initialization methods of subclasses of Nodes can guard what nodes will be allowed in, for example, as of July 2022 the TimeFrames class only allowed time frames that had a frame type (thereby blocking the many timeframes from Kaldi).

Variables:
  • summary – an instance of Summary

  • graph – an instance of graph.Graph, taken from the summary

  • nodes – list of instances of subclasses of graph.Node

add(node)[source]
get_nodes(**props)[source]

Return all the nodes that match the given properties.

class mmif.utils.summarizer.summary.TimeFrames(summary)[source]

Bases: Nodes

For now, we take only the TimeFrames that have a frame type, which rules out all the frames we got from Kaldi.

as_json()[source]
pp()[source]
class mmif.utils.summarizer.summary.TimeFrameStats(summary)[source]

Bases: object

class mmif.utils.summarizer.summary.Entities(summary)[source]

Bases: Nodes

This class collects instances of graph.EntityNode.

Variables:
  • nodes_idx – maps entity texts to lists of instances of graph.EntityNode

  • bins – an instance of Bins

as_json()[source]
pp()[source]
print_groups()[source]
class mmif.utils.summarizer.summary.Captions(summary)[source]

Bases: Nodes

as_json()[source]
class mmif.utils.summarizer.summary.Bins(summary)[source]

Bases: object

add_entity(text, entity)[source]

Add an entity instance to the appropriate bin.

mark_entities()[source]

Marks all entities with the bin that they occur in. This is done to export the grouping done with the bins to the entities and this way the bins never need to be touched again.

print_bins()[source]
class mmif.utils.summarizer.summary.Bin(node)[source]

Bases: object

add(node)[source]
print_nodes(i)[source]

mmif.utils.summarizer.utils module

Utility methods for the summarizer.

mmif.utils.summarizer.utils.compose_id(view_id, anno_id)[source]

Composes the view identifier with the annotation identifier.

mmif.utils.summarizer.utils.type_name(annotation)[source]

Return the short name of the type.

mmif.utils.summarizer.utils.get_transcript_view(views)[source]

Return the last Whisper or Kaldi view that is not a warnings view.

mmif.utils.summarizer.utils.get_captions_view(views)[source]

Return the last view created by the captioner.

mmif.utils.summarizer.utils.get_last_segmenter_view(views)[source]
mmif.utils.summarizer.utils.get_aligned_tokens(view)[source]

Get a list of tokens from an ASR view where for each token we add a timeframe properties which has the start and end points of the aligned timeframe.

mmif.utils.summarizer.utils.timestamp(milliseconds: int, format='hh:mm:ss')[source]
class mmif.utils.summarizer.utils.AnnotationsIndex(view)[source]

Bases: object

Creates an index on the annotations list for a view, where each annotation type is indexed on its identifier. Tokens are special and get their own list.

get_annotations(at_type)[source]
class mmif.utils.summarizer.utils.CharacterList(n: int, char=' ')[source]

Bases: UserList

Auxiliary datastructure to help print a list of tokens. It allows you to back-engineer a sentence from the text and character offsets of the tokens.

set_chars(text: str, start: int, end: int)[source]
getvalue(start: int, end: int)[source]
mmif.utils.summarizer.utils.xml_tag(tag, subtag, objs, props, indent='  ') str[source]

Return an XML string for a list of instances of subtag, grouped under tag.

mmif.utils.summarizer.utils.xml_empty_tag(tag_name: str, indent: str, obj: dict, props: tuple) str[source]

Return an XML tag to an instance of io.StringIO(). Only properties from obj that are in the props tuple are printed.

mmif.utils.summarizer.utils.write_tag(s, tagname: str, indent: str, obj: dict, props: tuple)[source]

Write an XML tag to an instance of io.StringIO(). Only properties from obj that are in the props tuple are printed.

mmif.utils.summarizer.utils.xml_attribute(attr)[source]

Return attr as an XML attribute.

mmif.utils.summarizer.utils.xml_data(text)[source]

Return text as XML data.

mmif.utils.summarizer.utils.normalize_id(doc_ids: list, view: View, annotation: Annotation)[source]

Change identifiers to include the view identifier if it wasn’t included, do nothing otherwise. This applies to the Annotation id, target, source, document, targets and representatives properties. Note that timePoint is not included because the value is an integer and not an identifier.

mmif.utils.summarizer.utils.get_annotations_from_view(view, annotation_type)[source]

Return all annotations from a view that match the short name of the annotation type.

mmif.utils.summarizer.utils.find_matching_tokens(tokens, ne)[source]