mmif.utils.summarizer package¶
Package containing the code to generate a summary from a MMIF file.
Submodules¶
mmif.utils.summarizer.config module¶
mmif.utils.summarizer.graph module¶
- class mmif.utils.summarizer.graph.Graph(mmif: Any)[source]¶
Bases:
objectGraph implementation for a MMIF document. Each node contains an annotation or document. Alignments are stored separately. Edges between nodes are created from the alignments and added to the Node.targets property. The first edge added to Node.targets is the document that the Node points to (if there is one).
The goal for the graph is to store all useful annotation and to have simple ways to trace nodes all the way up to the primary data.
- Variables:
mmif – the MMIF document that we are creating a graph for
documents – list of the top-level documents
nodes – dictionary of nodes, indexed on node identifier
alignments – list of <View, Annotation> pairs
token_idx – an instance of TokenIndex
- class mmif.utils.summarizer.graph.TokenIndex(tokens)[source]¶
Bases:
objectThe tokens are indexed on the identifier on the TextDocument that they occur in and for each text document we have a list of <offsets, Node> pairs
{'v_4:td1': [ ((0, 5), <summarizer.graph.Node object at 0x1039996d0>), ((5, 6), <summarizer.graph.Node object at 0x103999850>), ...] }
mmif.utils.summarizer.nodes module¶
- class mmif.utils.summarizer.nodes.Node(graph, view, annotation)[source]¶
Bases:
object- add_local_anchors()[source]¶
Get the anchors that you can get from the annotation itself, which includes the start and end offsets, the coordinates, the timePoint of a BoundingBox and any annotation with targets.
- add_anchors_from_targets()[source]¶
Get start and end offsets or timePoints from the targets and add them to the anchors, but only if there were no anchors on the node already. This has two cases: one for TimeFrames and one for text intervals.
- class mmif.utils.summarizer.nodes.EntityNode(graph, view, annotation)[source]¶
Bases:
Node- summary()[source]¶
The summary for entities needs to include where in the video or image the entity occurs, it is not enough to just give the text document.
- anchor() dict[source]¶
The anchor is the position in the video that the entity is linked to. This anchor cannot be found in the document property because that points to a text document that was somehow derived from the video document. Some graph traversal is needed to get the anchor, but we know that the anchor is always a time frame or a bounding box.
- anchor2()[source]¶
The anchor is the position in the video that the entity is linked to. This anchor cannot be found in the document property because that points to a text document that was somehow derived from the video document. Some graph traversal is needed to get the anchor, but we know that the anchor is always a time frame or a bounding box.
- class mmif.utils.summarizer.nodes.Nodes[source]¶
Bases:
objectFactory class for Node creation. Use Node for creation unless a special class was registered for the kind of annotation we have.
- node_classes = {'NamedEntity': <class 'mmif.utils.summarizer.nodes.EntityNode'>, 'TimeFrame': <class 'mmif.utils.summarizer.nodes.TimeFrameNode'>}¶
mmif.utils.summarizer.summary module¶
Main classes for the summarizer.
- class mmif.utils.summarizer.summary.Summary(mmif_file)[source]¶
Bases:
objectImplements the summary of a MMIF file.
- Variables:
fname – name of the input mmif file
mmif – instance of mmif.serialize.Mmif
graph – instance of graph.Graph
documents – instance of Documents
views – instance of Views
transcript – instance of Transcript
timeframes – instance of TimeFrames
entities – instance of Entities
captions – instance of Captions
- class mmif.utils.summarizer.summary.Documents(summary: Summary)[source]¶
Bases:
objectContains a list of document summaries, which are dictionaries with just the id, type and location properties.
- class mmif.utils.summarizer.summary.Annotations(summary)[source]¶
Bases:
objectContains a dictionary of Annotation object summaries, indexed on view identifiers.
- class mmif.utils.summarizer.summary.Document(summary)[source]¶
Bases:
objectCollects some document-level information, including MMIF version, size of the MMIF file and some information from the SWT document annotation.
- class mmif.utils.summarizer.summary.Views(summary)[source]¶
Bases:
objectContains a list of view summaries, which are dictionaries with just the id, app and timestamp properties.
- class mmif.utils.summarizer.summary.Transcript(summary)[source]¶
Bases:
objectThe transcript contains the string value from the first text document in the last ASR view. It issues a warning if there is more than one text document in the view.
- class mmif.utils.summarizer.summary.TranscriptElement(identifier: str, sentence: list, transcript: CharacterList)[source]¶
Bases:
objectUtility class to handle data associated with an element from a transcript, which is created from a sentence which is a list of Token Nodes. Initialization has the side effect of populating the full transcript which is an instance of CharacterList and which is also accessed here.
- class mmif.utils.summarizer.summary.Nodes(summary)[source]¶
Bases:
objectAbstract class to store instances of subclasses of graph.Node. The initialization methods of subclasses of Nodes can guard what nodes will be allowed in, for example, as of July 2022 the TimeFrames class only allowed time frames that had a frame type (thereby blocking the many timeframes from Kaldi).
- Variables:
summary – an instance of Summary
graph – an instance of graph.Graph, taken from the summary
nodes – list of instances of subclasses of graph.Node
- class mmif.utils.summarizer.summary.TimeFrames(summary)[source]¶
Bases:
NodesFor now, we take only the TimeFrames that have a frame type, which rules out all the frames we got from Kaldi.
- class mmif.utils.summarizer.summary.Entities(summary)[source]¶
Bases:
NodesThis class collects instances of graph.EntityNode.
- Variables:
nodes_idx – maps entity texts to lists of instances of graph.EntityNode
bins – an instance of Bins
mmif.utils.summarizer.utils module¶
Utility methods for the summarizer.
- mmif.utils.summarizer.utils.compose_id(view_id, anno_id)[source]¶
Composes the view identifier with the annotation identifier.
- mmif.utils.summarizer.utils.get_transcript_view(views)[source]¶
Return the last Whisper or Kaldi view that is not a warnings view.
- mmif.utils.summarizer.utils.get_captions_view(views)[source]¶
Return the last view created by the captioner.
- mmif.utils.summarizer.utils.get_aligned_tokens(view)[source]¶
Get a list of tokens from an ASR view where for each token we add a timeframe properties which has the start and end points of the aligned timeframe.
- class mmif.utils.summarizer.utils.AnnotationsIndex(view)[source]¶
Bases:
objectCreates an index on the annotations list for a view, where each annotation type is indexed on its identifier. Tokens are special and get their own list.
- class mmif.utils.summarizer.utils.CharacterList(n: int, char=' ')[source]¶
Bases:
UserListAuxiliary datastructure to help print a list of tokens. It allows you to back-engineer a sentence from the text and character offsets of the tokens.
- mmif.utils.summarizer.utils.xml_tag(tag, subtag, objs, props, indent=' ') str[source]¶
Return an XML string for a list of instances of subtag, grouped under tag.
- mmif.utils.summarizer.utils.xml_empty_tag(tag_name: str, indent: str, obj: dict, props: tuple) str[source]¶
Return an XML tag to an instance of io.StringIO(). Only properties from obj that are in the props tuple are printed.
- mmif.utils.summarizer.utils.write_tag(s, tagname: str, indent: str, obj: dict, props: tuple)[source]¶
Write an XML tag to an instance of io.StringIO(). Only properties from obj that are in the props tuple are printed.
- mmif.utils.summarizer.utils.normalize_id(doc_ids: list, view: View, annotation: Annotation)[source]¶
Change identifiers to include the view identifier if it wasn’t included, do nothing otherwise. This applies to the Annotation id, target, source, document, targets and representatives properties. Note that timePoint is not included because the value is an integer and not an identifier.