ZedDist EPUB MultiMedia
From zedwiki
Contents |
Notes on multimedia accessibility and integration in EPUB
Full-presentation multimedia synchronization
Requirement: add synchronized audio to EPUB documents in a non-intrusive way
Assumption: text is the master media type
Overlaying audio information
Audio synchronization would be accomplished as a transparent SMIL overlay to an existing EPUB fileset.
Success criteria: The SMIL information can be ignored by reading devices that do not support it. Reading devices that do not even recognize it would not be hampered by its presence (aka be able to read the book as a classic text-only EPUB.)
The SMIL would be very basic in its format and would only include audio and text media elements. Timing options are limited to letting the audio duration control the presentation flow. Optional content would be limited to skippable structures. Nested structures would be minimal or non-existant. In other words, we use a very limited subset of SMIL 3 in order to keep things simple -- hopefully even simpler than 2.02 SMIL. Future versions of EPUB could extend the supported SMIL subset if use cases arise.
Simply adding SMIL files to an EPUB fileset is not a complete solution. The following components of the DAISY and EPUB logic are impacted by the transparent overlay approach:
- spine
- DAISY uses OPF spine to list the sequential order of SMIL files. EPUB lists the sequential order of text content documents. This could be a non-problem: letting <spine> point to text documents as in EPUB is ok as long as there is a way for the SMIL-aware reader to infer which SMIL file corresponds to which content document. (See smilref below)
- ncx
- DAISY NCX traditionally points to SMIL fragments; in EPUB it points to text content document fragments. This could be a non-problem: letting NCX point to text documents as in EPUB is ok as long as there is a way for the SMIL-aware reader to infer which SMIL file corresponds to which content document. (See smilref below)
- smilref
- user agent convenience to supplement indexing of SMIL and text fragments. This indexing helps user agents to quickly find their place when going from the text to the SMIL. DAISY has previously used @smilref in the text document for this purpose; an out-of-line solution might be preferred to keep the text-document untouched
Solutions to inferring Text → SMIL associations
1. The correspondence between a text document and a SMIL document would be based on a rule of a 1-1 relationship. The association could be established in metadata (either in OPF, or in content document head). This solution would not provide fragment-level association; only identify correspondence on the file level; reading devices would have to react to "onClick" events by walking the SMIL file to find the correct fragment (this is done in DAISY 2.02 today when the DTBs do not contain "linkback".
2. A separate document contains a complete fragment-level mapping between nodes in the content document and SMIL nodes.
[Marisa] both solutions could co-exist and be useful
Container-level fallback
This is an alternate solution to the transparent overlay approach above. The OCF contains two versions (using an OPF and NCX for each):
- Text-only version
- Synchronized audio version, containing the additional SMIL and audio files
Both versions would still share common resources, such as a root-level images directory and styling information; and ideally also text documents. The latter depends on the fragment linking mechanism used by SMIL -- @id or XPath -- and on what is used already (e.g. does NCX require @id).
Related Question: Which features do we gain by including SMIL that we might want in the text-only version? E.g. skippability: don't just silence the page numbers in the audio rendition, also silence them in the text-only version (TTS rendition) and make them invisible on the page. Does this imply that there is a better place to define skippability than in the SMIL, such as in the NCX (already done to a certain point)?
[Marisa] not sure it's an alternate approach to having a transparent overlay -- they could live side by side. The transparent overlay approach adds more files to the manifest (and therefore the OPF file); wouldn't that make us want to use container-level fallback to specify which OPF to use?
Accessibility solutions for timed media embedded in text content
Timed media embedded in text content exists as its own "island". An example is an HTML5 document containing a <video/> element.
In general, accessibility encompasses multi-modality, structured navigation, and user agent behavior.
The HTML5 WG has recognized a wide range of accessible media requirements although navigation is not among these, and it's not currently known how many of these requirements will be accommodated by HTML5 language features. Based on minutes and other WG documents, it seems that at least captioning will be covered, but it is unclear what else is in their scope (see the mailing list and the whatwg wiki).
Our question is how to make an embedded media element (in this example, <video/>) fully accessible regarding the fileset (in this document, we will not concern ourselves with user agent behavior such as controlling which modalities are on/off or exposing navigation).
Multi-modality for video
- captions
- audio descriptions
Structured navigation for video
- high-level structures like segments (akin to chapters in text content)
- low-level structures like phrases
We could use the NCX to represent high level structures. The target URI for an NCX navigation node would therefore be the video element inside the text doc plus a time offset. It is important to reference the video element and not the raw media file itself. We need to investigate the latest in the Media Fragments WG.
To represent phrase structure, we could use the caption format, although this has the limitation that if there are no captions in use, we don't know where the phrase boundaries are. Other phrase-based synchronization might be a requirement, depending on the scope of video accessibility in EPUB (e.g. sign language captions or audio overdub) and these media types would require knowledge of phrase boundaries. See ZedDist_Video_UseCases_Requirements for more about video in DAISY 4.
