Go directly to main content.

Specification needed for dictionaries and other reference works

Project:EPUB Maintenance
Category:feature request
Status:future consideration

One of the shortcomings of the EPUB format as it is now is the absence of a clear specification for reference works such as dictionaries. This lack (in my and others' opinion) is preventing publishers to create dictionaries in the EPUB format, which in turn means that reading devices supporting EPUB usually don't support dictionary lookup.

In the MobileRead forums there has been some discussion about this issue, and there's even some proposal for extension of the spec here.

I believe work on such a specification should be a priority for the next revision of the EPUB spec.

Issue Id: 

Moving to "future consideration", but there is much agreement in comments, so a creation of a short-term informational document should be considered.



I would suggest you review Out-of-Line XML Islands, and then propose the use of such things as the Text Encoding Initiative (http://www.tei-c.org/index.xml) to these publishers. This is the purpose of XML Islands -- to allow for semantically richer formats than XHTML.

Sadly, the ability to use pure XML (as opposed to a highly constrained format like XHTML) in EPUB is something many overlook. Its existence in the specification allows publishers to use the wide variety of markup languages that have been developed for specialized purposes.


You are right, but I believe a standard specification is needed anyway. Otherwise each publisher might use a different format, and then reading systems would not be able to use a single method to look words up in a dictionary, for instance.

For a stand-alone book, having a custom choice of out-of-line XML islands can be a good solution, but dictionaries should not be stand-alone, they should be searchable from "outside" (i.e., while reading another books), etc.


I rather suspect that defining a standard for a given subset of books is beyond our current scope of work. If we do dictionaries, then we have to do science textbooks, and poetry, and so on.

What might be more persuasive is an indexing scheme that applies to something like the ID attributes found on virtually all XHTML elements, and could exist in a file external to the dictionary (or other work) itself. This would require no modification to the OPS portion of EPUB, and could be considered an extension of the existing standard.


 I think that dictionaries are a different class of book, and defining how dictionaries should be created does not imply the need to define formats for other kinds of books.

Dictionaries are a special case because their contents needs a syntax that is machine readable, not just searchable and displayable like other ePub books. Relying on ID attributes of XHTML elements could well be a way to do this. But unless this is defined as a standard syntax for dictionaries, ePub reading software is not going to be able to use any dictionaries other than ones created by the people writing the reading software, nor are dictionaries for ePub going to be created by the commercial dictionary publishers.

ePub needs to be a superset of existing formats. Mobipocket currently has a defined dictionary format, and anyone can create a dictionary in that format, and several commercial dictionary companies have done so. These dictionaries can be used by Mobipocket reader software to lookup words when reading other books, and the choice of dictionary is up to the user.




I propose that we define a metadata tag that would identify a book as a dictionary/reference (perhaps there is already a well-defined way to do it somewhere in Dublin Core). It would still have to be formatted as a regular EPUB XHTML book, but Reading Systems that wish to utilize it as a dictionary can build an index of all terms defined using standard XHTML dt tag. I think this will cover immediate need and do not think that we can do anything more in this cycle.


Assigned to:Anonymous»

I'm taking this one as lead. If you want to discuss off the wiki, please email me at ben at prodigal dot ca.


I agree with Peter, and it's probably a better idea to put the burden on the reading system (creating and using an index out of dt tags) than on the content providers.

It should be easy to create new dictionaries and with pure XHTML + a new metadata element it would be really easy.


Yesterday I was at a seminar on feature structures and someone presented LMF (which stands for Lexical Markup Framework).

It's an iso standard dedicated in part to creating machine readable dictionaries (iso code : ISO-24613:2008)

I remember from the presentation that it was created by some people from the TEI but it's not currently compatible with it (should be compatible with a future version of TEI). At the moment, it's rather used by linguists (thats why its wikipedia page is presenting an example without any definition, but definitions seem to be part of the core package). The standard itself doesn't define a preferred representation but advise to use XML (and gives a DTD).

Here is the "official" page about this work http://www.lexicalmarkupframework.org/

It seem we don't need to standardize on this issue if something powerful already exists.


Status:open» proposed resolution

As the discussion of supporting STM (science, technical and medical) publishers has also raised the question of standard XML-based extensions to EPUB, I view this as part of that issue, and there suggest we move this to future directions. This is obviously a far larger discussion than just "how do we support reference works" but "how do we leverage XML to support specialized publishing of all kinds in a standardized way that will allow interoperability at the reading system level?"


I disagree. Based on Peter's suggestions, we don't need any new markup: XHTML is enough. A simple metadata property is good enough to support this.


I would like to second Hadrien: we should be conservative. XHTML is enough for simple dictionaries. More semantics is good long-term direction, but that should be done across the board, not just for dictionaries and it is a fairly sizable task.


Assigned to:» HGardeur

Essentially, we've agreed to break this into two issues:

1 - The need for supporting specialized publishing in EPUB. This will be opened as a separate issue, shortly.

2 - We need examples for how this would be implemented in the markup of an EPUB document using DT and DD elements.

I'm handing off this issue to Peter and Hadrian, and focusing on item 1, above.


Status:proposed resolution» open


We need:
1. An example using DT and DD
2. A metadata property to indicate that an EPUB can be indexed by a reading system


The proper semantic would be:





Multiple definition terms (dt) can be associated to a definition description (dd) and a definition term can support multiple definition descriptions too.


I think a specific dc:type should be sufficient. How about xhtml-dictionary or ops-dictionary?


I agree Peter but should we also allow books with a glossary to be marked as indexable, or do we limit this strictly to dictionaries ?


OK, how about "ops-reference". If a book is marked as being of "ops-reference" type, it is guaranteed that all dl/dt/dd elements are used to define terms and not for something else.


Sounds reasonable to me.

So, we have our example and the metadata, any input from anyone else ?


I remain very confused by this recommendation. As Peter is fond of saying, I'm the guy that has to put the words in the spec into code. What, exactly, are we proposing? It's not good enough to just say "this thing is a dictionary", since I have no idea what the specific implications of that are. Does it mean I can show it as regular book? May I? Am I forbidden to? Can I pull information out of the dictionary and display it while reading other content (say, in a pop-up window or dialog)? Or must I send a reader to the dictionary publication itself? If I place it in a pop-up, should I display attribution information? What styling must I apply? Given the snippet:

military slang A word programmers use, from fubar

and a CSS rule "dd *.langroot {font-weight: bold}", must this be applied? Must all inheritance be applied? What about margins, and parent margins? It seems like some people have an idea of what this element does, but that idea may not match everyone elses idea. I would need to see this fleshed out much more than it is now before I could support adding it to the spec.


And, of course, the html in that post got displayed as html. But imagine there was a span tag around "military slang" with a class of "langroot".


First of all: with the metadata, we're not saying this thing is a dictionary, but rather "a reading system might index the dt/dd in this book and use them when the user wants to look-up for a definition".
Such a publication should work like any other EPUB file and it's entirely up to the reading system to handle how you look-up for a word and how this is presented (redirected to the indexed file, displayed in a pop-up etc).

Styling is a different problem and will largely be dependent on how you decide to display the definition: it doesn't make sense to inherit @page CSS rules for something that will be displayed in a pop-up for example.


I agree with Hadrien. A book marked as "ops-reference" is still a book and can be read just like any other book with styles applied in the usual manner. However a Reading System can use this hint to create implicit links into such book using terms defined by dt elements as keys. Also UI can be made more user-friendly, for instance when navigation into such book happens through a term look-up Reading System could potetentially first display enclosing dl element for the given term using some specialized UI such as a pop up box. Specification only defines semantics of dl/dt/dd elements (or better to say defines a way to hint that normal XHTML semantics can be relied upon for these elements and they are not just abused, say, for some sort of visual effect). Presentation of the book itself is not changed in any way.


I guess I am going to have to see a proposal before I can comment further. The fact that there may be reasonable answers to my questions (not all of which have been answered) or even multiple reasonable answers just means we will have to include some subset of those answers in the spec. To make this metadata a useful feature (one that can be implemented in a consistent, though not necessarily identical, manner across a variety of Reading Systems), we are going to have to add details of how we intend for the information to be used. And I continue to fail to see how this can be considered during the errata phase, since it is neither a bug fix or clarification. I suppose it might be the subject of an informational document, with the possibility of adding it to the spec at the next revision. If we don't intend to resolve it during this phase, we should set it to "Future Consideration", or possibly "Accepted".


I do not see why we have to say how information is going to be used. At least we did not do that for other parts of the spec. For instance, we said that NCX is a table of content, but we have to said much how that is going to be used in terms of specific UI. We don't even mandate that it is used in any way, I think. We just need to define what the information means.

As for errata vs. future considerations, this was discussed during the call. The idea is to address this (pretty hot) issue now in a simple manner without changes to the spec and revisit it in the next cycle when a more comprehensive solution can be developed.


The NCX has an entire section of our spec devoted to it, even though it has it's own entire spec document which defines all the terms, their uses and restrictions. So far, all I have seen for this item is we will have a new metadata element, but no actual spec language. My concerns here are three-fold.

First, it is not clear to content creators what the ramifications of this element may be. For instance, say "Jane's House o' Shakespeare" publishes the Awesomely Annotated Othello. This document has brief essays on almost every work in the play, commissioned at considerable expense by the publisher. Amazingly, they actually used the semantically correct html elements for these references (dt and dd tags). When they go to make an eBook, they see this ops-reference metadata element and say to themselves "hey, this is a reference. I should set that!" A couple of months go by and they have their book on the Steampunkinator 1000 Hydraulic eReader, when their arch nemesis, "Fred's Den of Reprints" ("If it's not under copyright, we print it!") produces Othello, a strict reprint of some 100 year old edition they found on the intertubes. While reading this horribly formatted tripe, the editor of the Awesomely Annotated edition taps on a characters name to see how lousy any possible descriptions might be. To their amazement, a wonderful essay appears in a pop up window with detailed analysis of the part. Even more astonishing, it appears to be a word-for-word copy of their essay! Lacking, of course, any attribution to the original work. Imagine their surprise. And by surprise, I mean willingness to sue anyone even remotely involved in the infringement of their copyright.

Second, implementors have no guidance for how to use this element, nor are there any conformance requirements to ensure consistent use across platforms. For instance, as an implementor, it is not at all clear what, if any, formatting must remain in a pop up window containing this text. In books, we are pretty clear about required CSS support, allowed exceptions, rendering of html elements, etc. Do all those requirements apply when shown in a pop up? Some? None? By what standard do we look at an implementation and say they are conforming or not with regards to this element?

Third, it is entirely inappropriate for this working group to consider non-error modifications of the spec. It is simply not in our charter, and provides a backdoor for adding features that do not go through the normal development process. At the very least, this introduces potential intellectual property conflicts. Our normal spec process requires members to certify they do not have any IP that would infringe on the features of the spec, and if they do they must be disclosed. I think we may even require a waver when a conflict is found.

For all these reasons I think adding this element, at this point is not just inappropriate, but actually harmful. If we really can't wait until the next spec version, then we should consider an informational document, for which a precedent has already been set with font mangling. Such documents can be produced and approved rapidly, and they provide enough room for a fully considered feature proposal. As an informational document, it adds no implementation or conformance requirements to existing publications or reading systems. If at some future point we decide to add the full feature to the spec, we can. Or we can add the more substantive changes to the spec and rely on this element as a stop gap.


1. It's a good point and I believe that aside from this example, it makes sense to display the title and author for another good reason: a reading system might find a definition in multiple EPUB files. But once again this is purely useful if you decide to display the definition in a pop-up, in the margin etc. If you redirect the user to the source file, attributions shouldn't be too much trouble.
We should include some specs language to encourage the reading systems to include this information, but since we don't know and don't want to force a single way of displaying the definition, we can't have requirements, only recommendations.

2. Same problem: we don't really know how the reading system might display the definition, and for different ways of displaying you probably need different rules. The easiest way to handle this is to redirect the user to the EPUB file: in this case, it should behave like any other EPUB books. For other UI elements (pop-up, footer like on the Sony PRS-700 etc.) we can recommend certain properties (all CSS properties affecting text presentation) but we can't go much farther.

3. We discussed this on the call and we're not adding anything in the specs: the semantics for dl/dt/dd already exists in XHTML/OPS and dc:type is part of the DC namespace that we use in OPF. This is very similar to the font mangling issue, I agree with you on that point. We need to add this feature very quickly and it requires minimal changes (none here, aside from the behavior expected from a reading system) to support it.

I'm open to further discussions about this but overall I agree that an informational document might be the best way to move forward at this point.


Hadrien, do you want to write an informational document? I think this issue should be moved to futute directions, however informational document does not have to wait until the new interation of the spec, since we don't need to add new features (nore there is any new IP, BTW).

Brady, how CSS would work for snippets of the content is clearly not defined in the spec, but it is not a unique problem for reference materials. Non-linear content can be displayed in a popup (e.g. a footnote), this was extensively discussed during "links to out-of-spine content" debate; also, rich text copy-and-paste requires application of CSS on a fragment of a content. This problem has nothing to do with dictionaries/reference materials. There is no mandate that references are implemented in this or any other way. We certainly can look into that in the new interation of the spec itself.


Do we have any reference document or basic example for these informational documents ?
I can co-edit this document with someone else, would you agree to work on this with me Peter ?


During one of our last conference call, we mentioned that we need to work on out-of-line content (footnotes, margin notes, definitions etc.). Should we create a new issue tagged as "future direction" ?


Status:open» future consideration