Go directly to main content.

Clarify which external DTDs are allowed in EPUB.

Project:EPUB Maintenance
Component:General
Category:bug report
Priority:normal
Assigned:BDuga
Status:completed @ 2.0.1

This is based on the bug filed against EPUBCheck (issue 31: http://code.google.com/p/epubcheck/issues/detail?id=31). Author used XHTML+MathML+SVG DTD and EPUB could not be processed. My position is that external DTDs cannot be allowed in EPUB. EPUB is supposed to be self-contained format and XML parsing is not possible without DTD in many cases. For instance in this particular case parser would not know which entities are declared in DTD. It is not practical to carry every possible DTD inside EPUB parser. I can see these DTDs as required: dtbook, ncx, oeb 1.2, opf, svg, xhtml. We need to discuss this and clarify the spec.

Description
Issue Id: 
7
Resolution: 

All DTDs and external entities (including, but not limited to, external DTD references) referenced by XML documents in the package manifest are considered part of the publication and thus must also be listed in the manifest. As an exception to that rule, certain DTDs of core document types do not need to be included. Reading Systems may identify these DTDs by the public identifier in the DOCTYPE of the document. The list of DTDs that do not have to be included in the manifest is:

- SVG 1.1 DTD: www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd
- XHTML 1.1 DTD: http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
- DTBook DTD: http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd
- NCX DTD: http://www.daisy.org/z3986/2005/ncx-2005-1.dtd

Comments

#1

It's a little odd to add the MathML DTD, since it would be illegal to reference MathML from outside an inline island. The last paragraph of section 2.1 is particularly helpful in muddying this issue:

"As with any DTD referenced from the DOCTYPE declaration, OPS XHTML Content Documents must not reference the XHTML DTD unless such documents are valid with respect to that DTD (i.e. it do not include Inline XML Islands or inline SVG)."

Does that mean it *can* reference other DTDs? But why would we, since only XHTML+svg and our extensions are allowed in an OPS xhtml content document? While you might want to use additional entities, those would *have* to be in an internal subset, since validating processors are not a requirement for Reading Systems. Seems like forbidding the external subset is correct.

#2

External DTDs have to be allowable in EPUB, or External XML Islands would basically be forced to well-formed status, which opens a potential nightmare of incompatibilities. I doubt this is what Peter means, of course, since any basic XML processor can handle DTDs.

I suspect the problem could be cleared up simply by re-wording section 2.1 as follows:

"OPS XHTML Content Documents must be well-formed XML documents. In addition, they may also be valid XML documents, if they reference the XHTML DTD in a standard DOCTYPE declaration. By the same token, OPS DTBook Content Documents must be well-formed XML documents, and may also be valid, should they reference the DTBook DTD in a standard DOCTYPE declaration.

No other DTD may be referenced in an OPS Content Document unless said document complies with the guidelines for an External XML Island."

With appropriate references through said text, of course, including a link to the XHTML DTD that we're using for validation.

#3

What I mean by external DTDs are DTDs outside of EPUB package (e.g. referenced by HTTP URL). I think non-blessed DTDs external to EPUB file should be outlawed. Non-blessed DTD can be external to XML document, but have to be included in EPUB - especially since HTTP-based DTDs can potentially change over time.

Even if we allow such DTDs, then we need to specify how such files are processed. There are many cases when XML cannot be parsed without DTD (e.g. entity declarations in XHTML DTD, or implicit attributes). Should such DTDs be outlawed? We cannot require Reading Systems to pull down DTD from a URL: bad for offline case and bad for privacy.

#4

Yes, I think DTDs (or any schema) must be part of the ePub. See OPF 2.3:

"The required manifest must provide a list of all the files that are part of the publication (e.g. Content Documents, style sheets, image files, any embedded font files, any included schemas)."

If a DTD is part of the publication it must be in the manifest, and hence part of an OCF/EPUB. This could just be a clarification, perhaps with a mention in OPS as well as OPF. Or you could read that sentence as only the schemas that are in the manifest need to be in the manifest (only "included" schema needs to included in the manifest). But that is a really bizarre interpretation.

#5

OH! Yes, then. I wholly agree with Peter's point. Let me take a stab at phrasing:

"OPS XHTML Content Documents must be well-formed XML documents. In addition, they may also be valid XML documents, if they reference the XHTML DTD in a standard DOCTYPE declaration. By the same token, OPS DTBook Content Documents must be well-formed XML documents, and may also be valid, should they reference the DTBook DTD in a standard DOCTYPE declaration. No other DTD may be referenced in an OPS Content Document unless said document complies with the guidelines for an External XML Island.

Any DTDs outside of OPS DTBook and XHTML which are referenced in an EPUB must be included in the EPUB package. System DTDs may not be referenced via an external network resource."

EXAMPLES FOLLOW:

Legal DTD Reference

<!DOCTYPE example SYSTEM "example.dtd">

Illegal DTD References

<!DOCTYPE example PUBLIC "-//EXAMPLE//Example DTD 1.0//EN" "http://www.example.com/example1.dtd">

#6

Interestingly, the XML spec may have given us a way out of this, and still allow us to be compliant to that specification.

See here:

http://www.w3.org/TR/xml/#sec-external-ent

"In addition to a system identifier, an external identifier may include a public identifier.] An XML processor attempting to retrieve the entity's content may use any combination of the public and system identifiers as well as additional information outside the scope of this specification to try to generate an alternative URI reference. If the processor is unable to do so, it MUST use the URI reference specified in the system literal."

Note the interesting bit -- to try to generate an alternative URI reference. It does not specify what the alternative URI reference needs to lead to. So, a compliant XML processor, when looking for a DTD called "example.dtd" might autogenerate an in-memory DTD that specifies something like this:

<!ELEMENT example ANY>

<!ATTLIST example

attribute1   CDATA  #IMPLIED

attribute2   CDATA  #IMPLIED

>

In other words, we could simply reference that section of the XML specification, and then provide non-normative text suggesting a processor approach which both solves our externalized DTD problem and conforms to the XML spec.

 

#7

I'm not sure why we need the work around - your solution in comment 5 seems fine. However, I wouldn't mention network resources, and just leave the requirement that external DTD subsets must be in the package manifest. Is there a problem with that approach that requires the generated DTD hack?

#8

Since reading systems do not have a validation requirement, this seems like a problem for content creation tools and any tools that check the conformance of content documents (like epubcheck). For our core content documents it seems reasonable to limit the external DTD subset to the corresponding DTD (the OPF DTD for package documents, the SVG DTD for SVG documents, etc). What DTD this is for our core XHTML is trickier, but we could just pick the XHTML DTD. That solves the immediate problem with epubcheck.

We could, in addition, require that any DTD referenced in an out-of-line XML Island must be included in the package manifest. Or we could alter epubcheck to simply not perform XML validation os those islands, since we don't appear to require validity as defined in the XML spec for such islands. They just need to be valid "to their schema". If that schema doesn't happen to be a DTD then using a validating parser as defined in the XML 1.1 spec seems like an error. The same could be said for our XHTML core media type - since a conforming XHTML content document does not have to valid per the XML 1.1 spec, we should probably just run the parser in non-validating mode when processing that document.

#9

Another point to add to this discussion: in addition to a being a problem for offline and a privacy issue, resolving all external DTDs and entities per XML validating parser rules is a security risk and should not be done on untrusted content. See http://www.securiteam.com/securitynews/6D0100A5PU.html

I am not sure I understand #6. XML spec indeed allows a processor to define its own rules to resolve public and system identifier into a URI. Is it suggested that we resolve all unknown identifiers into an empty file - or what? Where these "example" elements and "attribute1" attributes come from? I can do it for a DTD which is known ahead of time but for unknown external DTD I have zero information on what I can autogenerate.

Two solutions seem satisfactory to me:

1. Restrict external DTDs and entities to a well-defined list (which we will provide in the spec). All other DTDs must be packaged together with the document. This makes the document self-contained and ensures that its meaning does not change with time.

2. All documents that reference DTDs and entities outside of a well-defined list must be standalone (as defined by XML spec). This ensures that validating and non-validating parser produce the same result on these documents. This way XML can be validated against DTD if required, but non-validating parser is guaranteed to produce the same result, so access to external DTDs can safely be disabled. Note that all non-standalone XML documents can be algorithmically converted to equivalent standalone, so we do not lose ability to package arbitrary XML in EPUB.

#10

All you have to do is generate a DOM for the document, and then producing a DTD that allows any content is simple. I assume any software is producing a DOM, regardless, which is why this such an easy out.

Problems with both your solutions:

Option 1: We lose the ability to handle arbitrary XML, and require extensive modification of legacy documents.

Option 2: We require the same extensive modification of legacy documents, and we mess up the authoring process mightily for people using a single referenced DTD to handle multiple documents.

#11

Producing a DOM isn't currently required and we should avoid adding that requirement, especially if we did so as an implied requirement.

I don't understanding the difference between auto-generating a DTD, and reducing the requirement to be a well-formed XML document. If the approach did work (and I think it would, but only for some DTDs) it wouldn't be any different than ignoring the DTD altogether.

From an authoring perspective, I'd have to avoid creating documents that relied on the auto-generated DTD's as they'd seem to work, but if any of the DTDs I rely on contained a set of character entities, those items would be lost and the document would not be rendered correctly.

#12

Frankly, I'd just prefer to have external DTDs handled properly, but apparently, this is a big issue for people...

It would be different than ignoring the DTD, because even well-formed documents would be required to fetch the external DTD for entity resolution.

And yes, the problem of entity resolution is also a thorny one. Peter's "reduce it all to an internal subset" would solve that, but creates its own mess.

The simple fact of the matter is that unless we're compliant to the XML spec's model for resolving DTDs, we end up with a suboptimal solution, no matter what we do. This is why I am still strongly supportive of leaving the spec as is with some non-normative text to suggest processing when fetching external DTDs is problematic for limited implementations.

#13

Peter's "reduce it all to an internal subset" would solve that, but creates its own mess.

Please expand on that as I can see absolutely none at all.

The simple fact of the matter is that unless we're compliant to the XML spec's model for resolving DTDs, we end up with a suboptimal solution, no matter what we do.

As I have noted, there is a security problem is in XML spec. It is important to understand that the problem is in the spec itself - any validating parser that does exactly what the spec tells you to do is vulnerable and cannot be used on untrusted content. Any fix to that vulnerability involves breaking XML spec in some respect. In addition, of course, accessing external resources is a big privacy issue and cannot be supported on devices without network access.

One option for implementations is just not use validating parser and instead use non-validating parser and never resolve entities/DTDs outside of the package (totally compliant to XML spec). But in that case, unless XML document is marked standalone, it is possible to create content which would produce totally different results depending on usage of validating vs. non-validating parser. If DTDs are important to you, ability to use validating parser (or any parser that reads that DTD for that matter) got to be imporatant as well. And having the same result every time is critical for EPUB - the intent always was that EPUB is fully self-contained. You simply cannot have it both ways without restricting XML documents to be standalone (and that's exactly why standalone XML documents were invented).

#14

Please see comment 10 for the mess standalone documents create.

An EPUB can be standalone with an externally-referenced DTD, depending on the system's requirements. As previously mentioned, many systems pre-process the EPUB package -- meaning, they can process the DTD prior to delivery. Other systems can process it in real-time. Limited systems can either choose not to validate (i.e., being well-formed processors) and process it real-time or use some sort of kludge to generate a DTD from the document tree they'll likely have to generate, anyhow.

I am slowly coming back to the "this is an implementation detail" stance, though I agree that clarification in the spec is necessary. I am not going to support tossing out external DTD references for an abstract security concern (that is somebody's implementation detail) or somebody's unwillingness to handle external DTDs in realtme (also an implementation detail).

#15

I have another question on the issue. I'm trying to understand the scope...

Does this affect anything other than XHTML with inline XML islands?

DTBook isn't allowed to have inline islands, and the specification calls out the DTBook DTD. Out-of-line XML islands have fallbacks that generally would be used by any reading that wasn't expecting the content anyway. If a document claimed to have content of the type "application/docbook+xml", then a reading system that doesn't support docbook isn't going to have to worry whether the docbook dtd is internal or external because that file isn't going to get parsed. If the reading system does support docbook, and won't have net access, it can include a copy of the DTD it supports.

So, then is the problem just the documents of type "application/xhtml+xml"?

#16

I see it as also affecting out-of-line XML islands. Not all XML applications pass a MIME type, especially if it's a custom schema. And the issue with the net access is a red herring -- we can't make rules presuming net access or lack thereof. We have to cover everything, providing scaled fallbacks, just like we do with all other content.

This is why I proposed the hack I did -- it's ugly, but it solves the problem.

#17

Please see comment 10 for the mess standalone documents create.

When a class of documents for which processing today is not well-defined is eliminated, it does not sound like a mess to me. It sounds more like clean-up.

I am not going to support tossing out external DTD references for an abstract security concern (that is somebody's implementation detail) or somebody's unwillingness to handle external DTDs in realtme (also an implementation detail).

From that point of view every error in the spec is just implementation detail, even if the spec told you that 2 and 2 is 5. Cannot implement it that way - well, too bad! Never mind that it is not possible to implement it.

The security concern is not abstract at all - millions of dollars spent fixing that problem (and restricting external DTDs/entities in every case): http://www.google.com/search?hl=en&q=XXE+attack&aq=f&oq=&aqi=g-s1g-sx9

And of course it's a bit more than "unwillingness" to handle external DTDs, when someone is reading a book stored on a flash card on a hand-held device that does not have network connection. Should we outlaw such reading devices? Or outlaw file transfer by means of a flash card? How's that an "implementation detail"

Specifications are effectively contracts between implementors and content authors. It does not help when the spec requires something that cannot be implemented.

#18

This is why I proposed the hack I did -- it's ugly, but it solves the problem

I do not see how it does. Suppose your implementation sees a resouce with public ID "-//ADBE//DTD FOO 1.0//EN" and system id "http://foo.adobe.com/bar.dtd". What is supposed to be autogenerated?

#19

When a class of documents for which processing today is not well-defined is eliminated, it does not sound like a mess to me. It sounds more like clean-up.

I'll be sure to let the people who will need to spend a few hundred grand on "cleanup" know that you're fully supportive of their efforts to do better.

We cannot abandon backwards-compatibility just because we find a bug in the spec.

 

From that point of view every error in the spec is just implementation detail, even if the spec told you that 2 and 2 is 5. Cannot implement it that way - well, too bad! Never mind that it is not possible to implement it.

Funny. I have a fully validating OPS parser running on my Blackberry. So "cannot implement it that way" appears to be incorrect. It sounds a lot more like "unwilling or unable to implement it that way."

And for the record, I proposed a solution that would scale from top-to-bottom -- you just don't like it. Please stop trying to paint me as being against the limited devices -- I'm trying to ensure that we have compatibility at all levels to the best of our ability. You're saying this:

"One class of Reading System, low end devices with no network connection, cannot support external DTDs, therefore, we should cripple everybody else -- including devices that can fetch external resources securely, Reading Systems that are used on robust devices, Reading Systems that pre-compile before delivering to limited devices, Reading Systems that are based on millions of pages authored from a centralized DTD that they fetch remotely in a large corporate environment, and everybody else."

I just can't buy into that. If EPUB's mission was to produce the latest Stephen King works on tiny reading devices with no regular net connection, I'd buy it. But it's not our mission, and never has been. We can't cripple the spec for one specific subclass of limited device implementation.

#20

Let's assume the document looked like this.

<bar>

<foo class="bat">Some text here</foo>

</bar>

 

The generated DTD looks like this:

<!ELEMENT bar ANY>
<!ELEMENT foo ANY>
<!ATTLIST foo
class CDATA #IMPLIED>

It sucks for validation, but it solves the problem, and it's within the bounds of the XML spec, and doesn't break legacy documents. It also will let the marketplace decide what's preferable -- a crippled device that doesn't validate, or a robust implementation.

#21

The bug at issue is not one reported against a reading system. It is quite narrowly focused on epubcheck. Until such a time as DTD based validation is introduced as a requirement for reading systems, there is no problem with the current spec and low-powered, non-networked devices. There may be other bugs to report and if you see one, please open a new report for that specific bug. One simple solution would be to disable XML validation in epubcheck, since invalid XML documents (from a XML 1.1 spec perspective) can be conforming XHTML core document types.

#22

I'll be sure to let the people who will need to spend a few hundred grand on "cleanup" know that you're fully supportive of their efforts to do better.

We cannot abandon backwards-compatibility just because we find a bug in the spec.

According to the widely-held reading of the spec all these files are invalid already. EPubCheck detects an error in them and they don't display the same in different viewers. People who authored such files came to us to help make the spec clearer and we are telling them that their files just fall into this gray area which we cannot agree on ourselves. So we will keep the spec broken as it is so they can keep spending hundered grands on creating such files only to discover that everyone interprets them differently. Brilliant!

Errors are costlier than fixes.

I have a fully validating OPS parser running on my Blackberry. So "cannot implement it that way" appears to be incorrect. It sounds a lot more like "unwilling or unable to implement it that way."

If it validates and fetches external resources, it either does not implement XML spec correctly or is vulnerable to XXE attack. If you want to run software with a widely-known vulnerability, that's up to you of course, but you cannot seriously suggest everyone to do that just because our spec says so.

Now, normally when things are in quotes, they are quotations. Where did I say what you claim I said?

#23

Re: #20 - I see what you mean now, but what if the document looked like

<bar>&foo;</bar>

For such file, unless foo is defined as empty string, there is always going to be a difference between online/fetching and offline/no-fetching case. I do not think it is acceptable.

#24

The document in question has a doctype:

!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg-flat.dtd"

So, that particular XHTML falls into the category of Out of Line XML islands. (Authored in a preferred vocabulary, but using extended modules.)

Would it be possible to limit the DTDs for preferred vocabularies, but leave external DTDs as an option for Out-of-Line XML islands?

(Edited because the DOCTYPE didn't show up)

#25

The fact that it has a doctype is irrelevant for being an out of line xml island. It's the MIME type (in the OPF) that matters. I haven't seen the epub in question, so I don't know what MIME type it used.

#26

Ah, ok, the OPF and OPS specs differ in their definition of what an out of line XML island is, I'll add an issue.

#27

Can we stop bringing up XXE? It's a lark. By that logic, any system which includes items from any external resource is open to attack -- like, say, any page you visit on the Web that has ad links.

According to the widely-held reading of the spec all these files are invalid already.

That's an unsupportable and wholly subjective assertion.

We can make the spec clearer without crippling legacy documents based on our existing work. It seems to me that you just refuse to do so unless it means killing external resource references, which is something I simply cannot agree to do. It limits EPUB far too much for far too little gain.

#28

Here's the other problem I have with the whole "limit external DTD references" approach in general -- it's removing a major feature from the spec. This the maintenance working group -- our charter is not to release v2 of EPUB, it's the clarify and correct errors. There is no error in what we're currently doing -- it's wholly compliant to XML and EPUB. It could and should be clarified, yes. But removing a major feature seems too much like jumping the gun to the next revision of the standard.

And frankly, I wouldn't support killing legacy documents in a new revision of the standard, either.

#29

Can we stop bringing up XXE? It's a lark. By that logic, any system which includes items from any external resource is open to attack -- like, say, any page you visit on the Web that has ad links.

Certainly it would have been a security risk if browsers blindly resolved all external resources requested by pages, as a validating parser is required by the XML spec! Fortunately, they are very careful in what they download and what they block. Every browser I know addressed XXE problem one way or the other. Flash and Acrobat had to roll out fixes too. I personally know the guy who fixed it for Microsoft libraries. For instance, compare Win IE rendering of

http://www.sorotokin.com/peter/XXE.xml

and

http://peter.sorotokin.com/XXE.xml

Note that the file is exactly the same, but it renders differently. Many browsers simply went for a cheap solution and stopped resolving most external entities altogether ("removing a major feature" with only minor outcry from users). As I said, it is simply impossible to resolve all external DTDs and entities and be secure at the same time. XML spec rules in this area are inherently not secure. That's why modern browsers don't follow these rules at at best resolve external entities selectively.

Anyway, it is clear that we are not converging, so it will have to be resolved by a wider group during one of the calls.

#30

Section 4.2.2 of the XML spec says precisely nothing about "blindly resolving" external entity references. It just says you have to resolve them. There is nothing to say you can't check the content via some sort of incoming filter to ensure it's safe...in much the same way browsers test various kinds of content today. And despite all the checking they do, certain sites are still referred to as malware sites by Google's search engine. Can it be that no external resource is 100% safe?

In summary:

I don't view this, or anything else you've presented, as a compelling reason to disenfranchise a large portion of our constituency for the sake of one portion of it. I have presented options to clarify the text; you've made it clear that nothing less than outlawing resolution of external entities will do. My solution is backwards-compatible and solves your problem, albeit inelegantly -- you've presented no solution that addresses my concerns about legacy documents.

As you've suggested, this will need to be discussed.

#31

Here is my summary:

The biggest original problem was stated in the description of the issue:

EPUB is supposed to be self-contained format and XML parsing is not possible without DTD in many cases. For instance in this particular case parser would not know which entities are declared in DTD.

My position is that the intent of EPUB specs to make EPUB a self-contained file that does not require external resources to render correctly and thus arbitrary external (outside-of-package) DTDs and entities are not allowed - much like external images or stylesheets. However I think this needs clarification, as it causes confusion (there are numerous cases in addition to this bug).

I do not see how Ben's proposal addresses this issue. "Faking" DTD cannot get us entities declared in real DTD. Without DTD access an XML file that uses such entities (e.g. HTML-style entities like &nbsp;) will render incorrectly.

Here are my two proposals. Proposal 1 matches my reading of the spec as it stands today. Proposal 2 (slightly modified from the original) ensures as much "backward compatibility" with potentially broken content without sacrificing stability of EPUB rendering as possible.

1. Restrict external DTDs and entities to a well-defined list (which we will provide in the spec). All other DTDs must be packaged together with the document. This makes the document self-contained and ensures that its meaning does not change with time.

2. All XML documents that reference DTDs and entities outside of a well-defined list must be standalone (as defined by XML spec). This ensures that validating and non-validating parsers produce the same result on these documents. This way XML can be validated against DTD if required, but non-validating parser is guaranteed to produce the same result, so access to external DTDs is not required for correct rendering. Note that all non-standalone XML documents can be algorithmically converted to equivalent standalone, so we do not lose ability to package arbitrary XML in EPUB. While non-standalone XML documents are not allowed in valid documents, Reading Systems still MUST process them.

#32

RE #31:

Of course, option 1 doesn't disallow the approach in option 2. That is, if you don't want to include a DTD in manifest, you could just make it standalone. So, both these could be (and essentially currently are) allowed.

#33

I have asked Ben, Peter, Garth, and Brady to join me on a conference call this Tuesday September 1, 2009. My goal is to come to some commun understandings regarding this contraversial issue. I want to keep this to a small group and have therefore invited the people who I believe must be on the call. However, I want to be transparent and if somebody feels like they must join the call, they should contact me directly.

George Kerscher, Chair

#34

During the mini-conference call, the following text was proposed and agreed to in principal:

Schemas referenced by content documents must be included in the manifest
including, but not limited to, external DTD references. As an exception to
that rule, certain schemas of core document types do not need to be included
in the manifest. Reading Systems may identify these schemas by the public
identifier of the document or via the MIME type specified in the OPF. These
mappings are listed below.

This text would presumably go in the manifest section of the package spec.

We still need to generate a list of schemas that can be assumed. Presumably Our RelaxNG schema for XHTML content, the DTBook DTD, the SVG DTD and the OPF DTD. Any others? Do we allow references to XHTML DTDs without including them in the manifest? We don't use those DTDs for validation in our spec, but I would bet a lot of content would suddenly become non-conformant if we didn't allow them.

#35

Assigned to:Anonymous» BDuga

#36

The proposed language with a list of schemas is:

Schemas referenced by content documents must be included in the manifest including, but not limited to, external DTD references. As an exception to that rule, certain schemas of core document types do not need to be included in the manifest. Reading Systems may identify these schemas by the public identifier in the DOCTYPE of the document or via the MIME type of the content document specified in the OPF. The core schemas are those for OPS 2.0 (as defined in the OPS 2.0 specification at http://www.idpf.org/2007/ops/OPS_2.0_final_spec.html#AppendixB ), the SVG 1.1 RELAX NG schema ( http://www.w3.org/Graphics/SVG/1.1/rng/svg11.rng ), the XHTML 1.1 DTD ( http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd ), the OPF 2.0 RELAX NG schema ( as defined in the OPF 2.0 specification at http://www.idpf.org/2007/opf/OPF_2.0_final_spec.html#AppendixA ) and the DTBook DTD ( http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd ).

Please contribute to this thread with any comments.

#37

I find this language somewhat confusing. Do you call external DTDs "schemas"? Also, in many cases the schema(s) of an XML dialect is identified (but not referenced) through XML namespace - but we do not require those to be included in the package, since namespace URIs are not supposed to be resolved and they are not needed for correct document parsing.

I think we should go back to the narrower issue of external resources that an XML parser would need to parse the document, namely external entities and DTDs. I would like all URLs that can be used without including them in the package to be explicitly listed.

#38

Yes, I am referring to DTDs using the generic term "XML schema" (not be confused with "XML Schema").

As for identifying dialect through namespaces - ok. I don't say you should or shouldn't do that. I list 2 ways a reading system could identify a schema (DTD or otherwise) for a document. There may very well be other ways. If the schema you find is one of those listed, you can't complain if it is missing from the package. For content creators, if you reference a schema (probably a DTD via DOCTYPE), then, if it is one of the ones listed, you don't have to include it in the package.

I have included URLs for all the types that have them. However, neither OPF nor OPS have a specific URL for their RELAX NG schemas.

#39

I think we should be as explicit as possible. I would just say that here is the list of DTD references which are allowed (not RNGs - why these would ever need to be referenced?) and external entities not included in the package are just disallowed.

#40

I'm not sure I understand the last comment. There is a list of DTD references. And it seems pretty clear that if you use another DTD (not one listed) it has to be in the package manifest. Is that language not specific enough? Do you have any specific changes to make that would clarify it? As for RNG, those are included for completeness sake. If you had a Reading System that understood a vocabulary that could reference an RNG schema, the you would not need to add, say, the SVG RNG schema to the manifest. Adding ours is perhaps silly, and I could see removing them, since there is no URI to reference them by. It just felt odd to not include our schemas in a list of core types. Most Reading Systems will never have to worry about the references to RNG anyway.

#41

I would like to explicitly talk about DTDs and not schemas. DTD is part of XML standard and resolving them is important for XML parsing. Schemas are add-ons and XML parser does not have to know about them while parsing (Schemas cannot define entities or put implicit attributes on elements). When it comes to validation, schemas and DTDs are similar, but from XML parsing standpoint they are very different.

#42

Another problem is that SVG DTD should be perfectly OK to reference.

#43

OK, we can add the SVG DTD. Still not sure what you don't like about the proposed resolution. Does it not require enough be included? Or does it require too much? Not explicit enough? It seems like anything that is referenced must be in the manifest, with a limited subset of exceptions. Do you disagree with the general principle? Or with the chosen subset? If something does reference, say, the SVG RNG schema (using some unspecified mechanism) should we require that it be in the manifest? Or any non-DTD schema never needs to be included in the manifest?

#44

Here is specifically what I do not like:

1. "Schemas referenced by content documents..." - only DTDs are referenced by content documents, why we are talking about schemas at all? (even if we consider DTDs to be a subclass of schemas, which is not obvious to many people). I would like to limit this to DTDs, because only DTDs are necessary for XML parsing (since only DTD is defined in XML spec), at the very least the focus should be on DTDs. I do not want to spend tons of time educating people about schemas and DTDs, I need clear spec langauge.
2. There is no well-defined way to reference RNGs or XML Schemas, and these schemas are not assigned well-defined URI in most cases, so including them is misleading.
3. If schema is referenced as a schema somehow (e.g. by out-of-line custom XML island), unlike DTDs it is not going to be "source"-referenced (i.e. referenced in a way necessary for parsing and displaying the content), but "hyperlink"-referenced (i.e. referenced as an optional information), so there is no problem in it not being part of the package.
4. One can reference schema files as arbitrary XML content in useless yet legal way (e.g. in manifest using CSS stylesheet or using SVG tref element). Allowing certain external schemas to be referenced at all would mean that these references must be supported as well, or we have to engage in complex definition on what we mean by schema references.

#45

Here is how I would rewrite it:

DTDs and external entities referenced by content documents must be included in the manifest including, but not limited to, external DTD references. As an exception to that rule, certain DTDs of core document types do not need to be included in the manifest. Reading Systems may identify these DTDs by the public identifier in the DOCTYPE of the document. Here is the list of DTDs that do not have to be included in the manifest:
- SVG 1.1 DTD: www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd
- XHTML 1.1 DTD: http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
- DTBook DTD: http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd
- NCX DTD: http://www.daisy.org/z3986/2005/ncx-2005-1.dtd

#46

Status:open» proposed resolution

Updated with proposed resolution. Please review and submit comments.

#47

Status:proposed resolution» errata

No comments in 5 weeks. Moving to "errata".

#48

I see you ommitted XHTML 1.1 plus MathML 2.0 plus SVG 1.1 DTD in the list of allowed DTDs, which was specifically the DOCTYPE the user reporting the issue was using.

If I am not mistaken, the ePub standard allows inline SVG code inside XHTML documents (does it?). But, is it actually valid to use inline SVG in XHTML that just has an XHTML 1.1 DOCTYPE? My impression is that it is not valid. The W3 Validator does not validate such documents. Specifically, the following, with only an XHTML 1.1 DOCTYPE, does not validate:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
	<head>
		<title></title>
	</head>
	<body>
		<svg:svg version="1.1">
				<svg:rect x="0" y="0" width="10" height="10"/>
		</svg:svg>
	</body>
</html>

While the following, with XHTML 1.1 plus MathML 2.0 plus SVG 1.1 DOCTYPE, does:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
	<head>
		<title></title>
	</head>
	<body>
		<svg:svg version="1.1">
				<svg:rect x="0" y="0" width="10" height="10"/>
		</svg:svg>
	</body>
</html>

If referencing the XHTML 1.1 plus MathML 2.0 plus SVG 1.1 DTD is the only valid way to use inline SVG inside XHTML (correct me if I am wrong), and such use is specifically covered in the Open Publication Structure 2.0 spec, would it not be logical to include that DTD in the list of "DTDs of core document types do not need to be included in the manifest"?

#49

It is not necessary to use DTD to have valid EPUB file. In general, multinamespace documents should not use DTDs at all. If DTD is important for some reason, it should be included in EPUB.

#50

Please excuse me if I ask you to be explicit: can then an ePub the content of which is XHTML with inline SVG be valid even if the W3 Validator says those documents are not valid? The W3 Validator does mandate for DOCTYPE to be present specifying a DTD (and for XHTML + inline SVG, only one in particular seems to be valid).

My technical knowledge of the whole affair is slim. That is why I use validators.