It’s been several weeks since I last wrote in this series of entries examining what is in an OpenDocument Format file via some Python programming. In that last entry, I examined what was in the manifest file in the ODF zipfile. In a subsequent short entry, I noted that Rob Weir mentioned in his blog that MathML markup appears as separate entries in the manifest and the zipfile.
It’s time to look at the actual XML document content. Here it is in all its glory, with some formatting to make it easier to read. The line numbers at the beginning of the line are for reference: they are not part of the actual XML file.
content.xml 01 <?xml version="1.0" encoding="UTF-8"?> 02 03 <office:document-content 04 xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" 05 xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" 06 xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" 07 xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" 08 xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" 09 xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" 10 xmlns:xlink="http://www.w3.org/1999/xlink" 11 xmlns:dc="http://purl.org/dc/elements/1.1/" 12 xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" 13 xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" 14 xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" 15 xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" 16 xmlns:math="http://www.w3.org/1998/Math/MathML" 17 xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" 18 xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" 19 xmlns:dom="http://www.w3.org/2001/xml-events" 20 xmlns:xform="http://www.w3.org/2002/xforms" 21 xmlns:xsd="http://www.w3.org/2001/XMLSchema" 22 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 23 office:version="1.0"> 24 25 <office:scripts/> 26 27 <office:font-face-decls> 28 <style:font-face 29 style:name="Tahoma1" 30 svg:font-family="Tahoma"/> 31 <style:font-face 32 style:name="Arial" 33 svg:font-family="Arial" 34 style:font-pitch="variable"/> 35 <style:font-face 36 style:name="Tahoma" 37 svg:font-family="Tahoma" 38 style:font-pitch="variable"/> 39 <style:font-face 40 style:name="Times New Roman" 41 svg:font-family="'Times New Roman'" 42 style:font-family-generic="roman" 43 style:font-pitch="variable"/> 44 </office:font-face-decls> 45 46 <office:automatic-styles/> 47 48 <office:body> 49 <office:text> 50 <text:sequence-decls> 51 <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/> 52 <text:sequence-decl text:display-outline-level="0" text:name="Table"/> 53 <text:sequence-decl text:display-outline-level="0" text:name="Text"/> 54 <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/> 55 </text:sequence-decls> 56 57 <text:p text:style-name="Standard">x</text:p> 58 59 </office:text> 60 </office:body> 61 62 </office:document-content>
To get to the punchline, line 57 is the single ‘x’ I put in the original file. The Python function I used to extract the XML is
def print_odf_content(odf_file) : """Print the contents of the document content component in an ODF file.""" import zipfile odf_zipfile = zipfile.ZipFile(odf_file, 'r') content = odf_zipfile.read('content.xml') print 'nThe contents of the document content for "' + odf_file + '" is ' print content odf_zipfile.close() return content
Now, XML-wise, this is pretty dumb and I hand formatted the XML above to make it more legible. What was actually emitted when I printed it was one long line of text with no newlines or carriage returns. There are several ways of processing XML in Python and I plan to get to them in a future entry. Right now I want to focus on what the XML is, not so much how we process it. Later on we’ll look at some Python code to manipulate the document itself. If you want to go deep with this now, start with Robin Cover’s “XML and Python” page.
Let’s try to work our way outwards from the ‘x’ on line 57. This is part of the
text:p element where
text is a namespace defined on line 6. I discussed namespaces in Part 4. Namespaces give you the flexibility to re-use simple names like
p in multiple contexts and have them possibly mean different things. In our case,
p is for paragraph, just as it is in HTML. This is described in the OpenDocument v1.0 Specification on page 67, section 4.1.2.
There is one attribute on this element called
text:style-name and it has the value
Standard. We don’t know what this is and, strangely enough, it is mentioned nowhere else in this file. However, the spec in section 4.1.3 says that this is the name of a paragraph style, but I would have guessed that. It is defined in
styles.html. I’m not going to show that whole file, but the relevant section is
<style:default-style style:family="paragraph"> <style:paragraph-properties fo:hyphenation-remain-char-count="2" fo:hyphenation-push-char-count="2" fo:hyphenation-ladder-count="no-limit" style:text-autospace="ideograph-alpha" style:punctuation-wrap="hanging" style:line-break="strict" style:tab-stop-distance="0.4925in" style:writing-mode="page"/> <style:text-properties style:use-window-font-color="true" style:font-name="Times New Roman" fo:font-size="12pt" fo:language="en" fo:country="US" style:font-name-asian="Arial" style:font-size-asian="12pt" style:language-asian="none" style:country-asian="none" style:font-name-complex="Tahoma" style:font-size-complex="12pt" style:language-complex="none" style:country-complex="none" fo:hyphenate="false"/> </style:default-style> <style:style style:name="Standard" style:family="paragraph" style:class="text"/>
Most of this is devoted to defining the default style information for paragraphs. These settings will hold unless they are overwritten by a named style. For example, the
Times New Roman font will be used at a 12 point size unless later overwritten by another style or by a user choosing to reformat that part of the file. The last line defines
Standard to be a text paragraph style. Since it doesn’t say anything else, it picks up all the default properties. Since it is a named style, applications that support ODF can go in and change it.
Section 2.7 on page 52 of the specification defines the conventions for styles. In particular, it explains what an automatic style is. If I was to have taken my ‘x’ and changed the color to yellow from the default black, then an automatic style would have been generated for that single instance change. It lends consistency to how text (the ‘x’) is keep separate from the way it is to appear (that is, in yellow). On line 46 we have
<office:automatic-styles/> which indicates that there are no automatic styles in this document.
It’s a very bad idea to make the same kind of formatting changes to several parts of a document without defining an explicit style and then applying that. For example, let’s say that I change text in 100 places so that it is in red, bold, italic. If I decide that I really want the text to be green, I have to make 100 changes. If I instead had defined a style and then applied that in the 100 instances, a simple color change within the style would have affected all relevant parts of the text at the same time. ODF uses automatic styles for internal consistency of separating formatting from text, but you should get in the habit of creating and applying styles yourself.
Moving upward to line 27, we have the start of the
font-face-decls element. In section 2.6 on page 51 it says “A font face declaration provides information about the fonts used by the author of a document, so that these fonts or fonts that are very close to these fonts may be located on other systems.” Remember when I mentioned
Times New Roman? That is a font from Microsoft Windows. The
font-face-decls element tells me some information that will help me map the font to something else, perhaps on other systems like a Linux desktop or a mobile phone. Lines 39 through 43 essentially say that it is generic font in the roman family and has variable spacing (versus a monospace font like Courier where all characters take up the same width).
There are no scripts or macros in this document, so we have the empty element
At the very top of the document are listed the namespaces that are used. Remember from last time that these are used to remove ambiguiy from simple names. Many of these are defined for ODF itself to apply to elements and attributes internal to the OpenDocument structure. All the namespaces are described in Section 1.3, page 31, of the OpenDocument v1.0 Specification. Some important external ones to note are
dcfor Dublin Core Metadata.
mathfor the W3C Mathematical Markup Language.
svgfor the W3C Scalable Vector Graphics.
xformfor the W3C XForms.
xlinkfor the W3C XML Linking Language.
This demonstrates good reuse of existing work and standards rather than unnecessary reinivention.
By jumping around a bit, I’ve shown you some of what makes up a very simple ODF document. Much of this is preamble to the actual document text. Let’s return to this before we finish this section.
48 <office:body> 49 <office:text> 50 <text:sequence-decls> 51 <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/> 52 <text:sequence-decl text:display-outline-level="0" text:name="Table"/> 53 <text:sequence-decl text:display-outline-level="0" text:name="Text"/> 54 <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/> 55 </text:sequence-decls> 56 57 <text:p text:style-name="Standard">x</text:p> 58 59 </office:text> 60 </office:body>
Our document has a body consisting of text, as opposed to the other current options of drawing, presentation, spreadsheet, chart, or images. Before we get to the actual content, we declare some variables for numbering our illustrations, tables, text objects, and drawings. The declarations used are pretty minimal; more sophisticated documents will do other things with these and other variables. Finally, and again, we see our ‘x’.
This should give you an idea of the basic structure of a simple ODF text document. Much of this will be repeated for fancier documents but we’ll also start to see variations. That’s what we’ll look at next time: what happens when our document takes on some non-trivial content. After that, we’ll return to using Python to pull out and process some of the interesting bits.