Last time we got deeper into our OpenDocument Format text file and started using the documentation for the ODF standard to understand what should be in the document. We learned that there can be some variations in the internal structure between different documents. The information is saved in a zip file structure and this gives us the textual content, the styles and formatting, information about the document such as who created it, some data we might want to save between editing sessions in an ODF application, and a manifest of what we have in hand. If we have multimedia elements such as JPEG files, these will be included as well.
At the end of Part 3 I ran some code to print out what I had in my simple document (it only contains an ‘x’) and the compressed and uncompressed sizes of the components. Here is that listing again:
bob@bob-laptop:~/py-odf$ python drodf3b.py x-wmc.odt The size of the file "x-wmc.odt" is 5172 The components in the ODF zip file are: Name Compressed Uncompressed Size Size mimetype 39 39 content.xml 573 2120 styles.xml 1415 6312 meta.xml 1259 1259 settings.xml 995 6208 META-INF/manifest.xml 209 684 -------- -------- Total 4490 16622
Attentive readers will note that the numbers are the same as last time but the Python command line is not. Aside from the small change to the name of the Python file containing the instructions (I used drodf3b.py here), the prompt shows that I am no longer in Windows-land. This output was produced under Ubuntu Linux 6.0.6, Dapper Drake. It should not be lost on you, I hope, that it is important to be able to process ODF files on different platforms. My choice of Python for this exercise also aids in the portability.
Let’s begin with a simple verification of something we learned in Part 3: the mimetype element is supposed to contain the text
application/vnd.oasis.opendocument.text
Here’s a little Python function that does what I want:
def print_odf_mime_type(odf_file) : """Print the contents of the mimetype component in an ODF file.""" import zipfile odf_zipfile = zipfile.ZipFile(odf_file, 'r') mime_type = odf_zipfile.read('mimetype') print '\nThe contents of mimetype for "' + odf_file + '" is ' + mime_type odf_zipfile.close() return mime_type
There’s only one line, the fifth, that does anything significantly different from what I’ve done before and that is
mime_type = odf_zipfile.read('mimetype')
This simple command pulls out the content from the mimetype component and puts it in the Python string mime_type. After that I can do anything I want with it. I don’t need the zip file any more if I’m only going to need the MIME type. When I run this code I see
The contents of mimetype for "x-wmc.odt" is application/vnd.oasis.opendocument.text
That was easy! I must warn you that it gets a bit tougher from here on. We’re going to start looking at XML, though we won’t be using Python to do anything too sophisticated with it yet. I’m going to use some terms before defining them, but please bear with me.
XML is an open standard that was developed by the W3C. The XML home page there contains links to the working groups that are developing the various aspects of this important interchange language. It also contains links to resources such as tutorials.
The manifest contains a listing of what is in the package. I only need to modify print_odf_mime_type slightly to get print_odf_manifest:
def print_odf_manifest(odf_file) : """Print the contents of the manifest component in an ODF file.""" import zipfile odf_zipfile = zipfile.ZipFile(odf_file, 'r') manifest = odf_zipfile.read('META-INF/manifest.xml') print '\nThe contents of the manifest for "' + odf_file + '" is ' print manifest odf_zipfile.close() return manifest
This produces
The contents of the manifest for "x-wmc.odt" is
<?xml version="1.0" encoding="UTF-8"?>
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
<manifest:file-entry manifest:media-type="application/vnd.oasis.opendocument.text"
manifest:full-path="/"/>
<manifest:file-entry manifest:media-type="" manifest:full-path="Pictures/"/>
<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/>
<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="styles.xml"/>
<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="meta.xml"/>
<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="settings.xml"/>
</manifest:manifest>
This is the first real XML we have seen and it’s worth taking some time to go through it.
The first line identifies this an XML file and tells us what character encoding we’re using (UTF-8). It’s important to remember that XML was designed to be able to process all the world’s languages and not just English.
The next line
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
is an XML start-tag (see below) and defines a namespace. The latest W3C standard is Namespaces in XML 1.1 (Second Edition) and describes the motivation for the standard thusly:
We envision applications of Extensible Markup Language (XML) where a single XML document may contain elements and attributes (here referred to as a “markup vocabulary”) that are defined for and used by multiple software modules. One motivation for this is modularity: if such a markup vocabulary exists which is well-understood and for which there is useful software available, it is better to re-use this markup rather than re-invent it.
Such documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the elements and attributes which they are designed to process, even in the face of “collisions” occurring when markup intended for some other software package uses the same element name or attribute name.
These considerations require that document constructs should have names constructed so as to avoid clashes between names from different markup vocabularies. This specification describes a mechanism, XML namespaces, which accomplishes this by assigning expanded names to elements and attributes.
In the XML for the manifest, manifest:manifest and manifest:file-entry are expanded element names and manifest:media-type and manifest:full-path are expanded attribute names.
Let me illustrate the need for namespaces with an example. Suppose we had two XML fragments
<address> <street>123 Main Street</street> <city>Metropolis</city> <state>New York</state> <zipcode>15123</zipcode> </address>
and
<patientinfo> <name>Stephen O'Brien</name> <state>conscious</state> </patientinfo>
If these were both present in the same XML document and we encountered state, how would we know which one we meant? What if we had this instead?
<patientinfo>
<name>Stephen O'Brien</name>
<address>
<street>123 Main Street</street>
<city>Metropolis</city>
<state>New York</state>
<zipcode>15123</zipcode>
</address>
<state>conscious</state>
</patientinfo>
Wouldn’t it be easier to figure this out if we saw address:state or patientinfo:state? Namespaces get rid of any chance of confusion and it is simply good practice to use them when designing XML vocabularies such as ODF.
For expository purposes, I’m going to strip away the namespace information so we can more easily see and discuss the XML. Note that if I were writing a program to process the XML, the namespace data would be extremely important but also very straightforward to handle.
<manifest> <file-entry media-type="application/vnd.oasis.opendocument.text" full-path="/"/> <file-entry media-type="" full-path="Pictures/"/> <file-entry media-type="text/xml" full-path="content.xml"/> <file-entry media-type="text/xml" full-path="styles.xml"/> <file-entry media-type="text/xml" full-path="meta.xml"/> <file-entry media-type="text/xml" full-path="settings.xml"/> </manifest>
In Part 1 I spoke about how XML can look rather verbose but can be compressed very well. You can visually see how much smaller the manifest info is once I removed the namespace text. From our size table above, the manifest information starts out at 684 bytes but goes down to 209 once compressed. The decompression was completely transparent to us when the function print_odf_manifest executed the line
manifest = odf_zipfile.read('META-INF/manifest.xml')
That is, read did a lot of work and we could focus more on what we wanted to do rather than the low level details of how to do it. If someone comes along with a well designed Python library for processing ODF, what I’m showing you here will seem rather low level as well.
I want to focus first on one line in this elided manifest file:
<file-entry media-type="text/xml" full-path="content.xml"/>
This is a file-entry element. There are two attributes called media-type and full-path. The attribute name is followed by and equals sign (‘=’) and the value of the attribute. The media-type attribute value is a MIME type and tells us that the contents of that component is XML. We can find this particular component in content.xml, according to value for full-path.
The names for these components are not random and are described in the OASIS OpenDocument Format v1.0 specification that we looked at in Part 3.
There is more than one file-entry element. The very first one gives the MIME type for the overall package. I described style.xml, meta.xml, and settings.xml in Part 3. I don’t have any images in my simple document, so although the manifest contains a holder for a Pictures subdirectory, there is nothing in there. That’s why no images showed up in the size table at the very beginning of this section.
The file-entry elements are empty because they do not have any content, just attributes. Elements without content are terminated with />, as these are.
The manifest element does have content, namely all the file-entry elements. The content is all contained between the start-tag <manifest> and end-tag </manifest>.
An XML document carries all its information in the element content and the attribute values, as well as in the actual element and attribute names. Together with namespaces, we have a very powerful way of organizing information so that it can exchanged and archived. With this flexibility comes a lot of choice of how best to design the model for this information.
This can occasionally lead to controversy and arguments about the best way of do something. For us now we’re going to take ODF as is and just handle whatever it happens to be. If you decide to participate in the OASIS OpenDocument Format Technical Committee, you will have the chance to further develop the XML format for ODF itself.
We’re still searching for our elusive ‘x’, the only significant content to speak of in our document. I’m going to foreshadow Part 5 a little bit and let you know that it will be expressed as
<text:p text:style-name="Standard">x</text:p>
in content.xml. At this point you know enough to identify the namespace, the start-tag, the end-tag, the attribute name, the attribute value, and the content (x!). Expect to see more of these.
The content.xml component is 2120 bytes long in uncompressed form, so there will be a lot of other stuff we’ll have to wade through. Remember that we can deal with very general documents in ODF, so some of the extra material will be overhead for things we may or may not use.
Examining the namespaces in content.xml will be a good demonstration of the range of industry open standards that ODF uses. Next time we’ll dive into this component and spend as much time as we need to understand its contents, pun intended.
Parts: 1 2 3 4 5 6 7
Also see: “Regarding use of the Python code in my blog entries”


arrrg. this odf thing is getting annoying; I feel like I’m back in kindergarten with all the really simple things being spelled out so slowly and the interesting stuff being glossed over. :( I still feel like something really interesting is about to show up, but like many of my teachers’ lectures, I may miss it because I stop reading out of boredom. I’m just skimming the code parts now; those are quite verbose enough, for the most part.
I’m deliberately doing this slowly. What are the interesting bits that you think I am glossing over?
I dissagree with the first post. The pace is excellent for those of us not deep into xml and odf.
I’m a fan of Bob’s LARGE-TYPE writing style (and not only because I read at the 8th grade level.
What I noticed leading Marketing Project for OpenOffice was that the population of Newbies is always renewing itself. With ODF we’re presently somewhere on the left side of the Early-Adopter section of the Rogers (Everett) Diffusion of Innovations curve. That’s, what, 5% done and 95% left to go…if that?
Bob’s audience is the 95% — some of whom are internal at IBM, who even read above the 8th grade level ;-) — and the 95% together are by far the more important segment of the general population to ODF because they aren’t yet informed enough to act in their local circumstance. These are not people who particularly care about the arcana of file format standards consortia or the latest cool Python script.
If one is able to read Bob’s code, then one could use Google, too, and learn enough about ODF to drown Godzilla in the Sea of Japan even before the little plastic Mitsubishi Zeros come strafing in…that is, to be explicit, quickly.