In this series of entries, I’m looking at how we can use Python to do some basic examination and processing of the contents of an OpenDocument Format, or ODF, word processing file. The document I’m looking at is almost as simple as it can be because it just contains the letter ‘x’. Because ODF can handle much more sophisticated documents, the structure is quite general and so there will be some overhead. At the end of Part 2, I ran some Python code to see what was in my document and this is what I saw:
The size of the file "x-wmc.odt" is 5172 The components in the zip file are: mimetype content.xml styles.xml meta.xml settings.xml META-INF/manifest.xml
I guessed that our ‘x’ was in content.xml but I haven’t verified that yet. In this part we’re going to find out how big these different components are. In total, we know that the size of the parts add up to less than 5172. However, since this is a zip file, these parts are compressed and will expand to something that might be much bigger.
As I’ve been developing this, I’ve been writing and rewriting some Python programming language code. I’m just using what comes with the standard Python installation on Microsoft Windows and not using anything specific to ODF. While I’ve casually explained the Python code, it’s not my intent to here give a full Python tutorial. For that, you can either go to the Python web site or else buy a good book. For the latter, I recommend Mark Lutz and David Ascher’s Learning Python, 2nd Edition.
Before I do any more Python coding, we really should take a break and find out what these sections are. For that we’ll turn directly to the official OASIS OpenDocument Format v1.0 specification.
A MIME type is a way of telling software what kind of information is to be expected in what follows. MIME is short for “Multipurpose Internet Mail Extensions” and is well described at Wikipedia. The MIME type allows the receiving application to decide what to do with information, and this may mean starting a different application entirely.
For example, a browser can handle HTML directly but it might start an office application to process a spreadsheet that is being downloaded. The MIME type is the key to understanding how to direct the information traffic. The Multipurpose Internet Mail Extensions is an Internet Standard from the IETF.
The OASIS OpenDocument Format v1.0 specification Appendix C lists all the MIME types used in ODF documents. Since we are working with a word processing document, we should expect to see
application/vnd.oasis.opendocument.text
in the mimetype component of the ODF file. We’ll verify this later.
Let’s look at those XML files in the ODF file structure. Section 2.1 (page 37) of the OASIS OpenDocument Format v1.0 specification briefly describes what the first four of these are. I’ll quote that directly:
content.xml: “Document content and automatic styles used in the content.”styles.xml: “Styles used in the document content and automatic styles used in the styles themselves.”meta.xml: “Document meta information, such as the author or the time of the last save action.”settings.xml: “Application-specific settings, such as the window size or printer information.”
That certainly helps. All the text that we put in the document along with structural information such as headings and tables will be in content.xml. There may also be some formatting information in there but we should expect to see most of it in styles.xml.
It is extremely important to separate the text from the way it appears on the screen. This makes it much easier to change the general formatting efficiently, for example, by changing the font and size of all top level headings simultaneously. It also makes it easier to customize the formatting for different purposes, such as printing or different devices.
The meta.xml section has information about the document (that’s what meta means) but in some sense this information isn’t visible to most people when they are looking at it. The OASIS description mentions author and time of last save, but we could also include some keywords that might help find important documents during a search.
The settings.xml section contains yet other information but this is more related to the software process by which the document was created or edited. If I use the same word processor every time, I might like to see the document in the same size window with the same toolbars present each time I open the file.
If you have configured a particular printer for working with that document, those settings might also be stored here. These add to the convenience of the user but also must be very customizable. Different applications can save different settings, but there are also some common settings that all applications can use.
While we usually think of word processors as being the applications that handle text documents, there can be others and there will be many, many new ones in the future, in large part because of the growing popularity of ODF itself, in my opinion. For example, we might have a language translation application that translates an English language document into a German language document. That translation application might store information in settings.xml.
While there may be some application to application variation in content.xml and styles.xml, there will probably be more in meta.xml and a lot more in settings.xml. So while our working document is 5172 bytes from the IBM application, it might be more or less in, say, OpenOffice. In fact, it is a bit bigger in the test document I created. Therefore, we shouldn’t get too concerned about the size of the file as long as it stays within reasonable limits.
We have one more section of the ODF file to describe and that is META-INF/manifest.xml. Section 17 describes this as well as the general structure of the document. A manifest is a listing of everything that is shipped.
For example, if you bought a new bicycle that required assembly, the manifest would tell you how many of each part was given to you. With luck, the manifest also mentions a set of included instructions! Another example of the use of the word manifest is that it is the list of names of people on a commercial airline trip.
Section 17 tells us something that wasn’t obvious by just using Python to look at what was in our file: there can be other objects and information that are associated with the document in addition to the MIME type and XML files above. For example, if our document contains some images, then the manifest would tell us that they were in the package. We would have seen them when we did the listing shown at the top of this blog entry, but we wouldn’t have known too much about them.
The manifest can also contain other information about how parts of our ODF document are encrypted. It might also include data so we can determine if the document has gotten corrupted somehow. As it turns out, some of the sections of our document are optional. The manifest tells us what we have in hand.
The existence of the manifest strongly suggests how we should process the contents of an ODF document. First, open the manifest to see what was given us and then look at the MIME type to see what kind of document we have. Take a look at settings.xml to get any general or application-specific configuration information, and then get to work! Now clearly this is too general to be immediately practical, but it gives us a rough plan of attack. To fix ideas, what steps would you carry out if you were going to do the following tasks?
- Take an ODF document and extract any JPEG images that are in it.
- Take an ODF document and print a list of all misspelled words.
- Take an ODF document and throw away all formatting except the basic heading and paragraph structure, printing what’s left on the screen.
- Iterate over a directory of ODF files and print how many of them were authored by a given person.
- Take an ODF word processing document and convert as much of it as you can to HTML. Don’t forget the images and tables!
With only a little bit more Python programming, we could do the first task, pulling out the JPEG images. The other work will require more knowledge of the exact structure of the document.
I promised that we would look at how big the different parts of our ODF file are. I wrote some more code to do this and this is what it produced:
C:\py-odf>python drodf3b.py x-wmc.odt The size of the file "x-wmc.odt" is 5172 The components in the ODF zip file are: Name Compressed Uncompressed Size Size mimetype 39 39 content.xml 573 2120 styles.xml 1415 6312 meta.xml 1259 1259 settings.xml 995 6208 META-INF/manifest.xml 209 684 -------- -------- Total 4490 16622
I got a bit fancier with the formatting and, in fact, most of the new code seems to be to get things lined up nicely. There are some blank lines in there as well to make things more readable. Here’s the full code that does this:
import sys import os def print_file_size(file) : """Print and return the size of the file in bytes. Also display the file name.""" file_statistics = os.stat(file) file_size = file_statistics.st_size print '\nThe size of the file "' + file + '" is ' + str(file_size) return file_size def print_odf_zipfile_info(odf_file) : """Print the names of the zip components in an ODF file.""" import zipfile odf_zipfile = zipfile.ZipFile(odf_file, 'r') print '\nThe components in the ODF zip file are:\n' total_compressed_size = 0 total_uncompressed_size = 0 print " %-25s %12s %12s" % ('Name', 'Compressed', 'Uncompressed') print " %-25s %12s %12s\n" % (' ', 'Size', 'Size') for info in odf_zipfile.infolist() : total_compressed_size += info.compress_size total_uncompressed_size += info.file_size print " %-25s %12d %12d" % (info.filename, info.compress_size, info.file_size) print " %-25s %12s %12s" % (' ', '--------', '--------') print " %-25s %12d %12d" % ('Total', total_compressed_size, total_uncompressed_size) odf_zipfile.close() return total_compressed_size def dr_odf() : """Examine the ODF file given on the command line and print information about it.""" if len(sys.argv) > 1 : odf_file = sys.argv[1] print_file_size(odf_file) print_odf_zipfile_info(odf_file) else : print "Sorry, you forgot to give an ODF file name." dr_odf()
Almost all the changes are in print_odf_zipfile_info. Here are some general comments:
- The
printcommand displays everything after it on its own line, but wherever you put an ‘\n’ you get an extra line. This is what gives an extra blank line in the display. Alternatively, I could have usedprint " "in various places, but that is more verbose. - The
printcommand also can handle some fancy formatting. You can interpret" %-25s; %12d; %12d"as meaning: first skip two spaces, then left justify text (‘s’ is for string) in a space 25 characters long, then print one blank space, then print a number (‘d’ is for decimal) right justified in a space 12 characters long, then another blank, then another number right justified in a space 12 characters long. The Python spec tells you how to do this and there are some alternatives to this kind of output processing.
The real work being done here is in the section
for info in odf_zipfile.infolist() :
total_compressed_size += info.compress_size
total_uncompressed_size += info.file_size
print " %-25s %12d %12d" % (info.filename, info.compress_size, info.file_size)
In Part 2 I just looked at the names in the zip file structure. Here I’m using a more general object that contains much more information about each part. It is called … info. I’m going to use three bits of data from this: info.compress_size, the number of bytes in the compressed document part; info.file_size, the number of bytes in the document part if it were not compressed; and info.filename, the name of the part of the ODF file.
I’m totalling up all the sizes of the compressed parts in total_compressed_size and the uncompressed sizes in total_uncompressed_size. There is a fancy print command to make it look beautiful. At the very end of the function I print out the totals.
There is some overhead to using a zip file structure and that is what accounts for the 5172 – 4490 = 682 extra bytes. We’re not going to worry about this further, but we will descend into some of the other parts next time, starting with the manifest.
Parts: 1 2 3 4 5 6 7
Also see: “Regarding use of the Python code in my blog entries”


I like the way you are working your way into the ODF document in a progressive-disclosure style, with upgrades to your Python code as you confirm more and more.
An interesting feature of the mimetype component of the Zip file is that it is not compressed. It is also in a fixed place in the Zip organization (I believe it is the very first file) so that it can be discovered by inspection of the Zip binary without examining or decompressing anything in the Zip. This is an interesting hack (using supplemental assumptions that are not part of the Zip specification) that allows the Zip file to have its MIME-type “magic number” be discoverable.
It is a bit like #! comments at the front of Unix files, but grafted into a binary carrier.
In May I met someone who had been at PKware when they decided to maintain an open specification for the Zip format. Although it is not an open standard under your criteria, it is great that it has some de facto stability and a generous custodian.
Part 4 is at:
http://www.sutor.com/newsite/blog-open/?p=999
PAE
You can follow this series of entries via the Dr-ODF category.