The content on this site is my own and does not necessarily represent my employer’s positions, strategies or opinions.



Creative Commons License


This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License, unless otherwise specified.

Dr. ODF: Examining OpenDocument Format with Python, Part 1

There’s been so much talk over the last 18 months about ODF, the OpenDocument Format, that I decided that I wanted to take a look inside a file in that standard format. In particular, I wanted to take a very simple word processor document and see how it was saved on disk. What information is in there and how is it represented? How does this change when the document becomes more complicated?

I also wanted to play with some of the basics of how I could access this information without using any of the usual applications for processing ODF, namely OpenOffice, IBM Workplace Managed Client, and KOffice. Therefore I decided to try to write some simple Python code to help in the examination. The programs I mentioned are written in C++ and I’m not allowing myself to look at how they do anything. I am allowing myself to use Python libraries but only if they are general purpose ones. That is, I would be cheating if I used any Python ODF code I might find on the web.

My primary source of information will be the OASIS specification for the OpenDocument Format itself. I can also use my pre-existing knowledge of the format which is mainly:

  1. The format contains several parts to represent the document.
  2. XML is used to represent the document data.
  3. The content and structure of the document (for example, the words and paragraphs) are separate from the formatting information (for example, whether a word is bold and green and in a big font).
  4. The parts are zipped to keep them together as well as reduce the amount of space they take on disk.

In regard to the last point, XML files often have a lot of “air,” or whitespace, and repeated text in them. For example, consider this XML fragment that lists some of Bob Dylan’s albums:

<albums>
  <album>Another Side of Bob Dylan</album>
  <album>Blonde on Blonde</album>
  <album>Blood on the Tracks</album>
  <album>Bringing It All Back Home</album>
  <album>Desire</album>
  <album>Good As I Been to You</album>
  <album>Highway 61 Revisited</album>
</albums>

We see that the <album> tag is present 7 times, as is </album>. We could get rid of a lot of the blanks between the tags and not change the basic content in this example, but I chose not to in order to improve readability. I could get crafty and create a non-XML file that somehow reduced the redundancy in the representation, but I really don’t want to go to that extra trouble to do someting custom here.

Under Windows, the above album data takes up 284 bytes. By using WinZip, a common commercial utility, I can reduce this to 260 bytes. In fact, a lot of that is overhead for the overall file structure. The actual data only takes up 144 bytes in the WinZip archive file.

The XML album data is self-describing: we can tell by looking at it what the information is. This would have been even more obvious if I included extra fields for the years in which the albums were recorded, the producers, the tracks on each album, and so forth. So XML is used to represent the information in a way we can understand when we want to process it later. Zipping makes that take up less space.

Later on we’ll see some other facets of XML and also see what other things are saved in the zip file structure that ODF uses. We’ll also see how to do some useful things with these when using the Python programming language.

I’m going to start with what now sounds like to be a modest goal: take a very simple ODF word processing file and just list the components in the zip structure. I’m not even going to look at the ODF standard to do this. Once I see what’s in there at a high level, I’ll refer to the specification to understand what the pieces are for. Then I’ll figure out how to descend into them to learn something interesting.

Though I have coded in Python for a few small projects in the last few years, I am by no means an expert programmer in that language. Most of my experience is with languages like C++, Pascal, PHP, and Lisp. What I’m trying to say is that I’m going to be figuring out what I need to do pretty much right before I describe it to you. This means that I may later go back and talk about a better way of doing something I did earlier. That’s ok because programming is a process of successive iteration. Software should get better and more efficient as we understand the problem more profoundly and get a better feel for what the language and its libraries can do for us.

I’m certainly not going to try to build the most comprehensive library to handle ODF in Python. We need such a thing and we need an open source version so many people can quickly use and improve it. What I’m going to do here will be just enough to get an overall idea of what is happening inside an OpenDocument File. My secondary goal is to explain some of the underlying technology and how it works together.

Our basic file will have only one character in it, the letter ‘x’. If you create a plain text file and put an ‘x’ in it, that file takes up 1 byte on disk. Any guesses on how big the ODF file is?

As we saw above, zipping any information adds some overhead. We’re not going to be able to have ‘x’ take up only zero bytes. So if you guessed “greater than 1,” you were right but not terribly adventurous. The goal of this exercise is to understand what does take up any extra room there might be. In IBM Workplace Managed Client Productivity Editors v2.6, the ODF file size is 5172 bytes. Clearly we have some other things in there and we need to find out what.

If you have a different version of the software or a different ODF-supporting application, your number may be different. Standards sometimes allow optional structures or provide alternative ways of doing the same thing. There might be some extra information in there for the application itself such as how big a window was being used last time we edited the document. Over time, we develop best practices to minimize these differences. For example, in the web services standards area, that’s the role of the Web Services Interoperability Organization, to develop such best practices.

Though I tried to remove extra information about who created the file and so forth, I suspect that there is still some of that metadata in there. We shall find out.

I’m going to start with what now sounds like a modest goal: take a very simple ODF word processing file and just list the components in the zip structure. I’m not even going to look at the ODF standard to start doing this. I’ll cover this first analysis in Part 2. Since the goal is to examine an ODF file, I’m going to refer to this as the “Dr. ODF” project.

Once we see what’s in there at a high level, we’ll refer to the standard to understand what the pieces are for and then figure out how to descend into them to learn something interesting. I’m not sure where we’ll go after that, but I suspect a few options will present themselves.

Parts: 1 2 3 4 5 6 7
Also see: “Regarding use of the Python code in my blog entries”

  • Twitter
  • Facebook
  • Reddit
  • Diigo
  • Digg
  • del.icio.us
  • StumbleUpon
  • LinkedIn
  • Google Bookmarks
  • FriendFeed
  • Technorati

2 comments to Dr. ODF: Examining OpenDocument Format with Python, Part 1