In Part 1 of what I am calling the Dr. ODF project for examining what’s in an OpenDocument Format word processing file, I laid out our first big milestone: understand what is in the zip file holding the components of the document. Our initial document is very simple because I’ve only put the letter ‘x’ in it and then saved it in the IBM Workplace Managed Client. I pointed out that other applications may make larger or smaller files, but at this point we don’t know why. Eventually we’ll find out and we’ll also look at slightly more complicated documents.
I’m going to use the Python programming language to do this exploration and I’m going to use whatever general purpose libraries I can find that will help us with the zip file structure that ODF uses as well as the XML itself. What I’ve restricted myself from doing is using any Python code that might already exist that handles ODF. That’s cheating!
Before we look at the zip file, I’m going to set an even more mundane initial task for myself: write enough Python code to take an ODF file and just print out how big the file is. This will get us warmed up and also separate out some of the newness of looking at the Python from what we will then later try to do that is ODF specific.
In this project I’m going to be using Python under the Microsoft Windows operating system and I’ll be repeating a command line similar to the following over and over:
python drodf2.py x-wmc.odt
While it may look arcane, it is really pretty straightforward and the weird names are of my doing.
There are three components to this. The first is “python”, the name of the program which runs the Python language interpreter. This takes the Python code in the file “drodf2.py” and applies it to our ODF file “x-wmc.odt”. What we hope happens is that this prints how how big our ODF file is.
So “python” is the command and it is followed by two arguments. Computer people like to start counting at 0 instead of 1, and so argument 0 is “drodf2.py” and argument 1 is “x-wmc.odt”. These numbers will be important in a few moments.
The name of our project is Dr. ODF, so that is the “drodf”. This is part 2 of the discussion, so that’s why we have “drodf2”. The file extension “.py” is the standard way of identifying Python programming language files.
I chose the name “x-wmc.odt” because the ODF document simply contains the letter ‘x’ and it was created by the IBM Workplace Managed Client. The file extension “.odt” is the standard way of identifying OpenDocument Format word processing files.
Right now “x-wmc.odt” is pretty much a mystery to us except that we know there is an ‘x’ in there. When we run the Python code I’m about to show you we see
C:\py-odf>python drodf2.py x-wmc.odt The size of the file "x-wmc.odt" is 5172
So other than the ‘x’ there are 5171 bytes that we’re trying to decypher.
Here is what’s in “drodf2.py”:
import sys import os def print_file_size(file) : """Print and return the size of the file in bytes. Also display the file name.""" file_statistics = os.stat(file) file_size = file_statistics.st_size print 'The size of the file "' + file + '" is ' + str(file_size) return file_size def dr_odf() : """Examine the ODF file given on the command line and print information about it.""" odf_file = sys.argv[1] print_file_size(odf_file) dr_odf()
The last line is what initiates all the work. It calls the function dr_odf to examine and print information about the ODF file we give it. It is this function that we will be successively refining and giving more capabilities. Right now it just does one thing of significance, and that is calling print_file_size. Let’s look at this from top to bottom.
Python allows you to save your work so that you can use it at different times and in different places. It uses the notion of module to save definitions of functions, variables, and also things you want to happen when the module is first executed. Our new module is called drodf2. It uses two other modules provided with the Python installation.
The sys module contains lots of handy things about the environment in which we are working. In our case, we’ll use a function in there to get the name of our ODF file from the command line. The os module contains utilities for working with the operating system. In our case, we just need to get information about the ODF file, namely, how much space it takes up. These modules abstract away a lot of the details of whether we are running on, say, Linux or Windows, and makes it much easier to write software that runs on different platforms.
A function encapsulates some sort of activity that you want to use over and over. In Python, you define a function with a line that starts with a def and ends with a colon. You can pass arguments to a function so that it can do a sequence of actions to different things. Our function print_file_size does what the name says, but it needs to know what file to tell us about. So we give it the argument file. I hope my naming choices are helping make more obvious what is going on here.
The first line of print_file_size just describes what it does. This is a simple function and the name gives away what it does, but you should always add such a comment to help you or others figure out your code later on.
The second line of print_file_size gets all the statistics about the file. We just care about the size, but we could find out when the file was first created or last changed, among other things. We save the size information in (what else?) file_size.
The following line of print_file_size prints out what we want to know, with a little extra language to make it clear what information we are providing. The plus signs are not adding numbers together, they’re sticking together a bunch of strings, or sequences of characters, so that we can print them all at once. We use str to make our file size number into a string. There are some rather Python-y expressions on this line, but you learn these idioms just like you learn to speak or read a human language.
I have plans for doing something with this file size number later, so our function concludes by sending it back to whomever called print_file_size in the first place.
The dr_odf function has no arguments because it is going to operate directly on the name of the ODF document originally given on the command line. For clarity, I assign that file name to the variable odf_file. Remember when I said that the ODF file name was argument #1 on the command line? We get it via sys.argv[1]. This is pretty arcane but you get used to it.
The name argv goes way back in computer programming and refers to the arguments that were originally passed to the program. The “v” part of argv originally meant that there might be a variable number of arguments. Our program just takes one argument now other than the file pointing to the Python instructions to be run. Other programs might take no arguments or more than one.
Finally, I then have the function do the one thing we have defined so far, namely call print_file_size.
This works as expected but we’re missing some safeguards. What happens if I forget to tell it the name of the ODF file? What happens if I give it a name, but the file doesn’t exist? I’ll work some of those in as we go along, but for the sake of exposition I’ll keep things simple.
Let’s fix the first thing I mentioned: I’ll check that there is something given for argument #1.
def dr_odf() : """Examine the ODF file given on the command line and print information about it.""" if len(sys.argv) > 1 : odf_file = sys.argv[1] print_file_size(odf_file) else : print "Sorry, you forgot to give an ODF file name."
If we put this edited definition in the Python file drodf2a.py and omit the file name from the command line, we’ll see
C:\py-odf>python drodf2a.py Sorry, you forgot to give an ODF file name.
The new version of the function makes sure that the length (len) of the list of arguments is at least 2. If not, it tells us that we’re missing the file name.
Now let’s move on the zip structure of our ODF document. Remember that means that we have aggregated (I prefer “schmooshed”) several independent things together and then compressed them. Here’s some updated code that extends what we did before:
def print_odf_zipfile_info(odf_file) : """Print the names of the zip components in an ODF file.""" import zipfile odf_zipfile = zipfile.ZipFile(odf_file, 'r') print "The components in the ODF zip file are:" for name in odf_zipfile.namelist() : print " " + name odf_zipfile.close() def dr_odf() : """Examine the ODF file given on the command line and print information about it.""" if len(sys.argv) > 1 : odf_file = sys.argv[1] print_file_size(odf_file) print_odf_zipfile_info(odf_file) else : print "Sorry, you forgot to give an ODF file name."
When we run this we see
The size of the file "x-wmc.odt" is 5172 The components in the zip file are: mimetype content.xml styles.xml meta.xml settings.xml META-INF/manifest.xml
Now we’re getting somewhere! Let’s look at this output and then go back to how it was created.
There are six components to the file, five of which are labelled as being XML. These are the things that are taking up all that extra room beyond our ‘x’. If I had to guess right now, I would think that we would find our ‘x’ in content.xml. The rest needs a bit more interpretation, and so next time we’ll look at what the ODF standard itself has to say about these components.
The code I used to display this information adds one new function called print_odf_zipfile_info and then modifies dr_odf to call it. Our main function dr_odf now knows both how to show the file size as well as print out the basic contents of the zip structure.
The module zipfile contains the know-how to take a zip file and let you process it. We’re not going to be creating a new zip file or modifying an old one, so when we do
odf_zipfile = zipfile.ZipFile(odf_file, 'r')
we are creating an internal zip file object that we can only read. That’s why we use 'r'.
With this object in hand, we use namelist to give us a list of the names of the components stored in the zip structure. Lists are a special collections in Python and there are many convenient functions and utilities available to work on them. In particular, we can use for to do something to the names one after another.
for name in odf_zipfile.namelist() :
print " " + name
I want this to be formatted nicely, so for each name I put it on its own line and print two blank spaces before it. The last line of print_odf_zipfile_info tells the system that we’re done with the file and so it can clean up.
Next time we’ll embellish this by printing out how large the components are in the file in their compressed forms as well as how big they would be if they were not shrunk. Then we’ll see what the ODF spec has to say about the roles of the parts. We’ll conclude the next section by peeking inside two or three of the components.
Parts: 1 2 3 4 5 6 7
Also see: “Regarding use of the Python code in my blog entries”


And again – for those wanting to follow on, here’s the url for part 3:
http://www.sutor.com/newsite/blog-open/?p=995
PAE
You can follow this series of entries via the Dr-ODF category.