The content on this site is my own and does not necessarily represent my employer’s positions, strategies or opinions.



Creative Commons License


This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License, unless otherwise specified.

Dr. ODF: Examining OpenDocument Format with Python, Part 6

Before we go too much further, I want to show you a little bit of XML processing in Python. As we saw in Part 5, we can open up the content.xml component of our ODF file and retrieve the document. For a word processing document, this will include the text as well as style information for formatting. If there are special inclusions like math within the document, these will be held separately in the overall ODF file, listed in the manifest, and then linked to from the main document text. Overall, ODF is organized to logically separate the different structural and semantic aspects of a document.

In Part 5 I pulled out the XML for the content and then hand formatted it to make it easier to read. I also added some line numbers. It would be nice if we could get the formatting, or “pretty printing,” done automatically. I’m going to do this by changing my function a bit and using what is called the DOM, or W3C Document Object Model. Here is the new form of the Python code to read and then print the XML content:

def get_odf_component(odf_file, component_name) :
  """Returns a named component from an ODF file."""
  import zipfile
  odf_zipfile = zipfile.ZipFile(odf_file, 'r')
  component = odf_zipfile.read(component_name)
  odf_zipfile.close()
  return component

def print_separator() :
  print '\n-------------------------------------------------------------------------------'

def print_odf_content(odf_file) :
  """Print the contents of the document content component in an ODF file."""
  from xml.dom.minidom import parseString
  content = get_odf_component(odf_file, 'content.xml')
  print_separator()
  print 'The contents of the document content for "' + odf_file + '" is \n'
  dom = parseString(content)
  print dom.toprettyxml('  ')
  dom.unlink()
  return content

I refactored, or rearranged, the code a bit to pull out two utility functions that can be used in many places. The function get_odf_component takes the name of an ODF file and the name of a component and then returns that component information. In the case of XML, this will be a string with the XML in it. In previous parts, I repeated this sort of thing in the functions that extracted the manifest, content, and styles. It makes sense to have one function that can handle all of this. I should add some error checking just in case the ODF file does not exist or if the component is not present in the ODF file, but I’ll skip that right now.

Since the output is now getting rather cluttered with all the ODF file statistics and XML, I’ve added a separator line between sections. The print_separator function does that.

An aside: I’ve often seen the words “separate” and “separator” misspelled, usually because people put in an “e” for the first “a”. When I was little my mom told me to remember “a rat” (as in the rodent) and that would help me remember the correct spelling. It’s worked so far.

The lines in print_odf_content that do the new work with the DOM are

  from xml.dom.minidom import parseString
  ...
  dom = parseString(content)
  print dom.toprettyxml('  ')
  dom.unlink()

minidom is a lightweight implementation of the Document Object Model for when you only need to do a few things with XML. Since we’re just nicely printing the XML, it will do just fine. The first line imports, or brings in, the function parseString. This takes a string containing XML (for example, the content of an ODF file) and puts it in an internal form that you can then walk over and manipulate. That internal form is called dom in this example. The internal representation gets rid of all the angle brackets and puts the information into an object that reflects the element structure of the XML, including the inclusion of one element in another, plus some trickier things we haven’t seen yet. Let’s look at a simple example.

In Part 1 I used the simple XML example

<albums>
  <album>Another Side of Bob Dylan</album>
  <album>Blonde on Blonde</album>
  <album>Blood on the Tracks</album>
  <album>Bringing It All Back Home</album>
  <album>Desire</album>
  <album>Good As I Been to You</album>
  <album>Highway 61 Revisited</album>
</albums>

Our structure is a tree with one root and it is the albums element. That has 7 children, each of which is an album element with value equal to the name of a Bob Dylan album. If I had gotten fancy, I could have added a more complicated element structure to include other information such as year of release, producer, record company, genre, and so on. All this data can be arranged in a tree. I can walk over the tree looking for different kinds of information, I can change it, and then I can write it out again.

Generally, XML is a format used for exchanging information from one piece of software to another. An application reads in, or parses, the XML and puts into some internal structure or object that it can efficiently process. At that point, the XML is out of the picture though something else might want to use it. The dom is our way of accessing the data in that internal form.

One of the things we can do with this internal representation is to write it right back out again. There are rules in the W3C XML standard about what you can do to add whitespace—blanks, tabs, and newlines—without affecting the meaning of the information. Within those rules we might want to add some extra spaces and lines to make the XML easier to read. This is called “prettyprinting.”

In our function, the expression dom.toprettyxml(' ') takes the internal representation and creates a string with XML in it. The indentation used at each level is two blanks at the beginning of a line, and these get repeated the deeper we go into the nested XML structure. I then simply print out the string. The output isn’t too bad, as we shall see, though very long lines can be created when there are a lot of attributes. It’s intriguing to me to write a prettyprinter with other options such as line numbers and better handling of attributes, but that is somewhat outside the scope of this series.

The final line is dom.unlink(). This is nonintuitive to most people who haven’t done some serious coding. You can think of unlinking as erasing something or at least giving it back so it can be reused for something else. For general XML, the internal representation can be quite involved and not quite exactly a simple tree structure. By explicitly calling unlink I’m telling the Python interpreter that I’m not going to use dom any longer. Most of the time this happens automatically but in this case it needs a little help.

This is called “garbage collection” and removes a great many kinds of programmer errors. The old but wonderful programming language Lisp has garbage collection and, early on, garbage collection was seen as one of the great advantages of Java over C++. It’s a good thing and a fascinating topic in its own right, if you are so inclined (which I am).

With this rewrite of print_odf_content, we’re ready to look again at the content of our ODF file that just contains the character ‘x’. It’s not quite as attractive as what I did by hand in the last part, but it will do.

-------------------------------------------------------------------------------
The contents of the document content for "x-wmc.odt" is 

<?xml version="1.0" ?>
<office:document-content office:version="1.0"
xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dom="http://www.w3.org/2001/xml-events"
xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
xmlns:math="http://www.w3.org/1998/Math/MathML"
xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
xmlns:xform="http://www.w3.org/2002/xforms"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Tahoma1" svg:font-family="Tahoma"/>
    <style:font-face style:font-pitch="variable" style:name="Arial"
    svg:font-family="Arial"/>
    <style:font-face style:font-pitch="variable" style:name="Tahoma"
    svg:font-family="Tahoma"/>
    <style:font-face style:font-family-generic="roman" style:font-pitch="variable"
    style:name="Times New Roman" svg:font-family="'Times New Roman'"/>
  </office:font-face-decls>
  <office:automatic-styles/>
  <office:body>
    <office:text>
      <text:sequence-decls>
        <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
      </text:sequence-decls>
      <text:p text:style-name="Standard">
        x
      </text:p>
    </office:text>
  </office:body>
</office:document-content>

Confession: I broke up the list of attributes by hand onto multiple lines so it would look right in a browser, but that’s all I did by hand. Really.

Now let’s change our ODF file a bit and see what happens to the XML. I’m going to make my ‘x’ bold and increase the font size from the default of 12 points to an explicit 20 points.

<?xml version="1.0" ?>
<office:document-content office:version="1.0"
xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dom="http://www.w3.org/2001/xml-events"
xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
xmlns:math="http://www.w3.org/1998/Math/MathML"
xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
xmlns:xform="http://www.w3.org/2002/xforms"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Tahoma1" svg:font-family="Tahoma"/>
    <style:font-face style:font-pitch="variable" style:name="Arial"
    svg:font-family="Arial"/>
    <style:font-face style:font-pitch="variable" style:name="Tahoma"
    svg:font-family="Tahoma"/>
    <style:font-face style:font-family-generic="roman" style:font-pitch="variable"
    style:name="Times New Roman" svg:font-family="'Times New Roman'"/>
  </office:font-face-decls>
  <office:automatic-styles>
    <style:style style:family="paragraph" style:name="P1"
    style:parent-style-name="Standard">
      <style:text-properties fo:font-size="20pt" fo:font-weight="bold"
      style:font-size-asian="20pt" style:font-size-complex="20pt"
      style:font-weight-asian="bold" style:font-weight-complex="bold"/>
    </style:style>
  </office:automatic-styles>
  <office:body>
    <office:text>
      <text:sequence-decls>
        <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
      </text:sequence-decls>
      <text:p text:style-name="P1">
        x
      </text:p>
    </office:text>
  </office:body>
</office:document-content>

There are two changes from the plain version above. The first is where we went from

      <text:p text:style-name="Standard">
        x
      </text:p>

to

      <text:p text:style-name="P1">
        x
      </text:p>

As I explained last time, Standard is a named style that just has the default paragraph formatting. In this case, we have a new style called P1. If we go up and look at the automatic styles, we started with

  <office:automatic-styles/>

That is, no automatic styles, to

  <office:automatic-styles>
    <style:style style:family="paragraph" style:name="P1"
    style:parent-style-name="Standard">
      <style:text-properties fo:font-size="20pt" fo:font-weight="bold"
      style:font-size-asian="20pt" style:font-size-complex="20pt"
      style:font-weight-asian="bold" style:font-weight-complex="bold"/>
    </style:style>
  </office:automatic-styles>

An automatic style is created when you simply make individual formatting changes. Here the ‘P’ in P1 is for “paragraph.” You can see that this has as parent the previous Standard style we discussed. It inherits all the formatting properties from the parent. If we were to make a change to Standard, those changes would be reflected in P1 unless they were somehow overwritten.

If you look at the rest of the definition of P1, you will see how the font is made bold and the size is now 20 points. Incidentally, there are 72 points to an inch. A point is an old typesetter measurement.

The actual structure of the document near ‘x’ has changed very little, just a little change of style name. All the actual style information is kept somewhere else and so the document is kept nice and clean. Nice design, if you ask me.

Next time we’ll examine more serious changes to the document and see how they are reflected in the XML.

Parts: 1 2 3 4 5 6 7
Also see: “Regarding use of the Python code in my blog entries”

  • Twitter
  • Facebook
  • Reddit
  • Diigo
  • Digg
  • del.icio.us
  • StumbleUpon
  • LinkedIn
  • Google Bookmarks
  • FriendFeed
  • Technorati

Comments are closed.