The content on this site is my own and does not necessarily represent my employer’s positions, strategies or opinions.



Creative Commons License


This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License, unless otherwise specified.

Dr. ODF: Examining OpenDocument Format with Python, Part 7

Last time I looked at a slightly more advanced document that added a bit of formatting to the basic one that only contained the letter ‘x’. In this part I’ll create a more advanced document. We’ll look at the XML in the ODF file and complete our basic examination of the structure of documents. In future parts of this series we’ll get more hardcore with Python to do interesting things.

Here is the new basic document we’ll look at:

sample document

There are two level one headings and each is followed by a paragraph of two sentences. There is no special formatting applied and only the default styles are used.

The relevant portion of the contents.xml component of the ODF file is the following, as produced by the IBM Workplace Managed Client:

  <office:body>
    <office:text>
      <text:sequence-decls>
        <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
      </text:sequence-decls>
      <text:h text:outline-level="1" text:style-name="Heading 1">
        Chapter 1
      </text:h>
      <text:p text:style-name="Standard">
        It was a dark and rainy morning in Oslo as I walked across the park to
        the conference. I avoided getting splashed by cars driving through
        puddles, but my shoes were nevertheless squishy as I walked in the
        front door.
      </text:p>
      <text:h text:outline-level="1" text:style-name="Heading 1">
        Chapter 2
      </text:h>
      <text:p text:style-name="Standard">
        The sun was high and hot as I lay on the beach on Lake Ontario, thinking
        about the snow that would start falling before the Fall was over. I
        accidentally stepped in a puddle as I was walking to my car afterwards,
        so my shoes were squishy as I walked up the steps of the new porch.
      </text:p>
    </office:text>
  </office:body>

I did a little of hand formatting so everything would fit nicely on the screen. The styles.xml component has the definitions we need to understand the formatting.

    <style:default-style style:family="paragraph">
      <style:paragraph-properties
          fo:hyphenation-ladder-count="no-limit"
          fo:hyphenation-push-char-count="2"
          fo:hyphenation-remain-char-count="2"
          style:line-break="strict"
          style:punctuation-wrap="hanging"
          style:tab-stop-distance="0.4925in"
          style:text-autospace="ideograph-alpha"
          style:writing-mode="page"/>
      <style:text-properties
          fo:country="US" fo:font-size="12pt"
          fo:hyphenate="false" fo:language="en"
          style:country-asian="none"
          style:country-complex="none"
          style:font-name="Times New Roman"
          style:font-name-asian="Arial1"
          style:font-name-complex="Tahoma"
          style:font-size-asian="12pt"
          style:font-size-complex="12pt"
          style:language-asian="none"
          style:language-complex="none"
          style:use-window-font-color="true"/>
    </style:default-style>

    <style:style style:class="text" style:family="paragraph"
        style:name="Standard"/>

    <style:style style:class="text" style:family="paragraph"
        style:name="Text body" style:parent-style-name="Standard">
      <style:paragraph-properties
          fo:margin-bottom="0.0835in" fo:margin-top="0in"/>
    </style:style>

    <style:style style:class="text" style:family="paragraph"
        style:name="Heading" style:next-style-name="Text body"
        style:parent-style-name="Standard">
      <style:paragraph-properties
          fo:keep-with-next="always" fo:margin-bottom="0.0835in"
          fo:margin-top="0.1665in"/>
      <style:text-properties
          fo:font-size="14pt" style:font-name="Arial"
          style:font-name-asian="SimSun"
          style:font-name-complex="Tahoma"
          style:font-size-asian="14pt"
          style:font-size-complex="14pt"/>
    </style:style>

    <style:style style:class="text" style:family="paragraph"
        style:name="Heading 1" style:next-style-name="Text body"
        style:parent-style-name="Heading">
      <style:text-properties
          fo:font-size="115%" fo:font-weight="bold"
          style:font-size-asian="115%"
          style:font-size-complex="115%"
          style:font-weight-asian="bold"
          style:font-weight-complex="bold"/>
    </style:style>

We can read this style information, in part, in the following way:

  • The default text style for paragraph has several settings including that the font should be Times New Roman and the font size 12 points. (This is the same as last time.)
  • The style called Standard inherits all the default properties but does not add any of its own. (Also the same as last time for the simple case.)
  • There is a new style called Text body which is the same as Standard but it adds some settings about the space above and below the paragraph.
  • We have a new style called Heading which inherits from Standard. Unless otherwise specified, the style of the text following Heading text should have the style Text body. Heading text should always be kept with the text after it, for example, when the document is formatted to break across pages. The space above and below the Heading is set and non-zero. (And we note that the space below is about half the space above.)
  • The font for the Heading style is 14 point Arial (a sans-serif font, often used for headings when the text is roman).
  • We have yet another style called Heading 1, a first level heading, which inherits all the Heading properties but makes a few changes: the font is made bold and the font size is 115% that of Heading (and this happens to be 16.1 points).

When we look at the document again, we only see the Standard and Heading 1 styles. The rest are just there for internal use and some of the styles, Heading and Text body are part of the catalog of styles used by the Workplace Managed Client.

The Standard style is attached to the text within the p element and the Heading 1 style is attached to the code within the h element. Could we or should we have used the Heading 1 style on the text that we wanted to appear like a heading, but used a p element instead?

Yes we could have (for appearances) and no we shouldn’t have.

Semantically, a heading is different from regular paragraph text. If I am processing the document to produce a table of contents, I would want to include all the things that actually are headers, not things that look like headers. In addition, headers have special attributes such as their level in an outline and numbering information.

We use different element names and attributes to separate out the parts of the document structure according to their function, and we use styles to control how they look when formatted. If we were going to do textual analysis or searching on the document, we would just look at the structure and completely ignore the style information.

At this point I could continue to add more document elements such as sections, lists, and footnotes, but I think you get the idea. Chapters 4 and 5 of the OpenDocument v1.0 Specification from page 66 to page 94 will tell you everything you need to know. Note, however, that you will need to comprehend something about how the RELAX NG schema language works in order to know exactly how required and optional attributes and content are specified.

While reading the specification is a good way of learning about the abstract document structure, using Python code as I have done here will show you the data generated by ODF-compliant applications. If you did not have the specification, you could start to reverse engineer the document structure in this way. Of course, you would never quite know if you were handling all possible things that might be in a document, so use the spec to make sure you are complete in your coverage.

I’ve only looked at a word processing document but the OpenDocument v1.0 Specification also includes descriptions of the format for spreadsheets and presentations, plus all the sort of things that can go into these general documents.

I’ll end this part with my current Python function definitions for looking at the ODF files. Next time we’ll briefly look at metadata and then start to get fancier with the XML processing. Note that I could continue to condense and remove some of the redundancies in the following code, but I’ll leave it as is for clarity and perhaps return to refactoring the code at a later time.

import sys
import os

def get_odf_component(odf_file, component_name) :
  """Returns a named component from an ODF file."""
  import zipfile
  odf_zipfile = zipfile.ZipFile(odf_file, 'r')
  component = odf_zipfile.read(component_name)
  odf_zipfile.close()
  return component

def print_separator() :
  print '\n-------------------------------------------------------------------------------'

def print_file_size(file) :
  """Print and return the size of the file in bytes. Also display the file name."""
  file_statistics = os.stat(file)
  file_size = file_statistics.st_size
  print_separator()
  print 'The size of the file "' + file + '" is ' + str(file_size)
  return file_size

def print_odf_zipfile_info(odf_file) :
  """Print the names of the zip components in an ODF file."""
  import zipfile
  odf_zipfile = zipfile.ZipFile(odf_file, 'r')
  print_separator()
  print 'The components in the ODF zip file are:\n'

  total_compressed_size = 0
  total_uncompressed_size = 0

  print "  %-25s %12s %12s" % ('Name', 'Compressed', 'Uncompressed')
  print "  %-25s %12s %12s\n" % (' ', 'Size', 'Size')

  for info in odf_zipfile.infolist() :
    total_compressed_size += info.compress_size
    total_uncompressed_size += info.file_size
    print "  %-25s %12d %12d" % (info.filename, info.compress_size, info.file_size)

  print "  %-25s %12s %12s" % (' ', '--------', '--------')
  print "  %-25s %12d %12d" % ('Total', total_compressed_size, total_uncompressed_size)

  odf_zipfile.close()
  return total_compressed_size

def print_odf_mime_type(odf_file) :
  """Print the contents of the mimetype component in an ODF file."""
  mime_type = get_odf_component(odf_file, 'mimetype')
  print_separator()
  print 'The contents of mimetype for "' + odf_file + '" is ' + mime_type
  return mime_type

def print_odf_manifest(odf_file) :
  """Print the contents of the manifest component in an ODF file."""
  from xml.dom.minidom import parseString
  manifest = get_odf_component(odf_file, 'META-INF/manifest.xml')
  print_separator()
  print 'The contents of the manifest for "' + odf_file + '" is \n'
  dom = parseString(manifest)
  print dom.toprettyxml('  ')
  dom.unlink()
  return manifest

def print_odf_content(odf_file) :
  """Print the contents of the document content component in an ODF file."""
  from xml.dom.minidom import parseString
  content = get_odf_component(odf_file, 'content.xml')
  print_separator()
  print 'The contents of the document content for "' + odf_file + '" is \n'
  dom = parseString(content)
  print dom.toprettyxml('  ')
  dom.unlink()
  return content

def print_odf_styles(odf_file) :
  """Print the contents of the document styles component in an ODF file."""
  from xml.dom.minidom import parseString
  styles = get_odf_component(odf_file, 'styles.xml')
  print_separator()
  print 'The contents of the document styles file for "' + odf_file + '" is \n'
  dom = parseString(styles)
  print dom.toprettyxml('  ')
  dom.unlink()
  return styles

def print_odf_metadata(odf_file) :
  """Print the metadata component in an ODF file."""
  from xml.dom.minidom import parseString
  metadata = get_odf_component(odf_file, 'meta.xml')
  print_separator()
  print 'The metadata for for "' + odf_file + '" is \n'
  dom = parseString(metadata)
  print dom.toprettyxml('  ')
  dom.unlink()
  return metadata

def dr_odf() :
  """Examine the ODF file given on the command line and print information about it."""
  if len(sys.argv) > 1 :
    odf_file = sys.argv[1]
    print_file_size(odf_file)
    print_odf_zipfile_info(odf_file)

    print_odf_mime_type(odf_file)

    print_odf_manifest(odf_file)

    print_odf_content(odf_file)
    print_odf_styles(odf_file)
    print_odf_metadata(odf_file)
  else :
    print "Sorry, you forgot to give an ODF file name."

dr_odf()

Parts: 1 2 3 4 5 6 7
Also see: “Regarding use of the Python code in my blog entries”

  • Twitter
  • Facebook
  • Reddit
  • Diigo
  • Digg
  • del.icio.us
  • StumbleUpon
  • LinkedIn
  • Google Bookmarks
  • FriendFeed
  • Technorati

1 comment to Dr. ODF: Examining OpenDocument Format with Python, Part 7

  • Cinly Ooi

    Dear Bob,

    Is it possible to highlight changes in the xml code? May be something like ‘>’ on the left margin to indicate lines that had changed. As the xml code increases in length (and in future, complexity), it is becoming more and more difficult to find out quickly what had been changed.

    Many thanks,
    Cinly