XML Toolbox: RELAX NG & trang

Fri, 21. May 2010

Categories: en development Tags: Atom expat libxml2 regular expression RELAX NG rest restful schema trang W3C XML xmllint

e.g. when handling RESTful APIs you may want to validate the response XML โ€“ a custom one in most cases.

I typically use tools already installed on every Mac and fire a http GET request with curl and immediately check it with xmllint like

$ curl http://www.heise.de/newsticker/heise-atom.xml | xmllint --format --schema myschema.xsd -

But I just don’t like to create and edit W3C XML Schemas โ€“ the notorious angle brackets hurt my eyes and the redundant element names hide the real stuff in tons of ever same text. Neither do I like to click through graphical schema editors and getting lost hunting for hidden settings and property dialogs.

A minimal and naive schema validating the above example Atom feed (and simply created from the feed itself with trang, see below) as W3C Schema looks like this:

Naive Atom W3C Schema

Naive Atom W3C Schema

Here comes in RELAX NG, especially it’s „compact form„, which is just what I like โ€“ a concise, BNF-ish syntax. It was designed by Murata Makoto and James Clark, Technical Lead of the XML Working Group back when XML was created and father of the famous expat parser.

The very same schema as above as RELAX NG boils down to ยฝ the lines and about ? of the characters without a single angle bracket:

default namespace = "http://www.w3.org/2005/Atom"

start =
  element feed {
    title,
    element subtitle { text },
    link+,
    updated,
    element author {
      element name { text }
    },
    id,
    element entry { title, link, id, updated }+
  }
title = element title { text }
link =
  element link {
    attribute href { xsd:anyURI },
    attribute rel { xsd:NCName }?
  }
updated = element updated { xsd:dateTime }
id = element id { xsd:anyURI }

And as libxml2 and therefore xmllint supports RELAX NG, you can use the regular syntax to validate like in the beginning, but with a much more editable schema:

$ curl http://www.heise.de/newsticker/heise-atom.xml | xmllint --format --relaxng myschema.rng -

trang

is a schema converter for RELAX NG written in Java which I wrapped inside a bash script:

#!/bin/sh
java -jar `dirname $0`/trang-20090818/trang.jar $@

Writing a new schema from scratch can be much more convenient if you have a bunch of XML files you can feed into trang:

$ trang *.xml myschema.rnc

then refine the resulting schema in compact form and finally turn it into the regular form:

$ trang myschema.rnc myschema.rng

Trang also serves me as a schema indenter by converting from compact to regular and back.

BUT: trang converts RELAX NG into W3C but not vice versa.

Deep validation

Validating XML documents shouldn’t stop with elements and attributes but rather leverage XML Schema Datatypes and apply e.g. regular expressions

element uuid {
    xsd:string {

      ## A UUID
      pattern =
        "[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}"
    }
  }

or range constraints

element year {
          xsd:unsignedShort { minInclusive = "1900" maxInclusive = "2100" }
        }

P.S.: For a more complete Atom RELAX NG schema see here or ask your search engine of choice.