Click here to Skip to main content
15,867,308 members
Articles / General Programming / Algorithms

XmlToXsd - A Better Schema Generator

Rate me:
Please Sign up or sign in to vote.
4.93/5 (13 votes)
7 Dec 2010CPOL2 min read 42.5K   1.7K   44   4
Build better schema for rapid data model prototyping.

Introduction

In line of business projects, you frequently need to generate complex schemas. This article outlines rapid prototyping of high quality/maintainable schema from sample XML such that derivative object models generate cleanly in all platforms.

Background

There are many tools for creating and managing schema, most are "not fun". Starting from a blank screen on a complex tool is daunting, especially when you're writing in an abstract language like XML Schema (XSD).

It's most beneficial to create a sample, then use existing tools to generate schema. The problem with these schema generating tools is that they nest complex types... Nesting complex types causes two problems:

  • It's ugly/hard to maintain
  • Generators will build very ugly objects from this kind of schema
  • It does not follow general industry practices for XML Schema (msdata namespace)

By Example

I started my data modeling using a sample; in this case, I want to model Cub Scout pinewood derby race data (yes, I have an 8yo boy).

XML
<Derby>
    <Racers>
        <Group Name="Den7">
            <Cub Id="1" First="Johny" Last="Racer" Place="1"/>
            <Cub Id="2" First="Stan" Last="Lee" Plac="3"/>
        </Group>
        ...

If I run XSD.exe (included in the .NET SDK) on that XML, it would generate XSD like:

XML
<xs:schema id="Derby" xmlns="" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
  <xs:element name="Derby" 
       msdata:IsDataSet="true" 
       msdata:UseCurrentLocale="true">
    <xs:complexType>
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:element name="Racers">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="Group" 
                   minOccurs="0" 
                   maxOccurs="unbounded">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="Cub" 
                         minOccurs="0" 
                         maxOccurs="unbounded">
                      <xs:complexType>
                      ...

Notice all the nesting... When you then run xsd.exe on the generated derby.xsd... it will generate objects with names like: DerbyRacersGroupCub. Bleck!

The Better Schema

XML
<xs:schema xmlns="" 
        xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Derby" type="DerbyInfo" />
  <xs:complexType name="DerbyInfo">
    <xs:sequence>
      <xs:element name="Racers" type="RacersInfo" />
      <xs:element name="Races" type="RacesInfo" />
    </xs:sequence>
  </xs:complexType>
  ...

Improve Xml2Xsd

So I set out to solve all these problems and built a better/simpler generator.

Algorithm Overview

  • Open an XDocument for the sample XML.
  • Read all the elements and build a dictionary of XPaths. I used a dictionary, but a List<string /> with Distinct() could have worked too.
  • From the list of XPaths, drive through all the XPaths and build the attribute and elements, making sure to reference all new elements, instead of nesting.

High Level Static Method

C#
public static XDocument Generate(XDocument content, string targetNamespace)
{
    xpaths.Clear();
    elements.Clear();
    recurseElements.Clear();

    RecurseAllXPaths(string.Empty, content.Elements().First());

    target = XNamespace.Get(targetNamespace);

    var compTypes = xpaths.Select(k => k.Key)
        .OrderBy(o => o)
        .Select(k => ComplexTypeElementFromXPath(k))
        .Where(q => null != q).ToArray();

    // The first one is our root element... it needs to be extracted and massage
    compTypes[0] = compTypes.First().Element(xs + 
                     "sequence").Element(xs + "element");

    // Warning: Namespaces are tricky/hinted here, be careful
    return new XDocument(new XElement(target + "schema",
        // Why 'qualified'?
        // All "qualified" elements and
        // attributes are in the targetNamespace of the
        // schema and all "unqualified"
        // elements and attributes are in no namespace.
        //  All global elements and attributes are qualified.
        new XAttribute("elementFormDefault", "qualified"),

        // Specify the target namespace,
        // you will want this for schema validation
        new XAttribute("targetNamespace", targetNamespace),
                
        // hint to xDocument that we want
        // the xml schema namespace to be called 'xs'
        new XAttribute(XNamespace.Xmlns + "xs", 
                       "http://www.w3.org/2001/XMLSchema"),
                       compTypes));
}

Recurse All XPaths

For each element, find if it's distinct, look for repeating element names (recursively defined) elements, and track them.

C#
static void RecurseAllXPaths(string xpath, XElement elem)
{
    var missingXpath = !xpaths.ContainsKey(xpath);
    var lclName = elem.Name.LocalName;

    var hasLcl = elements.ContainsKey(lclName);

    // Check for recursion in the element name (same name different level)
    if (hasLcl && missingXpath)
        RecurseElements.Add(lclName);
    else if (!hasLcl)
        elements.Add(lclName, true);

    // if it's not in the xpath, then add it.
    if (missingXpath)
        xpaths.Add(xpath, null);

    // add xpaths for all attributes
    elem.Attributes().ToList().ForEach(attr =>
        {
            var xpath1 = string.Format("{0}/@{1}", xpath, attr.Name);
            if (!xpaths.ContainsKey(xpath1))
                xpaths.Add(xpath1, null);
        });

    elem.Elements().ToList().ForEach(fe => RecurseAllXPaths(
        string.Format("{0}/{1}", xpath, lclName), fe));
}

Generating Schema From XPaths

Now that we have a list of XPaths, we need to generate the appropriate schema for them.

C#
private static XElement ComplexTypeElementFromXPath(string xp)
{
    var parts = xp.Split('/');
    var last = parts.Last();
    var isAttr = last.StartsWith("@");
    var parent = ParentElementByXPath(parts);

    return (isAttr) ? BuildAttributeSchema(xp, last, parent) : 
        BuildElementSchema(xp, last, parent);
}

BuildAttributeSchema

C#
private static XElement BuildAttributeSchema(string k, 
               string last, XElement parent)
{
    var elem0 = new XElement(xs + "attribute",
        new XAttribute("name", last.TrimStart('@')),
        new XAttribute("type", "string"));
            
    if (null != parent)
        parent.Add(elem0);

    xpaths[k] = elem0;

    return null;
}

BuildElementSchema

This one is not as straightforward as BuildAttribute; we have to make sure we have the appropriate "type-references" made to the parent node... it's a little hairy, but it works nicely.

C#
private static XElement BuildElementSchema(string k, 
               string last, XElement parent)
{
    XElement seqElem = null;
    if (null != parent)
    {
        seqElem = parent.Element(xs + "sequence");

        // Add a new squence if one doesn't already exist
        if (null == seqElem && null != parent)
            // Note: add sequence to the start,
            //  because sequences need to come before any 
            //  attributes in XSD syntax
            parent.AddFirst(seqElem = new XElement(xs + "sequence"));
    }
    else
    {
        // In this case, there's no existing parent
        seqElem = new XElement(xs + "sequence");
    }

    var lastInfo = last + "Info";

    var elem0 = new XElement(xs + "element",
            new XAttribute("name", last),
            new XAttribute("type", lastInfo));
    seqElem.Add(elem0); // add the ref to the existing sequence

    return xpaths[k] = new XElement(xs + "complexType",
        new XAttribute("name", lastInfo));
}

Using the Code

  • Download the sample project
  • Build in VS2010 or Express
  • F5 from the debug solution will execute
  • Open Derby.Xsd in bin/Debug to see the result

If you're still reading, I strongly recommend F10/F11 through the project to get into the details. Have fun!

Enhancements

  • Elements without children (a.k.a. value elements)
  • Derive data types from the contents of the sample XML (integer, boolean, DateTime, etc.)

Future Improvements

  • Make recursively defined elements work

History

  • 12/04/2010 - Created.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Engineer Big Company
United States United States
My professional career began as a developer fixing bugs on Microsoft Word97 and I've been fixing bad habits ever since. Now I do R&D work writing v1 line of business applications mostly in C#/.Net.

I've been an avid pilot/instructor for 13+ years, I've built two airplanes and mostly fly gliders now for fun. I commute in an all-electric 1986 BMW 325 conversion.

I'd like to get back to my academic roots of programming 3D analysis applications to organize complex systems.

Comments and Discussions

 
GeneralA suggestion Pin
Erik Vullings13-Dec-10 12:18
Erik Vullings13-Dec-10 12:18 
GeneralRe: A suggestion [modified] Pin
CodingBruce15-Dec-10 4:38
CodingBruce15-Dec-10 4:38 
GeneralVery nice Pin
RudolfHenning7-Dec-10 2:02
RudolfHenning7-Dec-10 2:02 
GeneralRe: Very nice Pin
CodingBruce7-Dec-10 3:09
CodingBruce7-Dec-10 3:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.