Return to lecture notes index
October 19, 2010 (Lecture 17)

Semi-Structured Data, e.g. XML-formed Data

We've spent most of the semeister discussing ways of organizing large amounts of information so that we can quickly ask interesting questions and quickly get hopefully interesting answers. But, we've left open one very interesting question: How do we get the data that is the foundation for this process?

It might seem like we answered this question. Or, maybe like there is no real question. Our data has largely come from CSV files. But, these aren't really an answer.

They do contain data. But, there is a lot missing. For example, at first glance, there is nothing about a CSV file that tells us the domain of any field. It is a string of characters? A number? An instance of some enumerated type? And, once we look deeper, we realize that even more is missing.

CSV files allow us to list fields. But, they don't give us any way of organizing those fields in a way that demonstrates their relationship to each other. For example, if each line of a file represents a student record, composed of student information and course information, how do we know which fields are properties of the student information and which are properties of the course information? All we've got is a single flat line composed of fields that are independent, so far as the representation can demonstrate.

We need a way of representing data that allows us to captures its structure. So-called semi-structured, data is able to capture the relationship among the various elements as a tree withi interior and leaf nodes. In so doing, we can represent one object as the root of a subtree, and we can represent its constituent parts as its children -- and possibly roots of their own subtrees. The Extensible Markup Language (XML) is the primary tool for describing such emi-structured data.

Beyond Storage

When many people think about XML Documents (or other forms of semi-structured data), they think about its used within files. But, it is important to realize that the actual problem to be solved is much bigger and more general than this.

Semi-structured data is critical in representing data in transmission. For example, XML files are often used to facilitiate the electronic exchange of information between clients and Web services, or within other types of software systems. Representing semi-structured data is critically important to many types of information exchange.

So, here's the real deal. We need to realize that file storage is not entirely different than data exchange. In fact, the reading and writing of files is exactly one type of data exchange. There is no difference, in principle, between the exchange of an XML document between two systems in real time, and the exchange of XML data between two systems over time through the reading and writing of files -- or even between the exchange of data between a system and itself, over time, through the reading and writing of files.

Semi-structured data, including XML documents are critical to the exchange of data between systems, in real time over data networks, and when delayed over time and delivered via files.

Limits of Semi-Structured Data

It is often times very straight-forward to record data as it is generated in a semi-structured format. But, it is difficult to search large semi-structured files quickly. And, they don't lend themselves to indexing, precisely because they are semi-structured and don't nicely organize data into blocks, etc. It is also difficult to edit and change semi-structured files over time for the very same reason.

So, we see that semi-structured documents are excellent for the transmission of information within and across systems -- but not a very good form for directly searching and otherwise studying the data. We generally get data large volumes of data in semi-structured form, process it to create a database, and then use it in that form, possibly also using it as a way to represent the answers to queries, etc.

So, What Does XML Look Like, Anyway?

I think we are all familiar with HTML. XML is sort of like HTML, except that you get to invent your own tags for identifying elements of the document. And, you get to decide what attributes are valid within these tags for describing attributes of the elements. Unlike old-school HTML, all XML tags must having closing tags, e.g., <B*gt;...</B>, except that some tags can be self-closing, e.g. <DIV/>

Unlike HTML documents, which often contain more than one "top-level element, e.g. HEAD and BODY, an XML document must have a single root element. All of the other elements within this document must be beneath this element. As a result, we can view XML documents as a flat representation of a tree. When one element is nested within another element, it is a child in the tree.

Attributes of an element, for the most part, serve the same role within the tree as a leaf element. We can think of attributes as leaf-level child elements. The only way this breaks down is that we can actually have attributes that reference higher levels of the tree, whereas true leaf elements can't cause cycles.

In addition to the above, there is one technical detail. All XML documents must begin with a magic line of meta data, like the one below:

  <? xml version = "1.0" encoding = "utf-8" standalone = "yes" ?>
  

As you can probably guess, the line above indicates that the document is, in fact, XML and describes the character set used. XML documents can be tied to schemas in one of a couple of forms that describe the structure of the document. The standalone attribute indicates whether or not such a specification exists for this file.

Namespaces

Unlike HTML, the names of elements within an XML document are not part of the XML standard (although they may be part of the standard for whatever system is using XML to specify data). This means that, across many systems, especially when information is culled together and assembled, we can sometimes end up with the same element names meaning different things, because they come from different places. This is one form of the very common "namespace conflict" problem.

In order to address this, XML allows one to, using an xmlns attribute, identify a namespace with the URL of the organization and to give it a nickname for use within a document. This nickname can then be used to disambiguate the element names. This is done by prefixing the element name with the namespace's nicname. Below is an example:

  <xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:complexType name = "Class">
  

Document Type Definition (DTD)

So far, we've seen that XML documents provide a really nice, really general way of representing the relationships within any tree-structured data. This is far better than, oh, the CSV files we've used thus far. But, they are missing something important.

They don't tell us anything about the domain of the data. What type is it? What are the valid ranges? The original solution to this problem was inherited from a 1960's IBM markup lanaguge called, SGML. The solution is a companion document called a Document Type Definition (DTD). The DTD, which is not an XML document, has its own rules and syntax, but describes the domain of the data.

I don't want to go deeply into the SYNTAX of DTDs here. There are plenty of references on the Web. But, the example belwo will give you the flavor of it. The XML document is no longer "standalone". The next line in the XML document gives the name of the -external- DTD file. This DTD file references the root node, and describes its elements and attributes, and then thier elements and attributes. It uses a regular-expression-like syntax to indicated how many instances of a particular element are expected.

The example below shows how to tie the DTD to an XML document, and gives and example of a DTD:

Top of XML Document:

  <?xml version = "1.0" encoding = "utf-8" standalone= "no">
  <!DOCTYPE Class SYSTEM "class.dtd">
  

DTD:

    <!DOCTYPE root-tag [
       <!ELEMENT element-name (components)>
       more elements
   ]>


   <!DOCTYPE Class [
     <!ELEMENT Person (CDATA)>
     <!ELEMENT Teacher (Person+)>
     <!ELEMENT ClassId 
   ]>
  

XML Schemas

XML Schemas are a more modern solution to essentially the same problem that could be solved via DTDs. They are much more expressive. And, even better, they, themselves, are XML documents. Again, the Web is full of many examples and a full specification.

But, please find below an illustrative example. Please not the tie-in to the namespace for schemas.

  <?xml version = "1.0" encoding = "utf-8" standalone= "no">
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:complexType name = "Class">
    <xs:sequence>
      <xs:simpleType name = "DeptId">
        <xs:restriction base = "xs:integer">
          <xs:minInclusive value = "1"/>I 
          <xs:maxInclusive value = "15"/>
        </xs:restriction>
      </xs:simpleType>
      <xs:simpleType name = "CourseId">
        <xs:restriction base = "xs:integer">
          <xs:minInclusive value = "000"/>
          <xs:maxInclusive value = "999"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:sequence>
  </xsLschema>