How much abstraction is healthy for a schema/data model? - Part 2

In Part 1 I discussed Elements vs. Attributes and the document nature of business vs. the table nature of RDBMS. In this installment I'd like to shed some light on abstraction levels.
I'll be using a more interesting examples than CRM: a court/case management system. When I was at lawschool one of my professors asked me to look out of the window and tell him what I see. So I replied: "Cars and roads, a park with trees, building and people entering them and leaving and so on". "Wrong!" he replied, "you see subjects and objects".
From a legal view you can classify everything like that: subjects are actors on rights, while objects are attached to rights.
Interestingly in object oriented languages like Java or C# you find a similar "final" abstraction where everything is a object that can be acted upon by calling its methods.
In data modeling the challenge is to find the right level of abstraction: to low and you duplicate information, to high and a system becomes hard to grasp and maintain.
Lets look at some examples. In a court you might be able to file a civil, criminal, administrative or inheritance case. Each filing consists of a number of documents. So when collecting the paper when doing your contextual enquiry you end up with draft 1:

(I'll talk about the inner elements later) The content will be most likely very similar with plaintiff and defendant and the representing lawyers etc. So you end up writing a lot of duplicate definitions. And you need to add a complete new definition (and update your software) when the court adds "trade disputes" and, after the V landed, "alien matters" to the jurisdiction.
Of course keeping the definitions separate has the advantage that you can be much more prescriptive. E.g. in a criminal case you could have an element "maximum-penalty" while in a civil case you would use "damages-thought". This makes data modeling as much a science as an art.
To confuse matters more for the beginner: You can mix schemata, so you can mix-in the specialised information in a more generalised base schema. IBM uses the approach for IBM Connections where the general base schema is ATOM and missing elements and attributes are mixed in in a Connections specific schema.
You find a similar approach in MS-Sharepoint where a Sharepoint payload is wrapped into 2 layers of open standards: ATOM and OData (to be propriety at the very end).
When we abstract the case schema we would probably use something like:

A little "fallacy" here: in the id field the case type is duplicated. While this not in conformance with "the pure teachings" is is a practical compromise. In real live the case ID will be used as an isolated identifier "outside" of IT. Typically we find encoded information like year, type, running number, chamber etc.
One could argue, a case just being a specific document and push for further abstraction. Also any information inside could be expressed as an abstract item:

<document type="case" subtype="civil" id="ci-123">
<content name="plaintiff" type="person">Peter Pan </content>
<content name="defendant" type="person">Captain Hook </content>
</document>

Looks familiar? Presuming you could have more that one plaintiff you could write:

<document form="civilcase">
<noteinfo unid="AA12469B4BFC2099852567AE0055123F">
<created>
<datetime>20120313T143000,00+08 </datetime>
</created>
</noteinfo>
<item name="plaintiff">
<text>Peter Pan </text>
<text>Tinkerbell </text>
</item>
<item name="defendant">
<text>Captain Hook </text>
</item>
</document>

Yep - good ol' DXL! While this is a good format for a generalised information management system, it is IMHO to abstract for your use case. When you create forms and views, you actually demonstrate the intend to specialise. The beauty here: the general format of your persistence layer won't get into the way when you modify your application layer.
Of course this flexibility requires a little more care to make your application easy to understand for the next developer. Back to our example, time to peek inside. How should the content be structured there?
In a court case you have plaintiffs, defendants, lawyers, witnesses, jury, subject matter experts etc. So you could model them as individual elements or as participant elements with a type attribute. The later would look like this:

<case id="ci-123" type="civil">
<created>20120313T144500,00+08 </created>
<participant type="plaintiff" ref="some URI">
<name>Peter Pan </name>
<address type="primary">Lost Islands </address>
</participant>
<participant type="witness" for="plaintiff" ref="some URI">
<name>Tinker Bell </name>
<address>Lost Islands </address>
</participant>
<participant type="defendant" ref="some URI">
<name>Captain Hook </name>
<address>Crocodile Bay </address>
</participant>
<summary lang="en">Hook shall pay damages for trespassing </summary>
</case>

A full case document would be much more complex, but this snippet is sufficient to demonstrate another important principle: Denormalisation in the API. When persisting such data in a RDBMS, most likely there would be a table for people that captures individuals and an additional table to link people to cases with the type of participant attribute. So a person's particulars need to be stored only once.
That has a number of issues: when a person moves (or changes name, e.g. by marriage) all court cases would get updated (RDBMS proponents would argue: this is a good thing!) which invalidates existing documents (go ask your lawyer). So a good API would keep that information completely inside the case document but with a reference (.. some URI ..) to the current record, so the documents can be updated if needed or permitted or at least shown in the UI.
Expressed as schema our mini case would look a little like this:
Schema for a case

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="case">
<xs:complexType>
<xs:sequence>
<xs:element ref="created"/>
<xs:element maxOccurs="unbounded" ref="participant"/>
<xs:element ref="summary"/>
</xs:sequence>
<xs:attribute name="id" use="required" type="xs:ID"/>
<xs:attribute name="type" use="required" type="xs:NCName"/>
</xs:complexType>
</xs:element>
<xs:element name="created" type="xs:dateTime"/>
<xs:element name="participant">
<xs:complexType>
<xs:sequence>
<xs:element ref="name"/>
<xs:element ref="address"/>
</xs:sequence>
<xs:attribute name="for" type="xs:IDREF"/>
<xs:attribute name="ref" use="required" type="xs:anyURI"/>
<xs:attribute name="type" use="required" type="xs:NCName"/>
</xs:complexType>
</xs:element>
<xs:element name="name" type="xs:string"/>
<xs:element name="address">
<xs:complexType mixed="true">
<xs:attribute name="type" type="xs:NCName"/>
</xs:complexType>
</xs:element>
<xs:element name="summary">
<xs:complexType mixed="true">
<xs:attribute name="lang" use="required" type="xs:NCName"/>
</xs:complexType>
</xs:element>
</xs:schema>

In Part 3 I will have a look at some of the fundamental schemata one should know.

Posted by Stephan H Wissel on 13 March 2012 | Comments (0) | categories: Software

wissel.net

How much abstraction is healthy for a schema/data model? - Part 2

Comments

No comments yet, be the first to comment