Mime is where Legacy Systems go to die

Your new system went live. Migration of current, active data went well. A decision was made not to move historic data and keep the old system around in "read-only" mode, just in case some information needs to be looked up. Over time your zoo of legacy systems grows. I'll outline a way to put them to rest.

The challenges

All recent systems (that's younger than 30 years) data is stored more or less normalized. A business document, like a contract, is split over multiple tables like customer, address, header, line items, item details, product etc.

Dumping this data as is (csv rules supreme here) only creates a data graveyard instead of the much coveted data lake or data warehouse.

The issue gets aggravated by the prevalence of magic numbers and abbreviations that are only resolved inside the legacy system. So looking at one piece of data tells you squid. Only an old hand would be able to make sense of Status 82 or Flags x7D3z

Access to meaningful information is confined to the user interface of the legacy application. It provides search and assembly of business relevant context

The solution approach

Solving this puzzle requires a three step approach:

denormalize
transform
make accessible

Denormalize

The persistence format needs to be something that is closer to a document structure. The formats suitable would be XML, JSON or YAML. Probably owing to my age I would make a case for XML. I would argue that <Status id="42">Universe questions answered</Status> is way more readable than JSON when you try to link the magic number 42 to its business meaning Universe questions answered.

To be very clear: You will massively duplicate data and if any of the relations would change, you would face a data nightmare. But our use case: archive, read only doesn't face any penalties, other than storage, for this duplication.

Transform

While your friendly neighborhood geek would be perfectly fine staring at XML, normal mortal users will prefer something that's not only human readable, but easy to comprehend.

The reflective approach for an IT department tasked with this would be the creation of another future legacy system to visualize the data.

A better solution I'd like to propose is an XSLT (for XML) or Mustache (for JSON) transformation into HTML. Adventures souls could aim towards XSL:FO and a PDF-A result.

Now you have at least 2 files: the denormalized XML and the HTML (or PDF). Eventually you have binary data like image files or office attachments that belong to that business record.

Enter MIME. MIME is a container format, typically found in eMail systems. Nothing stops us reusing its capabilities. The MIME file consists of a MIME Header and one or more MIME parts, that qualify their content by a type.

A MIME part can contain a mime header and one or more mime parts - at nauseam or stack overflow.

Using MIME we can compose a single file. In the header would be meta information: what legacy system it came from, when it was exported, where the documentation could be found etc. Thereafter would be MIME parts with HTML, XML, PDF and binary files.

Make accessible

Just double clicking on that file would open it in the standard mail viewer. With clever set header fields (e.g. From as name of the legacy system) it will render the HTML part into something directly read- and understandable. Thus the file is useful in itself. If automated further processing is required, the XML or binary parts can be harvested.

Now take that file and store in an an append only object store (S3 anyone). The final piece is: how to make the information findable?

Users might search for a part, a customer, a time frame, a combination of all sorts of criteria. This type of requirements can be covered by a fulltext index and a search engine. Lucene is content format aware, so queries can be very specific or very simple. The beauty of the approach: Other than a database, the MIME documents don't need to have a shared structure to be able to be searched by Lucene. Still the indexing process can do some level of standardization like merging Id, ID, id, recordid, account_id to be searchable as ID=

Of course implementation needs to be planned carefully. Nevertheless this can be used as a blueprint for the one archive system to rule them all.

As usual YMMV

Posted by Stephan H Wissel on 22 June 2018 | Comments (2) | categories: Software Technology