Version 3 (modified by gordonrachar, 15 years ago)

--

How We Store and Exchange Textual Information


Contents

  1. Abstract
  2. How We Store and Exchange of Textual Information
    1. Dealing with Proprietary Hardware
    2. Dealing with Proprietary Software - Personal Scale
    3. Dealing with Proprietary Software - Industrial Scale
    4. Markup Languages
  3. The History of Markup Languages
    1. What is a Markup Language?
    2. 1960s GML
    3. 1980s Standard Generalized Markup Language (SGML)
    4. 1989 - Hypertext Markup Language (HTML)
    5. 1990s Extensible Markup Language (XML)
    6. Advantages over HTML
    7. Drawback for Interoperability
  4. NEXT

Abstract

Interoperability of digital information became an issue almost as soon as computers made their way into engineering offices. Many organizations from around the world have been working on this topic for many years, from Owner/Operators, Constructors, Consulting Engineers, and Software Developers. Many standards organizations world wide are involved, some having been created just for this purpose.


How We Store and Exchange of Textual Information

Human society has always had to find ways to manage, store, and retrieve information. The Library of Alexandria, which burned down in 48 BC (according to one story), is an example of both the best technology for managing information in hard-copy form, and a major limitation of doing so.

With the advent of computer-managed storage in the mid twentieth century, information managers have had to grapple with two problems:

  • Survival of information beyond the lifetime of proprietary hardware and software.
  • Moving a large amount of information between proprietary systems.

Dealing with Proprietary Hardware

A typical example of these types of questions is a help desk inquiry from the mid 1980's:

I have data I want to keep for decades. Should I invest in a good card reader, or should I transfer my data to these far more efficient but newfangled "floppy disks"?

Unfortunately, the best answer to this kind of question has always been rather labor intensive. That is, the only reliable way to keep digital information for decades is to upgrade your storage media every few years to whatever is the latest and greatest at the time. For personal use, in the 1980s it would have been 5 1/2" floppy disks. By the 1990s you would have had to copy your archive to 3 1/2" floppies. Then, sometime around 2000, the best storage medium became CDs, and a bit later, DVDs. At first everyone thought they would last for decades, but sometimes they didn't even last two years:

Now, nearing the end of the first decade in the twenty-first century, flash drives are looking like they will be readable for quite awhile. But ask yourself the likely hood of personal computers having USB ports in twenty years? Maybe, but whether in twenty years or forty years, at some point you will still have to load up your thumb drives and copy them to some new media; perhaps a three-dimensional, holographic memory block.

Dealing with Proprietary Software - Personal Scale

Unfortunately, even if you go through the exercise of transferring your archive every few years, how are you going to open the files twenty-five years from now? In the lifetime of your humble author (who is so old he can remember when an entire family had to make do with a single telephone), the word processor of choice has gone from WordStar, to Word Perfect, to Microsoft Word. (This would be a good place for a Mac vs PC joke if I could think of one!)

Working with Word 2002, now, as this is being written, we can see that Word users can open the following word processor file formats:

  • Word 2.0
  • Word 5.1 for Mac
  • Word 6.0 (95)
  • Word Perfect 5.0
  • Works 2000

Where is my beloved WordStar? In addition to copies of all my data files, do I have to keep copies of all my old authoring software? And even if I do, what will I run it on? Do I also have to keep a working model of each vintage of personal computer? What if it breaks down?

So now, if I actually want to be able to retrieve my personal archives for decades (perhaps I am thinking that after I become a famous author, a publisher will give me a million dollar advance to write my memoirs), I will have to open each of my archived files every couple years and somehow transfer the contents to whatever the new authoring software is.

This will remove the problem of having to keep old hardware and software around, but will introduce a new set of problems:

First, this solution will create an upper limit on how much information I can keep around. Since it will take a certain amount of time to upgrade my archive each cycle, I will have less and less time each round to create new information. Eventually I will just finish one upgrade when I will have to start over with new technology.

Second, who's to say there will always be a clear and easy upgrade path from one authoring software to the next? For example, what if I have a large number of files authored with obscure CAD software? What if none of the current set of dominant players did not write the appropriate conversions into their offerings?

Well, there is another option:

Error: Macro Image(History_LongtermStorage.JPG, 500px) failed
Attachment 'wiki:ISO15926Primer_History_ExchangeTextInformation: History_LongtermStorage.JPG' does not exist.

Fig 2 - Long Term Information Storage Using the Internet

(This is taken from a Slashdot discussion on the topic of long-term data storage. Here is the complete article.)

Dealing with Proprietary Software - Industrial Scale

If the problem of moving information between proprietary systems is daunting on a personal level, try to imagine what it is like for organizations that create large bodies of documentation. For instance, every model of aircraft you see today requires several million pages of documentation which has to be revised and published every quarter. (XML Handbook) The combined documentation libraries of the aircraft industry probably rivals the size of the entire world wide web. Yet every few years the dominant hardware changes, and along with it, the software used.

Governments and law firms are in a similar situation.

Markup Languages

It is precisely these issues, the survival of information beyond the lifetime of proprietary hardware, and moving a large amount of information between proprietary systems, that prompted Charles Goldfarb, with Ed Mosher, and Ray Lorie at IBM to create "Generalized Markup Language" (GML) in the early 1960s.

  • GML
  • SGML
  • HTML
  • XML

Except for GML (which became SGML), all of these markup languages are in wide use today. SGML is used for managing large bodies of textual information. HTML is the language of the World Wide Web, linking documents for human retrieval. XML is increasingly being used to manage large bodies of knowledge, including plant information with ISO 15926.

Most people will not need to know how markup languages are used to manage plant information, but a brief history of markup languages will be interesting for background information.

The History of Markup Languages

Markup languages have a long history in enabling computers to handle large bodies of text properly, without human intervention. When encoded with a markup language, the content of a body of text is separated from the format, or appearance of the text. This is an important concept in ISO 15926 where the goal is to embed enough context into the content that we do not need to see the format, or appearance, of the information to know what it means.

Key factors in a Markup Language:

  • A standard format in which to store information that lasts many times longer than proprietary commercial software.
  • A means to transfer information between proprietary computer systems.

What is a Markup Language?

In the context of understanding ISO 15926, a "markup language" is a set of conventions for marking up text that are used together with the text to tell a computer the meaning of the text.

At a very simple level, punctuation, capitalization, and even the spaces between words themselves can be considered markup. These features tell human readers when there is a break between ideas, when to pause, and where individual words start and stop. (If the reader thinks these are obvious necessities for understanding written text, there are numerous examples in the history of human societies were written material was in the style of scriptio continua.

Another example is spoken words, where the volume and tone of voice can be considered markup. For instance, a given string of words may have a completely different meaning if they are yelled, spoken in a soft voice, or with a condescending tone of voice. Thus, the value of the message (that is, the actual words spoken) must be considered together with the tags (that is, the volume and tone of voice) to obtain the correct meaning.

We have seen this concept previously in this primer. In the section about context, we saw that the numerical value 1034 on its own had no meaning, but in the context of a particular spot on a particular data sheet, it meant the pressure of the seal flush of a centrifugal pump. Thus, the location of a value on a data sheet can be considered a sort of markup.

If the meaning of a piece of text is embedded in the text by means of a markup language, one can use the same body of text for different purposes without modifying the text manually. For example, consider a scientific journal that publishes papers both in a printed magazine format and on its website. In the magazine, footnotes might be grouped at the end of the article, while footnotes on the website might pop up in their own little window. The publisher could manually edit the text for each purpose, but this would be doing it the hard way. The easy way is to encode the text with a markup language that marks the beginning and end of footnotes, and showed the correct anchor point in the manuscript. When the publishing software prepares the text for print, it will group the footnotes at the end of the article, but when it prepares the text for the website, would include the necessary HTML tags to create a popup window.

1960s GML

Generalized Markup Language (GML) was developed in 1969 by the team of Charles Goldfarb, Ed Mosher, and Ray Lorie. (Look at the initials formed by their last names--it's not a coincidence. In fact Goldfarb invented the term "Markup Language" just to be able to use them!) Goldfarb, a lawyer at the time, had joined IBM to get some high tech experience. He was assigned to a project to figure out how to merge case law research results together into one document, compose it, and print it. At the time there were no systems that would do all three things, so the text to be printed had to be transferred from one proprietary system to another, all without loosing it's fidelity, or meaning.

GML was a set of macros that described the logical structure of the document, for instance, to declare some text to be a heading and other text to be a body paragraph.

Note that the issue of being able to transfer information between proprietary systems is the same issue that drives ISO 15926.

References

1980s Standard Generalized Markup Language (SGML)

SGML is a descendent of GML. SGML was originally intended for publishing databases and text. One of its first applications was publishing an early edition of the Oxford English dictionary.

SGML is known as a metalanguage since it can be used to describe other markup languages.

In the field of publishing, historically, markup has meant the marks that an editor makes when reviewing a transcript. For instance, marks to indicate that one phrase is to be rendered in bold face and another in italics. In an age of machine-readable text, this term has now come to mean special formatting codes inserted in-line with the text to give direction to the computer that does the publishing.

Metalanguage means that SGML can be used to create other markup languages. SGML has the means to describe which markups are required and how to tell markups from text. Thus, you can use SGML to create other markup languages.

The first working draft of SGML by the American National Standards Institute (ANSI) was published in 1980. By 1983 it was ready for prime time and was adopted by the US Internal Revenue Service and the US Department of Defense. The next year the International Organization for Standardization (ISO) had gotten involved and in 1986 issued SGML as the international standard (ISO 8879:1986)

One feature of SGML that distinguishes it from other markup languages at the time is its emphasis on descriptive markup rather than procedural markup. This means that the tags describe the text rather than tell what to do with it. For instance, it was common to use a proprietary markup language which told proprietary publishing equipment to, say, print this in 10pt Times Roman, and that in 12pt sans serif. But if the publisher wanted to process the text on different equipment, the tags would all have to be stripped out and new tags entered. SGML, however simply said "this is body text", or "this is a footnote".

Of interest to the history of ISO 15926 are some of the reasons for using SGML:

  • In the government and law, large bodies of text must be readable for decades. Therefore it must not be stored in any proprietary format that may go out of fashion in a few years. This is also one of the reasons to use ISO 15926; the life of a typical plant also spans several decades, during which time computer operating systems and text handling software goes through many generations. The people dismantling the plant forty years later may not even remember the name of software that was used by the engineers designing the plant.
  • SGML was also used as a means to transfer texts from one system to another in a manner that preserved the intent of the formatting. Similarly, ISO 15926 can be used to transfer information about plant objects from one system to another in a manner that preserves the meaning of the attributes of the plant object.

Example Continuing the example of a pump data sheet, here is what the information might look like encoded in SGML:

<TITLE>CENTRIFUGAL PUMP DATA SHEET</TITLE>
<BODY><B>Client: ABC Chemical Company</B></BODY>
<BODY>Tag No: P101</BODY>
<BODY>Service: Chemical Injection to D-101</BODY>
<BODY>...</BODY>
<BODY><B>Seal Flush</B></BODY>
<BODY>Pressure: 1034 kPa</BODY>

This shows the information on the data sheet as plain text. The title will likely be a larger font, and the two heading will be in bold face. The rest of the text is understandable by humans, but you could not have a computer read it to extract, for instance, the tag number of the pump, its attributes, or to extract its relationship to D-101.

References

Good introductory material:

For more detailed information:

1989 - Hypertext Markup Language (HTML)

HTML is a descendant of SGML. HTML was invented by Tim Berners-Lee as a way to embed references to a document within another document. Tim envisioned being able to directly open such a referenced document directly without having exit the first document.

Berners-Lee based HTML on SGML since SGML could already be implemented on any machine. As with SGML, the idea was to be able to mark up text in a way that separated the message from the manner in which the message was displayed. For instance, <EM> some text </EM> meant that the enclosed text was to be somehow emphasized. Web browsers intended to be read with eyes might render the text slightly larger and bold face, or perhaps underlined. Alternatively, web browsers intended to be listened to might render it in a slightly louder tone.

HTML attracted mostly academic interest for the first year or two. But as the Internet became more widely known, organizations started to realize how HTML could open the Internet to average people. From the early 1990s, HTML became a battleground for various competing interests who added their own tags. One of the biggest issues was getting fine control over the appearance of the text and images. The example above, <EM> some text </EM> says that "some text" should be somehow emphasized, but in what way? Print publishers were used tweaking text with by adjusting the point sizes, leading and kerning, and were not happy trusting the default handling of "emphasized" text.

The result today is that HTML has a great many tags for fine tuning the appearance of text, but no tags to convey the meaning of text--you still need a person to read the material.

Example HTML added a number of tags to SGML:

  • P, for paragraph
  • H1 thru H6, for heading levels
  • OL, ordered lists
  • UL, unordered lists
  • LI, list items
  • HREF, references to other objects
  • A, to anchor HREF references

Here is how we might use them to encode our pump data sheet encoded in HTML:

<TITLE>CENTRIFUGAL PUMP DATA SHEET</TITLE>
<H1>Client: ABC Chemical Company</H1>
<P>Tag No: P101</P>
<P>Service: Chemical Injection to D-101</P>
<P>...</P>
<H2>Seal Flush</H2>
<UL>
<LI>Pressure: 1034 kPa</LI>
<LI>...</LI>
</UL>

Here we have used some of the new tags to lay the data sheet out a little nicer. The title is the same, but now we can group the pump's attributes under headings. However, we are still formatting the text for human viewers. We have more tags to handle the appearance of the information, but nothing to tell a computer what the various bits of text mean.

References

1990s Extensible Markup Language (XML)

XML, also a descendent of SGML, is also a meta language in that it can be used to define other markup languages. XML was intended to get back to the SGML roots without the SGML complexity. When it was released in its first draft in late 1996, its developers were not shy about proclaiming it to be the holy grail of computing, solving the problem of universal data interchange between dissimilar systems.

Since its introduction it has accomplished at least some of what was intended of it. For instance, most of our Office documents are now stored in XML format. While some argue that the particular dialect of OpenOffice? XML isn't the best formed in the world, it's still an order-of-magnitude better than the myriad of proprietary formats that preceded it. Now it is much easier for third parties to reverse-engineer documents in order to open them in different authoring software.

Of interest to the history of ISO 15926 are some of the implications of widespread use of XML in web publishing. Looking into our crystal ball we can see applications written by webmasters that will allow untrained users to write in something that looks like Microsoft Word, then upload their fine prose (or poetry, or...) straight in to the local content management system. And as XML-written documents displace documents written with proprietary software (and uploaded as inscrutable binary files), more and more data will be open, available to be searched and indexed, and therefore available for all.

The "X" in XML means "Extensible". We can use this feature to mark up information about plant objects in a way that will let a computer read it.

Example

<DATASHEET>
<TITLE>CENTRIFUGAL PUMP DATA SHEET</TITLE>
<CLIENT>ABC Chemical Company</CLIENT>
<TAG_NO>P101</TAG_NO>
<SERVICE>Chemical Injection</SERVICE><ASSOCIATED>D-101</ASSOCIATED>
<something>...</something>
<SEAL_FLUSH_PRESSURE>1034</SEAL_FLUSH_PRESSURE>
<SEAL_FLUSH_PRESSURE_UNITS>kPa</SEAL_FLUSH_PRESSURE_UNITS>
<something_else>...</something_else>
...
</DATASHEET>

Advantages over HTML

This example above shows how we can extend XML to include any kind of tags we wish. Right away you can see how we could then use a computer program to search the information to pull back the name of the pump, it's associated equipment (D-101), and that the seal flush pressure was 1034 kPa. Since it is extensible, any organization can create its own tags for whatever it needs.

Drawback for Interoperability

Agreement on the definition of terms. In order to get interoperability between systems, the owners of the systems have to agree on terms. As we have seen in previous sections, getting the agreement not a trivial question.

Descendants

  • SOAP
  • XML RPC

References


NEXT

Markup languages have a long history in enabling computers to handle large bodies of text properly, without human intervention. When encoded with a markup language, the content of a body of text is separated from the format, or appearance of the text. This is an important concept in ISO 15926 where the goal is to embed enough context into the content that we do not need to see the format, or appearance, of the information to know what it means.


Discussion

You have no rights to see this discussion.

Home
About PCA
Reference Data Services
Projects
Workgroups