[[TracNav(TracNav/ISO15926Primer)]] = How We Store and Exchange Textual Information = ---- [[PageOutline(2-4,Contents,inline)]] == Abstract == Interoperability of digital information became an issue almost as soon as computers made their way into engineering offices. Many organizations from around the world have been working on this topic for many years, from Owner/Operators, Constructors, Consulting Engineers, and Software Developers. Many standards organizations world wide are involved, some having been created just for this purpose. ---- == How We Store and Exchange of Textual Information == Human society has always had to find ways to manage, store, and retrieve information. The Library of Alexandria, which burned down in 48 BC [http://en.wikipedia.org/wiki/Library_of_Alexandria (according to one story)], is an example of both the best technology for managing information in hard-copy form, and a major limitation of doing so. With the advent of computer-managed storage in the mid twentieth century, information managers have had to grapple with two problems: * Survival of information beyond the lifetime of proprietary hardware and software. * Moving a large amount of information between proprietary systems. === Dealing with Proprietary Hardware === A typical example of these types of questions is a help desk inquiry from the mid 1980's: ''I have data I want to keep for decades. Should I invest in a good card reader, or should I transfer my data to these far more efficient but newfangled "floppy disks"?'' Unfortunately, the best answer to this kind of question has always been rather labor intensive. That is, the only reliable way to keep digital information for decades is to upgrade your storage media every few years to whatever is the latest and greatest at the time. For personal use, in the 1980s it would have been 5 1/2" floppy disks. By the 1990s you would have had to copy your archive to 3 1/2" floppies. Then, sometime around 2000, the best storage medium became CDs, and a bit later, DVDs. At first everyone thought they would last for decades, but sometimes they didn't even last two years: * [http://www.computerworld.com/action/article.do?command=viewArticleBasic&taxonomyName=storage&articleId=9123244&taxonomyId=19&intsrc=kc_top Restored DVD key to conviction in criminal case] Now, nearing the end of the first decade in the twenty-first century, flash drives are looking like they will be readable for quite awhile. But ask yourself the likely hood of personal computers having USB ports in twenty years? Maybe, but whether in twenty years or forty years, at some point you will still have to load up your thumb drives and copy them to some new media; perhaps a three-dimensional, holographic memory block. === Dealing with Proprietary Software - Personal Scale === Unfortunately, even if you go through the exercise of transferring your archive every few years, how are you going to open the files twenty-five years from now? In the lifetime of your humble author (who is so old he can remember when an entire family had to make do with a single telephone), the word processor of choice has gone from !WordStar, to Word Perfect, to Microsoft Word. (This would be a good place for a Mac vs PC joke if I could think of one!) Working with Word 2002, now, as this is being written, we can see that Word users can open the following word processor file formats: * Word 2.0 * Word 5.1 for Mac * Word 6.0 (95) * Word Perfect 5.0 * Works 2000 Where is my beloved !WordStar? In addition to copies of all my data files, do I have to keep copies of all my old authoring software? And even if I do, what will I run it on? Do I also have to keep a working model of each vintage of personal computer? What if it breaks down? So now, if I actually want to be able to retrieve my personal archives for decades (perhaps I am thinking that after I become a famous author, a publisher will give me a million dollar advance to write my memoirs), I will have to open each of my archived files every couple years and somehow transfer the contents to whatever the new authoring software is. This will remove the problem of having to keep old hardware and software around, but will introduce a new set of problems: First, this solution will create an upper limit on how much information I can keep around. Since it will take a certain amount of time to upgrade my archive each cycle, I will have less and less time each round to create new information. Eventually I will just finish one upgrade when I will have to start over with new technology. Second, who's to say there will always be a clear and easy upgrade path from one authoring software to the next? For example, what if I have a large number of files authored with obscure CAD software? What if none of the current set of dominant players did not write the appropriate conversions into their offerings? Well, there is another option: [[Image(History_LongtermStorage.JPG, 500px)]] '''Fig 2 - Long Term Information Storage Using the Internet''' ''(This is taken from a Slashdot discussion on the topic of long-term data storage. [http://ask.slashdot.org/article.pl?sid=08/12/13/1434216 Here is the complete article.])'' === Dealing with Proprietary Software - Industrial Scale === If the problem of moving information between proprietary systems is daunting on a personal level, try to imagine what it is like for organizations that create large bodies of documentation. For instance, every model of aircraft you see today requires several million pages of documentation which has to be revised and published every quarter. [http://www.amazon.com/Charles-Goldfarbs-XML-Handbook-4th/dp/product-description/0130651982 (XML Handbook)] The combined documentation libraries of the aircraft industry probably rivals the size of the entire world wide web. Yet every few years the dominant hardware changes, and along with it, the software used. Governments and law firms are in a similar situation. === Markup Languages === It is precisely these issues, the survival of information beyond the lifetime of proprietary hardware, and moving a large amount of information between proprietary systems, that prompted Charles Goldfarb, with Ed Mosher, and Ray Lorie at IBM to create "Generalized Markup Language" (GML) in the early 1960s. * GML * SGML * HTML * XML Except for GML (which became SGML), all of these markup languages are in wide use today. SGML is used for managing large bodies of textual information. HTML is the language of the World Wide Web, linking documents for human retrieval. XML is increasingly being used to manage large bodies of ''knowledge'', including plant information with ISO 15926. Most people will not need to know how markup languages are used to manage plant information, but a brief history of markup languages will be interesting for background information. &&&&&&&&&&&&&&&&&&&&&&&&&&&&&& == The History of Markup Languages == Markup languages have a long history in enabling computers to handle large bodies of text properly, without human intervention. When encoded with a markup language, the ''content'' of a body of text is separated from the ''format'', or appearance of the text. This is an important concept in ISO 15926 where the goal is to embed enough ''context'' into the ''content'' that we do not need to see the format, or appearance, of the information to know what it means. Key factors in a Markup Language: * A standard format in which to store information that lasts many times longer than proprietary commercial software. * A means to transfer information between proprietary computer systems. ---- === What is a Markup Language? === In the context of understanding ISO 15926, a "markup language" is a set of conventions for marking up text that are used together with the text to tell a computer the meaning of the text. At a very simple level, punctuation, capitalization, and even the spaces between words themselves can be considered ''markup''. These features tell human readers when there is a break between ideas, when to pause, and where individual words start and stop. (If the reader thinks these are obvious necessities for understanding written text, there are numerous examples in the history of human societies were written material was in the style of [http://en.wikipedia.org/wiki/Scriptio_continua scriptio continua]. Another example is spoken words, where the volume and tone of voice can be considered ''markup''. For instance, a given string of words may have a completely different meaning if they are yelled, spoken in a soft voice, or with a condescending tone of voice. Thus, the ''value'' of the message (that is, the actual words spoken) must be considered together with the ''tags'' (that is, the volume and tone of voice) to obtain the correct meaning. We have seen this concept previously in this primer. In the section about ''context'', we saw that the numerical value ''1034'' on its own had no meaning, but in the context of a particular spot on a particular data sheet, it meant the pressure of the seal flush of a centrifugal pump. Thus, the location of a value on a data sheet can be considered a sort of ''markup''. If the meaning of a piece of text is embedded in the text by means of a markup language, one can use the same body of text for different purposes without modifying the text manually. For example, consider a scientific journal that publishes papers both in a printed magazine format and on its website. In the magazine, footnotes might be grouped at the end of the article, while footnotes on the website might pop up in their own little window. The publisher could manually edit the text for each purpose, but this would be doing it the hard way. The easy way is to encode the text with a markup language that ''marks'' the beginning and end of footnotes, and showed the correct anchor point in the manuscript. When the publishing software prepares the text for print, it will group the footnotes at the end of the article, but when it prepares the text for the website, would include the necessary HTML tags to create a popup window. == 1960s GML == Generalized Markup Language (GML) was developed in 1969 by the team of Charles Goldfarb, Ed Mosher, and Ray Lorie. (Look at the initials formed by their last names--it's not a coincidence. In fact Goldfarb invented the term "Markup Language" just to be able to use them!) Goldfarb, a lawyer at the time, had joined IBM to get some high tech experience. He was assigned to a project to figure out how to merge case law research results together into one document, compose it, and print it. At the time there were no systems that would do all three things, so the text to be printed had to be transferred from one proprietary system to another, all without loosing it's fidelity, or meaning. GML was a set of macros that described the logical structure of the document, for instance, to declare some text to be a heading and other text to be a body paragraph. Note that the issue of being able to transfer information between proprietary systems is the same issue that drives ISO 15926. '''References''' * [http://www.sgmlsource.com/press/index.htm Charles Goldfarb's Press Online Kit] === 1980s Standard Generalized Markup Language (SGML) === SGML is a descendent of GML. SGML was originally intended for publishing databases and text. One of its first applications was publishing an early edition of the Oxford English dictionary. SGML is known as a ''metalanguage'' since it can be used to describe other markup languages. In the field of publishing, historically, ''markup'' has meant the marks that an editor makes when reviewing a transcript. For instance, marks to indicate that one phrase is to be rendered in bold face and another in italics. In an age of machine-readable text, this term has now come to mean special formatting codes inserted in-line with the text to give direction to the computer that does the publishing. ''Metalanguage'' means that SGML can be used to create other markup languages. SGML has the means to describe which markups are required and how to tell markups from text. Thus, you can use SGML to create other markup languages. The first working draft of SGML by the American National Standards Institute (ANSI) was published in 1980. By 1983 it was ready for prime time and was adopted by the US Internal Revenue Service and the US Department of Defense. The next year the International Organization for Standardization (ISO) had gotten involved and in 1986 issued SGML as the international standard (ISO 8879:1986) One feature of SGML that distinguishes it from other markup languages at the time is its emphasis on ''descriptive'' markup rather than ''procedural'' markup. This means that the tags ''describe'' the text rather than tell ''what to do with it''. For instance, it was common to use a proprietary markup language which told proprietary publishing equipment to, say, print this in 10pt Times Roman, and that in 12pt sans serif. But if the publisher wanted to process the text on different equipment, the tags would all have to be stripped out and new tags entered. SGML, however simply said "this is body text", or "this is a footnote". Of interest to the history of ISO 15926 are some of the reasons for using SGML: * In the government and law, large bodies of text must be readable for decades. Therefore it must not be stored in any proprietary format that may go out of fashion in a few years. This is also one of the reasons to use ISO 15926; the life of a typical plant also spans several decades, during which time computer operating systems and text handling software goes through many generations. The people dismantling the plant forty years later may not even remember the name of software that was used by the engineers designing the plant. * SGML was also used as a means to transfer texts from one system to another in a manner that preserved the intent of the formatting. Similarly, ISO 15926 can be used to transfer information about plant objects from one system to another in a manner that preserves the meaning of the attributes of the plant object. '''Example''' Continuing the example of a pump data sheet, here is what the information might look like encoded in SGML: {{{
Tag No: P101
Service: Chemical Injection to D-101
...