Context Navigation

Changes between Version 17 and Version 18 of ISO15926Primer_History_ExchangeTextInformation

Timestamp:: 11/16/11 05:27:13 (12 years ago)
Author:: gordonrachar (IP: 75.156.216.35)
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ISO15926Primer_History_ExchangeTextInformation

v17	v18
3	3	= How We Store and Exchange Textual Information =
4	4
5		----
6		[[PageOutline(2-4,Contents,inline)]]
	5	The '''ISO 15926 Primer''' has been replaced with '''An Introduction to ISO 15926''', a free download from Fiatech.
7	6
8		== Abstract ==
	7	This page is out of date and has been depricated.
9	8
10		One of the first uses of computers was to manage large bodies of written information. But as we have all personally experienced, hardware and software changes every few years. Every time an organization changes its technology, its entire document collection has to be moved to the new system. Because of the immense size of some of these collections, rekeying is impossible.
	9	If you reached this page from a link in another web page please inform the webmaster.
11	10
12		From this need we now have well-developed technology for moving text in a way that preserves any embedded context, or meaning. One example is XML, which is used by many systems as a transport language. It is a marriage of the lowest common denominator, ASCII text files which virtually every computer system worldwide can read, with the sophistication of being able to embed complex definitions and relationships.
	11	For a peek at the new book and instructions on how to download a copy please follow this link.
13	12
14		ISO 15926 uses XML to transport information.
15
16		----
17
18		== How We Store and Exchange of Textual Information ==
19
20		Human society has always had to find ways to manage, store, and retrieve information. The Library of Alexandria, which burned down in 48 BC [http://en.wikipedia.org/wiki/Library_of_Alexandria (according to one story)], is an example of both the best technology for managing information in hard-copy form, and a major limitation of doing so.
21
22		With the advent of computer-managed storage in the mid twentieth century, information managers have had to grapple with two problems:
23
24		* Survival of information beyond the lifetime of proprietary hardware and software.
25		* Moving a large amount of information between proprietary systems.
26
27		=== Dealing with Proprietary Hardware ===
28		A typical example of these types of questions is a help desk inquiry from the mid 1980's:
29
30		''I have data I want to keep for decades. Should I invest in a good card reader, or should I transfer my data to these far more efficient but newfangled "floppy disks"?''
31
32		Unfortunately, the best answer to this kind of question has always been rather labor intensive. That is, the only reliable way to keep digital information for decades is to upgrade your storage media every few years to whatever is the latest and greatest at the time. For personal use, in the 1980s it would have been 5 1/2" floppy disks. By the 1990s you would have had to copy your archive to 3 1/2" floppies. Then, sometime around 2000, the best storage medium became CDs, and a bit later, DVDs. At first everyone thought they would last for decades, but sometimes they didn't even last two years:
33
34		* [http://www.computerworld.com/action/article.do?command=viewArticleBasic&taxonomyName=storage&articleId=9123244&taxonomyId=19&intsrc=kc_top Restored DVD key to conviction in criminal case]
35
36		Now, nearing the end of the first decade in the twenty-first century, flash drives are looking like they will be readable for quite awhile. But ask yourself the likely hood of personal computers having USB ports in twenty years? Maybe, but whether in twenty years or forty years, at some point you will still have to load up your thumb drives and copy them to some new media; perhaps a three-dimensional, holographic memory block.
37
38		=== Dealing with Proprietary Software - Personal Scale ===
39
40		Unfortunately, even if you go through the exercise of transferring your archive every few years, how are you going to open the files twenty-five years from now? In the lifetime of your humble author (who is so old he can remember when an entire family had to make do with a single telephone), the word processor of choice has gone from !WordStar, to Word Perfect, to Microsoft Word. (This would be a good place for a Mac vs PC joke if I could think of one!)
41
42		Working with Word 2002, now, as this is being written, we can see that Word users can open the following word processor file formats:
43
44		* Word 2.0
45		* Word 5.1 for Mac
46		* Word 6.0 (95)
47		* Word Perfect 5.0
48		* Works 2000
49
50		Where is my beloved !WordStar? In addition to copies of all my data files, do I have to keep copies of all my old authoring software? And even if I do, what will I run it on? Do I also have to keep a working model of each vintage of personal computer? What if it breaks down?
51
52		So now, if I actually want to be able to retrieve my personal archives for decades (perhaps I am thinking that after I become a famous author, a publisher will give me a million dollar advance to write my memoirs), I will have to open each of my archived files every couple years and somehow transfer the contents to whatever the new authoring software is.
53
54		This will remove the problem of having to keep old hardware and software around, but will introduce a new set of problems:
55
56		First, this solution will create an upper limit on how much information I can keep around. Since it will take a certain amount of time to upgrade my archive each cycle, I will have less and less time each round to create new information. Eventually I will just finish one upgrade when I will have to start over with new technology.
57
58		Second, who's to say there will always be a clear and easy upgrade path from one authoring software to the next? For example, what if I have a large number of files authored with obscure CAD software? What if none of the current set of dominant players did not write the appropriate conversions into their offerings?
59
60		Well, there is another option:
61
62		[[Image(History_LongtermStorage.JPG, 500px)]]
63
64		'''Fig 2 - Long Term Information Storage Using the Internet'''
65
66		''(This is taken from a Slashdot discussion on the topic of long-term data storage. [http://ask.slashdot.org/article.pl?sid=08/12/13/1434216 Here is the complete article.])''
67
68
69		=== Dealing with Proprietary Software - Industrial Scale ===
70
71		If the problem of moving information between proprietary systems is daunting on a personal level, try to imagine what it is like for organizations that create large bodies of documentation. For instance, every model of aircraft you see today requires several million pages of documentation which has to be revised and published every quarter. [http://www.amazon.com/Charles-Goldfarbs-XML-Handbook-4th/dp/product-description/0130651982 (XML Handbook)] The combined documentation libraries of the aircraft industry probably rivals the size of the entire world wide web. Yet every few years the dominant hardware changes, and along with it, the software used.
72
73		Governments and law firms are in a similar situation.
74
75		=== Markup Languages ===
76		It is precisely these issues, the survival of information beyond the lifetime of proprietary hardware, and moving a large amount of information between proprietary systems, that prompted Charles Goldfarb, with Ed Mosher, and Ray Lorie at IBM to create "Generalized Markup Language" (GML) in the early 1960s.
77
78		* GML
79		* SGML
80		* HTML
81		* XML
82
83		Except for GML (which became SGML), all of these markup languages are in wide use today. SGML is used for managing large bodies of textual information. HTML is the language of the World Wide Web, linking documents for human retrieval. XML is increasingly being used to manage large bodies of ''knowledge'', including plant information with ISO 15926.
84
85		Most people will not need to know how markup languages are used to manage plant information, but a brief history of markup languages will be interesting for background information.
86
87		== The History of Markup Languages ==
88
89		Markup languages have a long history in enabling computers to handle large bodies of text properly, without human intervention. When encoded with a markup language, the ''content'' of a body of text is separated from the ''format'', or appearance of the text. This is an important concept in ISO 15926 where the goal is to embed enough ''context'' into the ''content'' that we do not need to see the format, or appearance, of the information to know what it means.
90
91		Key factors in a Markup Language:
92
93		* A standard format in which to store information that lasts many times longer than proprietary commercial software.
94		* A means to transfer information between proprietary computer systems.
95
96		----
97
98		=== What is a Markup Language? ===
99
100		In the context of understanding ISO 15926, a "markup language" is a set of conventions for marking up text that are used together with the text to tell a computer the meaning of the text.
101
102		At a very simple level, punctuation, capitalization, and even the spaces between words themselves can be considered ''markup''. These features tell human readers when there is a break between ideas, when to pause, and where individual words start and stop. (If the reader thinks these are obvious necessities for understanding written text, there are numerous examples in the history of human societies were written material was in the style of [http://en.wikipedia.org/wiki/Scriptio_continua scriptio continua]).
103
104		Another example is spoken words, where the volume and tone of voice can be considered ''markup''. For instance, a given string of words may have a completely different meaning if they are yelled, spoken in a soft voice, or with a condescending tone of voice. Thus, the ''value'' of the message (that is, the actual words spoken) must be considered together with the ''tags'' (that is, the volume and tone of voice) to obtain the correct meaning.
105
106		We have seen this concept previously in this primer. In the section about ''context'', we saw that the numerical value ''1034'' on its own had no meaning, but in the context of a particular spot on a particular data sheet, it meant the pressure of the seal flush of a centrifugal pump. Thus, the location of a value on a data sheet can be considered a sort of ''markup''.
107
108		If the meaning of a piece of text is embedded in the text by means of a markup language, one can use the same body of text for different purposes without modifying the text manually. For example, consider a scientific journal that publishes papers both in a printed magazine format and on its website. In the magazine, footnotes might be grouped at the end of the article, while footnotes on the website might pop up in their own little window. The publisher could manually edit the text for each purpose, but this would be doing it the hard way. The easy way is to encode the text with a markup language that ''marks'' the beginning and end of footnotes, and showed the correct anchor point in the manuscript. When the publishing software prepares the text for print, it will group the footnotes at the end of the article, but when it prepares the text for the website, would include the necessary HTML tags to create a popup window.
109
110		=== 1960s GML ===
111
112		Generalized Markup Language (GML) was developed in 1969 by the team of Charles Goldfarb, Ed Mosher, and Ray Lorie. (Look at the initials formed by their last names--it's not a coincidence. In fact Goldfarb invented the term "Markup Language" just to be able to use them!) Goldfarb, a lawyer at the time, had joined IBM to get some high tech experience. He was assigned to a project to figure out how to merge case law research results together into one document, compose it, and print it. At the time there was no single system that would do all three of these things, so when text was to be printed, it had to be transferred from one proprietary system to another, all without loosing it's fidelity, or meaning.
113
114		GML was a set of macros that described the logical structure of the document, for instance, to declare some text to be a heading and other text to be a body paragraph.
115
116		Note that the issue of being able to transfer information between proprietary systems is one of the same issues that drives ISO 15926.
117
118		'''References'''
119		* [http://www.sgmlsource.com/press/index.htm Charles Goldfarb's Press Online Kit]
120
121
122		=== 1980s Standard Generalized Markup Language (SGML) ===
123
124		SGML is a descendent of GML. SGML was originally intended for publishing databases and text. One of its first applications was publishing an early edition of the Oxford English dictionary.
125
126		SGML is known as a ''metalanguage'' since it can be used to describe other markup languages.
127
128		In the field of publishing, historically, ''markup'' has meant the marks that an editor makes when reviewing a transcript. For instance, marks to indicate that one phrase is to be rendered in bold face and another in italics. In an age of machine-readable text, this term has now come to mean special formatting codes inserted in-line with the text to give direction to the computer that does the publishing.
129
130		''Metalanguage'' means that SGML can be used to create other markup languages. SGML has the means to describe which markups are required and how to tell markups from text. Thus, you can use SGML to create other markup languages.
131
132		The first working draft of SGML by the American National Standards Institute (ANSI) was published in 1980. By 1983 it was ready for prime time and was adopted by the US Internal Revenue Service and the US Department of Defense. The next year the International Organization for Standardization (ISO) had gotten involved and in 1986
133		issued SGML as the international standard (ISO 8879:1986)
134
135		One feature of SGML that distinguishes it from other markup languages at the time is its emphasis on ''descriptive'' markup rather than ''procedural'' markup. This means that the tags ''describe'' the text rather than tell ''what to do with it''. For instance, it was common to use a proprietary markup language which told proprietary publishing equipment to, say, print this in 10pt Times Roman, and that in 12pt sans serif. But if the publisher wanted to process the text on different equipment, the tags would all have to be stripped out and new tags entered. SGML, however simply said "this is body text", or "this is a footnote".
136
137		Of interest to the history of ISO 15926 are some of the reasons for using SGML:
138		* In the government and law, large bodies of text must be readable for decades. Therefore it must not be stored in any proprietary format that may go out of fashion in a few years. This is also one of the reasons to use ISO 15926; the life of a typical plant also spans several decades, during which time computer operating systems and text handling software goes through many generations. The people dismantling the plant forty years later may not even remember the name of software that was used by the engineers designing the plant.
139
140		* SGML was also used as a means to transfer texts from one system to another in a manner that preserved the intent of the formatting. Similarly, ISO 15926 can be used to transfer information about plant objects from one system to another in a manner that preserves the meaning of the attributes of the plant object.
141
142		'''Example''': Continuing the example of a pump data sheet, here is what the information might look like encoded in SGML:
143
144		{{{
145		<TITLE>CENTRIFUGAL PUMP DATA SHEET</TITLE>
146		<BODY><B>Client: ABC Chemical Company</B></BODY>
147		<BODY>Tag No: P101</BODY>
148		<BODY>Service: Chemical Injection to D-101</BODY>
149		<BODY>...</BODY>
150		<BODY><B>Seal Flush</B></BODY>
151		<BODY>Pressure: 1034 kPa</BODY>
152		}}}
153
154		This script will render the information on the data sheet as plain text. The title will likely be a larger font, and the two headings will be in bold face. The rest of the text is understandable by humans, but you could not have a computer read it to extract, for instance, the tag number of the pump, the pump's attributes, or its relationship to D-101.
155
156		'''References'''
157
158		Good introductory material:
159		* [http://xml.coverpages.org/naggumWhat.html SGML: Erik Naggum's Brief Description] is one of the best places to start.
160		* [http://www.isgmlug.org/sgmlhelp/g-index.htm A Gentle Introduction to SGML] (in HTML) or [http://xml.coverpages.org/gentle.html A Gentle Introduction to SGML] (in plain text).
161		* [http://xml.coverpages.org/general.html#hist History of Generalized Markup and SGML]
162		* [http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language Wikipedia: SGML]
163		* [http://xml.coverpages.org/sgml.html SGML and XML as Markup Languages]
164
165		For more detailed information:
166		* [http://www.w3.org/MarkUp/SGML/ Overview of SGML Resources]
167
168		=== 1989 - Hypertext Markup Language (HTML) ===
169
170		HTML is a descendant of SGML. HTML was invented by Tim Berners-Lee as a way to embed references to a document within another document. He envisioned being able to directly open such a referenced document directly without having exit the first document.
171
172		Berners-Lee based HTML on SGML since SGML could already be implemented on any machine. As with SGML, the idea was to be able to mark up text in a way that separated the ''message'' from ''the'' ''manner'' ''in'' ''which'' ''the'' ''message'' ''was'' ''displayed''. For instance, ''<EM> some text </EM>'' meant that the enclosed text was to be somehow emphasized. Web browsers intended for human eyes might render the text slightly larger and bold face, or perhaps underlined. Alternatively, web browsers intended for human ears might render it in a slightly louder tone.
173
174
175		HTML attracted mostly academic interest for the first year or two. But as the Internet became more widely known, organizations started to realize how HTML could open the Internet to average people. From the early 1990s, HTML became a battleground for various competing interests who added their own tags. One of the biggest issues was getting fine control over the appearance of the text and images. The example above, ''<EM> some text </EM>'' says that "some text" should be somehow emphasized, but in what way? Print publishers were used tweaking text with by adjusting the point sizes, leading and kerning, and were not happy trusting the default handling of "emphasized" text.
176
177		The result today is that HTML has a great many tags for fine tuning the appearance of text, but no tags to convey the meaning of text--you still need a person to read the material.
178
179		'''Example''': HTML added a number of tags to SGML:
180
181		* P, for paragraph
182		* H1 thru H6, for heading levels
183		* OL, ordered lists
184		* UL, unordered lists
185		* LI, list items
186		* HREF, references to other objects
187		* A, to anchor HREF references
188
189		Here is how we might use them to encode our pump data sheet in HTML:
190
191		{{{
192		<TITLE>CENTRIFUGAL PUMP DATA SHEET</TITLE>
193		<H1>Client: ABC Chemical Company</H1>
194		<P>Tag No: P101</P>
195		<P>Service: Chemical Injection to D-101</P>
196		<P>...</P>
197		<H2>Seal Flush</H2>
198		<UL>
199		<LI>Pressure: 1034 kPa</LI>
200		<LI>...</LI>
201		</UL>
202		}}}
203
204		In this script we have used some of the new tags to lay the data sheet out a little nicer. The title is the same, but now we can group the pump's attributes under headings. However, we are still formatting the text for human viewers. We have more tags to handle the appearance of the information, but nothing to tell a computer what the various bits of text mean.
205
206		'''References'''
207
208		* [http://www.w3.org/People/Raggett/book4/ch02.html History from W3.Org]
209		* [http://infomesh.net/html/history/early/ The Early History of HTML]
210		* [http://www.yourhtmlsource.com/starthere/historyofhtml.html The History of HTML]
211		* [http://en.wikipedia.org/wiki/HTML Wikipedia: HTML]
212		* [http://www.livinginternet.com/w/ww_html.htm Hypertext Markup Language (HTML)]
213
214
215		=== 1990s Extensible Markup Language (XML) ===
216
217		XML, also a descendent of SGML, is also a meta language in that it can be used to define other markup languages. XML was intended to get back to the SGML roots without the SGML complexity. When it was released in its first draft in late 1996, its developers were not shy about proclaiming it to be the holy grail of computing, solving the problem of universal data interchange between dissimilar systems.
218
219		Since its introduction it has accomplished at least some of what was intended of it. For instance, today, near the end of the first decade of the twenty-first century, most of our Office documents are now stored in XML format. While some argue that the particular dialect of XML, !OpenOffice XML isn't the best formed in the world, it's still an order-of-magnitude better than the myriad of proprietary formats that preceded it. Now it is much easier for third parties to reverse-engineer documents in order to open them in different authoring software.
220
221		Of interest to the history of ISO 15926 are some of the implications of widespread use of XML in web publishing. Looking into our crystal ball we can see applications written by webmasters that will allow untrained users to write in something that looks like Microsoft Word, then upload their fine prose (or poetry, or...) straight in to the local content management system. And as XML-written documents displace documents written with proprietary software (and uploaded as inscrutable binary files), more and more data will be open, available to be searched and indexed, and therefore available for all.
222
223		'''Example''': Here is how we might mark up information about plant objects in a way that will let a computer determine what each data value means.
224
225		{{{
226		<DATASHEET>
227		<TITLE>CENTRIFUGAL PUMP DATA SHEET</TITLE>
228		<CLIENT>ABC Chemical Company</CLIENT>
229		<TAG_NO>P101</TAG_NO>
230		<SERVICE>Chemical Injection</SERVICE><ASSOCIATED>D-101</ASSOCIATED>
231		<something>...</something>
232		<SEAL_FLUSH_PRESSURE>1034</SEAL_FLUSH_PRESSURE>
233		<SEAL_FLUSH_PRESSURE_UNITS>kPa</SEAL_FLUSH_PRESSURE_UNITS>
234		<something_else>...</something_else>
235		...
236		</DATASHEET>
237		}}}
238
239		=== Advantages over HTML ===
240
241		The tags in the example above were created on the spot, showing how we can extend XML to include any kind of tags we wish. Right away you can see how we could then use a computer program to search the information to pull back the name of the pump, it's associated equipment (D-101), and that the seal flush pressure was 1034 kPa. Since it is extensible, any organization can create its own tags for whatever it needs.
242
243		=== Drawback for Interoperability - Agreement on Terms ===
244
245		In order to get interoperability between systems, the owners of the systems have to agree on the definition of terms. As we have seen in previous sections, getting that agreement is not a trivial question. (When we get to the section on how we store and exchange plant information, later in this primer, we will see how this issue is being resolved.)
246
247
248		Descendants
249		* SOAP
250		* XML RPC
251
252		'''References'''
253		* [http://en.wikipedia.org/wiki/XML Wikipedia: XML]
254		* [http://www.w3.org/XML/hist2002 Development History]
255		* [http://www.w3.org/TR/WD-xml-961114.html Extensible Markup Language (XML)]
256		* [http://www.itwriting.com/xmlintro.php Introducing XML]
257		* [http://www.ibm.com/developerworks/library/x-xml2008prevw.html?ca=dgr-lnxw01XML-Future The future of XML]
258		* [http://www.extropia.com/tutorials/xml/index.html Introduction to XML for Web Developers]
259
260		== NEXT ==
261
262		* [wiki:ISO15926Primer_History_KnowUnderstandThings Primer: How We Know And Understand Things]
263
264		----
	13	* [wiki:ISO15926Primer An Introduction to ISO 15926]