a “When messages are a major source of the information entering the system corrections to a previously transmitted original message can be received before the original message itself. If entered on an earlier update cycle the correction data can actually be 'corrected' during a later update cycle by the original incorrect message.” (Davis, 1967, p. 24). 2.25 “Errors will occur in every data collection system, so it is important to detect and correct as many of the errors as possible." (Hillegass and Melick, 1967, p. 56). “The primary purpose of a data communications system is to transmit useful information from one location to another. To be useful, the received copy of the transmitted data must constitute an accurate representation of the original input data, within the accuracy limits dictated by the application requirements and the necessary economic tradeoffs. Errors will occur in every data communications system. This basic truth must be kept in mind throughout the design of every system. Important criteria for evaluating the performance of any communications system are its degree of freedom from data errors, its probability of detecting the errors that do occur, and its efficiency in overcoming the effects of these errors.” (Reagan, 1966, p. 26). “The form of the control established, as a result of the investigation, should be decided only after considering each situation in the light of the three control concepts mentioned earlier. Procedures, such as key verification, batch totals, sight verification, or printed listings should be used only when they meet the criteria of reasonableness, in light of the degree of control required and the cost of providing control in relation to the importance and volume of data involved. The objective is to establish appropriate control procedures. The manner in which this is done - i.e., the particular combination of control techniques used in a given set circumstances - will be up to the ingenuity of the individual systems designer.” (Baker and Kane, 1966, pp. 99-100). 2.26 “Two basic types of codes are found suitable for the burst type errors. The first is the forward-acting Hagelbarger code which allows fairly simple data encoding and decoding with provisions for various degrees of error size correction and error size detection. These codes, however, involve up to 50 percent redundancy in the transmitted information. The second code type is the cyclic code of the Bose-Chauduri type which again is fairly simple to encode and can detect various error burst sizes with relatively low redundancy. This code type is relatively simple to decode for error detection but is too expensive to decode for error correction, and makes retransmission the only alternative." (Hickey, 1966, p. 182). 2.27 “Research devoted to finding ways to further reduce the possibility of errors is progressing on many fronts. Bell Telephone Laboratories is approaching the problem from three angles: error detection only, error detection and correction with non-constant speed of end-to-end data transfer (during the correction cycle transmission stops), and error detection and correction with a constant speed of end-to-end data transfer (during the correction cycle transmission continues). ” (Menkhaus, 1967, p. 35). “There are two other potential 'error injectors' which should be given close attention, since more control can be exercised over these areas. They are: the data collection, conversion and input devices, and the human being, or beings, who collect the data (or program a machine to do it) at the source. Bell estimates that the human will commit an average of 1,000 errors per million characters handled, the mechanical device will commit 100 per million, and the electronic component, 10 per million. “Error detection and correction capability is a 'must' in the Met Life system and this is provided in several ways. The input documents have Honey. well's Orthocode format, which uses five rows of bar codes and several columns of correction codes that make defacement or incorrect reading virtually impossible; the control codes also help regenerate partially obliterated data. . . “Transmission errors are detected by using a dual pulse code that, in effect, transmits the signals for a message and also the components of those signals, providing a double check on accuracy. The paper tape reader, used to transmit data, is bi-directional; if a message contains a large number of errors, due possibly to transmission noise, the equipment in the head office detects those errors and automatically tells the transmitting machine to 'back up and start over'." (Menkhaus, 1967, p. 35). 2.28 “Input interlocks - checks which verify that the correct types and amounts of data have been inserted, in the correct sequence, for each transaction. Such checks can detect many procedural errors committed by persons entering input data into the system.” (Hillegass and Melick, 1967, p. 56). 2.29 “Parity- addition of either a 'zero' or 'one' bit to each character code so that the total number of 'one' bits in every transmitted character code will be either odd or even. Character parity checking can detect most single-bit transmission errors, but it will not detect the loss of two bits or of an entire character." (Hillegass and Melick, 1967, p. 56). “Two of the most popular error detection and correction devices on the market - Tally's System 311 and Digitronics' D500 Series - use retransmission a correction device. Both transmit blocks of characters and make appropriate checks for valid parity. If the parity generated at the transmitter checks with that which has been created from the received message by the receiver, the transmission continues. If the parity check fails, the last block is retransmitted and checked again for parity. This method avoids the disadvantages of transmitting as . check digit. If one of his characters is not clear, the machine looks at the check digit, carries out its arithmetic on the legible characters, and subtracts the result from the result that would give the check digit to establish the character in doubt. It then rebuilds this character.” (Rothery, 1967, p. 59.) even an the entire message twice and of having to compare the second message with the first for validity." (Davenport, 1966, p. 31). "Full error detection and correction is provided. The telephone line can be severed and reattached hours later without loss of data . . . Error detection is accomplished by a horizontal and vertical parity bit scheme similar to that employed on magnetic tape." (Lynch, 1966, p. 119). "A technique that has proven highly successful is to group the eight-level characters into blocks of eighty-four characters. One of the eighty-four characters represents a parity character, assuring that the summation of each of the 84 bits at each of eight levels is either always odd or always even. For the block, there is now a vertical parity check (the character parity) and a horizontal parity check (the block parity character). This dual parity check will be invalidated only when an number of characters within the block have an even number of hits, each at the same level. The probability of such an occurrence is so minute that we can state that the probability of an undetected error is negligible. In an 84-character block, constituting 672 bits, 83 +8=91 bits are redundant. Thus, at the expense of adding redundancy of 13.5 per cent, we have assured error-free transmission. At least we know that we can detect errors with certainty. Now, let us see how we can utilize this knowledge to provide error-free data at high transmission rates. One of the most straightforward techniques is to transmit data in blocks, automatically checking each block for horizontal and vertical parity at the receiving terminal. If the block parities check, the receiving terminal delivers the block and an acknowledgment character (ACK) is automatically transmitted back to the sending terminal. This releases the next block and the procedure is repeated. If the block parities do not check, the receiving terminal discards the block and a nonacknowledgment character (NACK) a is returned to the sender. Then, the same block is retransmitted. This procedure requires that storage capacity for a minimum of one data block be provided at both sending and receiving terminals." (Rider, 1967, p. 134). 2.30 “What then can we say that will summarize the position of the check digit? We can say that it is useful for control fields – that is, those fields we access by and sort on, customer number, employee number, etc. We can go further and say that it really matters only with certain control fields, not all. With control fields, the keys by which we find and access records, it is essential that they be correct if we are to find the correct record. If they are incorrect through juxtaposition or other errors in transcription, we will 1) not find the record, and 2) find and process the wrong record. “One of the most novel uses of the check digit can be seen in the IBM 1287 optical scanner. The writer enters his control field followed by the 2.31 "A hash total works in the following way. Most of our larger computers can consider alphabetic information as data. These data are added up, just as if they were numeric information, and a meaningless total produced. Since the high-speed electronics are very reliable, they should produce the same meaningless number every time the same data fields are summed. The transfer of information within the computer and to and from the various input/output units can be checked by recomputing this sum after every transmission and checking against the previous total. . “Some computers have special instructions built into them to facilitate this check, whereas others accomplish it through programming. The file designer considers the hash total as a form of builtin audit. Whenever the file is updated, the hash totals are also updated. Whenever a tape is read, the totals are reconstituted as error check. Whenever an error is found, the operation is repeated to determine if a random error has occurred. If the information is erroneous, an alarm is sounded and machine repair is scheduled. If information has been actually lost, then human assistance is usually required to reconstitute the file to its correct content. Through a combination of hardware and programming the validity of large reference files can be maintained even though the file is subject to repeated usage.” (Patrick and Black, 1964, pp. 46-47). 2.32 “Message length - checks which involve a comparison of the number of characters as specified for that particular type of transaction. Message length checks can detect many errors arising from both improper data entry and equipment or line malfunctions.” (Hillegass and Milick, 1967, p. 56). 2.33 “In general, many standard techniques such as check digits, hash totals, and format checks can be used to verify correct input and transmission. These checks are performed at the computer site. The nature and extent of the checks will depend on the capabilities of the computer associated with the response unit. One effective technique is to have the unit respond with a verbal repetition of the input data.” (Melick, 1966, p. 60). 2.34 “Philco has a contract for building what is called a Spelling-Corrector . . . It reads text and matches it against the dictionary to find out whether the words are spelled correctly.” (Gibbs and MacPhail, 1964, p. 102). “Following keypunching, the information retrieval technician processes the data using a 1401 computer. The computer performs sequence checking, editing, autoproofing (each word of input is checked against a master list of correctly spelled words 70-411 0 - 70 - 4 to determine accuracy-a mismatch is printed out for human analysis since it is either a misspelled or a new word), and checking for illegitimate characters. The data is now on tape; any necessary correction changes or updating can be made directly.” (Magnino, 1965, p. 204). “Prior to constructing the name file, a ‘legitimate name' list and a 'common error' name list are tabulated ... The latter list is formed by taking character error information compiled by the instrumentation system and thresholding it so only errors with significant probabilities remain; i.e., 'e' for 'a'. These are then substituted one character at a time in the names of the 'legitimate name' list to create a 'common error' name list. Knowing the probability of error and the frequency of occurrence of the ·legitimate name' permits the frequency of occurrence for the common error' name to be calculated.” (Hennis, 1967, pp. 12–13). 2.35 “When a character recognition device errs in the course of reading meaningful English words it will usually result in a letter sequence that is itself not a valid word; i.e., a ‘misspelling?,” (Cornew, 1968, p. 79). 2.36 “Several possibilities exist for using the information the additional constraints provide. A particularly obvious one is to use special purpose dictionaries, one for physics texts, one for chemistry, one for novels, etc., with appropriate word lists and probabilities in each. . “Because of the tremendous amount of storage which would be required by such a 'word digram' method, an alternative might be to associate with each word its one or more parts of speech, and make use of conditional probabilities for the transition from one part of speech to another.” (Vossler and Branston, 1964, p. D2.4–7). 2.37 “In determining whether or not to adopt an EDC system, the costliness and consequences of any error must be weighed against the cost of installing the error detection system. For example, in a simple telegram or teleprinter message, in which all the information appears in word form, an error in one or two letters usually does not prevent a reader from understanding the message. With training, the human mind can become an effective error detection and correction system; it can readily identify the letter in error and make corrections. Of course, the more unrelated the content of the message, the more difficult it is to detect a random mistake. In a list of unrelated numbers, for example, it is almost impossible to tell if one is incorrect.” (Gentle, 1965, p. 70). 2.38 In addition to examples cited in a previous report in this series, we note the following: “In the scheme used by McElwain and Evens, undisturbed digrams or trigrams in the garbled message were used to locate a list of candidate words each containing the digram or trigram. These were then matched against the garbled sequence taking into account various possible errors, such as a missing or extra dash, which might have occurred in Morse Code transmission.” (Vossler and Branston, 1964, p. D2.4-1). "Harmon, in addition to using digram frequencies to detect errors, made use of a confusion matrix to determine the probabilities of various letter substitutions as an aid to correcting these errors. (Vossler and Branston, 1964, pp. D2.4-1-D2.4-2). “An interesting program written by McElwain and Evens was able to correct about 70% of the garbles in a message transmitted by Morse Code, when the received message contained garbling in 0-10% of the characters." (Vossler and Branston, 1964, p. D2.4-1). “The design of the spoken speech output modality for the reading machine of the Cognitive Information Processing Group already calls for a large, disc-stored dictionary The possibility of a dual use of this dictionary for both correct spelling and correct pronunciation prompted this study.” (Cornew, 1968, p. 79). “Our technique was first evaluated by a test performed on the 1000 most frequent words of English which, by usage, comprise 78% of the written language. For this, a computer program was written which first introduced into each of these words one randomly-selected, randomlyplaced letter substitution error, then applied this technique to correct it. This resulted in the following overall statistics 739 correct recoveries of the original word prior to any other; 241 incorrect recoveries in which another word appeared sooner; 20 cases where the misspelling created another valid word." (Cornew, 1968, p. 83). "In operation, the word consisting of all first choice characters is looked up. If found, it is assumed correct; if not, the second choice characters are substituted one at a time until a matching word is found in the dictionary or until all second choice substitutions have been tried. In the latter case a multiple error has occurred (or the word read correctly is not in the dictionary)." (Andrews, 1962, . p. 302). a 2.39 “There are a number of different techniques for handling spelling problems having to do with names in general and names that are homonyms. Present solutions to the handling of name files are far from perfect." (Rothman, 1966, p. 13). 2.40 “The chief problem associated with ... large name files rests with the misspelling or misunderstanding of names at time of input and with possible variations in spelling at the time of search. In order to overcome such difficulties, various coding systems have been devised to permit filing and searching of large groups of names phonetically as well as alphabetically . A Remington Rand Univac computer program capable of performing the phonetic coding of input names has been prepared.” (Becker and Hayes, 1963, p. 143). “A particular technique used in the MGH [Massachusetts General Hospital] system is probably worth mentioning; this is the technique for phonetic indexing reported by Bolt et al. The use described a involves recognition of drug names that have been typed in, more or less phonetically, by doctors or nurses; in the longer view this one aspect of a large effort that must be expended to free the manmachine interface from the need for letter-perfect information representation by the man. People just don't work that way, and systems must be developed that can tolerate normal human imprecision without disaster.” (Mills, 1967, p. 243). 2.41 “... The object of the study is to determine if we can replace garbled characters in names. The basic plan was to develop the empirical frequency of occurrence of sets of characteres in names and use these statistics to replace a missing character.” (Carlson, 1966, p. 189). “The specific effect on error reduction is impressive. If a scanner gives a 5% character error rate, the trigram replacement technique can correct approximately 95% of these errors. The remaining error is thus . . .0.25% overall. . “A technique like this may, indeed, reduce the cost of verifying the mass of data input coming from scanners [and] reduce the cost of verifying massive data conversion coming from conventional data input devices like keyboards, remote terminals, etc.” (Carlson, 1966, p. 191.) 2.42 “The rules established for coding structures are integrated in the program so that the computer is able to take a fairly sophisticated look at the chemist's coding and the keypunch operator's work. It will not allow any atom to have too many or too few bonds, nor is a “7' bond code permissible with atoms for which ionic bonds are not ‘legal. Improper atom and bond codes and misplaced characters are recognized by the computer, as are various other types of errors.” (Waldo and DeBacker, 1959, p. 720). 2.43 “Extensive automatic verification of the file data was achieved by a variety of techniques. As an example, extracts were made of principal lines plus the sequence number of the record: specifically, all corporate name lines were tracted and sorted; any variations on a given name were altered to conform to the standard. Similarly, all law firm citations were checked against each other. All city-and-state fields are uniform. A zipcode-and-place-name abstract was made, with the resultant file being sorted by zip code: errors were easy to sort and correct, as with Des Moines appearing in the Philadelphia listing.” (North, 1968, p. 110). Then there is the even more sophisticated case where An important input characteristic is that the data is not entirely developed for proc. essing or retrieval purposes. It is thus necessary first to standardize and develop the data before manipulating it. Thus, to mention one descriptor, location', the desired machine input might be 'coordinate', 'city', and 'state', if a city is mentioned; and 'state' alone when no city is noted. However, inputs to the system might contain a coordinate and city without mention of a state. It is therefore necessary to develop the data and standardize before further processing commences. “It is then possible to process the data against the existing file information ... The objective of the processing is to categorize the information with respect to all other information within the files ... To categorize the information, a substantial amount of retrieval and association of data is often required .. Many [data] contradictions are resolvable by the system.” (Gurk and Minker, 1961, pp. 263-264). 2.44 “A number of new developments are based on the need for serving clustered environments. A cluster is defined as a geographic area of about three miles in diameter. The basic concept is that within a cluster of stations and computers, it is possible to provide communication capabilities at low cost. Further, it is possible to provide communication paths between clusters, as well as inputs to and outputs from other arrangements as optional fea. tures, and still maintain economies within each cluster. This leads to a very adaptable system. It is expected to find wide application on university campuses, in hospitals, within industrial complexes, etc.” (Simms, 1968, p. 23). 2.45 "Among the key findings are the following: • Relative cost-effectiveness between time sharing and batch processing is very sensitive to and varies widely with the precise manmachine conditions under which experimental comparisons are made. • Time-sharing shows a tendency toward fewer man-hours and more computer time for experi mental tasks than batch processing. • The controversy is showing signs of narrowing down to a competition between conversationally interactive time-sharing versus fast-turn around batch systems. • Individual differences in user performance are generally much larger and are probably more economically important than time-sharing/ batch-processing system differences. • Users consistently and increasingly prefer interactive time-sharing or fast turnaround sults.” (Sackman, 1968, p. 350). However, on at least some occasions, some clients of a multiple-access, time-shared system may be satisfied with, or actually prefer, operation in a ex 66 batch or job-shop mode to extensive use of the conversational mode. “Critics (see Patrick 1963, Emerson 1965, and MacDonald 1965) claim that the efficiency of timesharing systems is questionable when compared to modern closed-shop methods, or with economical small computers.” (Sackman et al., 1968, p. 4). Schatzoff et al. (1967) report on experimental comparisons of time-sharing operations (specifically, MIT's CTSS system) with batch processing as employed on IBM's IBSYS system. “... One must consider the total spectrum of tasks to which a system will be applied, and their relative importance to the total computing load.” (Orchard-Hays, 1965, p. 239). "... A major factor to be considered in the design of an operating system is the expected job mix." (Morris et al., 1967, p. 74). “In practice, a multiple system may contain both types of operation: a group of processors fed from a single queue, and many queues differentiated by the type of request being serviced by the attached processor group ..." (Scherr, 1965, 2.46 “Normalization is a necessary preface to the merge or integration of our data. By merge, or integration, as I use the term here to represent the last stage in our processes, I am referring to a complex interfiling of segments of our data-the entries. In this 'interfiling,' we produce, for each article or book in our file, an entry which is a composite of information from our various sources. If one of our sources omits the name of the publisher of a book, but another gives it, the final entry will contain the publisher's name. If one source gives the volume of a journal in which an article appears, but not the month, and another gives the month, but not the volume, our final entry will contain both volume and month. And so on.” (Sawin, 1965, p. 17). Second, the job to be done always lies embedded within some formal organizational structure.” (Bennett, 1964, p. 98). “Formal organizing protocol exists relatively independently of an organization's purposes, origins, or methods. These established operating procedures of an organization impose constraints upon the available range of alternatives for individual be. havior. In addition to such constraints upon the degrees of freedon within an organization as restrictions upon mode of dress, conduct, range of mobility, and style of performance, there are protocol constraints upon the format, mode, , pattern, and sequence of information processing and information flow. It is this orderly constraint upon information processing and information flow that we call, for simplicity, the information system of an organization. The term 'system' implies little more than procedural restriction and orderliness. By 'information processing' we mean some actual change in the nature of data or documents. By 'information flow' we indicate a similar change in the location of these data or documents. Thus we may define an information system as simply that set of constraining specifications for the collection, storage, reduction, alteration, transfer, and display of organizational facts, opinions, and associated documentation which is established in order to manage, command if you will, and control the ultimate performance of an organization. “With this in mind, it is possible to recognize the dangers associated with prematurely standardizing the information processing tools, the forms, the data codes, the message layouts, the procedures for message sequencing, the file structures, the calculations, and especially the data-summary forms essential for automation. Standardization of these details of a system is relatively simple and can be accomplished by almost anyone familiar with the design of automatic procedures. However, if the precise nature of the job and its organizational implications are not understood in detail, it is not possible to know the exact influence that these standards will have on the performance of the system.” (Bennett, 1964, pp. 99, 103). 2.48 “There is a need for design verification. That is, it is necessary to have some method for ensuring that the design is under control and that the nature of the resulting system can be predicted before the end of the design process. In commandand-control systems, the design cycle lasts from two to five years, the design evolving from a simple idea into complex organizations of hardware, software, computer programs, displays, human operations, training, and so forth. At all times during this cycle the design controller must be able to specify the status of the design, the impact that changes in the design will have on the command, and the probability that certain components of the system will work. Design verification is the process that gives the designer this control. The methods that p. 95). “Normalize. Each individual printed source, which has been copied letter by letter, has features of typographical format and style, some of which are of no significance, others of which are the means by which a person consulting the work distinguishes the several ‘elements of the item. The family of programs for normalizing the several files of data will insert appropriate information separators to distinguish and identify the elements of each item and rearrange it according to a selected canonical style, which for the Pilot Study is one which conforms generally to that of the Modern Language Association." (Crosby, 1965, p. 43). 2.47 “Some degree of standardized processing and communication is at the heart of any information system, whether the system is the basis for mounting a major military effort, retrieving documents from a central library, updating the clerical and accounting records in a bank, assigning airline reservations, or maintaining a logistic inventory. There are two reasons for this. First, all information systems are formal schemes for handling the informational aspects of a formally specified venture. |