Research and Development in the Computer and Information Sciences: Overall system design considerations; a selective literature review

volume of information units or reports to be received, processed, or stored can be gained through the use of filtering procedures to reduce the possible redundancies between items received. (Timing considerations are important in such procedures, as noted elsewhere, because we won't want a delayed and incorrect message to 'update' its own correction notice.)

"Secondly, input filtering procedures serve to reduce the total bulk of information to be processed or stored-both by elimination of duplicate items as such and by the compression of the quantitative amount of recording used to represent the original information unit or message within the system.

"A third technique of information control at input is directed to the control of redundancy within a single unit or report. Conversely, input filtering procedures of this type can be used to enhance the value of information to be stored. For example, in pictorial data processing, automatic boundary contrast enhancements or 'skeletonizations' may improve both subsequent human pattern perception and system storage efficiency. Another example is natural text processing, where systematic elimination of the 'little', 'common', and 'non-informing' words can significantly reduce the amount of text to be manipulated by the machine.” (Davis, 1967, p. 49).

2.12 In this area, R & D requirements for the future include the very severe problems of sifting and filtering enormous masses of remotely collected data. For example, "our ability to acquire data is so far ahead of our ability to interpret and manage it that there is some question as to just how far we can go toward realizing the promise of much of this remote sensing. Probably 90% of the data gathered to date have not been utilized, and, with large multisensor programs in the offing, we face the danger of ending up prostrate beneath a mountain of utterly useless films, tapes, and charts." (Parker and Wolff, 1965, p. 31).

2.13 "Purging because of redundancy is extremely difficult to accomplish by computer program except in the case of 100% duplication. Redundancy purging success is keyed to practices of standardization, normalization, field formatting, abbreviation conventions and the like. As a case in point, document handling systems universally have problems with respect to bibliographic citation conventions, transliterations of proper names, periodical title abbreviations, corporate author listing practices and the like." (Davis, 1967, p. 20).

See also Ebersole (1965), Penner (1965), and Sawin (1965) who points to some of the difficulties with respect to a bibliographic collection or file, as follows:

"1. Actual errors, such as incorrect spelling of words, incorrect report of pagination, in one or more of the duplicates. The error may be mechanically or humanly generated; the error may have been made in the source bibliog

raphy, or by project staff in transcription from source to paper tape. In any case, error is a factor in reducing the possibility of identity of duplicates.

"2. Variations among bibliographies both in style and content. A bibliographical citation gives several different kinds of information; that is, it contains several 'elements,' such as author of item, title, publication data, reviews and annotations. Each source bibliography more or less consistently employs one style for expressing information, but each style differs from every other in some or all of the following ways:

a. number of elements

b. sequence of elements

c. typographical details" (1965, p. 96).

2.14 "File integrity can often be a significant motivation for mechanization. To insure file integ rity in airline maintenance records, files have been republished monthly in cartridge roll-microfilm form, since mechanics would not properly insert update sheets in maintenance manuals. Freemont Rider's original concept for the microcard, which was a combination of a catalog card and document in one record, failed in part because of the lack of file integrity. Every librarian knows that if there wasn't a rod through the hole in the catalog card they would not be able to maintain the integrity of the card catalog." (Tauber, 1966, p. 277).

2.15 "Retirement of outmoded data is the only long-range effective means of maintaining an efficient system." (Miller et al., 1960, p. 54).

With respect to maintenance processes involving the deletion of obsolete items, there are substantial fact finding research requirements for large-scale documentary item systems in terms of establishing efficient but realistic criteria for "purging". Kessler comments on this point as follows: "It is not just a matter of throwing away 'bad' papers as 'good' ones come along. The scientific literature is unique in that its best examples may have a rather short life of utility. A worker in the field of photoelectricity need not ordinarily be referred to Einstein's original paper on the subject. The purging of the system must be based on criteria of operational relevance rather than intrinsic value. These criteria are largely unknown to us and represent another basic area in need of research and invention." (1960, pp. 9-10).

"Chronological cutoff is that device attempted most frequently in automated information systems. It is employed successfully in real-time systems such as aircraft or satellite tracking or airline reservations systems where the information is useless after very short time intervals and where it is so voluminous as to be prohibitive for future analyses . . .

"That purging which is done is primarily replaceData management or file management

ment.

systems are generally programmed so that upon proper identification of an item during the manual input process it may replace an item already in the system data bank. The purpose of replacement as a purging device is not volume control. It is for purposes of accuracy, reliability or timeliness controls." (Davis, 1967, p. 15).

"The reluctance to purge has been a leading reason for accentuating file storage hierarchy considerations. Multi-level deactivation of information is substituted for purging. Deactivation proceeds through allocating the material so specified first to slower random-access storage devices and then to sequentially-accessed storage devices with decreasing rates of access all on-line with the computer. As the last step of deactivation the information is stored in off-line stores . . .

"Automatic purging algorithms have been written for at least one military information system and for SDC's time-sharing system. . . In the military system... the purging program written allowed all dated units of information to be scanned and those prior to a prescribed date to be deleted and transcribed onto a magnetic tape for printing. The information thus nominated for purging was reviewed manually. If the programmed purge decision was overridden by a manual decision the falsely purged data then had to be re-entered into system files as would any newly received data." (Davis, 1967, pp. 16-18).

"Automatic purging algorithms have been explored for the past three years. The current scheme attempts to dynamically maintain a 10 percent disc vacancy factor by automatically deleting the oldest files first. User options are provided which permit automatic dumping of files on a backup, inactive file tape . . . prior to deletion." (Schwartz and Weissman, 1967, p. 267).

"The newer time-sharing systems contemplate a hierarchy of file storage, with 'percolation' algorithms replacing purging algorithms. Files will be in constant motion, some moving 'down' into higher-volume, slower-speed bulk store, while others move 'up' into lower-volume, higherspeed memory- all as a function of age and reference frequency." (Schwartz and Weissman, 1967, p. 267).

2.16 "Some computer-oriented statistics are provided to assist in monitoring the system with minimum cost or time. Such statistics are tape length and length of record, checks on dictionary code number assignment, frequency of additions or deletions to the dictionary, and checks to see that the correct inverted file was updated." (Smith and Jones, 1966, p. 190).

"Usage statistics as obsolescence criteria are commonly employed in scientific and technical information systems and reference data systems . . . "Usage statistics are also used in the deactivation process to organize file data in terms of its reference frequency. The Russian-to-English automated translation system at the Foreign Technology Divi

sion, Wright-Patterson AFB had its file system organized on this basis by IBM in the early 1960's. It was found from surveys of manual translators that the majority of vocabulary references were to less than one thousand words. These were isolated and located in the fastest-access memory: the rest of the dictionary was then relegated to lower priority locations . . ." (Davis, 1967, pp. 18-19).

"The network might show publications being permanently retained at a particular location. This would allow others in the network to dispose of little-used materials and still have access to a copy if the unexpected need arose.

"Such an archival' copy could, of course, be relocated to a relatively low-cost warehouse area for the mutual benefit of those agencies in the network. Statistics on frequency of usage might be very helpful in identifying inactive materials, and the network could also fill this need." (Brown et al., 1967, p. 66).

"Periodic reports to users on file activity may reveal possible misuse or tampering." (Petersen and Turn, 1967, p. 293).

2.17 "Accessibility. For a system output a measure of how readily the proper information was made available to the requesting user on the desired medium." (Davis, 1964, p. 469).

2.18 Consider also the following:

"The system study will consider that the document-retrieval problem lies primarily within the parameters of file integrity; activity and activity distribution; man-file interaction; the size, nature and organization of the file; its location and workplace layout; whether it is centralized or decentralized; access cycle time; and cost. Contributing factors are purging and update; archival considerations; indexing; type of response; peak-hour, peak-minute activity; permissable-error rates; and publishing urgency." (Tauber, 1966, p. 274).

Then there are questions of sequential decisionmaking and of time considerations generally. "Time consideration is explicitly, although informally, introduced by van Wijngaarden as 'the value of a text so far read'. Apart from other merits of van Wijngaarden's approach and his stressing the interaction between syntax and semantics, we would like to draw attention to the concept of 'value at time t', which seems to be a really basic concept in programming theory." (Caracciolo di Forino, 1965, p. 226). We note further that "T as the time a fact assertion is reported must be distinguished from the time of the fact history referred to by the assertion." (Travis, 1963, p. 334).

Avram et al., point more prosaically to practical problems in mechanized bibliographic reference data handling, as in the case of different types of searches on date: The case of requesting all works on, say, genetics, written since 1960 as against that of all works on genetics published since 1960 with respect to post-1960 reprints of pre-1960 original

texts.

For the future, moreover, "In some instances, the

search request would have to take into account which data has been used in the fixed field. For example, should one want a display of all the books in Hebrew published during a specific time frame, an adjustment would have to be made to the date in the search request to compensate for the adjustment made to the data at input time." (Avram et al., 1965, p. 42).

2.19 "Here you run into the phenomenon of the 'elastic ruler'. At the time when certain data were accumulated, the measurements were made with a standard inch or standard meter . . . whether researchers were using an inch standardized before a certain date, or one adopted later." (Birch, 1966, p. 165).

2.20 "Large libraries face the problem of converting records that exist in many languages. The most complete discussion of this problem to date is by Cain & Jolliffe of the British Museum. They sug gest methods for encoding different languages and speculate on the extent to which certain transliterations could be done by machine. The possibility of storing certain exotic languages on videotapes is suggested as a way of handling the printing problem. At the Brasenose Conference at which this paper was presented, the authors analyzed the difficulties in bibliographic searching caused by transliteration of languages (this is the scheme most generally suggested by those in the data processing field)." (Markuson, 1967, p. 268).

2.21 "The question of integrity of information within an automated system is infrequently addressed." (Davis, 1967, p. 13).

"No adequate reference service exists that would allow users to determine easily whether or not records have the characteristics of quality and compatibility that are appropriate to their analytical requirements." (Dunn, 1967, p. 22).

2.22 "Controls through 'common sense' or logical checks . . . include the use of allowable numerical bounds such as checking bearings by assuming them to be bounded by 0° as a minimum and 360° as a maximum. They include consistency checks using redundant information fields such as social security number matched against aircraft type and aircraft speed. They also include current awareness checks such as matches of diplomat by name against reported location by city against known itinerary against known political views." (Davis, 1967, p. 36).

"A quite different kind of work is involved in examing for internal consistency the reports from the more than 3 million establishments covered in the 1954 Censuses of Manufacturers and Business. If these reports were all complete and selfconsistent and if we were smart enough to foresee all the problems involved in classifying them, and if we made no errors in our office work, the job of getting out the Census reports would be laborious. but straightforward. Unfortunately, some of the reports do contain omissions, errors, and evidence of misunderstanding. By checking for such incon

sistencies we eliminate, for example, the large errors that would result when something has been improperly reported in pounds instead of in thousands of pounds. Perhaps one-third to one-half of the time our UNIVACS devote to processing these Censuses will be spent checking for such inconsistencies and eliminating them.

"Similar checking procedures are applied to the approximately 7,000 product lines for which we have reports. In a like manner we check to see whether such relationships as annual man hours and number of production workers, or value of shipments and cost of labor and materials, are within reasonable limts for the industry and area involved.

"For example, the computer might determine for an establishment classified as a jewelry repair shop, that employees' salaries amounted to less than 10 percent of total receipts. For this kind of service trade, expenditures for labor usually represent the major item of expenses and less than 10 percent for salaries is uncommonly low. Our computer would list this case for inspection, and a review of the report might result in a change in classification from 'jewelry repair shop to retail jewelry store', for example." (Hansen and McPherson, 1956, pp. 59-60).

2.23 "The use of logical systems for error control is in beginning primitive stages. Questionanswering systems and inference-derivation programs may find their most value as error control procedures rather than as query programs or problem-solving programs." (Davis, 1967, p. 47).

"A theoretically significant result of introducing source indicators and reliability indicators to be carried along with fact assertions in an SFQA [question-answering] system is that they provide a basis for applying purifying programs to the fact assertions stored in the system-i.e., for resolving contradictions among different assertions, for culling out unreliable assertions, etc.

...

"Reliability information might indicate such things as: S's degree of confidence in his own report if S is a person; S's probable error if S is a measuring instrument; S's dependability as determined by whether later experience confirmed S's earlier reports; conditions under which S made its report, etc." (Travis, 1963, p. 333).

2.24 "Another interesting distinction can be made between files on the basis of their accuracy. A clean file is a collection of entries, each of which was precisely correct at the time of its inclusion in the file. On the other hand, a dirty file is a file that contains a significant portion of errors. A recirculating file is purged and cleansed as it cycles - a utility-company billing file is of this nature. After the file 'settles down,' the proportion of errors imbedded in the file is a function of the new activity applied to the file. The error rate is normalized with respect to the business cycle." (Patrick and Black, 1964, p. 39).

"When messages are a major source of the information entering the system corrections to a previously transmitted original message can be received before the original message itself. If entered on an earlier update cycle the correction data can actually be 'corrected' during a later update cycle by the original incorrect message." (Davis, 1967, p. 24).

2.25 "Errors will occur in every data collection system, so it is important to detect and correct as many of the errors as possible." (Hillegass and Melick, 1967, p. 56).

"The primary purpose of a data communications system is to transmit useful information from one location to another. To be useful, the received copy of the transmitted data must constitute an accurate representation of the original input data, within the accuracy limits dictated by the application requirements and the necessary economic tradeoffs. Errors will occur in every data communications system. This basic truth must be kept in mind throughout the design of every system. Important criteria for evaluating the performance of any communications system are its degree of freedom from data errors, its probability of detecting the errors that do occur, and its efficiency in overcoming the effects of these errors." (Reagan, 1966, p. 26). "The form of the control established, as a result of the investigation, should be decided only after considering each situation in the light of the three control concepts mentioned earlier. Procedures, such as key verification, batch totals, sight verification, or printed listings should be used only when they meet the criteria of reasonableness, in light of the degree of control required and the cost of providing control in relation to the importance and volume of data involved. The objective is to establish appropriate control procedures. The manner in which this is done-i.e., the particular combination of control techniques used in a given set circumstances - will be up to the ingenuity of the individual systems, designer." (Baker and Kane, 1966, pp. 99-100).

2.26 "Two basic types of codes are found suitable for the burst type errors. The first is the forward-acting Hagelbarger code which allows fairly simple data encoding and decoding with provisions for various degrees of error size correction and error size detection. These codes, however, involve up to 50 percent redundancy in the transmitted information. The second code type is the cyclic code of the Bose-Chauduri type which again is fairly simple to encode and can detect various error burst sizes with relatively low redundancy. This code type is relatively simple to decode for error detection but is too expensive to decode for error correction, and makes retransmission the only alternative." (Hickey, 1966, p. 182).

2.27 "Research devoted to finding ways to further reduce the possibility of errors is progressing on many fronts. Bell Telephone Laboratories is approaching the problem from three angles:

error detection only, error detection and correction with a non-constant speed of end-to-end data transfer (during the correction cycle transmission stops), and error detection and correction with a constant speed of end-to-end data transfer (during the correction cycle transmission continues)." (Menkhaus, 1967, p. 35).

"There are two other potential 'error injectors' which should be given close attention, since more control can be exercised over these areas. They are: the data collection, conversion and input devices, and the human being, or beings, who collect the data (or program a machine to do it) at the source. Bell estimates that the human will commit an average of 1,000 errors per million characters handled, the mechanical device will commit 100 per million, and the electronic component, 10 per million.

"Error detection and correction capability is a 'must' in the Met Life system and this is provided in several ways. The input documents have Honeywell's Orthocode format, which uses five rows of bar codes and several columns of correction codes that make defacement or incorrect reading virtually impossible; the control codes also help regenerate partially obliterated data. . . .

"Transmission errors are detected by using a dual pulse code that, in effect, transmits the signals for a message and also the components of those signals, providing a double check on accuracy. The paper tape reader, used to transmit data, is bi-directional; if a message contains a large number of errors, due possibly to transmission noise, the equipment in the head office detects those errors and automatically tells the transmitting machine to 'back up and start over'." (Menkhaus, 1967, p. 35).

2.28 "Input interlocks-checks which verify that the correct types and amounts of data have been inserted, in the correct sequence, for each transaction. Such checks can detect many procedural errors committed by persons entering input data into the system." (Hillegass and Melick, 1967, p. 56).

2.29 "Parity- addition of either a 'zero' or 'one' bit to each character code so that the total number of 'one' bits in every transmitted character code will be either odd or even. Character parity checking can detect most single-bit transmission errors, but it will not detect the loss of two bits or of an entire character." (Hillegass and Melick, 1967, p. 56).

"Two of the most popular error detection and correction devices on the market-Tally's System 311 and Digitronics' D500 Series - use retransmission as a correction device. Both transmit blocks of characters and make appropriate checks for valid parity. If the parity generated at the transmitter checks with that which has been created from the received message by the receiver, the transmission continues. If the parity check fails, the last block is retransmitted and checked again for parity. This method avoids the disadvantages of transmitting

the entire message twice and of having to compare the second message with the first for validity." (Davenport, 1966, p. 31).

"Full error detection and correction is provided. The telephone line can be severed and reattached hours later without loss of data. . . Error detection is accomplished by a horizontal and vertical parity bit scheme similar to that employed on magnetic tape.” (Lynch, 1966, p. 119).

"A technique that has proven highly successful is to group the eight-level characters into blocks. of eighty-four characters. One of the eighty-four characters represents a parity character, assuring that the summation of each of the 84 bits at each of eight levels is either always odd or always even. For the block, there is now a vertical parity check (the character parity) and a horizontal parity check (the block parity character). This dual parity check will be invalidated only when an even number of characters within the block have an even number of hits, each at the same level. The probability of such an occurrence is so minute that we can state that the probability of an undetected error is negligible. In an 84-character block, constituting 672 bits, 83+8=91 bits are redundant. Thus, at the expense of adding redundancy of 13.5 per cent, we have assured error-free transmission. At least we know that we can detect errors with certainty. Now, let us see how we can utilize this knowledge to provide error-free data at high transmission rates. One of the most straightforward techniques is to transmit data in blocks, automatically checking each block for horizontal and vertical parity at the receiving terminal. If the block parities check, the receiving terminal delivers the block and an acknowledgment character (ACK) is automatically transmitted back to the sending terminal. This releases the next block and the procedure is repeated. If the block parities do not check, the receiving terminal discards the block and a nonacknowledgment character (NACK) is returned to the sender. Then, the same block is retransmitted. This procedure requires that storage capacity for a minimum of one data block be provided at both sending and receiving terminals." (Rider, 1967, p. 134).

2.30 "What then can we say that will summarize the position of the check digit? We can say that it is useful for control fields - that is, those fields we access by and sort on, customer number, employee number, etc. We can go further and say that it really matters only with certain control fields, not all. With control fields, the keys by which we find and access records, it is essential that they be correct if we are to find the correct record. If they are incorrect through juxtaposition or other errors in transcription, we will 1) not find the record, and 2) find and process the wrong record.

...

"One of the most novel uses of the check digit can be seen in the IBM 1287 optical scanner. The writer enters his control field followed by the

check digit. If one of his characters is not clear, the machine looks at the check digit, carries out its arithmetic on the legible characters, and subtracts the result from the result that would give the check digit to establish the character in doubt. It then rebuilds this character." (Rothery, 1967, p. 59.)

2.31 "A hash total works in the following way. Most of our larger computers can consider alphabetic information as data. These data are added up, just as if they were numeric information, and a meaningless total produced. Since the high-speed electronics are very reliable, they should produce the same meaningless number every time the same data fields are summed. The transfer of information within the computer and to and from the various input/output units can be checked by recomputing this sum after every transmission and checking against the previous total. . . .

"Some computers have special instructions built into them to facilitate this check, whereas others accomplish it through programming. The file designer considers the hash total as a form of builtin audit. Whenever the file is updated, the hash totals are also updated. Whenever a tape is read, the totals are reconstituted as an error check. Whenever an error is found, the operation is repeated to determine if a random error has occurred. If the information is erroneous, an alarm is sounded and machine repair is scheduled. If information has been actually lost, then human assistance is usually required to reconstitute the file to its correct content. Through a combination of hardware and programming the validity of large reference files can be maintained even though the file is subject to repeated usage." (Patrick and Black, 1964, pp. 46–47).

2.32 "Message length-checks which involve a comparison of the number of characters as specified for that particular type of transaction. Message length checks can detect many errors arising from both improper data entry and equipment or line malfunctions." (Hillegass and Milick, 1967, p. 56).

2.33 "In general, many standard techniques such as check digits, hash totals, and format checks can be used to verify correct input and transmission. These checks are performed at the computer site. The nature and extent of the checks will depend on the capabilities of the computer associated with the response unit. One effective technique is to have the unit respond with a verbal repetition of the input data." (Melick, 1966, p. 60).

2.34 "Philco has a contract for building what is called a Spelling-Corrector . . . It reads text and matches it against the dictionary to find out whether the words are spelled correctly." (Gibbs and MacPhail, 1964, p. 102).

"Following keypunching, the information retrieval technician processes the data using a 1401 computer. The computer performs sequence checking, editing, autoproofing (each word of input is checked against a master list of correctly spelled words

376-411 0-70-4

« Previous Continue »

Books

Research and Development in the Computer and Information Sciences: Overall ...