Page images
PDF
EPUB
[ocr errors]

a

make up the design-verification process range from units may be added, removed or replaced in accordanalysis and simulation on paper to full-scale ance with changing requirements.” (Dennis and system testing." (Jacobs, 1964, p. 44).

Van Horn, 1965, p. 4). See also notes 5.83, 5.84. 2.49 “Measurement of the system was a major 2.53 “The actual execution of data movement area which was not initially recognized. It was commands should be asynchronous with the main necessary to develop the tools to gather data and processing operation. It should be an excellent use introduce program changes to generate counts and of parallel processing capability.” (Opler, 1965, p. parameters of importance. Future systems designers 276). should give this area more attention in the design 2.54 “Work currently in progress (at Western phase to permit more efficient data collection.' Data Processing Center, UCLA) includes: investi(Evans, 1967, p. 83.)

gations of intra-job parallel processing which will 2.50 "[The user) is given several control statis- attempt to produce quantititative evaluations of tics which tell him the amount of dispersion in each component utilization; the increase in complexity category, the amount of overlap of each category of the task of programming; and the feasibility of with every other category, and the discriminating compilers which perform the analysis necessary to power of the variables

.. These statistics are convert sequential programs into parallel-path probased on the sample of documents that he assigns grams.” (Dig. Computer Newsletter 16, No. 4, 21 to each category

Various users of an identical (1964).) set of documents can thus derive their own structure 2.55 “The motivation for encouraging the use of of subjects from their individual points of view." parallelism in a computation is not so much to make (Williams, 1965, p. 219).

a particular computation run more efficiently as it 2.51 “We will probably see a trend toward the is to relax constraints on the order in which parts of concept of a computer as a collection of memories, a computation are carried out. A multi-program buses and processors with distributed control of scheduling algorithm should then be able to take their assignments on a dynamic basis." (Clippinger, advantage of this extra freedom to allocate system 1965, p. 209).

resources with greater efficiency.” (Dennis and Van “Both Dr. Gilbert C. McCann of Cal. Tech and Horn, 1965, pp. 19-20). Dr. Edward E. David, Jr., of Bell Telephone Labo- 2.56 Amdahl remarks that “the principal motiratories stressed the need for hierarchies of com- vations for multiplicity of components functioning puters interconnected in large systems to perform in an on-line system are to provide increased the many tasks of a time-sharing system." (Commun. capacity or increased availability or both.” (1965, ACM 9, 645 (Aug. 1966).)

p. 38). He notes further that "by pooling, the 2.52 “Every part of the system should consist of number of components provided need not be large a pool of functionally identical units (memories, enough accommodate peak requirements processors and so on) that can operate independently occurring concurrently in each computer, but may and can be used interchangeably or simultaneously instead accommodate a peak in one occurring at at all times

the same time as an average requirement in the “Moreover, the availability of duplicate units other.” (Amdahl, 1965, pp. 38-39). would simplify the problem of queuing and the 2.57 “No large system is a static entity-it must allocation of time and space to users.” (Fano and be capable of expansion of capacity and alteration Corbató, 1966, pp. 134-135).

of function to meet new and unforeseen require“Time-sharing demands high system reliability ments." (Dennis and Glaser, 1965, p. 5). and maintainability, encourages redundant, modu- "Changing objectives, increased demands for lar, system design, and emphasizes high-volume use, added functions, improved algorithms and new storage (both core and auxiliary) with highly parallel technologies all call for flexible evolution of the system operation." (Gallenson and Weissman, 1965, system, both as a configuration of equipment and

as a collection of programs.” (Dennis and Van Horn, “A properly organized multiple processor system 1965, p. 4). provides great reliability (and the prospect of con- “A design problem of a slightly different chartinuous operation) since a processor may be trivially acter, but one that deserves considerable emphasis, added to or removed from the system. A processor is the development of a system that is 'open-ended'; undergoing repair or preventive maintenance i.e., one that is capable of expansion to handle merely lowers the capacity of the system, rather new plants or offices, higher volumes of traffic, than rendering the system useless.” (Saltzer, 1966, new applications, and other difficult-to-foresee de

velopments associated with the growth of the busi“Greater modularity of the systems will mean ness. The design and implementation of a data easier, quicker diagnosis and replacement of faulty communications system is a major investment; parts." (Pyke, 1967, p. 162).

proper planning at design time to provide for future "To meet the requirements of flexibility of capac

growth will safeguard this investment." (Reagan, ity and of reliability, the most natural form . . . is

1966, p. 24). as a modular multiprocessor system arranged so 2.58 “Reconfiguration is used for two prime that processors, memory modules and file storage purposes: to remove a unit from the system for

to

[ocr errors]

p. 14).

p. 2).

one

more

[ocr errors]

function is fail-safe. Such a system requires at least

unit of each type of system component, with the interconnection circuitry to permit it to replace any of its type in any configuration

A multi-computer system which can perform a satisfactory subset of its tasks in the presence of a malfunction is fail-soft. The set of tasks which must still be performed to provide a satisfactory through degraded level of operation, determines the minimum number of each component required after a failure of one of its type." (Amdahl, 1965,

a

p. 39).

[ocr errors]

service or because of malfunction, or to reconfigure the system either because of the malfunction of one of the units or to 'partition' the system so as to have two or more independent systems. In this last case, partitioning would be used either to debug a new system supervisor or perhaps to aid in the diagnostic analysis of a hardware malfunction where more than a single system component were needed." (Glaser et al., 1965, p. 202.)

"Often, failure of a portion of the system to provide services can entail serious consequences to the system users. Thus severe reliability standards are placed on the system hardware. Many of these systems must be capable of providing service to a range in the number of users and must be able to grow as the system finds more users. Thus, one finds the need for modularity to meet these demands. Finally, as these systems are used, they must be capable of change so that they can be adapted to the ever changing and wide variety of requirements, problems, formats, codes and other characteristics of their users. As a result general-purpose stored program computers should be used wherever possible.” (Cohler and Rubenstein, 1964, p. 175).

2.59 “On-line systems are still in their early development stage, but now that systems are beginning to work, I think that it is obvious that more attention should be paid to the fail safe aspects of the problem.” (Huskey, 1965, p. 141).

"From our experience we have concluded that system reliability must provide for several levels of failure leading to the term 'fail-soft' rather than 'fail-safe’.” (Baruch, 1967, p. 147).

Related terms are "graceful degradation" and “high availability", as follows:

“The military is becoming increasingly interested in multiprocessors organized to exhibit the property of graceful degradation. This means that when one of them fails, the others can recognize this and pick up the work load of the one that failed, continuing this process until all of them have failed." (Clippinger, 1965, p. 210).

"The term 'high availability' (like its synonym 'fail safe' has now become a cliche, and lacks any precise meaning. It connotes a system characteristic which permits recovery from all hardware errors. Specifically, it appears to promise that critical system and user data will not be destroyed, that system and job restarts will be minimized and that critical jobs can most surely be executed, despite failing hardware. If this is so, then multiprocessing per se aids in only one of the three characteristics of high availability." (Witt, 1968, p. 699).

"The structure of a multi-computer system planned for high availability is principally determined by the permissible reconfiguration time and the ability to fail safely or softly. The multiplicity and modularity of system components should be chosen to provide the most economical realization of these requirements ...

“A multi-computer system which can perform the full set of tasks in the presence of a single mal

“Systems are designed to provide either full service or graceful degradation in the face of failures that would normally cause operations to cease. A standby computer, extra mass storage devices, auxiliary power sources to protect against public utility failure, and extra peripherals and communication lines are sometimes used. Manual or automatic switching of spare peripherals between processors may also be provided." (Bonn, 1966, p. 1865).

2.60 "A third main feature of the communication system being described is high reliability. The emphasis here is not just on dependable hardware but on techniques to preserve the integrity of the data as it moves from entry device, through the temporary storage and data modes, over the transmission lines and eventually to computer tape or hard copy printer." (Hickey, 1966, p. 181.)

2.61 In addition to the examples cited in the discussion of client and system protection in the previous report in this series (on processing, storage, and output requirements, Section 2.2.4), we note the following:

“The primary objective of an evolving specialpurpose time-sharing system is to provide a real service for people who are generally not computer programmers and furthermore depend on the system to perform their duties. Therefore the biggest operational problem is reliability. Because the data attached to special-purpose system are important and also must be maintained for a long time, reliability is doubly crucial, since errors affecting the data base cannot only interrupt users' current procedures but also jeopardize past work.” (Castleman, 1967, p. 17).

“If the system is designed to handle both specialpurpose functions and programming development, then why is reliability a problem? It is a problem because in a real operating environment some new 'dangerous' programs cannot be tested on the system at the same time that service is in effect. As a result, new software must be checked out during offhours, with two consequences. First, the system is not subjected to its usual daytime load during checkout time. It is a characteristic of time-shared programs that different bugs' may appear depending on the conditions of the overall system activity. For example, the time-sharing bug' of a program manipulating data incorrectly because another program processes the same data at virtually the same

[ocr errors]
[ocr errors]

center, 3) protection of the center and its gear from fire and other hazards, 4) insist that separate facilities via separate routes and used to connect locations on the MIS network, and 5) build extra capacity into the MIS hardware system.” (Dantine, 1966,

p. 409).

time would be unlikely on a lightly loaded system. Second, programmers must simulate at night their counterparts of laymen users. Unfortunately, these two types of people tend to use application programs differently and to make different types of errors; so program debugging is again limited. Therefore, because the same system is used for both service and development, programs checked as rigorously as possible can still cause system failures when they are installed during actual service hours." (Castleman, 1967, p. 17).

“Protection of a disk system requires that no user be able to modify the system, purposely or inadvertently, thus preserving the integrity of the software. Also, a user must not be able to gain access to, or modify any other user's program or data. Protection in tape systems is accomplished: (1) by making the tape units holding the system records inaccessible to the user, (2) by making the input and output streams one-way (e.g., the input file cannot be backspaced), and (3) by placing a mark in the input stream which only the system can cross. In order to accomplish this, rather elaborate schemes have been devised both in hardware and software to prevent the user from accomplishing certain input-output manipulations. For example, in some hardware, unauthorized attempts at I/O manipulation will interrupt the computer.

“In disk-based systems, comparable protection devices must be employed. Since many different kinds of records (e.g., system input, user scratch area, translators, etc.) can exist in the same physical disk file, integrity protection requires that certain tracks, and not tape units, must be removed from the realm of user access and control. This is usually accomplished by partitioning schemes and central I/O software systems similar to those used in tapebased systems. The designer must be careful to preserve flexibility while guaranteeing protection." (Rosin, 1966, p. 242).

2.62 “Duplex computers are specified with the spare and active computers sharing I/O devices and key data in storage, so that the spare computer can take over the job on demand.” (Aron, 1967,

“It is far better to have the system running at half speed 5% of the time with no 100% failures than to have the system down 24/2% of the time." (Dantine, 1966, p. 409).

“Whenever possible, the two systems run in parallel under the supervision of the automatic recovery program. The operational system performs all required functions and monitors the back-up system. The back-up system constantly repeats a series of diagnostic tests on the computer, memory and other modules available to it and monitors the operational system. These tests are designed to maintain a high level of confidence in these modules so that should a respective counterpart in the operational system fail, the back-up unit can be safely substituted. The back-up system also has the capability of receiving instructions to perform tests on any of its elements and to execute these tests while continuing to monitor the operational system to confirm that the operational system has not hung up.” (Armstrong et al., 1967,

p. 409).

p. 54).

“The second channel operates in parallel with the main channel, and the results of the two channels are compared. Both channels must independently arrive at the same answer or operation cannot proceed. The duplication philosophy provides for two independent access arms on the Disk Storage Unit, two core buffers, and redundant power supplies.” (Bowers et al., 1962, p. 109).

"Considerable effort has been continuously

“ directed toward practical use of massive triple modular redundancy (TMR) in which logic signals are handled in three identical channels and faults are masked by vote-taking elements distributed throughout the system.” (Avižienis, 1967, p. 735).

“He must give consideration to 1) back-up power supplies that include the communications gear, 2) dual or split communication cables into his data

2.63 "The large number of papers on votetaking redundancy can be traced back to the fundamental paper of Von Neuman where multipleline redundancy was first established as a mathematical reality for the provision of arbitrarily reliable systems." (Short, 1968, p. 4).

2.64 "A computer system contains protective redundancy if faults can be tolerated because of the use of additional components or programs, or the use of more time for the computational tasks. ..

“In the massive (masking) redundancy approach the effect of a faulty component, circuit, signal, subsystem, or system is masked instantaneously by permanently connected and concurrently operating replicas of the faulty element. The level at which replication occurs ranges from individual circuit components to entire self-contained systems." (Avižienis, 1967, p. 733-734).

2.65 “An increase in the reliability of systems is frequently obtained in the conventional manner by replicating the important parts several (usually three) times, and a majority vote . . . A technique of diagnosis performed by nonbinary matrices . require, for the same effect, only one duplicated part. This effect is achieved by connecting the described circuit in a periodically changing way to the duplicated part. If one part is disturbed the circuit gives an alarm, localizes the failure and simultaneously switches to the remaining part, so that a fast repair under operating conditions (and without additional measuring instruments) is possible." (Steinbuch and Piske, 1963, p. 859).

a

=

2.66 “Parameters of the model are as follows:

n= total number of modules in the

system m=number of unfailed modules needed

for system survival Pf=Probability of failure of each module

some time during the mission. This parameter thus includes both the mission duration

duration and

and the module MTBF. Pnd=probability of not detecting an

occurred module failure PS= probability of system

survival throughout the mission Py=1-PS=probability of

system failure during the mission n/m=redundancy factor in initial sys.

tem. “Depending upon the attainable P, and Pnd, the theoretical reliability of a multi-module computing system may be degraded by adding more than a minimal amount of redundancy. For example, Pg=0.025 . . . it is more reliable to have only one spare module rather than two or four, for a typical current-day Pnd such as 0.075. Even for a Pnd as low as 0.03 (a very difficult P

nd to achieve in a computer), the improvement obtained in system reliability by adding a second spare unit to the system is minor.” (Wyle and Burnett, 1967, pp. 746, 748).

“The probability of system failure ... is:

[ocr errors]

=

[ocr errors][ocr errors]

that any item of information must be stored in at least two independent places, and that the updating of queue tables and other auxiliary data must be carefully synchronized so that operation can continue smoothly after correction of a malfunction. If it cannot be determined exactly where a transmis. sion was interrupted, procedures should lean toward pessimism. Repetition of a part of a message is less grievous than a loss of part of it.” (Shafritz, 1964, p. N2.3–3).

"Reference copies are kept on magnetic tapes for protective accountability of each message. Random requests for retransmission are met by a computer search of the tape, withdrawal of the required messages and automatic reintroduction of the message into the communications system." (Jacobellis, 1964, p. N2.1-2).

"Every evening, the complete disc file inventory is pruned and saved on tape to be reloaded the fol. lowing day. This gives a 24-hour 'rollback'capability for catastrophic disc failures.” (Schwartz and Weissman, 1967, p. 267).

“It is necessary to provide means whereby the contents of the disc can be reinstated after they have been damaged by system failure. The most straightforward way of doing this is for the disc to be copied on to magnetic tape once or twice a day; re-writing the disc then puts the clock back, but users at least know where they are. Unfortunately, the copying of a large disc consumes a lot of computer time, and it seems essential to develop methods whereby files are copied on to magnetic tape only when they are created or modified. It would be nice to be able to consider the archive and recovery problems as independent, but reasons of efficiency demand that an attempt should be made to develop a satisfactory common system. We have, unfortunately, little experience in this area as yet, and are still groping our way." (Wilkes, 1967, p. 7).

“Our requirements, therefore, were threefold: security, retrieval, and storage. We investigated various means by which we could meet these requirements; and we decided on the use of microfilm, for two reasons. First, photographic copies of records, including those on microfilm, are acceptable as legal representations of documents. We could photograph our notebooks, store the film in a safe place, and destroy the books or, at least, move them to a larger storage area. Second, we found on the market equipment with which we could film the books and then, with a suitable indexing system, obtain quick retrieval of information from that film” (Murrill, 1966, p. 52).

“The file system is designed with the presumption that there will be mishaps, so that an automatic file backup mechanism is provided. The backup procedures must be prepared for contingencies ranging from a dropped bit on a magnetic tape to a fire in the computer room.

[merged small][ocr errors][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small]

p. 748).

“One of the prime requisites for a reliable, dependable communications data processing system is that it employ features for insuring message protection and for knowing the disposition of every message in the system (message accountability) in case of equipment failures. The degree of message protection and accountability will vary from application to application." (Probst, 1968, p. 21).

“Elaborate measures are called for to guarantee message protection. At any given moment, a switching center may be in the middle of processing many different messages in both directions. If a malfunction occurs in any storage or processing device, there must be enough information stored elsewhere in the center to analyze the situation, and to repeat whatever steps are necessary. This means

a

an

“Specifically, the following contingencies are provided for: “1. A user may discover that he has accidentally

deleted a recent file and may wish to recover

it. “2. There may be a specific system mishap

which causes a particular file to be no longer

readable for some 'inexplicable’ reason. “3. There may be a total mishap. For example,

the disk-memory read heads may irreversibly score the magnetic surfaces so that all disk

stored information is destroyed. “The general backup mechanism is provided by the system rather than the individual user, for the more reliable the system becomes, the more the user is unable to justify the overhead (or bother) of trying to arrange for the unlikely contingency of a mishap. Thus an individual user needs insurance, and, in fact, this is what is provided." (Corbato and Vyssotsky, 1965, p. 193).

“Program roll-back for corrective action must be routine or function oriented since it is impractical from a storage requirement point of view to provide corrective action for each instruction. The rollback must be to a point where initial conditions are available from sensors, prestored, or reconstitutable. Even intermittent memory malfunction during access becomes a persistent error since it is immediately rewritten in error. Thus, critical routines or high iteration rate real-time routines (for example, those which perform integration with respect to time) should be stored redundantly so that in the event of malfunction the redundantly stored routine is used to preclude routine malfunction or error buildup with time.” (Bujnoski, 1968, p. 33).

2.68 “Restart procedures should be designed into the system from the beginning, and the necessity for the system to spend time in copying vital information from one place to another should be cheerfully accepted. .

“Redundant information can be included in supervisor communication or data areas in order to enable errors caused by system failure to be corrected. Even a partial application of this idea could lead to important improvements in restart capability. A system will be judged as much as by the efficiencies of its restart procedures as by the facilities that it provides. .

“Making it possible for the system to be restarted after a failure with as little loss as possible should be the constant preoccupation of the software designer.” (Wilkes and Needham, 1968, p. 320).

“Procedures must also be prescribed for work with the archive collection to prevent loss or contamination of the master records by tape erasure, statistical adjustment, aggregation or

, reclassification." (Glaser et al., 1967, p. 19).

2.69 “Standby equipment costs should receive some consideration, particularly in a cold war situation: duplicate tapes, raw data or semi

processed data. Also consider the possible costs of transporting classified data elsewhere for computation: express, courier, messenger, Brink's service.” (Bush, 1956, p. 110).

“For companies in the middle range, the commercial underground vaults offer excellent facilities at low cost. Installations of this type are available in a number of states, including New York, Pennsylvania, Kansas, Missouri and California. In addition to maximum security, they provide pre-attack clerical services and post-attack conversion facilities. The usual storage charge ranges from $2 to $5 a cubic foot annually, depending on whether community or private storage is desired. ...

“The instructions should detail procedure for converting each vital record to useable form, as well as for utilizing the converted data to perform the desired emergency functions. The language should be as simple as possible and free of 'shop' terms, since inexperienced personnel will probably use the instructions in the postattack.” (Butler, 1962, pp. 65, 67.)

2.70 “The trend away from supporting records is a recent development that has not yet gained widespread acceptance. There is ample evidence, however, that their use will decline rapidly, if the cold war gets uncomfortably hot. Except for isolated areas in their operations, an increasing number of companies are electing to take a cal. culated risk in safeguarding basic records but not the supporting changes. For example, some of the insurance companies microfilm the basic in-force policy records annually and forego the changes that occur between duplicating cycles. This is a good business risk for two reasons: (1) supporting records are impractical for most emergency operations, and (2) a maximum one-year lag in the microfilm record would not seriously hamper emergency operations." (Butler, 1962, p. 62.)

“Mass storage devices hold valuable records, and backup is needed in the event of destruction or nonreadability of a record(s). Usually the entire file is copied periodically, and a journal of transactions is kept. If necessary, the file can be reconstructed from an earlier copy plus the journal to date.” (Bonn, 1966, p. 1865).

2.71 “The life and stability of the (storage] medium under environmental conditions are other considerations to which a great deal of attention must be paid. How long will the medium last? How stable will it be under heat and humidity changes?” (Becker and Hayes, 1963, p. 284).

It must be noted that, in the present state of magnetic tape technology, the average accurate life of tape records is a matter of a few months only. The active master files are typically rewritten on new tapes regularly, as a part of normal updating and

maintenance procedures. Special precautions must be undertaken, however, to assure the same for duplicate master tapes, wherever located.

"Security should also be considered in another

« PreviousContinue »