NBS Special Publication

make informed decisions concerning acquisition of VS systems.

6. Shed light on possible areas of further exploration by Bell Laboratories personnel concerning VS performance factors.

The above goals were, as discussed earlier in this section, conditioned by the hardware restrictions of the system tested. In particular no comparison of OS and VS throughput was attempted. Rather the above goals emphasized in-depth probing of VS in its own right (goals 1 and 2), as well as the building of a foundation for a multi-faceted approach to the predictive question itself (goals 3 and 4). Goal 5 was included to meet the immediate needs of OTC data processing managers who must make dollars and cents decisions, based on such studies, concerning selection of hardware and software. Finally, any good study results in more new questions being asked than old questions being answered, and goal 6 reflected this fact. In total, these goals represent a rather typical example of goals for a general batch operating system evaluation project.

After these goals were set, the development of a working-hypothesis about VS/2 was facilitated by comparison with OS/MVT (on which VS/2 is based). Clearly, the main difference between these systems is in the area of memory management considering VS/2's virtual memory capability. Hence the working hypothesis arrived at was a rather simple one, i.e., that due to the overheads inherent in managing virtual memory the paging rates associated with the system would be the main determinant of system overhead and would have a great effect on overall throughput. Implicit in this hypothesis was the assumption that paging would be rather costly (in CPU cycles) in VS/2 and hence the system would degrade as soon as any significant amount of paging occured. This hypothesis was shown later to be false, but in the process of designing experiments that tested the hypothesis, much was learned about the system.

Phase 2 of the project was dominated by development of tools and experiments to test the working hypothesis as well as answer other questions about the VS/2 system. One highlight of this phase was the development of a software monitor, appropriately named VSMON, which was used both to validate the VS/2 operating system and measure its performance. The development of VSMON began in the first phase of the project, when specific tables used by the VS/2 Paging Supervisor to record the status of each 4K-byte page of main storage were discovered. These tables (Page Vector Table and Page Frame Table also record the status of various logical queues in the VS/2 operating system. Hence it was decided that in order to validate and measure VS/2 Paging Supervisor performance a software monitor which would periodically access those tables would be built. The result was a tool useful in both measuring the ownership of real memory by user and system programs and in checking for correct operation of the Paging Supervisor. It is worth noting that in one instance, a potentially harmful 'bug' in VS/2 was found using the VSMON monitor; the affected benchmark experiment was rerun after the 'bug' was fixed. The measurements from the monitor have made possible (in combination with log file data and hardware monitor information) a much more complete picture of the VS/2 system than heretofore had been obtained.

Phase 3 of the project was spread over almost two months of time, during which each of the 5 major experiments was tested, and the VSMON monitor checked out. As a result of this extended 'dressrehersal', the last phase of the benchmark came off without incident, each experiment operating as expected. As of the time of this writing, documentation of the project is in progress.

We believe that the methodology set forth in Section 1., and the case study presented in Section 2., provide a useful set of guidelines for those in the field who are anticipating general operating system evaluations.

REPORT ON FIPS TASK GROUP 13 WORKLOAD DEFINITION AND BENCHMARKING

David W. Lambert

The MITRE Corporation

ABSTRACT

Benchmark testing, or benchmarking, one of several methods for measuring the performance of computer systems, is the method used in the selection of computer systems and services by the Federal Government. However, present benchmarking techniques not only have a number of known technical deficiencies, but they also represent a significant expense to both the Federal Government and the computer manufacturers involved. Federal Information Processing Standards Task Group 13 has been established to provide a forum and central information exchange on benchmark programs, data methodology, and problems. The program of work and preliminary findings of Task Group 13 are presented in this paper. The issue of application programs versus synthetic programs within the selection environment is discussed. Significant technical problem areas requiring continuing research and experimentation are identified.

INTRODUCTION

Earlier this year the National Bureau of Standards approved the formation of FIPS Task Group 13 to serve as an interagency forum and central information exchange on benchmark programs, data, methodology, and problems. The principal focus of Task Group 13 is to be on procedures and techniques to increase the technical validity and reduce the cost of benchmarking as used in the selection of computer systems and computer services by the Federal Government.

In May, invitations were issued to Federal Government Agencies, through the interagency Committee on Automatic Data Processing and the FIPS Coordinating and Advisory Committee, for nominees for membership to Task Group 13. Invitations for industry participation were also issued through CBEMA. In

response to these invitations we have received 26 nominees from various Federal agencies and 6 from industry. These names are being submitted to the Assistant Secretary of Commerce for approval and we hope to be able to start holding meetings in late January or early February of next year.

This morning I would like to cover the five main topics shown in the first vugraph (Figure 1). First, a brief review of the approved program of work for Task Group 13. Second, a review of some of the techniques and typical problems involved in the current selection procedures. Next, I will discuss some alternatives to the present methods used to develop benchmark programs and attempt to clarify a major issue in the selection community at this time: application programs or synthetic programs for benchmarking? As a fourth topic, I would like to review the general methodology of synthetic program development and some specific tools and techniques currently being used. I would like to conclude with an overview of some key technical problems in workload definition and benchmarking which require further research and experimentation.

[merged small][merged small][merged small][ocr errors][merged small][merged small][ocr errors][merged small][merged small][merged small]

PROGRAM OF WORK

The presently approved program of work is summarized in the next vugraph (Figure 2). The first three tasks should be self-explanatory. Task 4 is probably the most controversial at this time because it is not really clear what type of benchmark programs should be included in any sharing mechanism such as a library. I will discuss this issue later in my talk. As part of this task we plan to test and evaluate a number of available benchmark programs. If you have programs and/ or a test facility and would like to participate in such an activity, we would be pleased to hear from you. In Task 5, initially we plan to capitalize upon the experiences of people who have been involved in previous selections and to prepare a preliminary set of guidelines to be made available to users and others faced with selection problems. Federal Guidelines and Standards will possibly come later. The bibliography (Task 6) is intended to include articles and reports covering all aspects of workload characterization, workload generation, benchmarking, comparative evaluation, etc. but limited to the selection environment. This task is well under way with more than 75 articles having been identified at this time.

FUNCTION AS A FORUM AND CENTRAL INFORMATION
EXCHANGE

REVIEW CURRENTLY USED SELECTION PROCEDURES
AND TECHNIQUES

IDENTIFY AND EVALUATE NEW TECHNICAL APPROACHES
DEVELOP AND RECOMMEND A MECHANISM TO FACILITATE
SHARING OF BENCHMARK PROGRAMS

PREPARE FEDERAL GUIDELINES AND STANDARDS

FIGURE 3

STEPS IN CURRENT SELECTION PROCESS

USER DETERMINES TOTAL WORKLOAD REQUIREMENTS FOR PROJECTED LIFE CYCLE OF SYSTEM.

USER SELECTS TEST WORKLOAD, TYPICALLY WITH
ASSISTANCE OF APPROPRIATE SELECTION AGENCY.

SELECTION AGENCY PREPARES AND VALIDATES BENCH-
MARK PROGRAMS, DATA AND PERFORMANCE
REQUIREMENTS FOR RFP.

VENDOR MAKES NECESSARY MODIFICATIONS TO BENCHMARK PROGRAMS AND RUNS ON PROPOSED EQUIPMENT CONFIGURATIONS.

VENDOR RUNS BENCHMARK TEST FOR USER AND SELECTION AGENCY FOR PROPOSAL VALIDATION PURPOSES.

VENDOR RERUNS BENCHMARK TESTS AS PART OF
POST-INSTALLATION ACCEPTANCE TESTING.

Step 1 is typically only one aspect of the requirements analysis and preliminary system design phase for any new system. For upgrades or replacements to an existing system, there are a variety of hardware and software monitors and other software aids that can be used to collect data on the existing system. Various programs and statistical analysis techniques can be used to analyze this data to identify the dominant char acteristics of the present workload. For example, a number of workers are beginning to use a class of cluster algorithms to aid in this step. The major problems occur for conceptual systems or systems in which the workload is expected to change radically in the future: for example, when changing from a batch environment to a time-sharing environment. In these cases, it is extremely difficult to estimate the future workload, particularly for the later years in the projected life cycle of the system.

In step 2, the general goal is to determine peak load periods or to compress long-term workloads such as a 30-day workload into a representative test workload that can be run in a reasonable length of time

of an existing system, the peak period or representative mix can generally be identified or selected using resource utilization data. The test workload can also be verified by comparing its resource utilization data against the original data. For conceptual systems, of course, none of this experimental activity can take place.

Within the selection environment, the principal goal in step 3 is to develop machine-independent benchmark programs. In current selection procedures, the benchmark programs are generally drawn from among application programs obtained from the user's installation. This has a number of drawbacks due to differences in job control techniques, program languages, and data base structures from machine to machine. Also, in many cases the original programs were tailored to take advantage of some of the particular features of the user's machine. Most of these problems have been satisfactorily solved for batch, multiprogrammed systems. However, there are a host of new problems foreseen with the advent of general purpose data management systems, time-sharing and on-line transaction processing applications.

In step 4, the principal objective is to collect timing data for each configuration (original plus augmentations to meet a growth in workload) to be proposed in response to the RFP. Most problems and costs in this step are attributed to trying to get the benchmark programs to run on the vendor's own machine because of language differences and data base structure differences. In my own experience, seemingly simple problems such as character codes and dimension statements have been the cause for much aggravation in trying to get one program to run on another machine.

In steps 4, 5 and 6, there are major costs on the part of the vendors to maintain the benchmark facility, including equipment and personnel. There can also be significant scheduling problems if the vendor is involved in a number of procurements at the same time. The se costs, and in fact all the vendor costs in preparing his proposals and running benchmarks, eventually get

reflected in the costs of the computers to the Federal Government.

BENCHMARK PROGRAM DEVELOPMENT

Most of the technical problems and costs in the current selection process are generally attributed to having to select, prepare, and run a new set of programs for each new procurement. What are some of the alternative approaches that could be addressed by Task Group 13? The first one is probably the most obvious, and that is to develop tools and techniques to simplify the process of sanitizing a given user program and then translating it to other machines. This would include preprocessors to flag or convert machinedependent job control cards, dimension statements, etc. I am not aware of any serious attempts or proposed attempts along this line; however, it seems to have some merit for consideration by Task Group 13.

A second approach, which has been proposed but never implemented even on an experimental basis, is to develop and maintain a library of application benchmark programs. These programs would then only have to be translated to other machines once. For each new procurement, the user would select some mix of benchmarks from this library which best approximates his desired test workload. The library would probably have to be extensive, since it would have to contain programs for a great variety of engineering, scientific and business applications. One proposal that I am familiar with specified 20 Fortran and 20 Cobol programs, all to have been parameterized to some extent to permit tailoring to a specific user's test workload. This approach seems to have been set aside by the selection community for a variety of reasons, the primary one probably being the cost to develop and maintain such a library. There is also some question as to the acceptance of this approach by the users. I believe that if the costs and benefits were analyzed on a Government-wide basis, this approach may be more favorable than it has appeared in the past.

The third approach, and the one receiving the most attention these days in the selection community,

is the synthetic program, or synthetic job, approach. As most of you know, a synthetic program is a highly parameterized program which is designed to represent either (1) a real program or (2) the load on a system from a real program. The first form, which is a taskoriented program, is generally designed so that it can be adjusted to place functional demands on the system to be tested, such as compile, edit, sort, update, and calculate. The second form is a resource-oriented program which can be adjusted to use precisely specified amounts of computer resources such as CPU processing time and I/O transfers. Neither does any "useful" processing from an operational standpoint.

Typically, the test workload can be specified in terms of say 10 and 15 different types of task-oriented synthetics which, incidentally, may contain common parameterized modules such as tape read, compute, or print. In the resource-oriented synthetics, there is normally one original program which is duplicated and adjusted to represent individual programs in the test workload.

The motivation toward synthetics in the selection environment is of course the same as for an application benchmark library. The benchmark program or programs need only be translated and checked out once by each computer manufacturer, thus minimizing the costs involved in future procurements. The most serious technical question regarding synthetics seems to be their transferability from machine to machine, particularly the resource-oriented synthetics, since demands on resources are obviously dependent on the architecture of each machine. The task-oriented synthetics seem to offer the most promise within the selection environment primarily because they seem least system dependent. Also, user acceptance would be more likely if some means can be developed to express or map user-oriented programs in terms of task-oriented synthetic programs.

SYNTHETIC PROGRAM METHODOLOGY

The general methodology of the synthetic job approach is summarized in the next vugraph (Figure 4).

Accounting programs are commonly used for data collection, although there is increasing use of hardware monitors, trace programs and other software packages to get more detailed timing data on the real (and the synthetic) jobs in execution. As has been pointed out earlier in this meeting, it is often necessary to use special techniques or develop special programs to be able to make specific measurements for the problem at hand. For example, in an experimental program sponsored by the Air Force Directorate of ADPE Selection, we are using a Fortran frequency analysis program to obtain run time data on actual programs. This data is in turn used to set the parameters of a synthetic Fortran program, while accounting data (SMF on an IBM 370/155) is used to compare the synthetic job to the original user job. For TSO jobs, we use a system program called TS TRACE to collect terminal session data, use this data in developing a synthetic session, and compare the two sessions on the basis of SMF data.

SUMMARY OF TECHNICAL PROBLEMS

In this last vugraph (Figure 5), I have listed some of the major problems affecting computer performance measurement and evaluation in general and computer selection in particular.

The lack of common terminology and measures affects all areas of selection, including workload characterization, performance specification, and data collection and analysis. The semantics problems alone make communication difficult between users and selection agencies and selection agencies and vendors.

« Previous Continue »

Books

NBS Special Publication, Issues 401-405