NBS Special Publication

USING SMF AND TFLOW FOR PERFORMANCE ENHANCEMENT

J. M. Graves

U.S. Army Management Systems Support Agency

Agency Overview

Before I get too involved in this presentation of how we use SMF and TFLOW for performance enhancement, let me give you a few words of background on my Agency, USAMSSA, The U. S. Army Management Systems Support Agency. We were created several years ago to serve the Army Staff's Data Processing requirements. From our site in the Pentagon, we handle the full range of commercial and scientific applications, with a slight bias toward the commercial. Applications are processed via batch, TSO, and RJE on a 3 megabyte IBM 360/65 and 1 and 1/2 megabyte IBM 360/50 using OS/MVT and HASP. As you might expect, we have a large number of peripheral devices attached to the two machines, some of which are shared. We have a number of operational conventions, but I need mention here only those two which have an impact on systems performance. The first operating convention is what I call the SYSDATA convention, whereby all temporary disk data sets are created on permanently mounted work packs given the generic name of SYSDATA at System's Generation Time. The second operating convention to bear in mind is the fact that of the total of 80 spindles on the two machines, only 41 are available for mounting our 150 private mountable packs, the remainder being used by permanently resident packs.

Problem Areas Defined

We have in the past made use of a number of tools to analyze system performance: two hardware monitors - DYNAPROBE and XRAY, a software monitor CUE, and last but not least, personal observation the computer room. These tools are all excellent in defining those areas of system performance that need optimizing, e.g., the CPU is not 100% busy, the channels are not uniformly busy at a high level, certain disk packs show more head movement than others, and there are an excessive number of private pack mounts. However, none of the usual tools tell you what to do to eliminate or alleviate defined system bottlenecks. Our approach to filling this information gap, is to use free IBM software - SMF and TFLOW to produce data for subsequent reduction into reports which at least point you in the right direction. SMF is a SYSGEN option and TFLOW, which records everything that goes through the trace table, is available from your IBM FE as a PTF. After making the indicated system changes, we go hack to the monitors to validate results.

Core Availability

The first area of system performance we examined was core availability, which on most machines, is one of the most important constraints on the degree of multiprogramming, and therefore on the maximum utilization of the machine. The more tasks that are in core executing concurrently, the higher will be the utilization of the CPU, the channels, and the devices, provided of course that device allocation problems do not arise from the additional concurrently executing tasks. If allocation problems do occur as a result of an increased level of multiprogramming, a rigidly enforced resource oriented class structure will help. If this approach fails to reduce allocation lockouts to an acceptable level, device procurement is about the only other avenue of approach. One way of justifying (or perhaps I should say rationalizing) the acquisition of additional peripheral devices to support the additional tasks now multiprogramming is to regard the CPU cost and the core cost per additional task as essentially zero, since these costs were a fixed item before you began system tuning. In summary, by making more core available, one should be able to support extra applications at no cost, or in the worst case, the cost of enough new devices to lower allocation contention to an acceptable level. Short of bolting on more core, there are a number of approaches

In our computing environment, since we do not bill for resources consumed, there is no economic incentive for a user to request only that amount of core which he actually needs to execute his program. To the contrary, many considerations in the applications programmer's mileau influence his REGION request to be on the high side. First, he wants to avoid S 80A abends-core required exceeds core requested. Second, he wants to provide enough core over and above the requirements of his program to provide a dump in the event of an abend. Next, he wants to provide room for future expansion of his program without the necessity of also changing his JCL. This consideration is particularly important to him if his JCL has been stored on the catalogued procedure library, SYS1. PROCLIB. Next, there is the certain knowledge that coding and keypunching effort can be saved by

specifying the REGION request at a higher level than the step EXEC statement, e.g., the EXEC PROC statement, or on the JOB statement. In either case, the REGION specified is applied to each step of a group of steps, and must therefore be large enough to enable the largest step of the group to execute. Obviously this practice can lead to gross core waste if there is a large variance between the REGION requirement of the largest and smallest steps. For example, we have experienced 300K compile, link, and go jobs when the 300K was required only for the go step.

One approach to this problem would have been to introduce a JCL scan to prohibit the use of the REGION parameter on the JOB statement or the EXEC PROC statement. I personally preferred this approach, but it was the consensus of our management that there was nothing wrong per se with specifying the REGION requirement at a higher JCL level than the STEP statement. Only when such a specification results in wasted core should the user be critisized. Since there is no way to know how much core actually will be used until after it is used, and after the fact reporting system had to be developed to identify daily, on an exception basis, those users whose wasteful REGION requests were causing the greatest impact on core availability. This mechanism is the Weighted Core Variance Report. The idea of the report is to produce an index number for each job step whose magnitude is a measure of the degree of waste his region request has produced in thd system. Acceptance of the report by management was contingent upon the repeatability of the computed index number. From SMF type 4 step termination records, the variance between the requested core and the used core is computed, a 6K abend/growth allowance is subtracted, and the result is weighted by the amount of CPU time used by the step. This product is further weighted by dividing by the relative speed of the hierarchy of core where executed. Negative results were not considered, since this result means the programmer was able to come within 6K on his core request. CPU time was used rather than thru time since it was felt that this figure was more independent of job mix considerations. The weighted variances are listed in descending order with a maximum of 20 job steps per administrative unit. It was felt that 20 job steps was about the maximum that a manager would attend to on one day. A minimum index value of 500 was established to eliminate reporting on trivial index numbers. The report is distributed daily to 40 or 50 management level personnel. Implementation of the report was met with widespread grousing which has continued up to the present. This, I feel, is a strong indication that the report is serving its purpose. Access Method Module Utilization Report

lest I give the impression that my idea of system tuning is to attack problem programming, let me describe the second of our approaches to achieve a higher level of core availability

One of the most interesting design features of OS is the use of re-entrant modules resident in a shareable area of core designated the Link Pack Area. If these modules were not so located and designed, there would have to be multiple copies of the same module in core at the same time one copy for each task that required its services. By having these modules core resident, the various tasks multiprogramming can share the one copy, thus effectively saving an amount of core equal to the number of concurrently executing tasks, minus 1, times the size of the module. The higher the degree of multiprogramming and the larger the module in question, the greater is the benefit in core savings, or looking at the situation from another point of view, the greater is the core space availability. This increase in availability can be anything but trivial on a machine with a healthy compliment of core, since some access method modules are substantial in size - 2 to 4K for some BISAM & QISAM modules for example. Our criteria for validating the RAM list is to select for link pack residency only those modules which show concurrent usage by two or more tasks more than 50% of the time. Unfortunately, we were unable to locate any free software that would give us a report of concurrent utilization of access methods. We then looked to see if we could generate such a report from the data we had on hand. Such data would have to provide a time stamp giving the date and time that each task began using each access method module, and another time stamp when the task stopped using the module. If we had this information, it would be a relatively simple manner to calculate for each access method module the length of time there were from zero to 15 concurrent users and then relate the total time for each level of multiprogramming to the total measurement time as a percentage.

It becomes clear that the stop and start timestamps were already available in the form of SMF type 4 Step Termination records and the SMF type 34 TSO Step Termination records. What was not available was the all important access method module names invoked by each step. This is where TFLOW gets into the act. We had used TFLOW previously to get a more complete count of SVC requests than CUE supplies, and so were familiar with the fact that when SVC 8 (load) is detected, the module loaded was provided in a 32 byte TFLOW record. Therefore, in the TFLOW data, we had the access method module name, but no time stamps or step identification. The problem was how to relate the TFLOW data and the SMF data. Fortunately, TFLOW gives the address of the Task Control Block (TCB) of the task requesting the access method. This information is provided to an optional userwritten exit at execution time. By searching a short chain of control blocks from the TCB to the TCT to the JMR we were able to locate the SMF JOBLOG field consisting of the JOBNAME,

READER on date, READER on time. This JOBLOG data is reproduced in every SMF record produced by the job in execution, and is unique to that job. Also located in the JMR is the current step number. By copying the JOBLOG and step number to a TFLOW user record also containing the access method module name, we have a means of matching the TFLOW user record to the SMF step termination record written sometime after the access method module is loaded. By suppressing the output of the standard TFLOW record, and recording only those user records that represent a load request for an access method module, the overhead attendant with TFLOW's BSAM tape writing is kept to a minimum.

[merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][ocr errors][merged small]

Some of these additions and deletions were surprising. The additional BISAM and QISAM modules indicated a much heavier use of ISAM than we had imagined. The deletion of the backward read on tape, and the various unit record modules is surprising only in that we should have thought of it before but didn't, since practically everyone at USAMSSA uses disk sort work files instead of tape sort work files where backward reading is done, and having HASP in the system precludes a number of unit record operations, since HASP uses the EXCP level of I/O.

What Increased Core Availability has meant to USAMSSA

Use of these reports, plus other systems tuning techniques, has enabled us to go from 12 to 14 initiators in our night batch setup on the model 65, and from 3 to 5 TSO swap regions plus an extra background initiator in our daytime setup. CPU utilization has gone from the 65% - 75% busy range to 90 - 100% busy. Except for weekends, we are able to sustain this level of activity around the clock on the model 65.

So much for core availability, the increase of which was intended to tap the unused capacity of our CPU. To a large degree, we have sucIceeded in this area. However, the increased degree of multiprogramming has tended to exaggerate whatever I/0 imbalances previously existed. We wanted to be able to keep a continuous eye on the disk usage area, since we had noted that some packs were busier than others and that some packs were frequently mounted and dismounted. We did not want to run Boole & Babbage's CUE every day, all day to get this information, for several reasons: systems overhead for one, and the uneasy feeling that

maybe the required information is somewhere out in the SMF data set. We certainly were paying a considerable system overhead in having SMF in the first place; wouldn't it be nice if we got something more out of it than job accounting and the core availability uses I've already described.

Disk Multiprogramming Report

We developed a report, the Disk Concurrent Usage Report, which gives an approximation of the CUE device busy statistic. We started with the assumption that if there are a large number of concurrent users of a single disk pack, it is virtually certain that there will be a correspondingly high device busy statistic and that probably a large component of this busy figure is seek time. Also for private packs not permanently mounted, one can reasonably infer that a medium to high degree of concurrent use requires many operator interventions for mounts. Getting the measure of concurrent disk utilization follows the same approach as concurrent access method module utilization. Again type 4 SMF records were used to supply the start and end time stamps for each step run in the system. stead of having to go to TFLOW to pick up the name of the shared resource, this time we were able to find the volume serial number (VOL/SER) of each disk pack containing each data set on which end of volume processing occurred in the SMF types 14 and 15, or EOV records. The types 4, 14, and 15 SMF records are matched to one another by the same common denominator as was used in the Access Method report the SMF JOBLOG, consisting of Jobname, Reader On Date, and Reader On Time. Locating the step which processed the data set which came to end of volume is accomplished by "fitting" the record write time of the type 14 or 15 record to the step start and end times. Once we have the time a step starts to use the pack and the time is stops using the packs, it is a simple matter to resequence the combined data and compute the percentage of time there were varying numbers of concurrent users on the pack.

We have used this report in several different ways. One way is to validate the permanently resident (PRESRES) list for private packs; that is, are all private packs that are permanently resident really busy enough to justify their mount status, and are there any nonresident packs which are so busy they would be good candidates for addition to the PRESRES list? Another way we've used this report is to check the activity of our SYSDATA, or work packs. Because OS bases its allocation of temporary data sets on the number of open DCB's per channel, one may wind up with considerably more activity on some work packs than on others. Our circumvention of this problem is to move a frequently allocated but low I/O activity data set from a relatively busy channel

to the channel containing the high activity SYSDATA work pack. This data set is typically a popular load module library. A big problem in USAMSSA as far as system tuning goes is the dynamic nature of the use of private data sets as systems come and go. A data set that is very active one month may be very inactive the next month. The only way to keep up with the effects that this produces in the system is by constant monitoring and shuffling of data sets.

Data Set Activity by Disk Pack Report To aid us in getting a better idea of the nature or quality of the disk activity noted in the previous report, we produce a report which gives a history of the activity on each pack by data set name, the number of EOV's recorded and the DDNAME of the data set, where that information would be significant. We wanted to know which data sets were being used as load module libraries so that we would have good candidates for relocating to a channel containing an over-used SYSDATA pack. We also wanted to know which data sets were used by batch only, TSO only, and both batch and TSO in order to have the best group of data sets placed on the 2314 spindles which are shared between our two computers. All of the above information is contained in the SMF type 14 and 15 records, with the exception of the TSO versus batch breakout, which we accomplish by analysis of our highly structured jobname. Additionally, the number of mounts for the pack is assumed to be the number of dismounts, which are recorded in the type 19 SMF record. The number of IPL's and ZEOD's (oper

ator Halt command) must be subtracted from this number if the pack was mounted at IPL time or ZEOD since a type 19 record is also written at these times.

We have also used this report to reorganize disk packs, particularly where one data set is the cause for most of the pack mounting. If the data set is very heavily used, we may try to find space for it on a permanently mounted pack.

Conclusion

We feel that the reports just described can be used as a vehicle for system tuning, in that they suggest courses of action by pointing out system anomalies. We can increase core availability by monitoring programmer core requests with the Weighted Core Variance Report and by modifying the standard IBM RAM list to accurately reflect USAMSSA's use of access methods. We can reduce disk I/O contention by monitoring disk utilization with the Disk Concurrent Usage Report and by monitoring data set activity with the Pack/DSN Activity Report. In a more passive sense, the reports can be used as a daily verification that the system stayed in tune without the attendant inconvenience of running a hardware or software montior. find these reports to be useful. If you are interested in acquiring them, see me during the coffee break, or see Mert Batchelder and he will give you my address.

USACSC SOFTWARE COMPUTER SYSTEM PERFORMANCE MONITOR: SHERLOC

Philip Balcom and Gary Cranson

U.S. Army Computer Systems Command

1. ABSTRACT.

This technical paper describes the internal and external characteristics of SHERLOC, a 360 DOS software monitor written by the US Army Computer Systems Command. SHERLOC primarily displays 1) the number of times each core image module is loaded, and 2) CPU active and WAIT time, by core address, for the supervisor and each partition. The major advantage of SHERLOC over most, if not all, other software monitors is that it displays supervisor CPU time by core address. In this paper emphasis is placed on the concepts required for a knowledgable system programmer to develop his own tailor-made monitor. Results of using SHERLOC since completion of its first phase in August 1973 are discussed. SHERLOC stands for Some thing to Help Everybody Reduce Load On Computers.

2. GENERAL DESCRIPTION OF SHERLOC.

SHERLOC is started as a normal DOS job in any 40K partition. During initialization, SHERLOC requests information such as the sampling rate from the operator. When the operator then presses the external interrupt key, SHERLOC begins sampling at the specified rate, accumulating, in core, all samples. When the operator presses the external interrupt key a second time, SHERLOC stops sampling and allows the operator to indicate whether he wants SHERLOC to terminate, or to record the gathered data on tape or printer, or a combination of the above. After satisfying these requests, if SHERLOC has not been told to terminate, the operator can press external interrupt to cause SHERLOC to continue sampling without zeroing its internal buckets.

A separate program named HOLMES exists to print data from the tape or to combine data from the tape by addition or subtraction before printing. Thus, in monitoring for two hours, if data is written to tape at the end of each hour, a printout for the last hour alone can be obtained by running HOLMES and indicating that the first data is to be subtracted from the second. Naturally, HOLMES stands for Handy Off-Line Means of Extending SHERLOC.

3. SHERLOC PRINTOUT AND INTERNAL LOGIC.

SHERLOC consists of two modules: SECRIT and WATSON. SECRIT stands for SHERLOC's External Communication Routine, Initiator and Terminator. WATSON stands for Way of Accumulating and Taking Samples Obviously Needed."

SHERLOC Initialization. (Figure 2). When SHERLOC is started by the operator (Box Al), SECRIT receives control and requests information such as sampling rate from the operator (Box B1). SECRIT then attains supervisor state and key of zero via a B-transient. SECRIT modifies the external new program status work (PSW) and the supervisor call (SVC) new PSW to point to WATSON. SECRIT makes all of DOS timer-interruptable by turning on the external interrupt bit in operands of the supervisor's Set System Mask instructions and in all new PSW's except machine-check. SECRIT sets the Sample flag -don't-sample which WATSON will later use to determine whether to take a sample. SECRIT then sets the operator designated interval in the hardware timer (Box L1). SECRIT adjusts the DOS software clock so that the software clock minus hardware time will always give DOS the real time of day. SECRIT then WAITS on an Event Control Block (Box B2).

When the Timer Underflows. When the timer interval elapses, the hardware timer under flows (Box A3) causing an external interrupt. The external new PSW becomes the current PSW immediately upon timer underflow since SECRIT has modified DOS (Box Gl) so that external interrupts are always enabled. WATSON receives control (Box B3) because SECRIT changed the address in the external new PSW during initialization (Box F1). WATSON then checks the Sample flag (Box B3) previously set by SECRIT (Box H1). Since this flag initially indicates samples are not to be taken, WATSON sets another interval in the hardware timer (Box G3), adjusts the software clock, and loads the external old PSW (Box H3). This returns control to whatever was occurring when the previous interval elapsed. This loop (Boxes A3, B3, G3, H3) in which the interval elapses but no sample is taken will continue until the external interrupt key is pressed (Box A5). When pressed, an external interrupt occurs, giving control to WATSON. WATSON inverts the Sample flag (Box B5) and then tests it (Box F5). At this time in our example, Sample flag sample. Therefore, WATSON returns

« Previous Continue »

Books

NBS Special Publication, Issues 401-405