Page images
PDF
EPUB

that good results can be obtained with a random selection of laboratories.

A round robin that takes in a crosssection of typical laboratories goes beyond an evaluation of the procedure. The data that will be collected reflect the merits of the procedure and also reflect the performance of the participating laboratories. Poor results may be caused by deficiencies in the procedure or failures to follow the procedure faithfully. Judgment of the procedure will be made on the data remaining after deleting absurd results. It will be shown that one deviant laboratory can easily account for a considerable fraction of the sum of the squared deviations used in evaluating the error.

Rejection of Results

There has long been needed some guide or aid to the judgment in those difficult situations that accompany the rejection of results submitted by a laboratory. What is needed is some understandable criterion that is convincing even to the laboratory concerned. For example, suppose a round robin involves nine laboratories testing seven materials. All the laboratories measure the same property on all the materials. Imagine that one of the laboratories turns in the highest (or lowest) result for every one of the seven materials. This event cannot be ascribed to chance. If the ace and all diamonds up to and including the nine spot are removed from a deck of cards and shuffled, the laboratory concerned may be challenged to pick the ace when the nine cards are spread face down. All the laboratory has to do is succeed in this effort seven times in succession, the cards being reshuffled each time. It is not necessary to mention the odds against achieving this performance. Even if the laboratory representative succeeded two or three times in succession, many would suspect that the cards were marked on their backs. That is, everyone would soon conclude that there was something to be explained. And that is just the point. The laboratory should explain why it gets such extreme results. Short consideration can be given the suggestion that these extreme results may be correct and the other eight laboratories share a common error. Conceivably that may happen but why go against the majority? It seems only reasonable to put the burden of proof on the single laboratory rather than on the other eight.

A general criterion for rejection of results could consist of assigning the ranks one to nine to the nine results reported by the nine laboratories on the first of seven materials. If multiple tests have been made on the same material, the average of the results is used to represent the laboratory. The rank of one goes to the laboratory with

the highest result, the rank of nine to the laboratory with the lowest result. If a tie exists, say two laboratories are tied for fifth place, assign the rank of 5.5 to the tied laboratories. If three are tied for fourth place, assign the middle rank of five to all three. This maintains the total of the ranks at 45 for each material. The average rank is 45/9 or 5.

When the laboratories have ranks assigned for all seven materials, a score is given each laboratory by adding up its ranks. A score of seven is the minimum possible (highest every time), and a score of 63 is the maximum possible (meaning that the laboratory reported the lowest result on every material). The average score is 7 X 5,

or 35, just midway between the minimum and maximum. If only random errors were involved, the rank a laboratory got on each material would be simply a matter of chance. To get an idea of the scores that turn up, shuffle nine cards (ace through nine of diamonds) to get them in random order, and then write the numbers opposite the letters A to I that identify the nine laboratories. Repeat this process until seven ranks have been entered against each letter. Sum the ranks and observe the scores. The outcome of such a simulated round robin is shown in Table I. This game was tried 1000 times with the aid of a computer. Examination of all 9000 scores shows that there were 22 scores of 16 or less and

[merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][subsumed][subsumed][merged small][merged small][subsumed][merged small][merged small][subsumed][subsumed][subsumed][merged small][merged small][subsumed][subsumed][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][subsumed][subsumed][merged small][subsumed][subsumed][merged small][merged small][subsumed][subsumed][merged small][subsumed][merged small][subsumed][merged small][subsumed][merged small][merged small][merged small][ocr errors][subsumed][subsumed][subsumed][ocr errors][ocr errors][ocr errors][ocr errors][ocr errors][ocr errors][ocr errors][subsumed][merged small][subsumed][merged small][merged small][merged small][ocr errors]

TABLE III.-APPROXIMATE 5 PER CENT PROBABILITY LIMITS FOR RANKING SCORES.

[merged small][merged small][ocr errors][ocr errors][ocr errors][ocr errors][merged small][merged small][ocr errors][ocr errors][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][subsumed][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][subsumed][subsumed][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][subsumed][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][ocr errors][merged small]

Sum the ranks to get the score for each laboratory. The mean score is M(L + 1)/2. The entries are lower and upper limits that are included in the approximate 5 per cent critical region.

21 scores of 54 or more for a total of 43 outlying scores. Exact enumeration gave 42.65 as the expected number of such scores. This is just about of 1 per cent of the 9000 scores. There are nine scores per round robin, so the chance of any given round robin having one of these extreme scores is about nine times this of 1 per cent; or about 5 per cent. Although doubling up would reduce the chance of finding a round robin with an extreme score, no such doubling up was found at this probability level.

Table II lists the scores for 20 of the 1000 simulated round robins. They were picked by taking every fiftieth, starting with number 50, of the 1000 computer-simulated round robins. Note the three extreme scores (54, 54, and 55) in three of the round robins, although only one was expected. Since there are only 45 of these 1000 round robins with an extreme score, it was unusual to get three round robins with extreme scores out of 20 selected in this manner. However, for all other sets of nine scores, the scores stay well within the range from 16 to 54. The scores cluster fairly closely around the average score of 35.

Table III lists the corresponding 5 per cent probability limits for various combinations of number of laboratories and number of materials. All of the results were obtained by direct enumeration of the actual probabilities of getting the indicated lower limit or less and the indicated upper limit or more. Because the scores go by units, it is not possible to have them correspond to the exact 5 per cent probability level. The tabulated scores in some instances correspond to a probability somewhat more than 5 per cent and in other cases to a smaller than 5 per cent limit. The probability refers to the chance of obtaining a round robin with the indicated extreme score.

Combinations for large numbers of both laboratories and materials are not given. The arithmetic became heavy in this region, at least with a desk calculator. More important, there is the question as to how often one is justified in requesting such a large program. If there does seem to be a need to have many laboratories and many materials, the data may be divided on some reasonable basis. Thus many laboratories might be split into two or more groups, say, geographically, or even randomly. Materials could be split into groups on the basis of the magnitude of the property or some other distinctive characteristic.

More extensive tables and further details about this method are contained in the forthcoming paper "A Rank Test for Outliers," by W. A. Thompson, Jr., and T. A. Willke, to be published.

Examples of Scoring Laboratories

The Plant Food Institute sends a monthly sample to a large number of laboratories. The ranks for nine laboratories for seven successive months are shown in Table IV. The choice of nine and seven was made to permit direct comparison with the machine scores shown in Table II. The scores appear very similar to those shown in Table II. Perhaps systematic errors do not persist over several months so there is no unusually low or high score. Even so, laboratory 7 was obtaining low ranks except for the first month, and laboratories 25 and 29 are generally credited with high ranks.

Table V also shows the ranks for another nine laboratories testing seven materials. Actually these laboratories tested 14 materials, but these were split into two groups of seven. The left-hand

group of materials are those with low values of the property; the right-hand group includes the materials with high values for the property. The tabulated limits of 16 and 54 are sharply exceeded in both groups. Laboratory 6 comes very close to a clean sweep for rank 9 every time. Laboratory 4 is almost always runner-up to laboratory 6. Laboratories 1 and 2 competed for ranks 1 and 2 in the first group but are in good positions in the second group. There is a pronounced tendency for a laboratory to maintain its position relative to the other laboratories. If this state of affairs cannot be remedied by individual corrective action, the situation may call for the use of reference samples to bring the laboratories into better agreement.

The ranks for 15 laboratories all making determinations of the per cent of indigestible residues on the same seven

TABLE IV.-TOTAL NITROGEN DETERMINATION BY 9 LABORATORIES ON 7 SUCCESSIVE MONTHLY FERTILIZER SAMPLES.

May June July Aug. Sept. Oct. Nov.

Laboratory

[blocks in formation]

Scorea

24 27

42

29.5 29

45.5

36.5

45

36.5

Average.

35.0

TABLE V.-RANKING RESULTS OBTAINED BY 9 LABORATORIES TESTING 14

[merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][ocr errors][merged small][merged small][merged small][subsumed][subsumed][subsumed][merged small][merged small][merged small][merged small][merged small][merged small]

TABLE VI.-RANKING OF 15 COLLABORATIVE RESULTS FOR THE AMOUNT OF INDIGESTIBLE RESIDUES IN 7 PROTEIN MATERIALS."

[blocks in formation]

• Taken from Table I, Journal, Assn. of Official Agricultural Chemists, Vol. 42, p. 232, 1959. ⚫ Critical limits for scores are 23 and 89 (Table III).

protein materials are shown in Table VI. The scores that are beyond the 5 per cent point are 23 and less, and 89 and

more.

The lowest of 15 results reported by the 15 laboratories is given the rank 15. If a laboratory obtained the lowest result on every one of seven materials, its score would be 105. The score for laboratory 4 is 92. The individual ranks are 14, 13, 14, 15, 13, 14, and 9. Evidently this laboratory has a tendency to get lower results than most of the other laboratories. Except for the last material, this laboratory maintains a consistent position in the ranking scale. Evidently this laboratory follows some individual practice in a careful manner. The scores given in Table III should convince laboratory 4 that its string of low values cannot be ascribed to chance. It may be more appropriate to ask laboratory 4 to review its technique rather than to report adversely on the procedure.

The ranks listed in Table VII are interesting because in the left-hand group the collaborators used the method of their own preference in making the determinations. Not all the laboratories participated in both programs, but the same samples were used for both programs. One might have anticipated that some laboratories would maintain their positions relative to the others when the laboratories were invited to use any method they preferred. Laboratory 2 does reach the critical score of 9, and laboratories 4 and 12 approach the other limit of 41. It is more surprising to find that the laboratories show definite individuality when all were presumably following the same tentative procedure. The tentative procedure may be charged with an inflated error that is actually caused by laboratories that are highly individualistic in the way they conduct the test. The lessons to be drawn from a round robin might be immensely helpful to these collaborator laboratories. The author has encountered round-robin data in which the scores were nearly the worst possible: M, 2M, 3M, LM. The conclusion here is that the description of the procedure does not specify properly some of the test conditions and equipment that influence the test result.

Exceptionally low or high scores support the supposition that the laboratory concerned is doing something uniquely different from the rest. It hardly seems just to the procedure under scrutiny to allow such uniquely different results to inflate the calculated error for the procedure. Extreme scores should prompt the laboratory concerned to review the

Milton Friedman, "A Comparison of Alternative Tests of Significance for the Problem of m Rankings," Annals of Mathematical Statistics, Vol. 11, pp. 86-92, 1940.

TABLE VII.-RANKS AND SCORES OF LABORATORIES REPORTING PERCENTAGE OF TOTAL ALKALOIDS AS NICOTINE. •

[blocks in formation]

There are fields of work where judges undertake to rank materials in order of merit. Often the judges do not agree among themselves in such subjective tests. A statistical measure of the concordance of the judges has long been in the statistical literature. Fortunately quantitative measurements usually do manage to get the materials in the correct order no matter which laboratory tests the materials. It is not this ranking that has been the object of interest in this paper. Rather the materials may be regarded as ranking the laboratories. If only random errors are operative, the order of the laboratories should not persist from material to material and there should be no concordance whatever.

The goal in the development of a test procedure is to attain an absence of concordance. The ranking scheme is a simple arithmetical device to measure progress toward that goal. If the ranks depend only on chance, the expected sum of squares associated with the scores when L laboratories are ranked M times is ML(L-1) (L+1)/12. Denote this sum by S'. Systematic errors spread the scores over a wider range and give a larger sum of squares than S'. Denote the sum of the squared deviations of the observed individual scores from the mean score, (L + 1)M/2, by S. The ratio S/S' should be distributed approximately as x/f, where ƒ is one less than

TABLE VIII.-PROBABILITY LIMITS FOR THE RATIO OF THE CALCULATED SUM OF SQUARES FOR SCORES TO THE EXPECTED SUM OF SQUARES, ML(L − 1)(L + 1) /12.

[blocks in formation]

the number of laboratories. The maximum sum of squares is obtained from the scores 1M, 2M, 3M, ..., LM which indicate perfect concordance. This sum of squares is equal to M2(L3 — L)/12. Dividing this quantity by the expected sum of squares, S', gives M as the maximum value the ratio S/S' can take.

A ratio in the neighborhood of unity is desirable. Ratios less than unity are purely chance occurrences. Because the distribution of the scores is closely approximated by the normal distribution the tabulated values for

TABLE IX.-RATIO OF OBSERVED SUM OF SQUARES S TO EXPECTED SUM OF

[blocks in formation]

x2/f may be used to obtain the approximate upper 5 per cent limit for values of the ratio. Values in excess of the tabulated limits in Table VIII indicate that systematic errors are producing some undesired concordance among the rankings.

The ratio has been calculated for the scores given in the examples in the tables. Thus the random ranks assigned by the playing cards shown in Table I gave scores whose ratio is less than one. The machine-generated scores for the 20 round robins tabulated in Table II gave the following ratios:

First ten: 0.91, 1.29, 0.47, 1.38, 1.26, 1.06, 1.15, 0.85, 0.41, 0.98.

Second ten: 0.78. 1.57, 1.31, 1.03, 0.84, 1.30, 0.98, 0.39, 1.47, 0.91. All of these ratios fall below the upper 5 per cent limit of 1.94 for nine laboratories.

The nitrogen results in Table IV and the data in Table VI gave acceptable ratios (Table IX). Notice that laboratory 4, singled out by the limits given in Table III as having a systematic error, contributed about one third of S. The sum of squares is reduced from 3030 to 1959.5 when this laboratory is dropped. With laboratory 4 included, the probability level for the ratio 1.54 is under 10 per cent.

The data in Tables V and VII yielded large values for all the ratios. Even the smallest of these is close to the tabulated value, 3.10, for a probability level of 0.1 per cent. These large ratios cannot be ascribed to one or two laboratories but are associated with a generally unsatisfactory state of affairs. Perhaps

the test procedure is very sensitive to quite minor departures from the specified techniques for performing the

test. Or perhaps the instructions are not specific enough. Whatever the reason, most of the laboratories are involved. The remedy here is to give the test procedure a thorough going over by a good laboratory. The effect of intentional deviations from stated conditions for conducting the test should be studied to discover if this accounts for the scatter of the result. Usually any deviation is maintained over long periods, and this would account for a laboratory obtaining about the same rank on all the materials.

Summary

This method of ranking laboratories has advantages besides those of simplicity and ease of calculation. There is no need to be concerned now that the precision may vary from one laboratory to another. Poor precision will tend to invite low or high individual ranks but in equal proportions so there is compensation. Differences in precision are fortunately rather small, or the usual analysis of variance that uses the actual values would run into statistical difficulties. Perhaps the most important advantage of the ranking procedure is that the variance of the scores is known a priori. The variance is given by M (L 1) (L + 1)/12. Indeed, the complete theoretical distribution of the ranks can be obtained if desired. When laboratory averages, obtained from the numerical values, are used to consider the possible rejection of a laboratory, the suspect laboratory average is part of the data and may give such a large estimate for the laboratory variance component that the rejection level is rather generous. With ranks, the rejection levels can be set in advance of seeing any data.

Finally, the ranking criterion is intuitively meaningful quite apart from any knowledge of advanced statistical techniques.

Systematic errors that are largely responsible for the disagreements that arise among laboratories probably cannot be completely eliminated. Thus the ranking scores obtained for round robins will tend to cover a wider range than theory predicts. So, too, the ratio S/S' will tend to reach large values. Both the scores and the ratio S/S' provide a convenient measure for gaging improvement. The ratio reflects the general performance of all of the laboratories, whereas the limiting scores focus attention on the laboratories with the extreme scores.

Table III provides an objective criterion for singling out laboratories that have the most pronounced systematic errors. Table VIII provides a quick evaluation of the data as a whole. Once laboratories become convinced that they are deviating from the procedure, the resulting search for the source of the deviation should produce (1) a general improvement in the quality of testing, (2) a better estimate of the inherent quality of the test procedure, and (3) perhaps fewer procedures that appear to require the prop of expensive reference materials.

Acknowledgments:

The author is indebted to William A. Thompson, Jr., for devising a reiterative scheme of computation for Table III. Mary C. Croarkin and Thomas A. Willke checked and extended a small preliminary table obtained by the author by programming the computations on a computer.

The Interlaboratory Evaluation of Testing Methods

By JOHN MANDEL and T. W. LASHOF

Traine

rained manpower and laboratory facilities can be used more effectively if improvements can be made in interlaboratory evaluation of testing methods. There are probably hundreds of these cooperative programs going on all the time under the aegis of the Society's 80 main technical committees. Too often the report of an interlaboratory program indicates the results are not useful because some variable was not adequately controlled or because there was some flaw in planning the program.

Planning interlaboratory test programs has occupied the attention of most if not all of the technical committees; in fact, twoD-11 on Rubber and D-13 on Textiles-have prepared recommended practices which have been published by ASTM (D 1421

and D 990, 1958 Book of ASTM Standards, Parts 9 and 10). Committee E-11 on Quality Control of Materials has the assignment to develop a general recommended practice for interlaboratory testing for use by all the committees. It is accordingly quite interested in the present paper, as indicated by the following statement by one who reviewed the paper for the committee:

"This paper gives a very complete treatment of the problem which almost every ASTM committee is constantly trying to solve ...(it is) a more comprehensive approach to the problem of designing and interpreting interlaboratory studies than has appeared in the literature up to now. Their (the authors') ideas are complex because the problem they are trying to solve is complex."-ED.

The various sources of variability in test methods are examined, and a new general scheme to account for them is proposed. The assumption is made that systematic differences exist between sets of measurements made by the same observer at different times or on different instruments or by different observers in the same or different laboratories and that these systematic differences are linear functions of the magnitude of the measurements. Hence, the proposed scheme is called "the linear model." The linear model leads to a simple design for round-robin tests but requires a new method of statistical analysis, geared to the practical objectives of a round robin. The design, analysis, and interpretation of a round robin in accordance with the linear model are presented, and the procedure is illustrated in terms of the data obtained in an interlaboratory study of the Bekk smoothness tester for paper. It is believed that this approach will overcome the "frustrations" that are often associated with the interpretation of round-robin test data.

IN

N THIS paper a new approach is presented for the analysis of interlaboratory studies of test methods. The various sources of variability in test methods are first reexamined and a new general scheme to account for them is proposed. This scheme leads to a simple design for round-robin tests but requires a new method of statistical analysis, geared to the practical objectives of a round robin. The theoretical details are dealt with in a companion paper (1). In the present article, the emphasis is on the application of the new concepts to ASTM committee studies of test methods. The procedure is illustrated in terms of the data obtained in an interlaboratory study of the Bekk smoothness tester for paper.

For much of the discussion in this paper, the consideration of different laboratories is not an absolute requirement. The word "laboratory" is used here to denote a set of measurements NOTE-DISCUSSION OF THIS PAPER IS INVITED, either for publication or for the attention of the authors. Address all communications to ASTM Headquarters, 1916 Race St., Philadelphia 3, Pa.

The boldface numbers in parentheses refer to the list of references appended to this paper.

obtained under conditions controlled within the set but such that systematic differences may exist from one set to another. For example, different operators within the same laboratory may also show systematic differences. The same may be true for sets of measurements obtained even by the same operator at different times. Since the use of different laboratories, in the

usual sense, is likely to result in the greatest number and severity of systematic differences, the practice of conducting interlaboratory round-robin programs for the study of test methods appears entirely justified.

A New Approach: The Linear Model We will assume that an interlaboratory study of a particular test method has been run in accordance with the schematic diagram shown in Table I; specifically, to each of a laboratories, b materials have been sent for test and each laboratory has run each material n times. Let us suppose that the b materials cover most of the useful range of the test method under study for the type of material examined. Then determinations made by the ith laboratory on the jth material constitute what will be denoted henceforth as the "j cell" (see Table I). Our reasons for using this scheme will become apparent as we develop the linear model.

JOHN MANDEL, Statistician with the Division of Organic
and Fibrous Materials, National Bureau of Standards, since
1947, has been engaged in research in statistical methodology,
with special reference to applications in physical and chemical
experimentation, and the development of test methods.

THEODORE W. LASHOF, Physicist in charge, Paper Physical Laboratory, National Bureau of Standards since 1954. Chairman of the Sampling and Conditioning Subcommittee of ASTM Committee D-6 on Paper and Paper Products and Vice-Chairman of the Precision Committee of the Technical Association of the Pulp and Paper Industry.

« PreviousContinue »