Monday, June 1, 2009

Automating the finding of coefficients for the USL

I got to playing around with R more and I've always found that to learn a language I need to solve problems with the language. I'm sure most everybody else does the same thing. My goal was to write a R function that imported a CSV with performance information to gonkulate against.

In my case, I'm using the basic performance information from "Guerrilla Capacity Planning." I've created a CSV file with the number of procs and the resultant ray trace benchmark from Table 5.1:




   1:  C:\Users\auswipe\Desktop>cat raw_throughput.csv

   2:  p,x

   3:  1,20

   4:  4,78

   5:  8,130

   6:  12,170

   7:  16,190

   8:  20,200

   9:  24,210

  10:  28,230

  11:  32,260

  12:  48,280

  13:  64,310



Then I wrote an R function to crunch the numbers:




   1:  uslCoefficients <- function(dataFile) {

   2:    uslData <- read.csv(dataFile, header=TRUE);

   3:    uslData$c <- uslData$x / uslData$x[1];

   4:    usl <- nls(c ~ p/(1+sigma*(p-1)+kappa*p*(p-1)),

   5:               uslData,

   6:               algorithm="port",

   7:               start=c(sigma=0.0, kappa=0.0),

   8:               lower=c(0,0));

   9:    sigma <- coef(usl)["sigma"];

  10:    kappa <- coef(usl)["kappa"];

  11:    return(list(sigma=sigma, kappa=kappa));

  12:  };



The function uslCoefficient returns a list where I can reference the "sigma" and "kappa" by named index:




   1:  > uslCoef <- uslCoefficients("c:\\Users\\auswipe\\Desktop\\raw_throughput.csv")

   2:  > uslCoef["sigma"]

   3:  $sigma

   4:      sigma 

   5:  0.0497973 

   6:   

   7:  > uslCoef["kappa"]

   8:  $kappa

   9:         kappa 

  10:  1.143404e-05 

  11:   

  12:  > uslCoef

  13:  $sigma

  14:      sigma 

  15:  0.0497973 

  16:   

  17:  $kappa

  18:         kappa 

  19:  1.143404e-05 



R is pretty nifty. I doubt I'll ever make use of all the power that is available but it'll be better than writing my own stat routines.

Using R to calculate coefficients of the Universal Scaling Law with Non-Linear Regression

Ooh! Doesn't that sound fancy?

Several weeks ago I purchased the eBook from O'Reilly called "The Art of Capacity Planning." I've always thought that load testing and capacity planning went hand-in-hand. One is not a replacement for the other but one can assist with the other. Load test helps out capacity planning by applying load to psuedo-production systems and capacity planning helps load testing by verifying results in load test against real world systems.

I finished "The Art of Capacity Planning" and wanted to read more on the subject and picked up a copy of "Guerrilla Capacity Planning" which has a lot more math than "The Art of Capacity Planning." One of the concepts is the Universal Scaling Law based on Amdhal's Law. Dr. Neil J Gunther is a smart cookie. He even has a Ph.D in Theoretical Physics which makes him closer to Gordon Freeman than I'll ever be! (Side question: Do Ph.D's in Theoretical Physics get crowbars at graduation?)

Anyhoo, in section 5.6.1 one of the methods in the book is to use Excel to do second degree polynomial regression for the calculation of two necessary coefficients, sigma and kappa. But when I tried to use Excel I was getting a negative value for sigma and one of the rules of the Universal Scaling Law is that the coefficients can never, ever, ever be negative. I just figured that I fat fingered something and tried it again and once again, got mismatching results.

I scratched my noggin, tried to figure out where I err'd and did some Googling and came across this entry of Dr. Gunther's blog:

Negative Scalability Coefficients in Excel

Because in Excel (and some other packages, like my TI-89) you can't put a constraint on the lower limits of the coefficient, you might from time to time get negative coefficients. But from reading the blog entry I see that other people are using R with success.

This is the first time that I've ever messed around with R for statistical purposes. In the past I've written some stat routines (years ago!) in C# for comparing before/after load testing results.

Here is how I used R from start to finish to gonkulate the coefficients.

Using the data from Section 5.3 I did the following in R:

First I defined my p array, which in the book is the number of processors used for ray tracing:




   1:  p <- c(1, 4, 8, 12, 16, 20, 24, 28, 32, 48, 64)



Then I defined my c array, which is the relative capacity for the number of processors used for ray tracing:




   1:  c <- c(1.0, 3.9, 6.5, 8.5, 9.5, 10.0, 10.5, 11.5, 13.0, 14.0, 15.5)



I combined both arrays into a data frame for later use.




   1:  df <- data.frame(p, c)



And when I check out the contents of df I get:




   1:  df

   2:      p    c

   3:  1   1  1.0

   4:  2   4  3.9

   5:  3   8  6.5

   6:  4  12  8.5

   7:  5  16  9.5

   8:  6  20 10.0

   9:  7  24 10.5

  10:  8  28 11.5

  11:  9  32 13.0

  12:  10 48 14.0

  13:  11 64 15.5



I can now use a non-linear regression routine with my data frame that I entered above.




   1:  usl <- nls(c ~ p/(1+sigma*(p-1)+kappa*p*(p-1)), df, algorithm="port", start=c(sigma=0.0, kappa=0.0), lower=c(0,0))



I can then access the coefficients by named index:




   1:  sigma <- coef(usl)["sigma"]

   2:  kappa <- coef(usl)["kappa"]

   3:   

   4:  sigma

   5:      sigma 

   6:  0.0497973 

   7:   

   8:  kappa

   9:         kappa 

  10:  1.143404e-05 

  11:   



Huzzah!

I can now interpolate the relative capacity based upon the USL and the coefficients that were previously gonkulated and add that to my current data frame, df, that I defined earlier. I do have to note that I was a slackard and did not apply the significant digits rules as outlined in Chapter 3 of "Guerrilla Capacity Planning."




   1:  df$proj_c <- p/(1 + sigma * (p - 1) + kappa * p * (p - 1))



There are the projected relative capacities. Yay!




   1:  df

   2:      p    c    proj_c

   3:  1   1  1.0  1.000000

   4:  2   4  3.9  3.479686

   5:  3   8  6.5  5.929346

   6:  4  12  8.5  7.745536

   7:  5  16  9.5  9.144406

   8:  6  20 10.0 10.253815

   9:  7  24 10.5 11.154233

  10:  8  28 11.5 11.898837

  11:  9  32 13.0 12.524174

  12:  10 48 14.0 14.259114

  13:  11 64 15.5 15.298811



And here I will make a simple little graph of the actual versus projected relative capacity:




   1:  plot(p, c)

   2:  lines(p, proj_c)



And here is the graph that is generated:



Kinda nifty, eh?

I can see myself using R more in the future. I'd rather write routines for automagic analysis of data with R than write my own routines from the ground up.