Thursday, July 30, 2009

Hey! Let's take a transistion matrix and convert it into a series of linear equations to gonkulate the number of visits to various pages!

I've been working on learning the wonderful world of queuing theory. Two of the books that I've been using are Analyzing Computer Systems Performance: With Perl: PDQ and Performance by Design: Computer Capacity Planning By Example.

I've been learning how to apply Dr. Gunther's PDQ API, specifically, PDQ-R which is PDQ for the R programming language. Normally I'd use the Perl::PDQ module but I've found R to be quite handy for crunching numbers. If anything, I'd use perl to munge the data in preparation for crunching by routines in R.

I decided that I would apply PDQ-R to several of the case studies found in Performance by Design: Computer Capacity Planning By Example. In my first attempt to do so I found some numbers that didn't match between the results of PDQ-R and the Excel spreadsheet available as part of Performance by Design: Computer Capacity Planning By Example. I contacted Dr. Gunther and he found that there was a bug in PDQ 5.0.1 that he blogged about here. I guess that is my 15 mS of internet fame!

I got side tracked with some work related stuff and went back to the task of applying PDQ-R to some of the case studies found in Performance by Design: Computer Capacity Planning By Example.

The case study that I have started working on is in Chapter 8: Case Study IV: An E-Business Service.

The author provides us with a Customer Behavior Model Graph of a website:



From this graph we can create a set of linear equations that describes the number of visits to each page of the system presented in the CBMG (seen above). Taking a closer look at the graph we can see that there are probabilities of transitioning between one page to another. For example, the probability of moving from the Home Page (h) to the Login (g) is Phg. The probability of moving from the Login (g) page to the Search (s) page is Psg.

Using these probabilities we can create a set of linear equations to describe the number of visits to each page in the system based upon the number of visits into the entry of the system. In this case, there is only one way into the system and that is via the Entry (e) page.

We can look at the graph and manually derive the linear equations. For example, we can easily see that Vh = Peh*Ve. We can also see that Vg = Phg*Vh + Psg*Vs + Pvg*Vv. Doing this we eventually come up with eight linear equations that we can then solve. (The equations are listed on page 208 of Performance by Design: Computer Capacity Planning By Example).

The authors of the book provide us with all the probabilities in the form of a Matrix of Transition Probabilities on page 209. Here is that matrix:



(e) (h) (s) (v) (g) (c) (b) (x)
Entry (e) 0 1 0 0 0 0 0 0
Home (h) 0 0 0.7 0 0.1 0 0 0.2
Search (s) 0 0 0.4 0.2 0.15 0 0 0.25
View Bids (v) 0 0 0 0 0.65 0 0 0.35
Login (g) 0 0 0 0 0 0.3 0.6 0.1
Create Auction (c) 0 0 0 0 0 0 0 1
Place Bid (b) 0 0 0 0 0 0 0 1
Exit (x) 0 0 0 0 0 0 0 0


So we can take those values found in the transition matrix and eventually gonkulate all the visits to each page. Although the transition matrix is provided by the author it is easy enough to re-create this matrix from web logs for a site.

A number of years ago I did the same thing for a major e-Commerce site. At the time my goal was to use the transition matrix to create a universal virtual user that would emulate what real customers did in production. At the time we were using a number of different virtual users and it was a real PITA to increase the number of vusers in a scenario. I figured if I could generate the transition matrix and generate random walks via the transition matrix that my vusers would better represent the traffic of real customers. Unfortunately, I never got to finish the implementation of the UberVuser but I did get to the point of generating a transition matrix. And that sucker was huge! It wasn't a nice 8x8 matrix but rather a 538x538 matrix. Yeah... That's a lot of elements!

So, thinking of my previous experience and looking at the author's provided transition matrix got me to thinking. There's bound to be a way of backing the transition matrix into the set of linear equations that can easily be solved in R (or any other method that can solve linear equations).

I worked it out by hand (more about this below) and everything looked good but to really make this useful I needed to automate the routine. I sat down looked at the problem and thought that I was going to have to do a bunch of element manipulation when the solution hit me. It turned out to be so damned simple! I love it when a complicated problem can be easily solved. It just amuses the hell out of me.

This is how I backed out the equations by hand. I know that the transition matrix has from "from" pages down the rows and the "to" along the columns. So for example, if I wanted to figure out the equation for the Vs I would work down the column and use the coefficients of the transition matrix as multipliers for the V variables.

For example, with the 3rd column with solving for Vs:

The third column of the transition matrix is:



(s)
(e) 0
(h) 0.7
(s) 0.4
(v) 0
(g) 0
(c) 0
(b) 0
(x) 0


So to solve for Vs:

Vs = 0*Ve + 0.7*Vh + 0.4Vs + 0*Vv + 0*Vg + 0*Vc + 0*Vb + 0*Vx

Notice that we have a Vs on both sides of the equal sign. This corresponds to the loop on the Search node.

We can drop out the variables that are multiplied by zero.

Vs = 0.7*Vh + 0.4Vs

Repeat the steps for all the columns of the transition matrix and you end up with the equations to solve for. But, most packages want the equations setup in matrix form. So, let's get all the variables on the left side of the equal sign.

Vs - 0.4Vs - 0.7*Vh

Simplifies into:

0.6Vs - 0.7*Vh

If we do this with all the equations we can finally put it into matrix form. But we can't solve it yet. We need another vector to solve with. This vector will contain the input into the system. In this case the entry into the system is into the Entry page so the vertical vector is:

(e) 1
(h) 0
(s) 0
(v) 0
(g) 0
(c) 0
(b) 0
(x) 0

If we call our first matrix that we created from the transition matrix A and our input vertical vector B, we have the classic A*B = 0 that we can solve for.

Now we get into the R solution for the above. As I was working out the algorithm to setup the problem for calling the R solve() routine it hit me how simple the solution was!

Let's call our transition matrix supplied by the authors as M.

We can take transpose of M and let's call it A.

Take A and multiply by the scalar value of -1. This is the act of moving all the variables to the left side of the equals sign.

Now we combine variables we are solving for in each row of A by adding 1 to the trace of matrix A.

And there we have our final form of A. We can now solve for A*B = 0!

Here is my solution in R




# Define the column information for easy access by name

e <- 1;
h <- 2;
s <- 3;
v <- 4;
g <- 5;
c <- 6;
b <- 7;
x <- 8;

# M will be the supplied transition matrix

M <- matrix(c(0,1,0,0,0,0,0,0,0,0,0.7,0,0.1,0,0,0.2,0,0,0.4,0.2,0.15,0,0,0.25,0,0,0,0,0.65,0,0,0.35,0,0,0,0,0,0.3,0.6,0.1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0), ncol=8, nrow=8, byrow=TRUE);

# B is the vector that describes entry into the system
B <- matrix(c(1,0,0,0,0,0,0,0),nrow=8,ncol=1);

VisitsByTransitionMatrix <- function(M, B) {
A <- t(M);
A <- -1 * A;
for (i in 1:sqrt(length(A))) {
j <- i;
A[i,j] <- A[i,j] + 1;
};
return(solve(A,B));
};

Visits <- VisitsByTransitionMatrix(M, B);

> Visits
[,1]
[1,] 1.0000000
[2,] 1.0000000
[3,] 1.1666667
[4,] 0.2333333
[5,] 0.4266667
[6,] 0.1280000
[7,] 0.2560000
[8,] 1.0000000

> Visits[e]
[1] 1
> Visits[h]
[1] 1
> Visits[s]
[1] 1.166667
> Visits[v]
[1] 0.2333333
> Visits[g]
[1] 0.4266667
> Visits[c]
[1] 0.128
> Visits[b]
[1] 0.256
> Visits[x]
[1] 1


And there we go! Using this I could have taken that monster 538x538 matrix and easily figured out the number of hits for each page based upon the derived transition matrix and used the R routine for playing a lot of what-if scenarios. For example, what if we want to change some of the probabilities between nodes, how will that effect the visit to other pages? Now it is simple to crunch the numbers. Huzzah!

My next task is to apply PDQ-R to solving the rest of the case study. Later in the chapter there are service times associated with each page and that will determine the total service demand per page and ultimately the page response time for each page and the effect on internal components that will be modeled in my PDQ-R model.

I can really see where I could have used this modeling at my previous job at a major e-commerce site. It would have come in really stinkin' handy to be sure.