Saturday 29 May 2010

One of my smallest yet a challenging project using Citrix machines

Recently I worked on a project for a car insurance company. It was very exciting in terms of the domain as I have not worked in the Insurance domain before. So there was some learning in that end. It was one of the best clients I have worked with; right from the management to down below, everybody was very co-operative in helping us out with all the infrastructure and also about the domain / requirements itself along with a good sense of humor.

The core of the application consisted of calculating the insurances which the advisors in the company's branches would use to communicate with their customers.
The business drive came along when the management found out that the advisors in their branches were giving the most expensive deal (high cost to company) to the customers, thus greatly reducing the revenues of the company. They decided to bring in ThoughtWorks to come up with a pilot application with a set of business defined formulae to bring the revenues back on track. One of the challenges that we had was to deliver the application within a months time. From the initial estimates we figured out that it would take nearly twice of that for us to build a quality application with a clean code base. Further discussions revealed that the business was in a hurry to showcase the application to their teams in a national conference during that month. We came up with a strategy of coming up with a mocked up application with stubbed data and all the front end designs ready for the conference, and then continue with developing the application further in due course of time. This idea was bought well by the business from where our journey began.

The team consisted of just seven of us, the smallest team I have ever been on. Interestingly it was an exceptionally diverse team with each one of us from a different country. Being the lone QA I thought I could influence my ideas to the team, but the developers had stronger opinions of their own. So I had to set up a quality expectation meeting along with the business so that we were all on the same page and not stepping on each others toes.

I joined the project at the initiation phase which was right after the inception. I collaborated closely with the BA to analyze requirements and start writing the acceptance criteria for the stories so that the developers could start off once they were done with setting up the infrastructure.

Liaising with the product owner from the finance department was very helpful in terms of defining the formulae. I then went off and started creating the test data for the complex array of calculations. This greatly assisted me in my manual testing as well as for the automation tests. (Data driven testing.) It was all a cake walk until the product owner came back two weeks before the go live date asking us to change some calculations that were implemented right at the beginning. We had to go ahead implementing this change knowing that it is going to be a huge risk and might have a ripple down effect to the other calculations. But our automation tests around calculations were robust enough to catch any regression defects.

As part of the automation testing, I also started writing BDD tests in WatiN using C#. The developers later implemented them as part of their story development. This technique proves to be mutually beneficial, because as a QA I would get to define the tests with the test data. It would also give the developers a head start in implementing the tests. For this project keeping in mind that the business was not too specific with the quality requirements and the size of the application itself being small, we mostly stuck to the happy paths while defining the automation tests. There is always a compromise between the developer build times and the quality in terms of automation tests. So there should be a mutual understanding as to how much to automate and what to automate. I decided to run rest of the tests manually. I also find it a good practice to keep updating a manual regression test suite as the project evolves, which would act as a reference point at the time of deployments.

Now comes the most interestingly challenging part of the whole project. The application was going to be used by the branch officials through Citrix machines (dumb terminal / thin client). These terminals were connected to the central server on a 512 Kbps connection. The icing on the cake was that the phone lines which the branch advisors use, also share the same bandwidth. Thus it was very critical for us to keep performance, load and stress testing in mind since the beginning of the project and develop the application in such a way that it uses minimum bandwidth at all times.
The client had a "Model office" which they were using to simulate a production environment. But the drawback was that they had just four user terminals, so I wanted to use a testing tool for gaining full confidence in terms of load. The client was interested in an economical tool for this pilot project to test the performance of the application. So Jmeter was recommended as part of the testing approach.  I simulated thousands of users hitting the application server several times and doing a typical user journey thus creating enough initial load on the test server. At the same time I manually started using the terminals in the "Model office" to test the performance of the application, observing the refresh rate and response time. I was also observing statistics on the server in terms of memory leaks, %CPU utilization and Throughput. This gave us a fair idea and also a quick heads up of whether we were taking the right path. But we also needed to consider the difference in the configuration settings between the test server and the production server. So the % difference in the load was to be looked out, for getting a realistic view. We started deploying the application on to production, to monitor how it behaves. We gently started hitting it with Jmeter and saw that we were maxing out the server CPUs. The developers started optimizing the code and Javascript tweaks for IE6 improved the performance of the application.
Understanding how critical it is to go live, the client offered us dedicated servers. The client had an infrastructure in place with huge amounts of servers, CPUs and memory. And all they had to do is virtualize what ever they required for our application without altering their hardware. which I found was a very cost effective way.
We took advantage and beefed up to 4 servers with 4 CPUs on each, which gave us tremendous performance results but found that the CPUs were getting under utilized. So we ramped down to 4 servers and 2 CPUs on each server, which still gave us the required performance and CPU utilization was optimum. We got the statistics from the client and found out that the max Throughput that they would hit is about 12 requests/second. And till about 70 requests/second per server our application was giving an observed response time of about 2 seconds which was in agreeable terms with the customer.
The load balancers were sending the requests to a single server based on the IP address from which the request was coming from. So as Jmeter was running from a single machine, all the requests were being sent to a single server instead of getting load balanced. Also Jmeter itself was not able to handle huge amounts of load and was breaking off at 200 threads. The solution we came up with was  running Jmeter over several computers to achieve the required load on the servers. Later found out that Jmeter needs a "super computer" to run hundreds of threads.
Not having too many Listeners writing test results helped it a bit. But even better is asking Jmeter to write the files directly to a file. Even though I had the root URL in the HTTP Request header, after recording a script, I had to go back and change all the URLs to accommodate the dynamic multi user record generation. We had the user ids in the URL, so random user ids which were being generated by the Jmeter Generator, had to be replaced with the parameter in the URL. Used the gaussian random timer and ramp up periods to simulate closer to realistic conditions.
I also realized the importance of having a dedicated performance environment, because sharing environments caused our daily jobs to slow down by not being able to deploy builds on time, the application itself getting badly hit under heavy load etc.

The servers which were rendering the application on to the terminals were hosting IE6 browser. So we also had to overcome some design challenges. Our UX expert also suggested and came up with loads of design changes which were initially developed by a third party.

The final build was deployed on time to production and the client was very pleased by the application.

No comments:

Post a Comment