The art of debugging (part2)
OK.. I've finally finished up the greuling one week of support for my FCS application that I mentioned just a little in my last post here. It wasn't so bad really other than the fact that I had to work when everybody else wasn't. To put the app in a nutshell, it's a very high-traffic video phone system for a currency dealing game/training.
Dealers are very very short tempered people who's only goal in life is to make money. Anything that slows them down in making that cash is as good as dead. I've heard many a scary story from support teams for these guys, along with a funny one that a guy was so pissed he busted his keyboard over his knee and threw it across the room. He then told the support guy that his keyboard isn't working and to hurry up and get him a new one... ah well, getting off topic here.
So, anyways the app went really well but I found debugging a bit tough due to the high speed in which these people would be calling other banks (there are up to 16 banks in a session). The calls are usually finished within a minute, after selling or buying some currency, and then they are off to the next call. I tried running through the traces on the server side of FCS, but that wasn't going well as it was scrolling so fast at times that it just wasn't readable. Not only that, scrolling in flash sucks up so much CPU that it has an effect on the FCS app... that sucks.
It just so happens though that I had set up some remoting to keep track of who called who and if it got through or not (this app actually has a waiting option if one bank really needs to get a hold of another bank). So we could track later who was a busy bank calling everyone in their dog, and who wasn't. This was quite handy, but I had forgotten to put a time stamp in there so it was extremely tough to track down my only bug in the system.. cross calls. When a cross call happened in the beginning the main problem was that the conversations were so alike that the person who wasn't supposed to be in the call would be able to follow along with a conversation that wasn't being had with them. They could hear the other call and that wasn't good as it would be taken that they were buying/selling currency to/from that bank. So I had to debug this puppy right then and there or there would be a lot of banks short or long on money...
Read on for look at my debugging approach..
The main reason this bug made it into the production version of this app is that we didn't have enough people or time to properly test when so many people are calling others at the exact same time. Whether they would be sitting in somebody elses waiting list or just happened to catch them inbetween calls or whatever. The options seemed endless. But there was a method to my madness and I was definitely going to track it down.
Problem: Two people would be able to call the same person
Sub problem: and the person who didn't make it through to that person could hear the current conversation.
Frequency: Rare... hard to duplicate
At the very least I had to get rid of the sub-problem. The weirdest thing is that it was theoretically impossible to call the same person as I had set them as busy as soon as one person made it through.
So on to the debugging. I first wondered if there was something not making it to the server. It is at the server where I am setting whether they are busy or not and if a call can be made or if that person should go into that persons waiting list. At this time, all changes were sent from the client, unfortunately if two people accessed (read) a value before it was set then there was the possiblity that two people could call one person. The biggest thing I needed to know is the timing of when this happened, so I immediately changed up the code that makes the inserts into the DB for logging to show the time when each call was made. This helped. I found two different patterns. Due to being able to see the time at which a call was made, whether that person was on a call (or should have been on a call). As all FCS developers know, FCS's logging tools... well.. suck. You have to use flash to read them and not only that, they aren't a very readable format. By putting everything into a DB and then looking it later using queries, it's quite easy to pinpoint where something is going wrong. I strongly recommend getting into remoting from the FCS side. Very handy.
So the two patterns were 1. Two banks would call one bank at the exact same time. 2. Even though a bank was busy, somebody would be able to call them and that would cut off the first person. These patterns were figured out by checking the call number and time it was made all from the DB.
Running through my code, I realized one thing about FCS apps.. If there is anything that you want only one person to do or access, do it all server side. I had a check on the client side to see if the person they were calling was busy or not then make the call from the client side. If I were to rebuild this app from the beginning, I would definitely use all server side script that would tell the clients what to do and when. The second pattern was a bit weirder because I think it had to do with the busy list. When somebody became free they would check their busy list (stored in an SO array) and if somebody was in there they would call them. Supposedly the person who is waiting is not talking to somebody else, but I think with the frequency of calls that were made (1000's within 30 minutes) somewhere along the line, a list didn't get cleaned out properly. *sigh* There was no easy to way to fix this one yet so I concentrated on making sure that the person who didn't get through or was cut off weren't able to keep listening to the conversation. This was done with a little trickery of making sure that their feed was being sent using their bank number and call number combined. You might be thinking, well.. wouldn't the person who received two calls at the same time be connected to both calls? nope, I thought of that and made sure that before I published a feed to make sure to unpublish first. That stopped the feed that was going to the first person (or whoever was getting cut off).
In the end, I didn't solve the main problem, but did manage to reduce confusion as now the person who got cut off was staring at a frozen image, or nothing. They could hang up and call the next person.. a little peeved maybe, buy it was only training and not the real thing.
The idea behind this debugging story is that you can't be too careful in your coding. When it comes to FCS stuff, everything is live, you have to think of so many more things. When to cut off feeds, start them, send messages to whom and when from where and if that is supposed to get a response then do something with it. Do people double click? Do they move the mouse off the button after they click? All kinds of stuff. Once again, I'll go into the little bugs in another post... I've typed for too long :)
Posted by Graeme at December 7, 2003 04:41 PM