General Steps to Resolve a Complex Software Defect

Real Lessons in Software Engineering Series

Written by Fernando Zamora on Friday Nov 9th, 2012 by Fernando Zamora

In a perfect world the software product would be free of any bugs or defects. The software will use best practices of design. The developers will approach the design with engineering principles that take into account performance, scalability, flexibity, targeted loads, etc. In the real world there’s no software product that’s 100% free of defects. There are simply too many factors that affect the functionality of a complex system. The factors are many to include

  • Skill level of the developers
  • Technical maturity of the lead developer/architect
  • Integration with third party software components
  • Third party software at lower layers of the overall product (i.e. web server application)
  • Hardware Infrastructure

Our targeted application type is line of business(LOB) applications

For this topic I will assume we are discussing line of business applications. Other types of systems, such as real-time or critical systems, may have different needs and the diagnostic approach must be adjusted. Careful consideration must be taken when analyzing those types of systems for obvious reasons (e.g. safety). The discussion for those types of systems is beyond the scope of this post. Most line of business (LOB) applications carry a significant amount of complexity, even in the simplest of features. The reason is that these types of applications have several layers and pieces of software at work.

Take a web application for instance. The application will have client-side software (the browser at a minimum), server side software (usually composed of the custom server side code and the web server), finally it has the back end database(Oracle, SQL, NoSql, etc). Many times these components have the possibility of being hosted on different boxes. Even these layers themselves have their own sets of complexity. I’m using a web application as an example, but it can be any system that is composed multiple components or frameworks.

Reasons for Flaws

Applications with this type of complexity are designed with unintended flaws. The flaws are a result of many different factors. These reasons can range anywhere from improper implementation or integration with or between any one of the other components. It can be bad coding due to lack of developer skills. At times, it can be a proper implementation but a faulty third party component. In many cases solution is intentionally designed with only a certain level of reliability due to the feasibility analysis. In other cases the factor can be a rushed timeline. The reasons for flaws are almost infinite. In any case, some defects will creep into the system.

In many cases these flaws are not noticeable in the testing environment. Many times these flaws are identified quickly after release once deployed into production. In many other cases, these flaws don’t surface until the system has been in operation for several months or even years. So what causes these flaws to surface? There are many circumstances that can force these flaws to the surface. Here are some of the ones I have personally observed:

  1. The system is now used by an organization with higher data-volume demands than it was normally used for
  2. The system is now shared by many more client users than originally anticipated
  3. The system has accumulated an amount of data that is past it’s threshold for optimal performance
  4. Certain network policies are enforced in the which have a negative impact on the application
  5. Users use the system in ways that the system was not originally designed
  6. Number of users of the system grows pushing the system’s performance capabilities

How to Recognize the Defect

Many of these flaws can be avoided by taking non-functional and performance requirements into consideration up front. I don’t intend to discuss those concepts today. My goal is to provide a way to diagnose and troubleshoot a system that is already in production and contains these flaws. This is a scenario where these requirements were not taken into consideration and as a result the system has a significant problem that needs to be addressed.

Many times bugs are straightforward, obvious, and require little research for the analysis and solution. Other times, the defects are bizarre, and can’t be easily explained. The latter are the type of bug that I’m discussing here. To make matters worse, these defects normally surface on the production system, where the ability to analyze them is more restrictive(you can’t simply insert risks that will further affect operations). By being on production, the impact of the problem has an even higher severity level because it may be slowing down operations. In many cases the flaw may even result in a complete halt of operations. As you already know, if you slow down or halt operations this equates to immediate and long term costs. Immediate costs due to loss of productivity and long term costs for your organization as a result in loss of future business opportunities.

This discussion is based on real lessons

I’ve participated in solving these types of defects many times. Through these experiences, I’ve learned that the problem must be approached in a highly structured manner. You must approach the problem in a manner that will reduce the problem area at each step of the process, thereby corralling the root cause into a nice little cage. I know, I know I am making it sound too simple. In reality when you are faced with this type of problem for the first time, you may be totally baffled and have no idea where to begin. It may even seem impossible to diagnose when you first hear about the problem. What makes it even more difficult is that it’s happening in production. In reality it’s not that difficult to diagnose these types of issues. You just have to be disciplined in following a highly structured approach.

Avoid an adhoc troubleshooting approach

One thing you should avoid when attempting to solve this type of problem is adhoc experimentation. Even worse, you should avoid rash assumptions to the solution. For example, let’s say that on a given screen, users are complaining of timeouts. You examine the problem and find out that the code is fetching many times more records than it should. So immediately you say to yourself “that must be the problem”. You fix the issue and deploy the fix. Your managers are glad to hear the news and everything is back to normal, or so you think. Once the solution is deployed, your organization finds out from an angry customer that the issue still exists. Worst yet that customer may be may start getting some real attention from his top management. So when his top management gets the word you can be sure that they will contact your top management. So I think you get the picture about what happens next (hint: management overseeing to fix the issue, hourly meetings, etc).

As a result of the incident, you now have lost a significant amount of credibility. I can’t argue that it’s not well-deserved. You should’ve tested out the fix before deployment. Even if you had tested out the issue, it’s not a good idea to approach troubleshooting in this manner, because that’s jumping directly to solutions. In many cases this approach works, but in most cases it doesn’t. The problem with adhoc experimentation is that you will lose track of what works and what doesn’t, because you haven’t reduced your problem area. It’s very likely that after several days you are back at square one with no clue of what’s wrong and no idea of what to try next. Beware of this approach because on the surface it appears to be faster than a structured approach. I find it similar to the effect of gambling; you can become very rich really fast, but it’s highly unlikely. The reality is that you will probably end up very broke. This is basically taking the “I’m feeling lucky” approach.

Try a structured approach instead

As it turns out, there is already an existing method that allows you to approach this problem systematically. It is called the scientific method.

Although the approach that I describe, has slight approach differences from the known scientific method, the approach is very similar. In our case we are trying to find a solution through a series of small implementations of the scientific method. So on top of using that approach we have to formulate a method which allows us to start with the entire system as the problem area, and through a carefully crafted plan we can reduce the problem are into slices. Each time we narrow the surface area into a particular slice, we take that slice and slice it further into other slices. Eventually this leads us to finding the root cause.

The first thing in the problem is to start out by recording all the information regarding the observation. Once you record that information you can start thinking about possible theories that may be causing the problem. You will want to ask the questions that will slice the whole problem into big slices. The kinds of questions that you want to ask are the ones that will provide the answers that eliminate big portions of the problem area. For example, you may want to identify if there is a bottle neck in the transport of data – the client, the server, the database. Once you determine the problem is in any one of those pieces you can throw out the others and focus solely on the faulty area.

You can repeat the steps to further subdivide that area using the same steps. You shouldn’t throw out any collected information though. That information will help keep a record that can be analyzed later if necessary. Let’s say, for example, that the problem is occurring on the back-end. Once we analyze the issue, we discover that the problem is due to bad inline queries. You can go back to the bad queries and revisit the existing data if necessary. So even though the problem was occurring in the back-end, the fix will have to be address on the application code.

In this case let’s say for example that you have reduced the problem down to the database. At this point you need to start looking for ways to approach analysis there. You may need to do some additional research and make some additional observations. You may want to look at the query itself. The size of the data within the table. You may start developing some additional tests that you want to run. This would obviously require a significant amount of work and cooperation with an experienced DBA. Your findings may lead you to conclude it’s one of the following problems data volume, inefficiency queries, lack of database fine tuning. At this point you can leverage the scientific method to reduce your problem area even more.

Validating any conclusions

Commonly when diagnosing these types of issues, there may be several teams of members looking at the problem from different angles simultaneously. One of those members may pop-up and determine they’ve found the problem. When that happens, it is their responsibility to provide evidence of their conclusion. It is their responsibility to prove that their findings and conclusions are accurate. Never accept conclusions without proof. I’ve seen many times where experienced developers are fully convinced of their conclusions, only to find out later that they were wrong. Therefore never accept their conclusions without proof. You must also validate the proof itself. A faulty test will provide false positives. Even more importantly, the developers supporting your effort must have a good understanding of your goals. It is important that they understand your approach. This is important to keep them on track and to keep them from chasing tangent observations. Not doing so, will slow down your progress and create confusion.

Once you have reduced the problem down to one possible root cause. You can use the steps below to finalize your analysis. They are the adjusted steps, that allow you test each individual possible cause.

  1. Keep a document of all your research that points you to this possible root case
  2. Prove your diagnosis by reproducing the problem with the known variables
    If your diagnosis failed double check your setup and go back to step 1.
  3. Apply the conditions that will resolve the issue (at this point you don’t necessarily have the real fix, just a test to prove what the culprit is)
  4. Repeat the steps that reproduced the bug and the issue should no longer exist
    If the issue still exists re-verify that all of your conditions are set up correctly.
    If it still exists go back to step 1
  5. Analyze to determine possible solutions.

    Keep in mind that you were initially only trying to prove the cause. This does not mean you have actually fixed it. At this point you have only figured out how it happens and how to prevent it from happening. For example let’s say that you have bad data. You tested it by putting bad data in the database and then by removing that data. The proper solution may involve a combination of changes that will clean up the data, prevent bad data from creeping back into the system, making modifications to any affected areas or subsystems. This may require opinions of several team members. You must consider time, resources, as well as effects to other areas and user experience. This is usually where you will need to seek guidance from management. It is your job to provide management with all the necessary facts for them to make a decision. You can also make a recommendations. If a manager makes a different request from your original options or a modified request, it is your job to inform them of the risks involved or if it’s even doable at all.

  6. Pick the most feasible solution.
  7. Develop the solution
  8. Apply the solution
  9. Retest under the original conditions through your normal software development life-cycle processes

Managing the process

There are several challenges in implementing this approach. One problem is that many, including management, may perceive this as a slow way of getting to the problem. Another problem is that you may not have adequate support from other technical members of the team. For example you may need information regarding the server itself or you may require information regarding the database, that may only be available to System Administrators and Database Administrators.

An even more common problem is the ego problem. A programmer or DBA may be offended that you may question their judgement, or their skills during your research. It causes them to be apprehensive or even refuse to provide you with the answers that you may request from them. The other problem may be the customer’s unwillingness to cooperate due to their level of satisfaction with the product. If you happen to be the key champion in finding this issue you must ensure that you implement a positive way to acquire everyone’s buy-in for your approach. You must ensure that everyone on the team understands the objectives and the approach. You must make sure that your participants don’t deviate from the problem and start adhocking at will. It is important to apply those soft skilss here so that you can convince everyone in tactful manner why it is important to go through the full process. You must make that they understand that in order to cross something off your list you must validate it. Also, they must know that if they run cross any observation, they must run it by you before they spend any time pursuing it. This may sound like something of a control-freak approach. It is not that all, it is only controlled in this manner because you as the champion are aware of what the rest of the teams are doing. Which means that you may prevent someone from going on a expedition to prove something that has already been proven or eliminated by someone else.

This approach does not guarantee you a speedy process. It doesn’t even guarantee you a solution. In many cases the solution may not be feasible or adequate. It does guarantee you the proper diagnosis of the problem. In many cases the solution may be a temporary work-around.

This approach works. In your work you may have to adjust it but hopefully it helps you out the next time you are faced with a complex software defect.

One last thing I’d like to mention, is for this approach to work the individual must a well-versed lead developer. Meaning that he must have a solid understanding of the programming language in questions as well as the architecture of the system.

Leave a Reply

Post Comment

Connect With Us

Recent Posts

A Guide for Learning Design Patterns

July 28th 2016 by Fernando Zamora If you’ve been writing code for more than a year, you might have h...

Read More

Using UML to Analyze Legacy Code

June 30th 2016 by Fernando Zamora For every programmer out there creating brand new code and working...

Read More

Python vs. Other Languages

April 29th 2016 by Fernando Zamora For the last two months or so my fellow McLane Advanced Technolog...

Read More

Naming Your Adapter When Implementing the Adapter Pattern

October 19th 2015 by Fernando Zamora At some point in your project you may need to use a third party...

Read More

10 Methods to Help You Debug Legacy Code

September 24th 2015 by Fernando Zamora A large majority of the work we do in our project is to fix r...

Read More