Jeroen's world of Software Testing: Metrics

Showing posts with label Metrics. Show all posts

Wednesday, July 28, 2010

Repsonse: on How many test cases by James Christie

Somehow the question about how many test cases is so important for "important" people. As if they are getting paid by numbers instead of valued for delivered value. Somehow a false trust is derived from figures. People rely on numbers and assume that number of test cases represents good quality. This seems so obvious the way of doing.

The posting of James Christie triggered me to answer using my weblog.

I value the blog posting of But how many test cases? written by James Christie by its content in relation with his defined context. This context is lost when you translate it to numbers. Below you will find several attempt to make a good posting valued wrong.

Example 1:
Imagine that your blog posting is rated by the number of letters. In your posting you use about 5534 characters. Telling the same story using twitter you need over 40 tweets. Does this show value? It seems that one post in a blog provide more value then one tweet, although the tweet which pointed me to that blog was also very valuable. So is 1 more then 40?

Example 2:
Or what about the coverage of letters. You used all letters of the alphabet, this means your coverage is 100%, does this provide some information about the quality of your posting?What about assigning numbers to it?

Impressive usage of the letter “e” based on numbers it is far most the extensive used letter. Does this provide information? I don’t think so, Perhaps the letter “e” should be used more often, perhaps in relation with other letters. Even this way of thinking is wrong. It doesn’t tell any thing about the context.

Example 3:
What about visualizing the numbers. Below you see a snap shot I you look only at the numbers mentioned in his blog.

I also left some noise in it. The (con)text is now removed.

What does it value now? What information can be obtained? Perhaps the 100.000 mentioned in the text is impressive

Conclusion
Is it correct to drive our testing on numbers? Is it useful to explain coverage in terms of test cases executed? Is the weblog of James in the based on my examples valid and good? The numbers are clear and proven? Are you counting the time?

I compliment James with his blog. A lot of time is spend to "proof" something which is explained incorrect. Wrong questions are asked.

Thursday, April 29, 2010

Benchmarking test effort time useful?

Recently I was asked by another professional about the information which is created during a test process. Which products are delivered? Which information is reported and therefore collected?

I provided him some information with the remark: It depends.

It used the some of phases from TMap® to give him some food for thoughts about figures/product delivered per phase. This is just short list which cam in my mind at that moment, there are items which can and should be added and or removed.
Preparation
- Review results – provided to test team
- overview progress preparation functional designs/ use cases – provided to test - team/management/ project team
- Bi-weekly progress reports – provided to management and project team
- E-mail/ memo instead of escalation to stakeholder
- Test plan, provided to project team and test team
Specification
- Progress specification of test cases including the number of test cases who owns etc. provided to test team and project team
- Bi-weekly progress reports – provided to management and project team
- E-mail/ memo instead of escalation to stakeholder
Execution
- Issue report provided to project team
- Progress report provided to management and project team
- Checklists test scripts provided to test team
- Daily or ad hoc progress report provided to test team or stakeholders
- E-mail/ memo instead of escalation to stakeholder
Completion
- End report
- Issue data base
- Delivered value

When I met the professional I asked him about the project he is on to. It turns out he was collecting information based on experiences how much effort it cost and how much result is delivered during a period of testing. Based on this information from several different sources he is working on some general figures to support providing information in front how much testing will cost.

This is a very noble idea. Every one wants information how much testing will cost in terms of people, money, resources, skills. Every one wants to know what the results are like number of test cases, number of checks, number of issues found, coverage of testing and more.

Support people who challenging themselves to think further were others have been, continued or stopped.
I hope the professional I’m talking about will succeed and share his information with us.

To support him I will share some thoughts of mine which might help. Somehow I doubt it is easily done to collect information from several projects and compare it with each other and use those results as some kind of benchmark.

I believe this is hard because of differences in:
- Organizations: not only the business differs, they also change in strategy etc.
- Projects: you have to make a distinction between development approaches
- People: teams are not supported by the same people and same commitment
- Technology: It looks like you can compare counting in function points. In my opinion there are differences as technology is used by humans and they have different skills and approaches for example in error handling.
- Test approach and used techniques: these are often adapted based on influences from the items above. This results in numbers with similar names only the value cannot be compared. It will be comparing green apples with red apples. You know about the amount only the taste might differ or not.

Basically it is hard to compare the figures which are collected. What is the value of the number of test cases which are written in a certain period for a certain project? Can these figures be used to calculate the effort and money it will cost? Perhaps adding a certain percentage to be sure?
Can the figures how many cases in combination with found issues be used to tell how many test cases must be executed to provide trust in a system without the system being built yet?
Can the figures how many testers found the number of issues by executing a certain number of test cases to predict the needed resources?

Is benchmarking good? Perhaps some people need this kind of trust and use it as experience because they are missing experience? Is the prediction valuable to use?

I wonder when you are finished with benchmarking/collecting figures/information to provide a fancy chart on which people should trust.

Why should we spend time to collect? I value more the skills and experience I have and the ability to compare them to any new project. Sometimes the experience in numbers can be used, sometimes I can rely on, at least it gives me direction. Should I share it with others? Yes of course, in combination with the context and the lessons learned from it. Can others use it? If they are able to judge it against their particular situation and not using as common proven figures.

Sunday, July 26, 2009

What is the leading source?

Ever wondered which data you should rely on? Are we monitoring the test scripts or are we monitoring the issue database? Based on which information are you able to give an advise about quality?

It seems obvious to claim that both sources are important and necessary. Both sources appear on those metrics dashboard often are used. Which means that information shown there is used to provide detailed advice whether to go into production or not.

Only what are you doing when they are not in sync? This should not happen? Of course it does!

Situation A:
All test cases are executed and all issues are closed. Based on which source is advice given? Does it matter? Yes, who gives the guarantee that you found all issues? Which source gives an indication about quality? It seems that the list of pending issues provide sufficient arguments towards management to go live. In this case, the issue database gets the benefit supported by the number and type of tests executed. the issue database is the leading source.

Situation B:
All test cases are executed and some failed, for these issues are registered and still open. It seems that test scripts and issue database are in sync providing the same information.
Are you using information the issue database or the test cases? In this situation it seems that the issue database tells nothing about what you tested. The test scripts should. if al important/critical tests are performed then you might be able to give advice based on this information and identify possible risks when going live with known issues. The test scripts are the leading source.

Situation C:
All test cases are performed and passed; only there are some issues still open in the issue database. Those were found during testing only not related to any test case.
There might be situations when test cases are based on designs and they are not telling everything. During testing new issues are found which are not related to any test case. Another situation can be that issues are found which are also in production. In these cases the test scripts cannot be used as resource to provide advice as according to them, there are no issues . The issue database will be used and based on the impact of open issues an advice will be given. The issue database is the leading source.

Situation D:
All issues are closed and not all test cases are executed. Under time pressure you are not able to execute all test cases. Though, all mandatory cases are executed. Bases on information about the test cases an advice can be given. The test scripts are now the leading source.

Situation E:
All issues are closed only the test scripts are showing that there are issues pending. According to the issue database there is no risk anymore. According to the test scripts there is some risk although all cases are executed; there are issues with pending status in the scripts. An approach would be performing a recheck on the issue database. The question here is, does the issue database provide enough information to change the status in the test script? Or, are you executing that test case again? What is now the leading source? If the issue database contains test results which proofs the issue is solved and the time stamp corresponds with the time stamp in the test script, the issue database is the leading source. if insufficient information is available, the test script becomes the leading source.

Situation F:
Not all issues are closed and not all test cases are executed. This is a tricky one. It might be easy to give a negative advice as not everything is done; test cases and issues are open. What if those are not that important? What is the leading source then? I suggest that both sources can provide some information only not enough to become the leading source. Another source should be contacted. This can be a requirement list, a risk list, the manager, the business. In this case a combination of sources should be used.

Situation G:
No test cases are executed, and some issues are still open. This seems to be odd to go live with this information. What about an update of patches. Sometimes tests are performed without writing them down. In this case the only source you can rely on initially is the issue database in combination with knowledge of people. Let's say that the issue database is the leading source.

Of course there are other situations also. The intention of this post is visualize that based on the situation the source that provides information to make decisions differs. You should be aware that you can go into production based on number of test cases or number of issues with respect to their status.

Monday, June 1, 2009

Different metrics

One of my hobbies is playing with MS Excel. I do this at home, at my assignment and in the past I learned it on Experts-Exchange.com.

The power of MS Excel is collecting data and transform it into information. Initially I tried to create a dashboard which I can use in every project. Unfortunately this was barely possible. Each time, the challenge was to translate the provided data in such a way it represents the same overview of tables and charts. Also requests for information from management differed each time.

I noticed there is a difference between information I need to control the test process and information management need to get informed.

As a project is dynamic and also data is dynamic during the process I start always easy and make the information I can provide also dynamic.

Information for me would be:
- # issues found
- # issues found in combination with system component
- # number test cases /issues open for test
- # test cases /issues in combination with location (what is the status and who should act
- percentage ready (when possible in relation with time left)

Often the difficulty is that information I need is stored in other applications. When possible I download them. Disadvantage can be here if the structure of information is changing or objects are changing in names it become hard to compare. Therefore I start always easy and let the dashboard grow when necessary, or reduce when possible.

Personally I like tables with data, because it enables you to discuss how to understand the information. This is often a complex process, not only you have to explain what information is there, you also have to explain that the data which is shown is correct. Often you can draw several different conclusions out of those tables.

The pitfall here is that this is sometimes hard to explain to management because the like charts. The risk of charts is the discussion behind cannot be shown, information might be missing, people can be mislead. In certain cases I like to keep it simple and fancy.

Here some examples I used over the years:

- the progress meter, can be used per process and for a total. For detailed explanation see: Peltier's Speedometer

Another chart I use is a traffic light to show how the testers feel about the quality, this is often supported by written remarks. Although it is not objective, you still show the feeling and possible threats.

Another daily chart I often use is to see the location of issues. This is a chart which changes on daily basis which I use in combination with available time

A chart managers often like is a chart planned vs actual. This is often hard to get as the data for planned numbers of test cases depends on the availability of testers and also if they are dedicated available for testing. Another remark can be the origin of planned figures. As planned is different then intended.

One of the powerful functions in MS Excel to obtain data unstructured sources is SUMPRODUCT. This one I use often above pivot table, as they avoid growth of workbooks with MB's. Using them wise full, it also enables you to change when data sources are changing.

My advice is start simple and avoid using standard templates as each project will be different. It cost more time to change a template based on provided data and information then start from scratch. Working this way you avoid that unnecessary data has to be maintained and management is informed with non-required information.

Thursday, February 12, 2009

How many bugs do you need to make a metric?

Last year I attended a presentation by Markus Schumacher about metrics. His presentation can be summarized by his saying: "Metrics should change behavior".

I cannot forget about this saying while I'm busy collecting and presenting data. Often I see presentations of figures how many bugs we already have found during testing. Discussions are held how many bugs we didn't find. A lot of time is spent to make nice charts from this data.

Michael Bolton posted a very good blog posting about metrics: Meaningful Metrics

In his blog he also suggests to create metrics which rising questions instead of creating metrics to be used as a control mechanism.

I don't think you can expect from management that you skip presenting control-metrics from the start. It is also not appreciated you don't present any metrics. What I often do is start simple by presenting the number of issues found in a certain period and what their severity is. Although this won't change behavior or raising questions about quality as it should do, it creates the room to discuss what should be measured. Collecting data and creating metrics is also a dynamic process.

I would summarize this process in the following steps:
1. Ask what metrics are needed immediately, let management prioritize because it cost time and effort;
2. Ask what their goals are for those metrics;
3. Define the goal of the selected metrics together with the customer and define criteria how to use them and to monitor the benefits of those;
4. Add some 1 or 2 other metrics you think which are needed to meet their goal;
5. If they got used to those metrics you might even visualize it in charts;
6. After an iteration or a weekly meeting check if it met their goals;
7. Adapt your metrics based on requests, this also can mean you add or remove metrics;
8. Sometimes you have to dive into your collected to see if questions can be defined based on historical data;
9. If people got punished because of the metrics, remove that information as interpreting metrics is a profession on its own and often the metrics we are dealing with can barely be used as indications;
10. Do this every time again, don't rely on the believe if things were good enough the last tie then it should be good enough for now;

Coming back to the title: "How many bugs do you need to create a metric?" I would say: None. You will need at least one question where you have to find an answer to. This answer should lead to change in behavior, make people move or raise more new questions.

Numbers of bugs can be used to create fancy charts. As long as those charts doesn't move people I would say that charts presenting >1000 bugs, 10 categories over a period of 10 weeks are very useful if you want to exercise with colors and chart types.

Sunday, February 8, 2009

To do or not to do that is to measure

Somewhere there is a thin border between the type of metrics needed and metrics offered. What I saw over the years is that projects generate a huge number of data. Based on experience I noticed that most of the data was useful to provide information in a project and not desired in the other.

Translating data into information is usually time consuming. This raises the question to me: should I spend time to provide all information I'm able to deliver? I would answer with a "NO".
Should I stop collecting data? Again I say "NO".

There are several tools available to provide information about the test process. Usually I end up with creating some kind of cockpit in MS Excel. The benefit of this is the ability to combine data from several sources and turn them into information.

To get the data from other tools in MS Excel I export the data from other tools into a readable format for this spreadsheet tool. The advantage here is that I am able to conserve historical data.

Instead of turning all data into information I start with some obvious ones. Those management need to get information about the process and those I need to steer processes. I learned that after starting presenting nice columns of data and charts management is asking for more different information or the already presented information in more detail or less detail.

Based on conserving historical data I’m able to use those also instead of staring measuring from scratch.

The benefits of not presenting all information you are able to give will be immediately saving time as you don’t offer all information. Being adaptive for new information requests will give you the benefit that things are changing because of presented information.

I believe defining and creating metrics should lead to change in behavior, attitude or trust. And we should be aware of creating false trust. Keep asking what might change if metrics are collected and presented. If nothing change, the do not create metrics. Certainly avoid the attitude: I do this because it is written or I do this always.

Tuesday, February 3, 2009

Classification of the "Unknown Unknowns"

In my previous post Schedule the Unknown Unknowns I talked about the good post and classification by Catherine Powell about possible problems from systems in the test schedule: Problem Types. She wrote a follow-up on this: Uncertain of Your Uncertainty. She asked for ideas on this topic. This triggered me to dive in the subject of "Unknown Unkowns".

Although there are a lot of articles which can be found on the internet the one from Roger Nunn: The "Unknown Unknowns" of Plain English helped me to get a bit of light in the darkness. He posted there the "Rumsfeld's categories: the states of knowledge during communication"

1. Known knowns:
* Things I assume you know - so I do not need to mention them.
2. Known unknowns:
* Things I know you do not know - so perhaps I will tell you, if I want you to know.
* Things you know I do not know - but do you want me to know?
3. Unknown unknowns:
* Things I do not know that you do not know or that you need to know - so I do not think of telling you.
* Things I am not aware of myself - although it might be useful for you to know.
* Things you do not tell me, because you are not aware of them yourself - although I might need to know.

I think the main topic in his article is "Communication". He refers to Rumpsfeld: "Rumsfeld's winning enigma uses combinations of only two very common words that would fit into anyone's list of plain English lexis."

An article from C. David Brown, Ph.D. , PE: Testing Unmanned Systems triggered me to think in 2 types of systems: manned systems and unmanned systems. Why is this important? Acknowledging the existence of unmanned systems can help you not to forget also to those systems questions. Although I never worked with unmanned systems; I worked in projects with modules which were autonomously working parts in the whole systems. So there is a huge chance that it contains my "unknown unknowns".

C. David Brown classified Unmanned systems can generally be classified into three categories: - remotely operated;
- fully autonomous;
- leader-follower.

Combining these both stories you have to deal with human communication and system communication to find out about the "Unknown Unknowns". I think it should be possible to capture those uncertainties in a time schedule by identifying which questions on what moment to whom/what to ask for which purpose. Of course this is so obvious. Still it is hard to tell how to fill in those gaps.

What often I read on James Bach's and Michael Bolton's weblogs is to question the system, to question the business. To test the tester instead of certify the tester. If I'm correct these questions are raised to obtain mutual understanding and exchanging knowledge. The pitfall here is that questions are not asked because of one of the above 3 reasons mentioned in the "Unknown Unknowns".

Perhaps the nature of the system can help us here and also complexity.
If a system/function is running fully autonomous the chance of asking the system is much harder, the chance of getting answers from the system is even harder. Human interaction has to take place which is introducing another variant of those 3 unknown unknowns.

Perhaps you can continue now in at least 3 directions:
1. Keep asking the system and business, continue learning from each other and start adapting your test process based on this uncertainty;
2. Define an approach were you are not measuring how many time is already used, start measuring how many time you need to get things Done (a SCRUM approach could help)
3. Another option would be reviewing. I think the reviewing can decrease the occurrence of level of unknown unknowns as the individual expert is no longer the only persons who ask questions, also others. With reviews you might build up a system of numbers which can help you to define on which areas you have lesser knowledge by the number of questions raised or major issues found.

In the first 2 options, I can imagine that it will be hard to take this into your time schedule. If you will use the last option you might take some numbers from the Boehm curve. and play with it From the book (Dutch) "Reviews in de praktijk by J.J. Cannegieter, E. van Veenendaal, E. Van de Vliet and M. van der Zwan" a table they defined a simplified model of the Boehm-law:

Phase Simplified Boehm-law
Requirements 1
Functional Specifications 2
Technical Specification 4
Realization 8
Unit/Module test 16
System/Acceptance test 32
Production 64

Now you can start playing with these figures, keep in mind that this will not lead to reliable figures. At least some figures can be useful instead of no figures.

First: flip them around
Phase Simplified Boehm-law
Requirements 64
Functional Specifications 32
Technical Specification 16
Realization 8
Unit/Module test 4
System/Acceptance test 2
Production 1

Second: if 1 = 1 hour then you might calculate 20% for available unknown unknowns. Here you will have in the requirements phase and addition to the time schedule of 64 * 20% = 12,8 hrs

I think you can use such kind of calculation as in the beginning of a project the chance for unknown unknowns is much higher then at the end.

In this situation you accept that every unknown unknown will have the same impact.
You can add a third step to it:
Third: Add classification based on MoSCoW principle
Must haves = 100%,
Should Haves = 80%
Could Haves = 40%
Would Haves: = 20%

When having 10 M, 4 S, 25 C and 25 W this will lead in requirements phase to:
10*(64*20%))*100%= 128 hrs*100% + 4*(64*20%))*80%= 40.96 hrs + 25*(64*20%)*40%= 128 hrs + 25*(64*20%)*20%= total: 360.96 hrs

Fourth: Here are still a lot of assumptions made. Based on History and complexity the percentages should be adapted. The initial 20% should be defined based on historical figures. The percentages used in the MoSCoW classification can be defined on technical complexity combined with available information. I assume if a lot of questions are raised the chance something is missing will be higher if reviews are executed in a proven state.

Fifth: The numbers used from the simplified Boehm-law can be adapted based on level of clear communication between people. Perhaps the level of interactions between people can be used to minimize the chance that information is not shared in an early stage.

Sunday, February 1, 2009

Schedule the Unknown Unknowns

Catherine Powell posted a very good and clear article about dealing with possible problems from systems in the test schedule: Problem Types.

She classifies them into:
- Fully understood problems;
- Partially understood problems that can be worked in parallel;
- Partially understood problems that are linear.

What I missed here is considering the risk of problems which we are not yet aware of: the so called unknown unkowns.

Donald Rumsfeld said: "There are known knowns. There are things we know that we know. There are known unknowns. That is to say, there are things that we now know we don’t know. But there are also unknown unknowns. There are things we do not know we don’t know."

I think Catherine covered with her classification almost the whole saying of D. Rumpsfeld:
* Fully understood problems -> things we know that we know
* Partially understood problems -> the known unknowns and the now unknowns

As she mentioned that this can be used to refine the time schedule, the known knowns are already in place, and the fully and partially understood problems have to be considered to estimate some additional time for it.

The unknown unknowns I still miss in the classification. If I understood her correctly she gives the advice to keep investigating and then add them to one of the classifications.

I can imagine that in the time schedule you already reserve time for this. For time reservation you need some solid ground. As we are talking about the unknown unknowns this is certainly missing. Perhaps you can justify that solid ground by investigating how complex the system is and how many questions you still might ask to the system. Based on this you can visualize the field of uncertainty.

Explicitly reserving time for this can help project management to identify the risk that the schedule will exceed. The big challenge is not only to move the unknown unkowns towards the unknown knows section; it will also be a challenge to make this activity accepted by project management.

Jeroen's world of Software Testing

Wednesday, July 28, 2010

Repsonse: on How many test cases by James Christie

Thursday, April 29, 2010

Benchmarking test effort time useful?

Sunday, July 26, 2009

What is the leading source?

Monday, June 1, 2009

Different metrics

Thursday, February 12, 2009

How many bugs do you need to make a metric?

Sunday, February 8, 2009

To do or not to do that is to measure

Tuesday, February 3, 2009

Classification of the "Unknown Unknowns"

Sunday, February 1, 2009

Schedule the Unknown Unknowns

Copyright

AS IS

Site Search

Blog Archive

Labels

Blogs I frequently read

Other blogs

Valuable forums

About Me

Software Testing Club