The importance of rehearsal

You have designed a study. Everyone seems to be buying in. Scheduling participants is working out and the mix looks good. What’s left to be done except just doing the sessions? Three things:

  1. Practice.
  2. Practice.
  3. Practice.

There are three rounds of practice that I do before I do a “real” session. Jeez, I can hear you say, why would I need to practice so much? Why would you, Dana, who have been doing usability testing for so many years, need to practice so much? I do it for a couple of reasons:

  • It gives me multiple opportunities to clarify the intent of the test, the tasks, and the data measures.
  • I can focus on observing the participant in each regular session because any kinks have been worked out.

Walk through the script and gather tools and materials
The first is to walk through my test plan and script. I read the script aloud even though I’m by myself. While I’m doing that, I do two things: adjust the wording to sound more natural, and gather tools and materials I’ll need to do the sessions.

Do a dress rehearsal or dry run
For the second round of practice, I do a dry run of the now refined script with someone I know filling the role of the participant. We do everything you would normally do in a session, from greeting and filling out forms, to doing tasks, to closing the session. I might occasionally stop the session to adjust the script or to make notes about what to do differently next time. I might even ask the participant (usually a friend, neighbor, or colleague) questions about whether the test is making sense. It’s a combination of dress rehearsal and “logic and accuracy” test to get the sequence down and to make sure you’ve got all the necessary pieces.

Pilot the protocol
Finally, there’s the pilot test session. In this pilot, I work with a “real” participant – someone who was screened and scheduled along with all of the other participants. I conduct the session in the same way I intend to conduct all of the following sessions. The twist this time is that observers from the design team should be present. At the end of the session, I debrief with them about the protocol.

Don’t waste good participant data
There have been times when I’ve been rushed by a client or was just too cavalier about going into a usability test and did not rehearse. I paid for it by having rough sessions that I couldn’t use all the data from. Every time it’s a reminder that preparation and practice are as important to getting good data as a good test design is.

Are you doing “user testing” or “usability testing”?

Calling anything user testing just seems bad. Okay, contrary to the usual content on this blog – which I’ve tried to make about method and technique – this discussion is philosophical and political. If you feel it isn’t decent to talk about the politics of user research in public, then you should perhaps click away right now.

I know, talking about “users” opens up another whole discussion that we’re not going to have here, now. In this post, I want to focus on the difference between “usability testing” and “user testing” and why we should be specific.

When I say “usability test,” what I’m talking about is testing a design for how usable it is. Rather, how unusable it is, because that’s what we can measure: how hard is it to use; how many errors do people make; how frustrated do people feel when using it. Usability testing is about finding the issues that leave a design lacking. By observing usability test sessions, a team can learn about what the issues are and make inferences about why they are happening to then implement informed design solutions.

If someone says “user testing,” what does that mean? Let’s talk about the two words separately.

First, what’s a “user”? It is true that we ask people who use (or who might use) a design to take part in the study of how usable the design is, and some of us might refer to those people as “users” of the product.

Now, “testing” is about using some specified method for evaluating something. If you call it “user testing,” it sure sounds like you are evaluating users, even though what you probably mean to say is that you’re putting a design in front of users to see how they evaluate it. It’s shorthand, but I think it is the wrong shorthand.

If the point is to observe people interacting with a design to see where the flaws in the design are and why those elements aren’t successful, then you’re going beyond user testing. You’re at usability testing. That’s what I do as part of my user research practice. I try not to test the users in the process.

Should you test in a lab or in the field?

I haven’t been in a usability test lab for about a year. Ironically, since I was writing a book about usability testing, much of my work was field research to learn about particular audiences and their tasks.

And, though my usual position about labs is that exploratory usability testing is probably better done in the user’s environment, I’m excited about getting back into the lab.

Good reasons to test in a lab
I’m doing these upcoming tests in a lab facility because

  • The testing is quantitative and summative. That is, I’m doing very specific counts of errors and failures that are strictly defined, so I want to control other aspects of the test such as the computer setup.

 

  • I don’t want to interact much with the participants. I only want to direct participants when to start their tasks. Otherwise, I will intervene in the session only at prescribed points, so I will direct the session from a different room from where the participants are working. 
  • I may have observers, but I won’t know until the last minute. Though I prefer it if observers arrive before the session starts and stay through a whole session, at a facility they can come and go because they can observe from a separate room.

 

Good reasons to test in the field
I recently did a usability study in the field. Why?

  • I wanted to learn about the user’s environment (rather than controlling it). In the exploratory study I’m thinking of, I got the best of both worlds: usability testing data in a realistic situation. I learned about lighting levels, surrounding noise, and what the participant’s desk setup was like. But I also got to observe relationships and interactions the participant had with others, typical interruptions (and recovery from those), and how the thing I was testing fit into the person’s work.

 

  • It was convenient for the participants. They don’t have travel to the testing site. The interruption of their typical day is minimized. 
  • The sessions were informal enough that observers could be present in the room (after they had been properly trained). In fact, people from neighboring cubes often chimed in comments or questions because they’d overheard what we were talking about. I took this to be a good thing because I learned about that communication dynamic, but those eavesdroppers often contributed information that was useful to me in my study.

 

In a future post, I’ll talk about what to look for in a lab facility if you’re renting one and how to find one.

Recruit based on demographics or behavior?

Recruiting for usability test is hard. (I’ve said this before.) And it’s the most important thing to get right in a test. So how do you decide who to recruit?

Demographics don’t describe behavior
If you buy the argument of your marketing department, you will look at the demographics of the various segments and try to match their proportions. You’ll know the ages, incomes, educations, ethnicity, and genders of your participants. But does knowing this help you predict behavior or performance? More importantly, with a sample of, say, eight participants, can you generalize discovered usability problems to the broader cohort?

Probably not. Here’s an example of why.

Though most video gamers are male, some are female. The problems and successes they have in using a game are similar. And there will be differences within the genders, too. Though most video gamers are young, there are a lot that aren’t. The problems they have in using a game are not likely due to differences in age if the participants have similar expertise on the platform and with the game (or similar games).

Behavior describes performance
Instead, the differences in behavior (interaction between the person and the technology) and performance (whether the human is successful in completing technology-mediated tasks) are much more likely to stem from differences in expertise.

Being younger or older doesn’t make you an expert at anything necessarily. Having a higher or lower household income doesn’t, either. You could argue that education level might, but it usually doesn’t unless there’s something in the test that is related to a particular domain that the educated person was specifically trained for.

You want people to be motivated to do the tasks you want them to do when they get into your test situation. This is a place where it might make it easier or more difficult to find people. For example, if you want to test an online banking service or find out if someone might sign up for a brokerage account online, it’s more likely that the participants will fall into a “mature” category on the age scale than at the younger end or the very old end. And that is just because people in the mature range are more likely to have or want a mortgage than someone who is younger and isn’t in the market to buy a house or someone who is older who really would rather have a reverse mortgage. But you might find some on either end, too. But you want to see a range of people with different aptitudes and skill levels.

How do you recruit, then?

Minimize the demographics for small tests, focus on knowledge and proficiency
Skip the demographic questionnaire (or minimize it at least) and focus on what participants have done related to what you’re testing.
If you are doing a test of a Web site, you might care about what kinds of things do participants do on the Internet and how often they do it. Also, when was the last time? For example, what’s the last thing they bought online? Purchasing at an e-commerce site, no matter how well designed the site is, involves complex interaction. It might be a reasonable proxy for searching, narrowing a search, going through a decision process, filling in online forms, handling error and information messages, understanding where in an online process they are, and so on. But it doesn’t matter how old participants are, how educated they are, or (usually) what their household income is.

If you’re testing how well text messaging works, you want to know whether people do it already and how much. If they don’t do texting, you might want some people in your study who have received messages but don’t send them. By asking what their recent experiences were related to what you want to test (without giving away your tasks), you can find out about motivation as well as expertise.

And this brings us to a discussion about “novice” versus “expert.” But that’s another post.

Does Geography Matter?

Today I’ve been writing for the new edition of Handbook of Usability Testing about setting up a test environment. Should you be in the lab or in the field? If you’re in the lab, what should the setup be like and why? These seemed like fairly easy questions to answer. But then I got to a question that I’ve been wondering about myself for years: Does geography matter?

Nielsen says it doesn’t

Jakob Nielsen’s April 30, 2007 Alertbox (http://www.useit.com/alertbox/user-test-locations.html) says that geography doesn’t matter (unless there are international considerations or a single industry dominates the location or a couple of other things). “You get the same insights regardless of where you conduct user testing, so there’s no reason to test in multiple cities. When a city is dominated by your own industry, however, you should definitely test elsewhere.”

I sent my question around to several usability testing experts. Jared Spool sent one of the most interesting, but nearly everyone had experience that indicates that geography does matter.

Spool, Killam, and James say it does matter

“Remember,” Jared Spool says, “if you know everything [emphasis mine] there is to know about your users, their tasks, and their contexts, then you never need to test in the first place — all you need to do is be really smart and create a simple design. At that point, it boils down to a simple matter of programming.”

Bill Killam, of User-Centered Design put it this way:

Performance and subjective preference and motivation are all linked, so any change in location that affects one or more of these can be a factor across all of them. But we usually find it appears only in subjective data – not as much in behavioral observations. Even local variations like testing within the client’s office versus a “neutral lab” sometimes have noticeable effects on things like projected responding. However, also consider regional differences in the use or exposure to the product being tested. That will certainly effect results. Not to use too specific an example, but consider testing voting machines in the DC area versus a rural location. Or DC where paper and DREs [direct recording electronic voting machines] already exist versus NY where a full face ballot is used versus Oregon where all votes are by [mail].

Janice James contributed, “I’ve found that it IS important to test across multiple locations because I’ve found that the users do differ in terms of their experience level and exposure to product types, and technology, in general.”

Professor Spool and I continued the conversation by IM:

Dana: Okay, so it seems like your answer and Jakob’s article come from different assumptions. Jakob seems to assume that the field work is done. The team knows the context, etc. You seem to be saying that teams don’t always do the field work, first. By Nielsen’s parking meter example, the design team seems to have some background about the location.

Jared: Except teams always think they know everything.

Dana: I also think Jakob is assuming a fairly mature UX [user experience] group.

Jared: But, Jakob says, except for the few special cases discussed below, we’ve always identified the same usability findings, no matter where we tested. By now, we can clearly conclude that it’s a waste of money to do user testing in more than one city within a country. Good thing he wasn’t testing soda. Or pop. Or coke.

Dana: Yes, to your example, testing IA [information architecture] is a REALLY good reason to test in multiple locations. And the design team always will get some benefit from being on site – usually something that wasn’t predictable.

Dana: And with the audience for this book, I think it’s safe to assume that they won’t have done much (or any) field work before doing usability testing.

Jared: Right.

Jared: Testing in more than one locale is definitely a luxury.

Jared: I wouldn’t not test at all because you can’t get to more than one venue. Another approach is to make it work great for the local community and look to support and other feedback channels to hear if regional differences pop up. It’s the cross-your-fingers approach to design. It’s worked well through the centuries. Another approach is to look at other competitive/comparable designs for things that might be regional. If the designs have elements that seem different, is there a regional explanation?

Jared: Many design issues are just pure human behavior, independent of any cultural or regional issues.

Dana: I believe that.

Jared: Rolf [Molich] and Carolyn [Synder] did a study where they tested people in two countries on the same sites. They found 80% of the problems were in common. They found regional biases. People in Europe didn’t understand the purpose of a gift registry (and found it to be quite vulgar). But, if you perfected the design for your local venue, you’d nail 80% of the problems found anywhere else, if you extrapolate their results. And that’s a pretty good hit rate for a small budget.

Dana: I agree.

Jared: My guess is that’s what Jakob was trying to say.

Dana: That’s possible.

Jared: It’s hard to say with his shield of impenetrable ego obscuring the real intent.

Dana: Do you mind if I clean up this thread and use it in a blog post?

Jared: Not at all.

Jared: You can even leave in the impenetrable ego comment.

Dana: Makes it more believable that it was a conversation with Jared Spool.

Jared: Remember, all elephants are tall and flat, except for the instances when they are long and skinny.

Dana: That’s right. Anyway, thanks for answering the email and for continuing the discussion. I appreciate it.

Jared: I’m saying his exceptions are the generalized case. And his generalized declaration is rarely executable.

Beware the Hawthorne Effect

In a clear and thoughtful article in the May 3, 2007 Journal of Usability Studies (JUS) put out by the Usability Professionals’ Association, Rich Macefield blasts the popular myths around the legendary Hawthorne effect. He goes on to explain very specifically how no interpretation of the Hawthorne effect applies to usability testing.

Popular myth – and Mayo’s (1933) original conclusion – says that human subjects in any kind of research will perform better just because they’re aware they’re being studied.

Several researchers have reviewed the original study that generated the finding, and they say that’s not what really happened. Parsons (1974) was the first to say that the improvement in performance of subjects in the original study was more likely due to feedback they got from the researchers about their performance and what they learned from getting that feedback.

Why it doesn’t apply to usability tests

Macefield convincingly demonstrates why the Hawthorne effect just doesn’t figure in to well designed and professionally executed usability tests:

  • The Hawthorne studies were longitudinal, most usability tests are not.
  • The subjects were experts, most participants are novices at something in a usability test because what they are using is new.
  • The metrics used in the Hawthorne studies were different from most usability tests.
  • The subjects in the Hawthornestudies had horrible, boring jobs, so they may have been motivated to perform better because of attention they got from researchers; it’s possible in usability tests that participants are experiencing unwanted interruptions by being included or that they’re just doing the test to get paid. 
  • The Hawthorne subjects may have thought that taking part in the study would improve their chances for raises or promotions; the days of usability test participants thinking that their participating in studies might help them get jobs are probably over.

What about feedback and learning effects?

We want feedback to be part of a good user interface, don’t we? Yes. And we want people to learn from using an interface, don’t we? Again, yes. But, as Macefield says, let’s make sure that all the feedback and learning from a usability test comes from the UI and not the researcher/moderator. Instead, get to the cause of problems from qualitative data such as the verbal protocol from participants’ thinking aloud to see how they’re thinking about the problem.

Look at effects across tasks or functions

Macefield suggests that if you’re getting grief, add a control group to compare against and then look at performance across tasks. For example, you might expect that the test group (using an “improved” UI) would be more efficient or effective in all elements of a test than a control group. But it’s possible that the test group did better on one task but both groups had a similar level of problems on a different task. If this happens, it is unlikely that the moderator has given feedback or prompted learning to create the effect of improved performance because the effect should be global across tasks across groups.

Macefield closes the article with a couple of pages that could be a lesson out of Defense Against the Dark Arts, setting out very specific ways to argue against any assertion that your findings might be “contaminated.” But don’t just zoom to the end of the piece. The value of the article is in knowing the whole story.

Moderating tips and techniques

Getting the right information from the participant can be a difficult. As the moderator, you must attend to many things besides what the participant doing and saying. Focusing on a few specific behaviors of your own will help you have a better test.

Focus your attention on what’s happening now

  • Quickly build rapport with the participant
  • Listen attentively
  • Be open to what might happen in a session – be ready to learn from the participant

Tips for being a better moderator

Be the neutral observer – avoid priming or teaching. If you’re too close to the product or the domain, you may train participants without realizing it by using keywords in your task scenarios or materials.

Observe at the expense of collecting data, if you must. It is difficult to take notes and to watch the participant at the same time. If things are happening quickly or you find yourself missing things the participant is saying or doing, just stop taking notes. Instead, listen and spend time between sessions making notes about what happened. Go through your recordings later if you need to, or ask observers to share their notes.

Play dumb – don’t answer questions. If participants perceive that you are an expert on the product, they may ask you questions about it or look for your approval on actions. Instead, let her know that you are learning too, and that you’ll note her questions but won’t always be able to answer them.

Flex the script and test plan. Even after you pilot test your test, you may have to adjust on-the-fly when participants do unpredictable things. That’s okay. You’re learning important things that fit into your aggregate patterns of use.

Practice and get feedback. Ask co-workers and observers to give you feedback about how you conduct sessions and how you ask questions.

Your own self-awareness is your best tool for moderating test sessions successfully. Following these guidelines should help you get valid, reliable data from your participants, even if your attention is slightly divided.

Why create a test design?

I get a lot of clients who are in a hurry. They get to a point in their product cycle that they’re supposed to have done some usability activity to exit the development phase they are in and now find they have to scramble to pull it together. How long can it take to arrange and execute a discount usability test, anyway?

Well, to do a usability test right, it does take a few steps. How much time those steps take depends on your situation. Every step in the process is useful.

The steps of a usability test
Jeff Rubin and I think there are these steps to the process for conducting a usability test:

  1. Develop a test plan
  2. Set up the testing environment and plan logistics
  3. Find and select participants
  4. Prepare test materials
  5. Conduct the sessions
  6. Debrief participants and observers
  7. Analyze data and observations
  8. Create findings and recommendations

Notice that “develop a test plan” and “prepare test materials” are different steps.

It might seem like a shortcut to go directly to scripting the test session without designing the test. But the test plan is a necessary step.
Test plan or test design?
There’s a planning aspect to this deliverable. Why are you testing? Where will you test? What are the basic characteristics of the participants? What’s the timing for the test? For the tasks? What other logistics are involved in making this particular test happen? Do you need bogus data to play with, userids, or other props?

To some of us, a test design would be about experiment design. Will you test a hypothesis or is this an exploratory test? What are your research questions? What task scenarios will get you to the answers? Will you compare anything? If so, is it between subjects or within subjects? Will the moderator sit in the testing room or not? What data will you collect and what are you measuring?

It all goes together.

 

Why not just script the session without writing a plan?
Having a plan that you’ve thought through is always useful. You can use the test plan to get buy-in from stakeholders, too. As a representation of what the study will be, it’s understanding the blueprints and renderings before you give the building contractor approval to start building.

With a test plan, you also have a tool for documenting requirements (a frozen test environment, anyone?) for the test and a set of unambiguous details that define the scope of the test. Here, in a test plan, you define the approach to the research questions. In a session script, you operationalize the research questions. Writing a test plan helps you know what you’re going to collect data about and what you’re going to report on, as well as what the general content of the report will be.
Writing a test plan (or design, or whatever you want to call it) will give you a framework for the test in which a session script will fit. All the other deliverables of a usability test stem from the test plan. If you don’t have a plan, you risk using inappropriate participants and getting unreliable data.

Should you record sessions on video/audio?

The accepted practice for professional usability practitioners has been since the beginning of time to record sessions on video. It is something that we tend to do automatically.

There aren’t many obstacles to recording sessions these days. It really only takes a web camera and some relatively inexpensive recording software on the testing PC. (Of course, this assumes that you’re testing software or web sites that run on desktop or laptop computers.)

Recording is inexpensive
The software is pretty easy to use and it doesn’t cause issues with response times or otherwise fool with the user’s experience of using the software or website you’re testing. You get nice, bright colors, picture-in-picture, and you can capture it all digitally. For example, there’s Morae, by TechSmith. (In the interest of full disclosure: I own a license, and I have upgraded to the new version). With Morae, you can capture all sorts of nerdy bits. It’s a good tool.

Even if you decide to use a regular video camera rather than a web cam, or multiple cameras, that technology is cheaper and more accessible all the time. Storage media also is very inexpensive.


But should you record sessions?

Karl Fast on Boxes and Arrows (from August 2002) has a whole treatise on recording usability test sessions: http://www.boxesandarrows.com/view/recording_screen_activity_during_usability_testing. He called it “crucial.” I say Not.

Know why you’re recording
You may want the video recordings for reviewing, or sharing with a research partner. You may want your boss to sit down and watch the recorded sessions as evidence. Most practitioners will say that they use video recordings as backup to notes. You could go back and review the recordings.

Most usability tests have fairly few participants. Say you’re doing a study with 5 to 8 participants. If your notes from so few sessions don’t help you analyze the data, you should work on making better data collection tools for yourself or make it a practice to write notes about what happened immediately following each session. Reviewing recordings is making work for yourself.

But do you actually review the recordings? Rarely. And do people who could not attend the sessions review the recordings later? Again, rarely.

Know how you’re storing recordings and control access to protect the privacy of participants
And let’s consider participant privacy and confidentiality. Digital recordings are easier than ever to manage and archive. However, the longer the recordings hang around your company, the more likely it is that they will a) get lost, b) fall into the wrong hands, or c) be misused in some way. A client once asked me if her company could review a tape of a participant because he was coming in for a job interview. I said absolutely not.

You ask participants to sign a recording waiver that sets out specific purposes of the recording. Someone has to make sure that the waiver is respected. That person is the usability specialist who recorded the session to begin with.

Manage recordings carefully
The form that you ask study participants to sign asking for their permission to record, you should also state in plain language

  • How the recording will be used
  • Who will use the recording
  • How long you (or your company) will store the recording
  • How the recording will be destroyed

But get it approved by your legal department, of course.

There are some good reasons to record sessions on video. There are a lot of good reasons not to. Should you?

Keeping a rolling list of issues throughout a study

Design teams are often in a hurry to get results from usability studies. How do you support them while giving good data and ensuring that the final findings are valid?

One thing I do is to start a list of observations or issues after the first two or three participants. I go over this list with my observers and get them to help me expand or clarify each item. Then we agree on which participants we saw have that particular problem.

I continue adding to that list the numbers for each participant who had the issue and note any variations on each observation.

For example, in a study I’m working on this week, we noted on the first day of testing that

Participants talked about location, but scrolled past the map without interacting with it to get to the search results (the map may not look clickable)

I went back later and added the participant numbers for those who we observed doing this:

Participants talked about location, but scrolled past the map without interacting with it to get to the search results (the map may not look clickable) PP, P1, P3

Today, I’ll add more participant numbers. At the end of the study, we’ll have a quick summary of the major issues with a good idea of how many participants had each problem.

There are three things that are “rolling” about the list. First, you’re adding participant numbers for each of the issues as you go along. Second, you’re refining the descriptions of the issues as you learn more from each new participant. Third, you’re adding issues to the list as you see new things come up (or that you didn’t notice before, or seemed like a one-off problem).

I will still go back and tally all of the official data that I collected during each session, so there may be slight differences between these debriefing notes and the final report, but I have found that the rolling issues list and the final reports usually match pretty closely.

Doing the rolling list keeps your observers engaged and informed, helps you cross-check your data later, and gives designers and developers something to work from right away that is fairly reliable.