H. Rex Hartson, José C. Castillo, and John Kelso
Department of Computer Science
Jonathan Kamler
Department of Forestry
Virginia Tech
Blacksburg, VA 24061
Tel: +1-540-231-4857
Email: hartson@vt.edu
Wayne C. Neale
Eastman Kodak Company
901 Elmgrove Road
Rochester, NY 14653-5800
Tel: +1-716-726-0953
Email: neale@pixel.kodak.com
These barriers have led us to consider methods for remote usability evaluation wherein the evaluator, performing observation and analysis, is separated in space and/or time from the user. The network itself serves as a bridge to take interface evaluation to a broad range of networked users, in their natural work settings.
Several types of remote evaluation are defined and described in terms of their advantages and disadvantages to usability testing. The initial results of two case studies show potential for remote evaluation. Remote evaluation using video teleconferencing uses the network as a mechanism to transport video data in real time, so that the observer can evaluate user interfaces in remote locations as they are being used. Semi-instrumented remote evaluation is based on critical incident gathering by the user within the normal work context. Additionally, both methods can take advantage of automating data collection through questionnaires and instrumented applications.
These barriers to usability evaluation have led us to extend the concept of formative evaluation beyond the usability laboratory to methods for remote usability evaluation, typically employing the network itself as a bridge to take interface evaluation to a broad range of network users, in their natural work settings.
In the following sections we briefly survey several kinds of remote evaluation as distinguished by the methods used and data collected. In brief case studies, we then explore two approaches specific to our own work, concluding with a glimpse at possible future directions for this work.
Approaches based on remote questionnaires have the advantage that they capture remote user reactions while they are fresh, but they are limited to subjective data based on questions pre-written by developers or evaluators. Thus, many of the qualitative data normally acquired in laboratory testing, i.e., the data directly useful in identifying specific usability problems, are lost.
The semi-instrumented approach has potential for cost-effectiveness, since the user and the system gather the data and evaluators look only at data that relate to usability problems. However, reliance on users with only minimal training to identify critical incidents during their performance of on-the-job tasks, makes it a technique with as yet unproved effectiveness.
| Evaluation Process | Network/Automation Support |
|---|---|
| Defining protocol and method of data collection (distributed contributors) | Network and CSCW tools (email, WWW, groupware) |
| Document sharing (questionnaires, protocol, data, report, etc.) | Network and CSCW tools (email, WWW, groupware), desktop video conferencing |
| Subject profile questionnaire | Digital questionnaire |
| Training, instructions, set-up | Network, digital questionnaire, desktop video conferencing |
| Direct observation | Desktop video conferencing |
| Post-test questionnaires | Digital questionnaire |
| Debriefing | Desktop video conferencing |
| Distribution of results/final report | Network and CSCW tools (email, WWW, groupware), desktop video conferencing |
Four other participants' monitors were observed through desktop video conferencing software from a remote location (remote condition). Pilot tests were performed between Kodak (Rochester, NY) and Virginia Tech (Blacksburg, VA) to test the feasibility of the remote condition. In the remote condition for the main part of the study the evaluator was located in a remote office within Kodak. The screen output was captured by a scan converter and a video digitization board. The digital video was sent via the Internet to the remote evaluator by the desktop video conferencing software. Fifteen frames per second was the average video frame rate. A telephone connection was used to transmit each user's voice in this condition.
The participants in the lab condition were given printed paper questionnaires while the participants in the remote condition were given on-line questionnaires. The content of the questions were identical. The on-line questionnaire, however, was designed to format and export the quantitative data in specific file formats (e.g., Microsoft Excel(TM), StatView(TM), Superanova(TM)) for statistical analysis. The qualitative data of the questionnaire (i.e., open ended questions) were entered into text fields and saved as word processor files. The questionnaire was sent to the remote machine via the Internet, and the exported files were sent back to the remote evaluator.
Instead of on-site visitation, traveling can now be virtual, saving thousands of dollars. Furthermore, using digital questionnaires improved efficiency gains in data collection and analysis. Many variables can affect the quality of digital video (e.g., software algorithms, camera, LAN and Internet bandwidth, network traffic, sample rate, processor speed, display size and video board).
There were times when it was difficult in the remote condition to interpret what the user was doing on the screen. Most of the time the users' verbal comments, in addition to the video window, were sufficient to identify the problem. Mimicking the users' actions with the same application at the evaluator's computer also helped resolve the users' actions that were difficult to see in the desktop video conferencing software window.
Success of the semi-instrumented method of remote usability evaluation depends on the ability of typical users effectively to recognize CIs. This issue was explored in the case study and will be the object of further studies. The current laboratory-based formative usability evaluation practice of having CI identification done by a trained HCI evaluator might lead to skepticism about casting the user in that role. However, critical incidents in human factors experiments, before the CI technique was adapted for HCI [2], were originally identified by the subject during his or her own task performance [3]. Thus, we expect our users with minimal training to be able recognize CIs, an expectation supported by our initial case study. Our users cannot be expected to be generally trained in HCI, but they can be given minimal training for the specific task of spotting CIs.
The goal of semi-instrumented remote evaluation is to gather qualitative data (e.g., CIs and verbal protocol), rather than quantitative data (e.g., timing or error rates). We consider qualitative data to be the most useful for identifying and fixing usability problems.
For comparison, Figure 2 depicts one view of the "traditional" laboratory-based formative evaluation process. While directly observing user task performance, evaluators produce a list of CIs (at point A) that are later analyzed by evaluators into a usability problem list (at point B).
In semi-instrumented remote evaluation, as shown in Figure 3, users themselves identify CIs during the normal course of on-the-job task performance.
Figure 2. One view of local formative evaluation
Figure 3. Semi-instrumented remote evaluation
Whenever usage difficulty is encountered, the user clicks on a "Report CI" button, a single added object consistently appearing on all screens of the application. The click activates the instrumentation routine (software outside the application) causing the system to store the incident indicator and data about its context (e.g., system, interface, task state information, and video clips or screen image sequences with audio), to be sent later to developers. This package of data, the CI and its context taken together, is called a contextualized critical incident (CCI). Thus, the semi-instrumented evaluation process is a user-triggered CI-reporting mechanism that produces sets of CCIs (at point A'), sent asynchronously via the network to remote evaluators to be later analyzed into a usability problem list (at point B').
The major working hypothesis of semi-instrumented remote evaluation is that it will provide approximately the same quantity and quality of qualitative data that can be obtained from formative laboratory-based usability evaluation.
1. initial case study to explore relevant issues and to find intuitive indications of the validity of our working hypotheses;
2. controlled laboratory-based experiments validating the hypotheses; and
3. fields studies conducted in real work environments, accounting for work context factors (noise, interruptions, multi-thread task performance, etc.).
The primary goal of the first step, the case study, was to judge feasibility of the process. As a bottom-line, we wanted to see whether this new method could provide approximately the same quantity and quality of qualitative data that can be obtained from laboratory-based formative evaluation. Specifically, we wished to see if the list of usability problems at B' in Figure 3 is a good approximation to the list at B in Figure 2. Finally, we wanted to appraise the likelihood of achieving all these goals in a way that is cost-effective for both developer (e.g., minimal resources for data collection and analysis) and user (e.g., minimal interference with work).
For the case study the goals stated in the previous section translate into two research questions related to the hypotheses of the semi-instrumented remote approach:
Question 1: Can user subjects identify CIs approximately as well as the expert subjects?
Question 2: Can expert subjects transform contextualized CIs from user subjects into a usability problem list approximately as well as they can produce a usability problem list in the normal lab-based approach?
2. Sessions of user subjects performing tasks and simultaneously identifying CIs were videotaped. The test application was a program for viewing and manipulating images from a digital still camera.
3. A panel of three expert subjects viewed the tapes to detect any CIs missed by the user subjects.
4. The tape sets were edited into sets of CCI packages.
5. Two expert subjects (different ones) were asked to convert the CCI sets into usability problem lists.
The results of Step 3 provided answers to Question 1 and results of Step 5 gave answers to Question 2.
In this case study, however, we wished to make direct comparisons of exactly the same CIs, between expert subjects in the "normal" case and user subjects in the remote case. This meant the experts had to view tapes of user subjects made in the remote evaluation case. Thus, the experts also saw the users identifying their own CIs. This aspect of user subject behavior could not be ignored by the expert subjects and could not be edited from the tapes without gaps or "glitches" in the tape, thus revealing the presence of CIs. We could have masked this effect somewhat by introducing additional "decoy" glitches, but we felt that line of reasoning would divert us from the goal at hand. Thus, we settled on using the expert subjects to judge user subject performance in CI identification. We felt this produced results essentially equivalent to a direct comparison, especially for the most important case of CIs not found by the user subjects. In future controlled experiments we plan to compare user and expert subject performance directly, without tapes, but this will require large numbers of subjects and CIs in order to establish statistical significance.
Two tapes were made simultaneously during each task performance. One video camera was used to record users in the normal way, including audio of user comments. The second tape captured screen activity via a scan converter connected to the computer monitor. The experimenters set up the hardware and software, and the subjects were in a controlled setting in a laboratory, with no interruptions.
To address Question 1, the panel of expert subjects viewed these unedited tapes on a pair of monitors, watching user subjects identifying their own CIs as they performed tasks, especially looking for CIs the user subjects failed to identify.
Anything considered by a user subject to be a CI was deemed so, by definition. Thus, we did not look for any "false positive" identifications.
Incidentally, we found that IDEAL [5], a tool designed to support laboratory-based formative user interface evaluation, was also useful in supporting evaluation methods research. IDEAL provided controls for marking, viewing, editing, and synchronizing the videotapes. IDEAL also supported marking tapes where CIs begin and end, allowing rapid wind/rewind to view and analyze a given CI. CIs could also be annotated in IDEAL by any of the case study participants.
As a substitute for entering text in a dialogue box (for example) to describe a CI, subject users gave a verbal description that was captured on an audio track of the videotape. Verbal descriptions (which use a non-visual output channel) did not interfere, to the extent that typing would, with task performance during capture and observation during evaluation.
By informal experimentation we determined to use a 60 second video clip centered around the CI identification. This interval provided economical coverage for most of the data in this study. However, the tradeoff between bandwidth requirements to transmit clips on the network and richness of context will be the subject of a future study.
Question 2 was approximated by the following question: How easily can expert subjects turn CCI sets into usability problem lists? Expert subjects reviewed the CCIs found by user subjects and tried to transform each CCI into one or more specific usability problems, while the experimenters judged how well they did this. In the future, it would be possible to use multiple expert subjects for each user subject and/or to use different expert subjects to judge these results.
Users liked the idea of having a sounding mechanism for reporting the CIs, which we discovered in pilot testing needed to be distinguishable from all other system sounds. We used a "gong" sound, activated by pressing the space bar, which the users liked because it was a bit like "gonging" out the designer for bad parts of the interface. Users indicated a preference for using the same sound to report both positive and negative CIs, and in the end we found that users rarely, if ever, identify positive critical incidents, anyway. Also, user subjects usually reserved gong ringing for cases where task performance was blocked and usually did not ring the gong for situations where they could perform the task but it was awkward or confusing. For example, most users had minor trouble rotating an image, but they all eventually figured it out and none signaled a CI associated with this part of the task. The experts did see this trouble with rotation as a CI, one of the few examples of the ones they recognized but users did not. There was also one case of a CI spotted by an experimenter and missed by both user and expert subjects.
To summarize the results for Question 1, there were very few CIs missed by the users but found by the experts, and the problems missed were ones of less importance.
The tape of the scan-converted screen provided the most valuable data for the experts and the experimenter. The camera on the user was occasionally useful for revealing when the user was struggling or frustrated (e.g., pounding on the gong key).
The principal problem exposed by the study was the need to prompt users continually for the verbal protocol that is so essential in the tape clips for establishing task context to the CIs. Even though users were asked up-front to give a running verbal commentary, they generally did not speak much without prompting. This problem is exacerbated in the case of remote evaluation, since the experimenter will not be present and the user must be self-actuated in this regard. Verbal protocol was essential for the experts to establish for each CI what the task was, what the user was trying to do, and why the user was having trouble.
A second major area where insight was gained involved the question of how to "package" CCIs in the most cost-effective way with respect to network transmission cost (e.g., bandwidth) and usefulness to the developers. The study indicates the need to examine the use of screen capture only. For remote evaluation, CCI packages are sent over the network from users to developers. Existing digital screen capture programs can overlay audio and text (e.g., for task descriptions) on the screen image, requiring less storage and bandwidth to transmit than continuous video.
One problem with "automatically" packaging CCIs is that different kinds of CIs need different intervals. For example, it is not surprising that we found more pre-gong time is needed to establish clear context for goal-related problems than for action-level problems.
With regard to Question 2, when task information was not given (experts not told what user was trying to do), expert subjects had difficulty guessing what was happening and did not do well in identifying the usability problems associated with the CI. When CI clips were augmented with verbal protocol about intended task and context of where the user was in the task when the CI occurred, the expert subjects were generally able to identify the associated usability problems and design flaws that led to them.
We will also investigate the number of camera views that are necessary for quality observation. A sufficient base of research is needed to guide development of tools that would control and streamline remote data collection. The result of such work would reduce the cycle time for designing interfaces in the overall development process. This work is being continued at Virginia Tech and at Kodak.
Future studies will also focus on how much context to include in a CCI package for the semi-instrumented approach. Typically, users ring the gong some time after the beginning of a CI. The context of a CI continues to develop some time after the gong, as well. Thus, a CI and its context typically span both sides of the gong. We plan empirical studies to determine the best pre-gong and post-gong time intervals for CCI clips.
The design and evaluation of specific user training for identifying CIs is an important focus of our work. Cost-effective on-demand availability of this training to potentially large numbers of users requires a self-paced module. The subject matter is appropriate for multi-media with at least video and sound, delivered on the network (e.g., through the World-Wide Web).
Further, a clear need for a remote method for user prompting emerged as a lesson learned from the studies. As a result of this study we intend to investigate social, psychological, and organizational issues in prompting users at a distance to maintain verbalization. We expect to explore voice and/or dialogue boxes for conveying a request for information about the task and the problem, either when the CI button is pushed (e.g., in the semi-instrumented case) or if the user is inactive for too long.
An additional area of study will focus on the necessity for the user to pause and identify the CI, and the effect that this task interruption might have on usability of the application itself. We also plan to catalog which types of applications work best with remote evaluation, and possibly identify predictor characteristics for new applications. We also hope to qualify which of the two methods we studied are best suited for collecting data from a given type of application.
The Montgomery County, Virginia, school system will serve as a near-term testbed for remote evaluation methods. We, in the Department of Computer Science at Virginia Tech, are working with the public schools to design, implement, and evaluate a software architecture, software tools, and courseware to construct a virtual physics laboratory to support broadly collaborative, highly interactive physical science education for middle school and high school students. The virtual laboratories will be accessed via the Internet and the Blacksburg Electronic Village, a densely interconnected community in Southwestern Virginia. Remote evaluation will be a necessity for this project.
In the longer term, we expect to be working with the U.S Forestry Service and the Bureau of Land Management, who are using an expert system for landscape architecture as part of a very large scale geographical information system. We look forward to the possibility of using remote evaluation methods with literally hundreds of remote users across the country.
2. del Galdo, E. M., Williges, R. C., Williges, B. H., and Wixon, D. R. An Evaluation of Critical Incidents for Software Documentation Design. In Proceedings of Thirtieth Annual Human Factors Society Conference Human Factors Society, Anaheim, CA, 1986, 19-23.
3. Fitts, P. M., and Jones, R. E. "Psychological aspects of instrument display." Selected Papers on Human Factors in the Design and Use of Control Systems. 1: Analysis of 270 "Pilot Error" Experiences in Reading and Interpreting Aircraft Instruments. Sinaiko ed., Dover Publications, Inc., New York, 1947.
4. Hix, D., and Hartson, H. R. Developing User Interfaces: Ensuring Usability Through Product and Process. John Wiley & Sons, Inc., New York, 1993.
5. Hix, D., and Hartson, H. R. IDEAL: An Environment for User-Centered Development of User Interfaces. In Proceedings of EWHCI'94: Fourth East-West International Conference on Human-Computer Interaction (St. Petersburg, Russia, August 2-6), 1994, 195-211.
6. Nolan, P. R. Welcome to Vertical Research. Vertical Research, Inc., P.O. Box 1214, Brookline, MA 02146, USA. (1995). http://www.nolan.com/~pnolan/vertical.html.
7. Siochi, A. C., and Ehrich, R. W. Computer Analysis of User Interfaces Based on Repetition in Transcripts of User Sessions. ACM TOIS. 9, 4 (October 1991), 309-335.
8. Whiteside, J., Bennett, J., and Holtzblatt, K. "Usability Engineering: Our Experience and Evolution." Chapter 36 in Handbook of Human-Computer Interaction. Helander ed., Elsevier North-Holland, Amsterdam, 1988, 791-817.