John Bowers*, James Pycock**, Jon O'Brien***
This paper attempts to subject such claims to preliminary examination by giving a characterisation of what CVEs are like as environments for cooperative work and social interaction and seeing how ordinary conversational mechanisms are exploited or transformed in such environments. To these ends we employ empirical techniques derived from Conversation Analysis or CA [e.g. 14], which to our knowledge has not been attempted before in studying CVEs or other VR technologies, though CA has had some influence in HCI and CSCW in enabling detailed studies of interaction in the workplace [10], the impacts of new computer based technologies on 'talk at work' [8] as well as in motivating technical design choices [3] and assisting in the analysis of the design process itself [4]. Accordingly, we seek to add to this literature while extending it to the study of a novel setting (a work-related meeting being conducted in VR).
Finally, we are concerned to show how methods of interaction analysis might contribute to the evaluation and hence future requirements of CVEs. User-oriented evaluative studies of VR systems are still overwhelmingly dominated by investigations of such matters as motion sickness [13] and the characterisation of phenomena in terms of individual perceptual psychology [15]. We wish to study CVEs so as to extend the base in reference to which VR systems should be evaluated and developed. Through characterising the nature of social interaction in a CVE at least in a preliminary fashion, we hope to make a start on the task of exploring the worth of distributed VR technologies for the support of cooperative work and social interaction.
The MASSIVE system [9] that was used for the virtual meeting supports multi-user interaction between distributed sites allowing participants to communicate over graphical, textual and audio media. The graphical interface provides a navigable 3D view of the shared virtual world and of other participants represented as simple graphical embodiments. For the current meeting the 3D view was presented on screen rather than immersively. The audio interface allows real-time conversation. The text interface provides a 2D plan view of a world and allows the exchange of text messages. MASSIVE employs client-server information distribution. The hardware used was exclusively Silicon Graphics for the interface clients with a Sun Sparc 10/51 at Nottingham University in the UK running server software.

The user-embodiments or 'blockies' are made of simple 3D box-polygons with one square 'eye' on one vertical surface and the user's name 'suspended' above the top surface. As well as identity, this design affords a rudimentary sense of 'face', 'front' and 'back' which according to Goffman [6] are features of the human body of basic significance to social interaction enabling us to distinguish between, for example, talking to someone's face from talking behind their backs. Furthermore, the 3D view that a user has can relate to the blockie's face. Although different views are possible, the default is that one 'sees out of' the blockie's eye. This view, or one where one looks over the blockie's shoulders, are the ones typically employed by users. In this way, what other participants can see and where they are looking is often available from an inspection of their bodily orientation and, accordingly, a sense of mutual awareness can be sustained and transformed by aligning the blockies or moving them around. For example, 'full face encounters' [6] can be brought about by two participants aligning their blockies to face each other. Indeed, it has been precisely a consideration of what was the minimal geometrical object necessary to sustain basic interactional relations between participants which has informed the design of the blockies [1]. Insisting that the embodiments should be geometrically simple (yet still have interactional potential) is necessary because of the extreme computational complexity of distributed VR systems based on current technology.
The blockies also support minimal gesturing. They have 'ears' which can be 'flapped' in different ways (left one raised, right raised etc.). They can recline (or 'sleep'). This can be used, for example, to denote that the participant the blockie corresponds to has currently left their local machine and is not available for interaction. Gestures are controlled through simple key sequences and the blockie is moved by clicking the mouse on the 3D view or by using the arrow keys. Finally, the blockies have a 'mouth' which opens when a user's speech exceeds a certain amplitude threshold.
The overall business of the virtual meeting is well described by the following quote from the minutes which one of the participants distributed afterwards: "The time spent in the conferencing software, approximately an hour and a half, can be split up into three distinct periods. To begin with there was a fifteen minute mingling session where people arrived for the meeting and chatted socially ... This time was also spent making sure everybody could communicate with everybody else. At a quarter past two SB called the meeting to order and everybody trooped off to the designated meeting room where a pre-prepared agenda was awaiting on a noticeboard and ... introductions were carried out. CG then gave a tutorial on how to use MASSIVE ... [The items on the agenda were then discussed.] The formal meeting finished at about two fifty five, and after that much more informal communication took place." While some of the clients needed to be restarted from time to time, the server software and network connections were reliable throughout.
Before we turn to some examples from our data, it is necessary to clarify the transcript conventions we have adopted. We employ an adapted version of the conventions devised by Jefferson and presented in a number of sources [e.g. 14]. Pauses and silences are notated by their length in seconds shown within round brackets: (1.2). Talk which receives more emphasis than the surrounding speech is underlined:
Parts of the transcript where we are unsure of what is said, but are able to guess, are notated with round brackets:someone else's turn.
thanks lennart (how) very eloquent.Where we are unable to guess the round brackets are empty. Prolonged sounds are indicated by inserting colons, and concatenated speech, where the words are quickly run together, is notated by hyphens placed between the words:
an::d this wonderful VR system-MASSIVE-that-we're-using is wot i writ.An audible in-breath is notated
.hhhand
hhh.denotes an audible out-breath. Overlapping speech is notated by means of square brackets positioned to indicate where the overlaps occur. Speech which is distorted by, say, some malfunctioning of the audio link is placed within curly brackets:
i'm john {b}owers from manchester {uni}versity. Our comments within the transcript are enclosed within angled brackets: <laughter>.We shall discuss our conventions for transcribing the movements of the blockies later.
(ahh lennart can you hear us?)or request
(go on dave)to a specific, perhaps explicitly named, participant and (ii) turns which do not contain next-selecting components:

(i'll go next).AB's turn itself contains no components which project who is next to speak after him. He introduces himself and then stops. At such moments, it is open for the other participants to either select themselves as next-to-speak or, for that matter, for AB to continue speaking prolonging his current turn.




anyone else?(Examples 3 and 4) or
someone else's turn(Example 1), leaving it up to whoever is next to speak to select themselves. Such components are followed by quite lengthy silences. Indeed, in Example 1, after just over a second's silence, SB engages in some 'side talk' thanking LF in an ironically humorous fashion for his prior self-introduction. Another twelve seconds of silence follows before CG starts talking. In Example 2, SB and CG exchange some humorous remarks before SB utters
excellentwhich can be heard as closing his exchange with CG. The three second pause that follows is then broken by AB selecting himself. In all of these examples, speakers only self-select after quite lengthy silences and with much preparatory activity (e.g. mouth clicks, in-breaths, protracted sounds, stammerings and so forth) or, as in Example 1, a minimal turn from CG
(i'll 'ave a go then)which requires confirmation from SB
(yeah please do)before CG continues.
This preparatory activity is often quite exaggerated as in Example 2 where a mouth click and an audible in-breath are heard from AB and then, after a pause of just over two seconds, there follows a second mouth click, another audible in-breath running into a vocalization transcribed as
woahruhbefore AB explicitly self-selects. The interactional significance of such preparatory activity is worth noting. Audible in-breaths and the rest do not explicitly or fully claim a turn at talk in and of themselves. They could be interrupted by another participant immediately launching into a turn without such activity.
Accordingly, such preparatory activity displays a participant's readiness to contribute as next-to-speak, without disqualifying others. Indeed, in Example 2, even after he self-selects, AB pauses very briefly (notated by (.)) after
i'll go nextand for about six tenths of a second after
then if no one else is speaking.These are further junctures where another participant could have self-selected and claimed the floor to introduce themselves ahead of AB. These features of the examples suggest that self-selected turns are managed with considerable care by speakers - a matter borne out by the fact that AB in Example 2 is explicitly attentive to the possibility that others may speak ahead of him. The exaggeration of preparatory components in the virtual meeting is, we suggest, a means for managing turn taking at moments which can be problematic where, for example, in the absence of explicit next-selection, a number of speakers could start to speak simultaneously. Indeed, Examples 5 and 6 suggest that simultaneous self-selected turns are problematic for the smooth conduct of the virtual meeting and that the presence of audio distortion makes them especially hard to manage.


Note again the presence of various artifacts (e.g. distortion) in JB's talk in Example 4. In the next example, this is particularly intense. SB does not hear DE's
eh:::m:. (0.5) yuhas preparatory to a turn at all!

Interestingly, DE's first attempt to claim a self-selected turn at talk in this example manifests very similar preparatory features to those we have already seen in Examples 1 to 4. It is unsuccessful presumably because SB does not hear them as having this significance or as being any different from the artifacts, pops and crackles and other background noises that can be heard on the audio channel. DE having the poorest audio connection is doubly disadvantaged: first in that his speech is easily masked by others in overlap, secondly in that his routine attempts to anticipate the problems of overlap (e.g. protracting an
ehmor uttering a preparatory
yup)are not heard as such!
It is important to emphasise that the difficulties with turn taking we have noted cannot simply be reduced to problems with audio quality. We observe substantial silences in examples where there are no (Examples 1 and 2) or few (Examples 3 and 4) audio problems. Indeed the audio connections within the site accommodating SB, CG and AB were of good quality throughout the meeting, yet these provide some of the most notable silences in our transcripts. Hence we argue that the problems of self-selection are exacerbated by but not solely attributable to poor audio quality.
The lengthy silences before speaker-switches, we suggest, reflect problems due to managing self-selection with minimal embodiments which have restricted gestural abilities. In this regard, it is of interest that DE does not attempt to compensate for interactional difficulties by any form of virtual gesture or change of body orientation at any moment in Example 7. The embodiments are very rarely used concurrently to aid speakers in designing their own turns or in eliciting turns from others at such moments (in contrast to the use of gesture and body movement in ordinary co-present conversation, see [7, 11], or, for that matter, as reported in the videoconferencing literature, see [16]). This is a point we shall return to.



When conversations are technically mediated, technical failure can be inferred as a source of attributable silence. Indeed, in the data we have, technical failure (rather than some socially significant attribution like evasiveness or rudeness) is invariably first considered as accounting for an attributable silence. In Example 11, SB first asks a question about whether another meeting is intended. AB replies to this saying that a future meeting should use the DIVE system. SB continues by asking a question which is hearably directed at two of the current meeting's members who are also developers of DIVE. However, this receives no reply and after a one second silence, SB explicitly names (and aligns his embodiment in the virtual world so as to face) the two the question is addressed to. Again, a long silence follows and AB checks on LF's ability to hear. Another long silence: whereupon SB notices and brings to the attention of others that KJ has typed a message in the text window saying he cannot hear anything. The point of this example is that when next-to-speak has been explicitly selected and no reply is heard, this is interpreted here as arising from an inability to hear due to an audio failure.



We remarked above that participants seem to rarely use virtual body movements to aid the design of their own turns, even when constructing turns by means of talk alone is problematic (as in Example 7). Indeed, it is rather rare in the virtual meetings we have studied for people to complement their own turns at talk with any concurrent movement of their embodiment. This may not strike one as surprising as talking down a potentially troublesome audio channel may be difficult enough without having to engage in simultaneous mouse movements to get one's embodiment to move! However, it presents a stark contrast with ordinary talk where a whole array of body and facial movements, gazings and changes in overall deportment can accompany and aid the design of turns at talk [7, 11]. Example 14 transcribes the body movement from Example 1. Here, CG raises and lowers the ears on his embodiment. These gestures span his breaking of the long silence we have already noted and aid the construction of his self-selected turn. This is however the only example we have yet found of gesture being used to aid the design of a speaker's concurrent turn. (It also introduces our transcript conventions for showing movement. The beginning and end of the movement are shown underneath the concurrent talk and described in italics on a line after that.)

�----�the length and the position of the symbol corresponding to the analogous position in the talk above it. We transcribe translation movements (xyz-displacements) with
^-----^.A verbal description of each movement is given just below each movement-transcription. A period (.) at the beginning of a line is used to match up lines of transcribed movement where no body movement occurred with the corresponding line of talk.
It is notoriously difficult to adequately and clearly transcribe body movement [8, 11] but we hope our conventions will become clear as we now explicate this example. Early in the example, AC turns towards AB just while LF utters
fahl�n.He then stays facing AB for the rest of LF's turn. A 0.6 second pause follows, at the very start of which AC begins to turn back towards SB, continuing this movement over a brief
uh humfrom SB and stopping the movement when LF begins to hesitate
(er:)in his next turn. For his part, reciprocally, AB turns towards AC, again beginning the movement at a hesitancy in LF's turn (the initial
uhm:). Immediately once AC has finished turning back towards SB and started to move away from the group, AB also turns back towards SB, starting this movement during LF's
er:.AB makes a further movement towards SB while (again) LF is hesitating and pausing with
er:m (1.2).SB then makes a movement towards LF which starts during a 0.6 second pause in LF's talk. Following the start of SB's movement, LF continues talking with
and um:: i've been involved in these things for a long time.When LF begins to utter
for a long time,SB reciprocates AB's slightly earlier turn towards SB, before finally returning towards LF, initiating this movement again during a one second pause in LF's talk.

Thus, we see systematic ways in which participants try to resolve or anticipate turn-taking problems, the elementary coordination of body movements between participants, the coordination of movements with ongoing speech, the utilisation of the bodies to engage others and initiate talk, amongst other phenomena. While, of course, we have concentrated on data from just one virtual meeting and it must be acknowledged that more meetings and more examples are required to develop yet more convincing generalisations, it is fascinating that what we have noted so far in the virtual world is in some way familiar.
Though familiar, this is not to say that improvements should not be made to systems such as MASSIVE to enhance their abilities to support multi-party collaborative activity in a virtual environment. Our interaction analytic techniques have been able to highlight problems, some of which may be amenable to technical solution or assistance. Let us discuss three classes of possibilities.
(1) We noted that it is possible for many technical failures to pass unnoticed for some time, simply because those moments which make them clear (a sufferer of technical failures being selected as next-to-speak, yet not responding) may not have arisen in the conduct of the meeting. This suggests to us that CVEs should support local troubleshooting because it may be very hard to bring to the attention of others that one is experiencing a local failure. This has implications for the overall distributed architecture that a system might exploit. An architecture must be not only robust in the face of local failure but it must also support graceful distributed recovery from local failures. A local failure must be remediable at that site and not require initiation from a remote site where participants may be unaware of the failure. Additionally, it should be possible to bring about such recovery by means within the expertise of any participant. These are actually quite demanding requirements and we believe they follow from our observations of how the structure of technically-mediated social interaction may lead to problems identifying ongoing failures in distributed systems such as MASSIVE.
(2) The overall design of virtual worlds should be considered in terms of how they afford social interaction and not just in terms, say, of their navigability, capability for presenting masses of information, or their thrilling aesthetics. The kinds of objects that we insert into a virtual world should be selected and designed with social interaction in mind. For example, a meeting table may be a simple device for people to gather around while affording them means for coordinating their talk, views of each other and mutual bodily orientation. Indeed, such a device may have aided our participants in solving some (not all) of their turn-taking problems by suggesting a 'round the table' sequence for talk. Quite simple devices (e.g. a table as polygon on the base plane) may often be the most important from a social interactional standpoint yet their inclusion in the virtual world is easy to forget. Although the MASSIVE system can support a variety of 'meeting furniture', only a noticeboard was included on this occasion.
(3) While the blockies do afford various social interactional phenomena (and are not merely navigational aids or simple interface devices), it is worth reminding ourselves of the subtlety of interaction and participation which is much more readily possible with a real human body. The hands, the arms, the head, the neck, the torso permit a number of different orientations with respect to each other as well as with respect to co-interactants' bodies. In this way, we can glance without moving our heads, or turn our heads without moving the rest of our bodies. Importantly, the coordinated flexibility of our eyes and heads enable us to look around without turning our backs on anyone. This, together with all the other kinds of embodied distinctions which are available for investing with interactional significance, is not available to the blockies. The only way for a blockie to 'glance' is by changing its whole bodily orientation. Accordingly, the blockies are considerably constrained in just how they can display their attentiveness to others and also in just how they can gesture or engage in whole body movements to aid the design of their own turns or to partake in a finely co-ordinated stream of talk. What perhaps is remarkable is that the minimal embodiments offer any interactional affordances at all. Nevertheless, introducing articulations (in the physical sense!) to the embodiments does seem to be worthwhile: e.g. so that a 'looking-around' can be distinguished from a 'turning-away'. Current VR systems essentially treat action in a virtual world as a matter of navigation or object manipulation. However, in CVEs where participants are interacting with one another, perhaps one should consider the direct support of actions of a 'higher-order' than mere movement, actions of social interactional significance (like approaches, turnings, glances and maybe some under collaborative control like 'form a circle' and so forth). In future work, we wish to study whether such higher-order actions can be sensibly added to the repertoire of interaction techniques available to the blockies and participants who 'inhabit' them. In this way, we may also be able to help users employ gestures to aid the concurrent construction of their turns - something which is currently problematic.
Of course, adding any further complexity to the blockies has to be reckoned with in the light of technical issues such as computational and network-transmission performance. We feel, though, that a viable and systematic research strategy for developing useful CVEs is to incrementally add further sophistication to very simple embodiments as and when analysis reveals that it is called for in the support of social interaction. Interestingly, this goes against the grain of many VR research trajectories which are devoted towards photorealistic body renderings and whole body movement detection. But, unless the social interactional significance of the body is understood, such developments may be not only unduly computationally expensive (especially when one considers distributed collaborative VR systems) but also lacking in social scientific motivation.