Distributed Multimodal Interaction Protocol

Proceedings of the International Conference on Multimedia and Human Computer Interaction

Toronto, Ontario, Canada, July 18-19 2013

Paper No. 56

Distributed Multimodal Interaction Protocol

Lucas Stephenson, Anthony Whitehead

Carleton University, HCI Department, Faculty of Engineering and Design

1125 Colonel By Drive, Ottawa, Ontario, Canada

lucas.stephenson@carleton.ca; anthony.whitehead@carleton.ca

Abstract This paper introduces a novel networking protocol, distributed multimodal interaction protocol (DMIP) designed to alleviate development effort by providing a standard method to define and negotiate interaction modalities between networked, distributed systems, and by pushing application logic and layout to server side services. To evaluate the protocol, software layers were written to implement the protocol specification. On the endpoint providing the service, an additional layer was created to implement common application functionality. This layer, or service application programming interface (API) was then used to create a variety of applications (4.2 Implemented Services). Creation of software that can provide interact-ability using standardized mode channels with remote clients, on indefinite platforms required minimal effort, provided the DMIP software implementation layers were available. The results imply that DMIP can be used to create remote client interfaces, using standardized modalities for services. Also noted was the fact that the development required to create the DMIP software layers need not be duplicated, and reference implementations are thusly made available.

Keywords: Mobile, Standardization, Accessibility, Distributed, Networking, Interaction Mode

  1. Introduction

The world is filled with computing devices, and the networking of these devices has allowed basic interconnectedness between them. However, using generic devices to directly provide human-centric modes of input and output is currently limited. Put simply, users cannot generally use one non-specific device to interface with another.

No current networking standard allows simple development of applications that use distributed transport of a large and expandable set of input and output modalities. Providing an open standard network protocol will enable more devices to inter-communicate by reducing costs associated with developing externally interactive devices.

An example that has been often demonstrated, but is not widely accepted due to the need for costly proprietary devices is home automation (Gomez & Paradells, 2010). The concept is a simple one; use any device to control and manipulate the state of various devices in a home; automatically turn off a light downstairs when you fall asleep, pre-heat the oven on the way home from work, change the TV channel from your phone, monitor temperature and security systems from afar. These home automation tasks are all common tasks that are technically feasible, but because a standard for modal device to device communication is not available, they are not widely implemented or adopted. Some further examples of enabled scenarios are:

The Availability of an increasing number of available input modes on personal devices like mobile telephones, music players and video game controllers as well as specialized input mode devices, like gaze trackers, motion trackers and brain-computer interface (BCI) devices means that users have an ever increasing number of ways of interacting with technology. In addition to supporting many interaction modalities, user focused interaction design should incorporate feedback to aid the user’s understanding of the impact of their actions (Nielson & Molich, 1990). This work intends to demonstrate and evaluate a novel application layer network protocol, Distributed Multimodal Interaction Protocol (DMIP) that aims to provide a standardized method of negotiating and communicating both emerging and existing interaction modes between network-connected devices.

  1. Background

The basic premise of computer networking is to enable computing devices to intercommunicate. What information is communicated has largely been focused on computer formatted data, not high level abstractions of human knowledge and actions. Communication of inputs allows the direct manipulation of distinct devices, enabling human users to use computers using their preferred input modalities and devices. Output communication allows feedback and allows inter-device interactions to be communicated. Efforts to aggregate some multimodal data exist, but are not sufficient as a general purpose multimodal IO standard. We examine some of these related technologies next.

    1. Related Technologies

Transporting user interactions over networks is a complex issue preventing the more pervasive spread of human control of computer environments. There are some existing technologies that attempt to simplify multimodal data and/or provide network connectivity. TUIO (Kaltenbrunner, 2009) is a spatial (touch, motion tracking) input library that aims to aggregate these types of devices for use as an API. ROSS (Wu, et al., 2011) provides a development toolkit for managing combinations of specific devices over a network. HCI2 (Shen, et al., 2011)is a tool for aggregating machine-local input, so they can be more uniformly processed. EMMA (Johnston, 2009) is an XML standard for encoding multimodal inputs. OpenNI (OpenNI, 2011) is a tool that is aimed at processing and aggregating motion based inputs. StreamInput is a project aimed at providing a flexible input API.

TUIO is a network based system that aims at providing aggregate touch input data, to allow easier development of a variety of touch and other related continuous input systems (Kaltenbrunner, et al., 2005). While this technology is extensible (Kaltenbrunner, 2009) and does approximate some of the intentions of the proposed protocol, it is purely input communication, and does not have the ability to negotiate the usage of modes; implementations have fixed message expressiveness. DMIP needs to provide methods of negotiating the use of any input and output mode to allow devices to self-identify what remote-devices they can interact with and how they can interact with them.

ROSS is a development toolkit (Wu, et al., 2011) for Tangible media (Ishii & Ullmer, 1997) that provides a platform independent method of consuming multimodal of inputs. ROSS provides an explicit input schema for devices connected to the system and operates over a TCP/IP network (Wu, et al., 2011). ROSS allows the development of applications that make use of specific devices, the need for explicit input schema for each node limits the system’s ability to aggregate signals. DMIP needs to negotiate and transfer specific input/output mode data, alleviating the need to develop code for differing combinations and devices.

HCI 2: A Software Framework for Multimodal Human-Computer Interaction Systems describes a modular platform that aims to support the development and research of multimodal systems (Shen, et al., 2011) (Shen & Pantic, 2009). This system provides a method of efficiently aggregating machine-local input signals so that they can be processed. There is a focus on performance, and is primarily intended to be a research platform, and there is no focus on data-type communication. The intention is for DMIP to provide a means to communicate signals over a distributed network of devices, these two technologies could be combined, allowing the local application to aggregate inputs and determine meaning, and use DMIP to communicate that meaning over a network. DMIP needs to operate over a network and be extensible so as to provide for any communicable interaction mode.

EMMA is an XML W3C standard (Johnston, 2009) designed to alleviate the difficulty in aggregating and augmenting through annotation multimodal inputs by providing a standard for communicating them In XML format. EMMA is designed largely with speech as the focus, and while DMIP can potentially support both audio streams and text the overhead of XML means that ides from EMMA would be most significant in the definition of a standardized voice DMIP channel.

OpenNI is intended to provide an aggregation of processed continuous (primarily 3D vision) inputs (Web-2). It allows applications to be developed based on processed inputs from various modes. It does not provide a system for transport of these signals. A DMIP channel for actions/interpreted intentions could be supported created to provide for a method of communicating processed signals.

StreamInput is a project to create a cross-platform API to support application development for new sensors. StreamInput is yet another system for aggregating interaction controls and does not provide any transport of such signals. This project is not yet released, but mentioned here as it may provide insight as to the channels that should be standardized for DMIP.

The solutions for distributed interaction collection and transmittal are numerous, however there is a lack of a standardized protocol. Specifically, there is not one that is oriented around various input modalities that are commonly used for interaction. DMIP is an effort to address this shortcoming. The lack of a common standard causes applications to be developed in “silos”, each new solution requires redevelopment from the ground up, causing the majority of applications to be based on proprietary technology, which limits device compatibility and increases software development costs. DMIP is designed to solve these problems by providing a reusable and extendable protocol that can provide information about client device abilities and negotiation for the usage thereof as input to a remote system as and to provide feedback (output) from the system.

  1. DMIP Protocol

The DMIP protocol provides a lightweight layer for communicating categorized interaction mode data over a TCP/IP network. DMIP operates over TCP, and optionally UDP. The purpose of the DMIP protocol is to provide functionality that simplifies the transport of interaction mode channels. The main benefit of the DMIP protocol is that it simplifies development of distributed, network based applications, as it allows a developer to focus on behaviours and presentation rather than being concerned with networking, interpretation of a wide range of input/output types or platforms (OS/hardware). Moreover, it allows the device creator to specifically respond to any DMIP compliant client yet still remain in control of the interaction experience.

DMIP sessions are 1 to 1; a client initiates a session with a service. Clients are independent of services, meaning a single client implementation of DMIP can connect to, and provide interactive experiences for a variety of DMIP services. Below, in figure 1, is a high level overview of the DMIP protocol in a system.

Fig. 1. High Level DMIP Protocol Workflow

    1. DMIP Clients and Services

DMIP Clients are typically non-static devices/mobile devices, and provide a remote, user-centric interface for a service. Clients initiate DMIP sessions by contacting a known network endpoint over TCP. These endpoints can be facilitated nicely using Service Location Protocol (SLP). A session is initiated by client by providing a listing of input modalities it can support. Because clients are independent of services, a single client implementation may be developed for a hardware device/platform. However, client implementations can be augmented by providing automatic service location, and any number of other options, and it is imaginable that multiple clients for a platform might be developed.

A DMIP service application is where developer efforts are intended to be focused. To develop a DMIP service a developer creates client layouts and behaviours to support different modes of interaction by the hosted application.

    1. DMIP Channels

DMIP Channels carry mode formatted data and represent input and/or output data from either endpoint. As shown in figure 2, a client sends a listing of its Abilities (input modes it implements) and the service determines if it can interact with all, or a subset, of the presented abilities, and what channels it will use. Instances of the chosen channels are created and allow mode formatted data to flow. This input stream continues until the connection times out, or is explicitly removed.

Fig. 2 Channel Instance Construction and Lifecycle

A listing of the currently available standard mode channels is provided in table 1. This list is currently not exhaustive, and there are provisions for the extension of mode channels (Web-1). For the purpose of the evaluation, only the more commonly used input modalities were used.

Table 1. DMIP Standardized Channels

Mode Name

Description

SingleState2DButton

A button

Label2D

A label

Range1D

A range slider

Selection

A selection menu, with key value pairs

Title

A title, representing the caption of the client application

Relative3DStream

32-bit, 3D data

Relative2DStream

32-bit, 2D data

Direction4

Direction data up/down/left/right

Chars

Single character, various encodings

Image2DView

2D Image data, jpeg, png or bmp formats

In addition to mode channels, there are additional meta-channels that allow attachment of additional attributes to channel instances, these are also standardized and extendable. The current standardized meta-channels are TextLabel, Positionable3D and Interactable.

To evaluate DMIP’s ability to simplifying distributed application development, several example implementation scenarios were created.

  1. Evaluation Implementation

Implementation of the protocol was performed, to meet two goals; evaluate the feasibility of an implementation platform, and to evaluate the ease of applications development given that the platform was made available. Selection of channel types for standardization in the initial release of DMIP was accomplished through 2 methods; native controls for the platforms currently used (.NET WinForms, Android) and through the use of an informal interaction survey (Web-3). Survey responses are classified and marked for potential inclusion.

    1. Implementation Platform

The first purpose was evaluated by implementing the protocol and the creation of a software library capable of sending, receiving and interpreting DMIP messages as well as managing underlying TCP and UDP connections, this layer is named Connection. Next, Client and Service layers were added, the client layer creates an instance of the Connection layer and provides the target address and hosts threads for processing DMIP messages. The counterpoint service layer was created similarly, however to support multiple clients and more complex applications, it additionally provides session handling, basic layout definition from XML-format files, basic channel compatibility checking and helper methods to broadcast DMIP payloads to all connected clients. The Connection and Client layers were initially developed in C# on the .NET platform and then ported to Java for Android devices. The Service was also implemented on the .NET platform. Client applications, using the Client layer were developed for .NET, and Android devices, both of the client applications implement a number of the DMIP standard channels based on the platform capabilities. The channels implemented in each of the two client applications are listed in table 2.

Table. . Implemented DMIP Channels in Client Applications

 

.NET

Android

SingleState2DButton

*

*

Label2D

*

*

Range1D

*

 

Selection

*

*

Title

*

*

Relative3DStream

*

*

Relative2DStream

*

 

Direction4

*

 

Chars

*

 

Image2DView

*

*

The implementation of these layers provides a useable DMIP platform, and programming reference. These implementations provide an API that can be used to implement and augment services, clients, or as examples to port these layers to other platforms. The layers are further described and available for download via the web (Web-1) with a permissible MIT license.

    1. Implemented Services

With the .NET platform Service layer implementation, a number of simple applications were created to determine whether the DMIP provided a platform upon which distributed applications could be simply developed. These implementations are available (Web-1). In this section the PowerPoint controller application is described.

The service application is created as a .NET PowerPoint Add-In, and makes use of Visual Studio Tools for Office (Web-3). An XML file is used to define a standard layout for all clients. The XML file provides positioning, interaction and text label information for relevant channel instances, as well as providing what channel types are required or optionally used by the connecting clients. The channels are Label2D and SingleState2DButton are required and enable the client to display forward/back buttons, and allow viewing the slide notes. Optionally the service supports Direction4 to enable arrow key navigation, Title to provide a title caption for the client application and Image2DView that enables a view of the current, previous and next slides in the PowerPoint slide deck. When a compatible client connects, it is provided with an interface generated by the service. Below, in figure 3, is a view of a WinForms DMIP client connected to a DMIP service, designed to allow control of a PowerPoint Presentation.

Fig. 3. PowerPoint DMIP Service

The network sequence diagram below in figure 4, shows the network messages used for the “Previous” button, within the application.

Fig. 4. Network Sequence for Previous Button

  1. Discussion

DMIP currently provides a simplified development path to enable networked devices to utilize network based services. Standardization of additional channels will make the resultant applications more appealing to end users.

Another potential improvement is security; while the underlying transport layer could provide connection encryption (Dierks & Rescorla, 2008), it might be valuable to encrypt specific channel instances and have protocol handling of security measures so that users can view the security information.

Developing software that makes use of distributed clients is much simplified by reducing concern about the client’s operating system, hardware abilities and simplified network programming. DMIP can enable ubiquitous access to technology by enabling developers to simply create cross platform interfaces for distributed devices. DMIP Implementation and source code examples are available here (Web-1).

References

de Sales, L. M., Almeida, H. O. & Perkusich, A., 2008. On the performance of TCP, UDP and DCCP over 802.11 g networks. Ceará, Brazil, s.n.

Dierks, T. & Rescorla, E., 2008. The Transport Layer Security (TLS) Protocol, Version 1.2. [Online]
Available at:
https://tools.ietf.org/html/rfc5246
[Accessed 2 February 2013].

Gomez, C. & Paradells, J., 2010. Wireless home automation networks: A survey of architectures and technologies. Communications Magazine, IEEE, 48(6), pp. 92-101.

Guttman, E., Perkins, C. & Kempf, J., 1999. Service Templates and Service: Schemes. [Online]
Available at:
http://tools.ietf.org/html/rfc2609

Guttman, E., Perkins, C., Veizades, J. & Day, M., 1999. Service Location Protocol, Version 2. [Online]
Available at:
http://tools.ietf.org/html/rfc2608

Ishii, H. & Ullmer, B., 1997. Tangible bits: towards seamless interfaces between people, bits and atoms. Atlanta, Georgia, USA, s.n.

Johansson, N., Kihl, M. & Körner, U., 2003. TCP/IP over the Bluetooth Wireless Ad-hoc Network, Lund, Sweden: Department of Communication Systems, Lund University.

Johnston, M., 2009. Building multimodal applications with EMMA. Beijing, China, s.n., pp. 47-54.

Kaltenbrunner, M., 2009. reacTIVision and TUIO: A Tangible Tabletop Toolkit. Banff, Canada, s.n.

Kaltenbrunner, M., Bovermann, T., Bencina, R. & Costanza, E., 2005. TUIO - A Protocol for Table-Top Tangible User Interfaces. Vannes, France, s.n.

Nielson, J. & Molich, R., 1990. Heuristic evaluation of user interfaces. Seattle, WA, ACM New York, NY, USA, pp. 249-256.

OpenNI, 2011. OpenNI: Introduction. [Online]
Available at:
http://openni.org/Documentation/Reference/introduction.html
[Accessed 14 September 2012].

Shen, J. & Pantic, M., 2009. A software framework for multimodal humancomputer interaction systems. San Antonio, TX, USA, s.n.

Shen, J., Wenzhe, S. & Pantic, M., 2011. HCI 2 Workbench A development tool for multimodal human-computer interaction systems. Santa Barbara, CA, s.n.

Stevens, W. R., 1994. TCP/IP Illustrated: the protocols. Boston, Massachsetts: Addison-Wesley Professional.

Wu, A., Jog, J., Mendenhall, S. & Mazalek, A., 2011. A Framework Interweaving Tangible Objects, Surfaces and Spaces. Orlando Florida, USA, s.n.

Web sites:

Web-1: http://iv.csit.carleton.ca/~dmip/ consulted 28 Feb. 2013

Web-2: http://openni.org/Documentation/Reference/introduction.html consulted 14 Sept. 2012

Web-3: http://msdn.microsoft.com/en-us/library/d2tx7z6d.aspx consulted 21 Jan. 2013

56- 1