Publications

Publications
Publications
We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.

Publications

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), 2012.

Structuring E-Commerce Inventory

Karin Maugé, Khashayar Rohanimanesh, Jean-David Ruvini

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions.

In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm.

We evaluate our method on a year worth of eCommerce data and show that it achieves excellent precision with good recall.

Keywords
ACL 2012: 805-814

Structuring E-Commerce Inventory

Karin Maugé, Khashayar Rohanimanesh, Jean-David Ruvini

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions.

In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm.

We evaluate our method on a year worth of eCommerce data and show that it achieves excellent precision with good recall.

Keywords
CIKM 2012:596-604

Large-scale Item Categorization for e-Commerce

Dan Shen, Jean-David Ruvini, Badrul Sarwar

This paper studies the problem of leveraging computationally intensive classification algorithms for large scale text categorization problems. We propose a hierarchical approach which decomposes the classification problem into a coarse level task and a fine level task.

A simple yet scalable classifier is applied to perform the coarse level classification while a more sophisticated model is used to separate classes at the fine level. However, instead of relying on a human-defined hierarchy to decompose the problem, we use a graph algorithm to discover automatically groups of highly similar classes.

As an illustrative example, we apply our approach to real-world industrial data from eBay, a major e-commerce site where the goal is to classify live items into a large taxonomy of categories.

In such industrial setting, classification is very challenging due to the number of classes, the amount of training data, the size of the feature space and the real world requirements on the response time. We demonstrate through extensive experimental evaluation that (1) the proposed hierarchical approach is superior to flat models, and (2) the data-driven extraction of latent groups works significantly better than the existing human-defined hierarchy.

Keywords
SDM 2012

Multi-Skill Collaborative Teams based on Densest Subgraphs

Atish Das Sarma, Amita Gajewar, Atish Das Sarma, Amita Gajewar

We consider the problem of identifying a team of skilled individuals for collaboration, in the presence of a social network, with the goal to maximize the collaborative compatibility of the team. Each node in the social network is associated with skills, and edge-weights specify affinity between respective nodes. We measure collaborative compatibility objective as the density of the induced subgraph on selected nodes.

This problem is NP-hard even when the team requires individuals of only one skill. We present a 3-approximation algorithm for the single-skill team formulation problem. We show the same approximation can be extended to a special case of multiple skills.

Our problem generalizes the formulation studied by Lappas et al. [KDD ’09] who measure team compatibility in terms of diameter or spanning tree. The experimental results show that the density-based algorithms outperform the diameter-based objective on several metrics.

Keywords
Design, Guidelines, and Requirements, Personal and Ubiquitous Computing, 2012

Live Mobile Collaboration for Video Production

Marco deSa, David Shamma, Elizabeth Churchill

Traditional cameras and video equipment are gradually losing the race with smart-phones and small mobile devices that allow video, photo and audio capturing on the go. Users are now quickly creating movies and taking photos whenever and wherever they go, particularly at concerts and events.

Still, in-situ media capturing with such devices poses constraints to any user, especially amateur ones. In this paper, we present the design and evaluation of a mobile video capture suite that allows for cooperative ad-hoc production. Our system relies on ad-hoc in-situ collaboration offering users the ability to switch between streams and cooperate with each other in order to capture better media with mobile devices.

Our main contribution arises from the description of our design process focusing on the prototyping approach and the qualitative analysis that followed. Furthermore, we contribute with lessons and design guidelines that emerged and apply to in-situ design of rich video collaborative experiences and with the elicitation of functional and usability requirements for collaborative video production using mobile devices.

Keywords
Journal of the American Society for Information Science and Technology (JASIST), 2012

Automatic Identification of Personal Insults on Social News Sites

Sara OwsleySood, Elizabeth Churchill, Judd Antin

As online communities grow and the volume of user-generated content increases, the need for community management also rises. Community management has three main purposes: to create a positive experience for existing participants, to promote appropriate, socionormative behaviors,

and to encourage potential participants to make contributions. Research indicates that the quality of content a potential participant sees on a site is highly influential; off-topic, negative comments with malicious intent are a particularly strong boundary to participation or set the tone for encouraging similar contributions. A problem for community managers, therefore, is the detection and elimination of such undesirable content. As a community grows, this undertaking becomes more daunting. Can an automated system aid community managers in this task?

In this paper, we address this question through a machine learning approach to automatic detection of inappropriate negative user contributions. Our training corpus is a set of comments from a news commenting site that we tasked Amazon Mechanical Turk workers with labeling. Each comment is labeled for the presence of profanity, insults, and the object of the insults. Support vector machines trained on these data are combined with relevance and valence analysis systems in a multistep approach to the detection of inappropriate negative user contributions.

The system shows great potential for semiautomated community management.

Keywords

Pages