Publications

Publications
Publications
We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.

Publications

Bayes-Nash Equilibria of the Generalized Second-Price Auction

Renato Gomes

We develop a Bayes–Nash analysis of the generalized second-price (GSP) auction, the multi-unit auction used by search engines to sell sponsored advertising positions. Our main result characterizes the efficient Bayes–Nash equilibrium of the GSP and provides a necessary

and sufficient condition that guarantees existence of such an equilibrium. With only two positions, this condition requires that the click–through rate of the second position is sufficiently smaller than that of the first.

When an efficient equilibrium exists, we provide a necessary and sufficient condition for the auction revenue to decrease as click–through rates increase. Interestingly, under optimal reserve prices, revenue increases with the click–through rates of all positions. Further, we prove that no inefficient equilibrium of the GSP can be symmetric.

Our results are in sharp contrast with the previous literature that studied the GSP under complete information.

Keywords
Journal of the American Society for Information Science and Technology (JASIST), 2012

Automatic Identification of Personal Insults on Social News Sites

Sara OwsleySood, Elizabeth Churchill, Judd Antin

As online communities grow and the volume of user-generated content increases, the need for community management also rises. Community management has three main purposes: to create a positive experience for existing participants, to promote appropriate, socionormative behaviors,

and to encourage potential participants to make contributions. Research indicates that the quality of content a potential participant sees on a site is highly influential; off-topic, negative comments with malicious intent are a particularly strong boundary to participation or set the tone for encouraging similar contributions. A problem for community managers, therefore, is the detection and elimination of such undesirable content. As a community grows, this undertaking becomes more daunting. Can an automated system aid community managers in this task?

In this paper, we address this question through a machine learning approach to automatic detection of inappropriate negative user contributions. Our training corpus is a set of comments from a news commenting site that we tasked Amazon Mechanical Turk workers with labeling. Each comment is labeled for the presence of profanity, insults, and the object of the insults. Support vector machines trained on these data are combined with relevance and valence analysis systems in a multistep approach to the detection of inappropriate negative user contributions.

The system shows great potential for semiautomated community management.

Keywords
in IEEE Large-scale Data Analysis and Visualization (LDAV) 2012

Visual Analysis of Massive Web Session Data

Zeqian Shen, Jishang Wei, Neel Sundaresan, Kwan-Liu Ma, Zeqian Shen, Jishang Wei, Neel Sundaresan, Kwan-Liu Ma

Tracking and recording users’ browsing behaviors on the web down to individual mouse clicks can create massive web session logs.While such web session data contain valuable information about user behaviors, the ever-increasing data size has placed a big challenge to analyzing and visualizing the data.

An efficient data analysis framework requires both powerful computational analysis and interactive visualization. Following the visual analytics mantra "Analyze first, show the important, zoom, filter and analyze further, details on demand", we introduce a two-tier visual analysis system, TrailExplorer2, to discover knowledge from massive log data.

The system supports a visual analysis process iterating between two steps: querying web sessions and visually analyzing the retrieved data. The query happens at the lower tier where terabytes of web session data are processed in a cluster.

At the upper tier, the extracted web sessions with much smaller scale are visualized on a personal computer for interactive exploration. Our system visualizes a sorted list of web sessions’ temporal patterns and enables data exploration at different levels of details.

The query visualization exploration process iterates until a satisfactory conclusion is achieved. We present two case studies of TrailExplorer2 using real world session data from eBay to demonstrate the system's effectiveness.

Keywords
Categories
in IEEE Visual Analytics Science and Technology (VAST) 2012

Visual Cluster Exploration of Web Clickstream Data

Jishang Wei, Zeqian Shen, Neel Sundaresan, Kwan-Liu Ma, Jishang Wei, Zeqian Shen, Neel Sundaresan, Kwan-Liu Ma

Web clickstream data are routinely collected to study how users browse the web or use a service. It is clear that the ability to recognize and summarize user behavior patterns from such data is valuable to e-commerce companies. In this paper, we introduce a visual analytics system to explore the various user behavior patterns reflected by distinct clickstream clusters.

In a practical analysis scenario, the system first presents an overview of clickstream clusters using a Self-Organizing Map with Markov chain models.

Then the analyst can interactively explore the clusters through an intuitive user interface. He can either obtain summarization of a selected group of data or further refine the clustering result. We evaluated our system using two different datasets from eBay.

Analysts who were working on the same data have confirmed the system’s effectiveness in extracting user behavior patterns from complex datasets and enhancing their ability to reason.

Keywords
Categories
Proceedings of KDD’12, Beijing, China. August 2012

Bootstrapped Language Identification For Multi-Site Internet Domains

We present an algorithm for language identification, in particular of short documents, for the case of an Internet domain with sites in multiple countries with differing languages.

The algorithm is significantly faster than standard language identification methods, while providing state-of-the-art identification. We bootstrap the algorithm based on the language identification based on the site alone, a methodology suitable for any supervised language identification algorithm.

We demonstrate the bootstrapping and algorithm on eBay email data and on Twitter status updates data. The algorithm is deployed at eBay as part of the back-office development data repository.

STOC 2011 (Invited and accepted to SICOMP)

Distributed Verification and Hardness of Distributed Approximation

Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, Roger Wattenhofer, Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, Roger Wattenhofer

We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in the end of the process whether H has the specified property or not). We would like to perform this verification in a decentralized fashion via a distributed algorithm. The time complexity of verification is measured as the number of rounds of distributed communication.

In this paper we initiate a systematic study of distributed verification, and give almost tight lower bounds on the running time of distributed verification algorithms for many fundamental problems such as connectivity, spanning connected subgraph, and s-t cut verification.

We then show applications of these results in deriving strong unconditional time lower bounds on the hardness of distributed approximation for many classical optimization problems including minimum spanning tree, shortest paths, and minimum cut.

Many of these results are the first non-trivial lower bounds for both exact and approximate distributed computation and they resolve previous open questions. Moreover, our unconditional lower bound of approximating minimum spanning tree (MST) subsumes and improves upon the previous hardness of approximation bound of Elkin [STOC 2004] as well as the lower bound for (exact) MST computation of Peleg and Rubinovich [FOCS 1999]. Our result implies that there can be no distributed approximation algorithm for MST that is significantly faster than the current exact algorithm, for any approximation factor.

Our lower bound proofs show an interesting connection between communication complexity and distributed computing which turns out to be useful in establishing the time complexity of exact and approximate distributed computation of many problems.

Keywords
PVLDB 2011 (Invited to VLDB Journal Special Issue)

Personalized Social Recommendations – Accurate or Private?

Atish Das Sarma, Ashwin Machanavajjhala, Aleksandra Korolova

With the recent surge of social networks such as Facebook, new forms of recommendations have become possible -- recommendations that rely on one's social connections in order to make personalized recommendations of ads, content, products, and people. Since recommendations may use sensitive information, it is speculated that these recommendations are associated with privacy risks. The main contribution of this work is in formalizing trade-offs between accuracy and privacy of personalized social recommendations.

We study whether "social recommendations", or recommendations that are solely based on a user's social network, can be made without disclosing sensitive links in the social graph. More precisely, we quantify the loss in utility when existing recommendation algorithms are modified to satisfy a strong notion of privacy, called differential privacy. We prove lower bounds on the minimum loss in utility for any recommendation algorithm that is differentially private.

We then adapt two privacy preserving algorithms from the differential privacy literature to the problem of social recommendations, and analyze their performance in comparison to our lower bounds, both analytically and experimentally.

We show that good private social recommendations are feasible only for a small subset of the users in the social network or for a lenient setting of privacy parameters.

Keywords
PODC 2011

A tight unconditional lower bound on distributed random walk computation

Atish Das Sarma, Danupon Nanongkai, Gopal Pandurangan, Atish Das Sarma, Danupon Nanongkai, Gopal Pandurangan

No Information

Keywords
Volume 18, Issue 1, 2011

Rat, rational, or seething cauldron of desire: designing the shopper, interactions

Elizabeth Churchill

In this article, I look at how our shopping experience, both online and offline, is designed.

Keywords

Pages