We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.


Workshop at WSDM-2014

Data Design for Personalization: Current Challenges and Emerging Opportunities

Elizabeth Churchill, Atish Das Sarma

Personalization is central to most Internet experiences. Personalization is a data-driven process, whether the data are explicitly gathered (e.g., by asking people to fill out forms) or implicitly (e.g. through analysis of behavioral data).

It is clear that designing for effective personalization poses interesting engineering and computer science challenges. However, personalization is also a user experience issue. We believe that encouraging dialogue and collaboration between data mining experts, content providers, and user-focused researchers will offer gains in the area of personalization for search and for other domains.

This workshop is part of a larger effort we are developing: D2D: Data to Design - Design to Data.

Our vision is to provide a forum for researchers and practitioners in computer and systems sciences, data sciences, machine learning, information retrieval, interaction and interface design, and human computer interaction to interact.

Our goal is to explore issues surrounding content and presentation personalization across different devices, and to set an agenda for cross-discipline, collaborative engagement.


Arrival and Departure Dynamics in Social Networks

Shaomei Wu, Atish Das Sarma, Alex Fabrikant, Silvio Lattanzi, Andrew Tomkins

In this paper, we consider the natural arrival and departure of users in a social network, and ask whether the dynamics of arrival, which have been studied in some depth, also explain the dynamics of departure, which are not as well studied.

Through study of the DBLP co-authorship network and a large online social network, we show that the dynamics of departure behave differently from the dynamics of formation.

In particular, the probability of departure of a user with few friends may be understood most accurately as a function of the raw number of friends who are active. For users with more friends, however, the probability of departure is best predicted by the overall fraction of the user's neighborhood that is active, independent of size.

We then study global properties of the sub-graphs induced by active and inactive users, and show that active users tend to belong to a core that is densifying and is significantly denser than the inactive users. Further, the inactive set of users exhibit a higher density and lower conductance than the degree distribution alone can explain. These two aspects suggest that nodes at the fringe are more likely to depart and subsequent departure are correlated among neighboring nodes in tightly-knit communities.

American Economic Journal: Microeconomics 2013, 5(1): 1–27

Set-Asides and Subsidies in Auctions

Susan Athey, Dominic Coey, Jonathan Levin

Set-asides and subsidies are used extensively in government procurement and resource sales. We analyze these policies in an empirical model of US Forest Service timber auctions.

The model fits the data well both within the sample of unrestricted sales used for estimation, and when we predict (out-of-sample) outcomes for small business set-asides. Our estimates suggest that restricting entry substantially reduces efficiency and revenue, although it increases small business participation.

An alternative policy of subsidizing small bidders would increase revenue and small bidder profit, with little efficiency cost. We explain these findings by connecting to the theory of optimal auction design. (JEL D44, H57, L73, Q23)

Tutorial at CIKM-2013

Beyond Skylines and Top-k Queries: Representative Databases and e-Commerce Product Search

Atish Das Sarma, Ashwin Lall, Nish Parikh, Neel Sundaresan

Skyline queries have been a topic of intense study in the database area for over a decade now. Similarly, the top-k retrieval query has been heavily investigated by both the database as well as the web research communities. This tutorial will delve into the background of these two areas, and specifically focus on the recent challenges with respect to returning a small set of results to users, as well as requiring minimal intervention or input from them.

These are two main concerns with skylines and top-k respectively, and therefore have drawn a great deal of attention in the recent years with several interesting ideas being proposed in the research community. This tutorial will cover the current approaches to representative database selection. We will focus on both the theoretical models as well as the practical aspects from an industry standpoint.

The topics of covered in this tutorial will include identifying representative subsets of the skyline set, interaction based approaches, e-commerce product search, and leveraging aggregate user preference statistics.

In proceedings of The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013. 829-836. (Best Paper Award Winner)

Chelsea Won, and You Bought a T-shirt: Characterizing the Interplay Between Twitter and E-Commerce

Haipeng Zhang, Nish Parikh, Neel Sundaresan

The popularity of social media sites like Twitter and Facebook opens up interesting research opportunities for understanding the interplay of social media and e-commerce. Most research on online behavior, up until recently, has focused mostly on social media behaviors and e-commerce behaviors independently.

In our study we choose a particular global ecommerce platform (eBay) and a particular global social media platform (Twitter). We quantify the characteristics of the two individual trends as well as the correlations between them.

We provide evidences that about 5% of general eBay query streams show strong positive correlations with the corresponding Twitter mention streams, while the percentage jumps to around 25% for trending eBay query streams. Some categories of eBay queries, such as 'Video Games' and 'Sports', are more likely to have strong correlations.

We also discover that eBay trend lags Twitter for correlated pairs and the lag differs across categories. We show evidences that celebrities' popularities on Twitter correlate well with their relevant search and sales on eBay.

The correlations and lags provide predictive insights for future applications that might lead to instant merchandising opportunities for both sellers and e-commerce platforms.


On the Embeddability of Random Walk Distances

Adelbert Chang, Atish Das Sarma, Haitao Zheng, Ben Y.Zhao

Analysis of large graphs is critical to the ongoing growth of search engines and social networks. One class of queries centers around node affinity, often quantified by random-walk distances between node pairs, including hitting time, commute time, and personalized PageRank (PPR).

Despite the potential of these "metrics," they are rarely, if ever, used in practice, largely due to extremely high computational costs. In this paper, we investigate methods to scalably and efficiently compute random-walk distances, by "embedding" graphs and distances into points and distances in geometric coordinate spaces.

We show that while existing graph coordinate systems (GCS) can accurately estimate shortest path distances, they produce significant errors when embedding random-walk distances.

Based on our observations, we propose a new graph embedding system that explicitly accounts for per-node graph properties that affect random walk. Extensive experiments on a range of graphs show that our new approach can accurately estimate both symmetric and asymmetric random-walk distances.

Once a graph is embedded, our system can answer queries between any two nodes in 8 microseconds, orders of magnitude faster than existing methods. Finally, we show that our system produces estimates that can replace ground truth in applications with minimal impact on application output.

CIKM ’13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management Pages 1137-1146

On Segmentation of eCommerce Queries

Nish Parikh, Prasad Sriram, Mohammad AlHasan

In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce queries. QSEGMENT uses frequency data from the query log which we call buyers′ data and also frequency data from product titles what we call sellers′ data.

We exploit the taxonomical structure of the marketplace to build domain specific frequency models. Using such an approach, QSEGMENT performs better than previously described baselines for query segmentation.

Also, we perform a large scale evaluation by using an unsupervised IR metric which we refer to as user-intent-score. We discuss the overall architecture of QSEGMENT as well as various use cases and interesting observations around segmenting eCommerce queries.

Lightning Talk and Poster @ Extremely Large Databases XLDB 2013.

Building a Network of E-commerce Concepts

Sandip Gaikwad, Sanjay Ghatare, Nish Parikh, Rajendra Shinde

We present a method for developing a network of e-commerce concepts. We define concepts as collection of terms that represent product entities or commerce ideas that users are interested in. We start by looking at large corpora (Billions) of historical eBay buyer queries and seller item titles.

We approach the problem of concept extraction from corpora as a market-baskets problem by adapting statistical measures of support and confidence. The concept-centric meta-data extraction pipeline is built over a map-reduce framework. We constrain the concepts to be both popular and concise.

Evaluation of our algorithm shows that high precision concept sets can be automatically mined. The system mines the full spectrum of precise e-commerce concepts ranging all the way from "ipod nano" to "I'm not a plastic bag" and from "wakizashi sword" to "mastodon skeleton".

Once the concepts are detected, they are linked into a network using different metrics of semantic similarity between concepts. This leads to a rich network of e-commerce vocabulary. Such a network of concepts can be the basis of enabling powerful applications like e-commerce search and discover as well as automatic e-commerce taxonomy generation. We present details about the extraction platform, and algorithms for segmentation of short snippets of e-commerce text as well as detection and linking of concepts.