Publications

Publications
Publications
We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.

Publications

Advances in Neural Information Processing Systems (NIPS), 2014

Parallel Feature Selection inspired by Group Testing

Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher Ré, Venu Govindaraju

This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with state-of-the-art feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a pre-processing step for most other existing methods.

Proceedings of the Sixteenth ACM Conference on Economics and Computation (EC '15). ACM, New York, NY, USA (2015)

Canary in the e-Commerce Coal Mine: Detecting and Predicting Poor Experiences Using Buyer-to-Seller Messages

Dimitriy Masterov, Uwe Mayer, Steve Tadelis

Reputation and feedback systems in online marketplaces are often biased, making it difficult to ascertain the quality of sellers. We use post-transaction, buyer-to-seller message traffic to detect signals of unsatisfactory transactions on eBay. We posit that a message sent after the item was paid for serves as a reliable indicator that the buyer may be unhappy with that purchase, particularly when the message included words associated with a negative experience. The fraction of a seller's message traffic that was negative predicts whether a buyer who transacts with this seller will stop purchasing on eBay, implying that platforms can use these messages as an additional signal of seller quality.

ICCV, December, 2015

HD-CNN: Hierarchical Deep Convolutional Neural Network for Image Classification

Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu

In image classification, visual separability between different object categories is highly uneven, and some categories are more difficult to distinguish than others. Such difficult categories demand more dedicated classifiers. However, existing deep convolutional neural networks (CNN) are trained as flat N-way classifiers, and few efforts have been made to leverage the hierarchical structure of categories.

In this paper, we introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. During HD-CNN training, component-wise pretraining is followed by global finetuning with a multinomial logistic loss regularized by a coarse category consistency term.

In addition, conditional executions of fine category classifiers and layer parameter compression make HD-CNNs scalable for large-scale visual recognition. We achieve state-of-the-art results on both CIFAR100 and large-scale ImageNet 1000-class benchmark datasets. In our experiments, we build up three different HD-CNNs and they lower the top-1 error of the standard CNNs by 2.65%, 3.1% and 1.1%, respectively.

Computer Networks 79: 91-102 (2015)

PcapWT: An efficient packet extraction tool for large volume network traces

Young-Hwan Kim, Roberto Konow, Diego Dujovne, Thierry Turletti, Walid Dabbous, Gonzalo Navarro:

Network packet tracing has been used for many different purposes during the last few decades, such as network software debugging, networking performance analysis, forensic investigation, and so on. Meanwhile, the size of packet traces becomes larger, as the speed of network rapidly increases. Thus, to handle huge amounts of traces, we need not only more hardware resources, but also efficient software tools. However, traditional tools are inefficient at dealing with such big packet traces. In this paper, we propose pcapWT, an efficient packet extraction tool for large traces. PcapWT provides fast packet lookup by indexing an original trace using a wavelet tree structure. In addition,pcapWT supports multi-threading for avoiding synchronous I/O and blocking system calls used for file processing, and is particularly efficient on machines with SSD. PcapWTshows remarkable performance enhancements in comparison with traditional tools such as tcpdump and most recent tools such as pcapIndex in terms of index data size and packet extraction time. Our benchmark using large and complex traces shows thatpcapWT reduces the index data size down below 1% of the volume of the original traces. Moreover, packet extraction performance is 20% better than with pcapIndex. Furthermore, when a small amount of packets are retrieved, pcapWT is hundreds of times faster than tcpdump.

Keywords
Categories
Proceedings of NAACL-HLT 2015, pages 160–167, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics

Distributed Word Representations Improve NER for e-Commerce

This paper presents a case study of using distributed word representations, word2vec in particular, for improving performance of Named Entity Recognition for the e-Commerce domain. We also demonstrate that distributed word representations trained on a smaller amount of in-domain data are more effective than word vectors trained on very large amount of out-of-domain data, and that their combination gives the best results.

Keywords
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 805–814, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics

Structuring E-Commerce Inventory

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions. In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm. We evaluate our method on a year worth of ecommerce
data and show that it achieves excellent precision with good recall.

Keywords
40th International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015

Switching to and Combining Offline-Adapted Cluster Acoustic Models based on Unsupervised Segment Classification

The performance of automatic speech recognition system degrades significantly when the incoming audio differs from training data. Maximum likelihood linear regression has been widely used for unsupervised adaptation, usually in a multiple-pass recognition process. Here we present a novel adaptation framework for which the offline, supervised, high-quality adaptation is applied to clustered channel/speaker conditions that are defined with automatic and manual clustering of the training data. Upon online recognition, each speech segment is classified into one of the training clusters in an unsupervised way, and the corresponding top acoustic models are used for recognition. Recognition lattice outputs are combined. Experiments are performed on the Wall Street Journal data, and a 37.5% relative reduction of Word Error Rate is reported. The proposed approach is also compared with a general speaker adaptive training approach.

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Correcting Keyboard Layout Errors and Homoglyphs in Queries

Keyboard layout errors and homoglyphs in cross-language queries impact our ability to correctly interpret user information needs and offer relevant results. We present a machine learning approach to correcting these errors, based largely on character-level n-gram features. We demonstrate superior performance over rule-based methods, as well as a significant reduction in the number of queries that yield null search results.

Pages