When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
March 03, 2017 Β· Declared Dead Β· π The Web Conference
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Aneesh Sharma, C. Seshadhri, Ashish Goel
arXiv ID
1703.01054
Category
cs.SI: Social & Info Networks
Cross-listed
cs.DC,
cs.DS
Citations
17
Venue
The Web Conference
Last Checked
4 months ago
Abstract
Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold $Ο$. In contrast to previous work where $Ο$ is assumed to be quite close to 1, we focus on recommendation applications where $Ο$ is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small $Ο$. To the best of our knowledge, there is no practical solution for computing all user pairs with, say $Ο= 0.2$ on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Social & Info Networks
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Fake News Detection on Social Media: A Data Mining Perspective
R.I.P.
π»
Ghosted
Natural Scales in Geographical Patterns
R.I.P.
π»
Ghosted
Representation Learning on Graphs: Methods and Applications
R.I.P.
π»
Ghosted
The COVID-19 Social Media Infodemic
R.I.P.
π»
Ghosted
OSMnx: New Methods for Acquiring, Constructing, Analyzing, and Visualizing Complex Street Networks
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted