We have a complicated spam ecosystem at Facebook because we're one of the largest — and therefore, lucrative — platforms for spammers. We also need to scale our services for 1.4 billion people. But we're not the only organization facing these challenges. In fact, spammers often exploit multiple platforms simultaneously to maximize their return on investment.
This week hundreds of spam-fighting professionals gathered at Facebook for our Spam Fighting @Scale conference to discuss techniques for fighting spam, forge new cooperative relationships, and help each other anticipate new spam attacks.
The event featured talks from engineers at Facebook, Pinterest, Dropbox, Yelp, and LinkedIn who are focused on fighting spam and abusive content at scale on online platforms and services.
Several key considerations emerged from the discussions, including the following:
Below are selected videos and recaps from the event. We hope to see you at a future @Scale conference!
The day kicked off with remarks from Gregg Stefancik, Engineering Director at Facebook. Gregg stressed the importance of collaboration and shared technology for the broader security and spam fighting community.
Chris Walters, tech lead for Real-Time Platform at Pinterest, shared experiences from the history of fighting spam at Pinterest, how it influenced their strategy, and how it shapes technical decisions. Like many content sharing platforms, the very characteristics that attract consumers to Pinterest also attract spammers—namely, trusted user-generated content with a wide distribution potential. Chris walked through the evolution of Pinterest's architecture to support both sync and async enforcement with a flexible DSK framework and execution engine.
Chris advised spam-fighting teams to build systems that teams wants to use, exert scaled effort, and aim for faster execution with a tighter feedback loop.
Jieqi Yu and Isaac Fullinwider, members of the Site Integrity team at Facebook, discussed some of the ways Facebook leverages event-based processing and adopt methods that transcend the root cause of spam, such as fake accounts, malware, phishing, etc. Their goal is to aggregate low confidence features into high confidence decisions.
They shared insights about the advantages and challenges of counter-based and cluster-based aggregation techniques to identify anomalies on the Facebook platform. A counter-based strategy, they argued, has lower data storage requirements and can be conducted with a given number of observations. On the other hand, cluster-based aggregation makes it easier to remediate the first observed entities, supports arbitrary groupings, and enables retroactive analysis of new anomalies.
Marcin Pawlowski, a software engineer on the Site Integrity Infrastructure team at Facebook, presented several concepts for building spam-fighting systems: executing rules, aggregating data, and clustering. Effective rules are precise and lead to pure functions while data aggregation and clustering favor approximate values and compactness.
Facebook uses Haxl, a Haskell library, to simplify access to remote data with cleaner data-fetching code. Clustering can then be used to aggregate around various features such as photos, text, requests, or code. These clusters are based on similarity and it is often sufficient to cluster features that are simply modifications of each other.
Above all, Marcin advised, when it comes to building spam-fighting infrastructure: simplify, experiment, and move fast (without breaking things).
Mark Hammel leads the Threat Infrastructure team at Facebook. He shared details about ThreatExchange, a platform created by Facebook for security professionals anywhere to share threat information more easily, learn from each other’s discoveries, and make their own systems safer. Launched in early 2015, the platform already supports more than 50 participating organizations and spans a wide range of industries including financial services, industrial manufacturing, pharmaceuticals, and technology.
ThreatExchange allows security professionals to share indicators (domains, SSL certs, URLs), malware (families, hashes, samples), or signatures (BRO, Snort, Yara, etc). A core design component of ThreatExchange is data modeled in what mathematicians and computer scientists commonly call a graph. This design, the same one Facebook uses to represent your Facebook account and connections between friends, lends itself well to representing real world interactions between threats like malware, bad domains, and spammy URLs.
Feedback is a gift and lurking is OK, Mark said, encouraging anyone who's curious about ThreatExchange to sign up and tell us what's working and what isn't.
Sign up here: threatexchange.fb.com.
Jonathan Gross is an engineering manager protecting the Facebook platform from spam and abuse. He discussed some of the lessons we've learned while dealing with malicious apps, protecting social plugins from clickjacking attacks, and handling abuse of OAuth access tokens.
Jonathan pointed out the importance of understanding the economics of spam because attackers will abandon unprofitable campaigns. Facebook increases the cost of spamming by making developer accounts a scarce resource and requiring additional verification. At the same time, we are reducing the profitability of spam campaigns by improving detection latency.
Jonathan counseled attendees that product changes are often better than spam classifiers and to anticipate that people will continue to be easy victims of social engineering.
Ted Hwa and Sayan Dasgupta from the Security Data Science team at LinkedIn presented best practices for finding spam in small-text fields, particularly with indirect channels such as profile data. Fighting spam is difficult within indirect channels because it's hard to detect before an event occurs and there is is often not enough text to take advantage of traditional classifiers.
Ted and Sayan walked through a few examples of how they're leveraging small-text fields to identify spam within indirect channels including a name scoring model to authenticate whether names are real, headline scoring to detect fake titles or company names, and account clustering to group together fake accounts coming from a single entity.