DataSift Curation Engine Aims for Relevance in Real-time

As I have said many times previously, if 2009 was all about the hype of Real-time, the future is all about capturing Relevance in real-time. Datasift has partnered with Twitter to get the full Twitter firehose and is building a platform to enable curation and filtering in real-time.Datasift

An introductory video about Datasift was posted in their first blog post, which didn’t reveal much about how the platform works. Now, uber-geek Robert Scoble has posted a video of an extensive discussion with Datasift’s founder, Nick Halstead.

Robert Scoble with Datasift founder Nick Halstead

This post is a summary of Datasift as discussed above concluding with my own thoughts.

The Basics

Twitter’s firehose at present has around 800 tweets/sec, or 70 million tweets/day. Datasift can filter this firehose using over 20 variables. Examples of these variables include:

  • Profile information like name, location, bio, number of follows, followers, lists, etc.
  • Text and language of tweets
  • Geo-location of tweets
  • Verified users
  • Source of tweets – web, Seesmic, TweetDeck, etc.
  • Number of Retweets
  • Whether tweet contains a hyperlink

Datasift is a rules-based engine that can filter this firehose using thousands of complex rules and provide a filtered stream in real-time within milliseconds. It is built using a Service Oriented Architecture and has an API.

The Rules

Rules can comprise of any combination of filters using the above variables. Rules can be combined and merged, or added and subtracted, into a single new rule. Stream outputs from Datasift using such rules can become columns in Twitter clients like TweetDeck.

Here are a few examples of how rules can be used:

  • Show me tweets containing “google” from users who don’t have “social media” in their bio, and who have more than 500 followers.
  • Show me tweets from my curated Twitter list of tech brands that have more than 100 Retweets.
  • Show me tweets originating from within a radius of 5 miles from the location of XYZ Conference that don’t have swear words, irrespective of whether their tweets contain the hashtag for the conference.
  • Show me tweets originating from Starbucks shops around the world, of users who are “Verified Accounts”, irrespective of what they’re about.

Datasift’s website is intended as a community website for curators and developers to collaboratively work on developing these rules. You can leverage rules created by others to avoid duplication of effort. Rules are classified with tags, and Datasift provides search, ranking and trending for easier discoverability of rules.

Partnerships for Influence Tracking and Sentiment Analysis

Datasift has partnered with PeerIndex and Klout to enable filtering using their influence and authority scores. It has also partnered with a firm for real-time sentiment analysis.

Thus, any of the above rules can be filtered further using such scores, and a stream of tweets with negative sentiment about a brand or product, combined with any other rules, can be monitored in real-time.

Alerts and Analytics

For esoteric rules that may provide a result infrequently, alerts can be set up. The example discussed is of any politicians from a Twitter list tweeting the word “scandal”. Developers can send these alerts as email, SMS, or notifications on smartphones.

The resulting streams from all rules applied by the engine are stored by Datasift. This data can be extracted, segmented, and analyzed later. For example, this can be used to track the performance of social media campaigns.

Relevance Filtering of Links

Datasift can use TweetMeme and other databases to check the links in tweets, and determine whether they are relevant to a specific topic. Not much details on how this is achieved, but apparently, Nick says that all sites are already classified into different subjects by Tweetmeme and other such databases.

Blekko-style Twitter Search

Datasift has developed a prototype of Twitter search along the lines of Blekko’s slashtags. Thus, along with your query text, you can use filters such as “/nolinks” to get tweets without links, or “/California” to get tweets originating from CA.

RSS Feeds

Compared to the massive volume of the Twitter firehose, the volume of RSS is minimal. Datasift plans to have their own PubSubHubbub server. Developers and third-parties can plugin any RSS feeds and use Datasift’s filtering rules to get an output feed.

Revenue Model

One option is free access to the stream with in-stream ads. Ads will be tailored and designed for the target form factor – desktop/mobile/tablet/etc.

Second option is selling data B2B for developers and brand companies, charged by volume of data consumed.

Prospective Partners

Datasift is seeking to work with startups like Flipboard, who are creating new ways for curated content consumption. This can also include any of the startups focusing on Relevance, such as TwitterTimes or Paperli.

My Thoughts

When I compared approaches to filtering information for relevance, I had suggested that the service most likely to succeed would be the one that supports multiple approaches and platforms. We can easily see that Datasift supports all platforms and several approaches like crowdsourced filtering, influence filtering, location filtering, etc. It is easily the most powerful relevance filtering engine I have seen yet.

The market of end-users for curated real-time content is at present unknown. Startups involved in creating pleasant experiences for consuming content have yet to find a monetization strategy. The degree of Datasift’s success from an end-user perspective is largely dependent on:

  • The creativity of developers and curators to create compelling experiences, and
  • How the monetization strategies of presentation apps fare and how Datasift is able to work with them

Nevertheless, with the amount of content being created online growing exponentially, curation and filtering will eventually become necessities for any social media client. It is just a matter of time.

I also see a bright future on the B2B front. By partnering with influence and authority tracking companies, combined with sentiment analysis, Datasift may already be a compelling choice for brand monitoring and social media reputation tracking.

Lastly, thanks to Robert Scoble and Nick Halstead for the interesting interview.

Tagged with:
 

The Evolution from Numbers to Relevance

Social media and Businesses on the web today are driven by the numbers game – of traffic, page views, and follower numbers. But the trend I foresee is:

The web is evolving from a numbers model to a relevance model.

Paradigm Shift: What is the Relevance Model?

Historically, monetization driven by CPC/CPM based advertising has led to websites and marketers focusing on page views and traffic. This is partly the cause of social media being spammed by internet marketers, ranking algorithms being gamed for traffic, and so on.

Numbers Model

Relevance Model

# of Followers Context-driven Lists
# of Clicks # of Interactions
# of Page Views # of Returning Visitors
# of Ads Displayed Time spent on site
# of Ads Clicked # of Subscriptions Gained
Obnoxious Ads Relevant Ads
Influence Management Dynamic Social Graph
Sharing Orgy & Noise Curation
Information Overload Filtered, Relevant Information
Traffic Economy Attention Economy
SEO and SMO Personalization

 

The above table lists different attributes of this paradigm shift. The “Influence Management” entry links to a post by Mia Dand who describes how leveraging social media is often about using a handful of influencers (read: with large follower numbers) to spread your message. Contrast that with Dynamic Social Graphs as described by Robert Scoble, where influence is dynamically determined based on relevance and not just numbers.

The Facebook Kingdom was built on Relevance

The king of the social web, Facebook, was not built on numbers, but relevance.

The success of Facebook and why it has garnered over 400 million users is because it grew on a base of real-life friends who were relevant in the users’ social circle. Other networks have failed to challenge Facebook partly because they have tried to go the other way around – from numbers to relevance.Bullseye

Prioritizing numbers over relevance is putting the cart in front of the horse.

Even as its explosive growth continues unabated, Facebook has not compromised on relevance. It knows that its success depends on users finding relevant content on Facebook and is willing to sacrifice advertising revenue to avoid becoming irrelevant.

I’ve touched upon various aspects of this ongoing theme while tracking the Google vs. Facebook race towards a relevant real-time. It’s becoming increasingly apparent that relevance wins over real-time.

While Facebook has never been in the numbers game, other networks like Digg are now moving from the numbers model to the relevance model.

Relevance vs. Real-Time in Location Check-ins

Consider the hottest trend of check-ins via location services, such as Foursquare or Gowalla.

When I check-in at a restaurant, the real-time checkins of my friends in other places is irrelevant. What is more important and relevant to me is the tips from my friends who have checked-in at the same place as I am right now.

In all cases, my friends are relevant in real-time only if they are at the same location as me. My other friends NOT at the same location become irrelevant.

Relevance wins over real-time.

The Mobile View

While mobile internet access grows, the screen of mobile devices remains constrained by its form factor. This is a major factor driving this evolution. If the content on your screen is constrained by its display, it had better be relevant.

Lifestreaming and Aggregation

As I discussed extensively in my post on why Google Buzz should not simply be yet-another-aggregator, lifestreaming and aggregation have failed to take off and gain mainstream adoption. The reason is simple – lack of relevance.

Which is why, it is personally heartening to see the champions of lifestreaming and aggregation turn their focus towards relevance and disaggregation.

Startups focusing on Relevance

Quite a few startups are hoping to capitalize on this trend:

  • my6sense – recently introduced an ‘Attention API’ allowing publishers to deliver relevant content to users
  • Cadmus – auto-filters Twitter/RSS streams by relevance
  • Knowmore – surfaces relevant stuff from Twitter/Facebook
  • TwitterTimes – personalized aggregation from Twitter
  • FeedTrace – personalized aggregation from Twitter
  • VictusMedia – ‘Intelligent Media Manager’
  • MixPanel – tracking what I’ll term “Relevance Analytics” for publishers
  • Cascaad – personalized news stream based on social graph from Twitter/Facebook

From Around the Web

Here are related posts that further elaborate on this evolution:

The Race Towards A Relevant Real-Time

I have written earlier about the advantage Google has over Facebook in achieving relevance in real-time. There have been many interesting developments since:

Looking at these developments together, it is clear that both Facebook and Google want to become indispensable by providing you with relevant information in real-time.

This race is becoming a war. Facebook COO Sheryl Sandberg had mentioned a “shift from an information economy to a social economy”. Mike Arrington clearly understands this war, as he asked Google’s VP Marrissa Mayer about moving from search to discovery of content via social networks like Twitter and Facebook. Quite predictably, Mayer talked about Google Social Search and Real-Time search in response.

My thoughts:

Facebook is on a quest for Searchability, Google is on a quest for Relevancy. Facebook has already pocketed relevancy with its social network, while Google has already pocketed search. Real-time is no longer a technological challenge.

Google is taking an algorithmic approach to social relevancy. Google’s personalized search results and enhancements to Google Suggest reveal its algorithmic approach to finding relevance in the absence of its social network. Because you do not have friends in Google, it is using your browsing patterns, Twitter follows, and search patterns of billions of its users to ascertain what may be socially relevant to you.

Steve Rubel predicted that Google will start promoting Google Profiles heavily. In the meantime, with personalized search results, Google has a stealth profile of everyone already.

Facebook keeping Friend Lists private may be to retain its exclusive access to your social graph, rather than a response to privacy criticism. Facebook must realize that if it hands over your social graph to public search engines, Google will have a huge lead in this race. I suspect that its retraction has more to do with continuing to be a strategic player in this race, and less with privacy. At the minimum, Facebook should expect significant financial benefits out of sharing Friend List information.

Twitter wants to be equated with real-time information. With Twitter opening up its data, expect Facebook apps to work with Twitter. Search engines are already integrating Twitter like mad. Whether you are a user, search engine, application or developer, Twitter wants to be your real-time channel. Kent asked why does Real Time always equate to Twitter? Because that’s exactly what Twitter wants to be.

Tagged with:
 

The Local Nature of Real-Time Means RSS Rules Forever

I wanted to make an observation about real-time and the Google Reader vs. Twitter war, about which Louis excellently summarizes the advantages of both in this post.GoogleReader.jpg

While real-time technology is removing all barriers to instant communication and information flow everywhere, there are geographical and biological limitations that it has not overcome yet. While announcements and press releases are being made from Silicon Valley, half the world who lives on the other side of the planet is sleeping.

Scoble, who doesn’t use Google Reader anymore, compares Techmeme with Twitter Lists, also noted:

If you don’t read tweets for eight hours, don’t worry, all the big stuff you missed will be on TechMeme.

My point exactly. Most of the world sleeps anywhere between 6-9 hours a day, and does many other things besides being on Twitter. When they wake up and want to get updated with the major tech news of the day, Twitter is of no help. This is not a limitation of Twitter, it’s just the local aspect of real-time.

When I observe who follows who on Twitter, sure there are millions of cases where people follow folks from around the world. But if someone were to make a statistical analysis of everyone on Twitter, I think it would be clear that the majority of follows are within their own country. The same may not be true of their Google Reader subscriptions or the links in their blogrolls on Wordpress.com. This is why, RSS will continue to rule, as long as the earth keeps rotating, we have nights and days, and need to sleep.

A real-life example encapsulating all this happened yesterday, when I felt earthquake tremors at home here in India. I tweeted about the earthquake from my personal Twitter account where I occasionally indulge in India-specific news, did not tweet from my tech-focused ScepticGeek account, and obviously did not bother to blog about it even on my personal blog.

There was no need for the rest of the world to know about those mild tremors, and it did not hit Techmeme. The Twitter feed of my personal account was filled with tweets about the earthquake, but this would have been “noise” to others. Those who follow me on Google Reader did not get any such noise.

This is the local nature of real-time. This is also why I agree with Mark Dykeman, who noted the difference between a reasonable time web and a real-time one.

Tagged with:
 

Timeless vs Real Time

If I were a book, you will put me in a bookshelf after you’ve read me. Later, I’ll probably lie in an attic and find my way to a library. My life would span a few decades, or even more. If I’m exceptionally good, I’ll be a timeless classic.

If I were your personal diary, I will probably last your lifetime, even if you stop using me after a while. You’ll keep me under lock and key, and no one else will read it. You will always treasure me.

If I were a real greeting card, you must have looked at me fondly, caressed me as if I were precious. You may not look at me again for many years, but I’ll be stashed away in some drawer of “memories”. Some day, you will enjoy nostalgia going through that drawer.

If I were a photo from your childhood, I will be stuck in some family album. This family album will be a great source of joy during holidays when the whole family is together.

cohdranknwaterfallandleaves2

If I were a blog post, I will live for a few years at best. That is, unless my blog is hacked or accidentally wiped out. I will be happy if your children know the name of my blog.

If I were a JPEG, I’d be one among the millions on Facebook or Flickr. Some people you’ve never met in real life may look at me and write comments. If I offend the sensibilities or political opinions of the owners of such social networks, I may be deleted.

If I were an email, my life in your inbox will be a few hours. After you’ve read me, I will be deleted or archived, and forgotten forever.

If I were a status update on a social network, I’ll be real-time, one among many that flow like fallen leaves in your friends river of feeds. If I’m good, I might be “liked”, extending my life by a few more minutes.

If I were an IM or chat conversation, I am real-time. I exist for a few fleeting minutes. I am usually used just to say Hi, or pass a link. Nobody ever looks at me again, as I vanish from this universe usually without leaving a trace.

If I were a tweet, my value usually lasts a few minutes. I may be short, but I am real-time. If I am any good, I will be passed around, shared among people who don’t know much about each other beyond their 140 character bios.

Tagged with:
 

Switch to our mobile site