Big data is fool's gold for most.

A couple of days ago, I found myself in a mini-conversation with my friend Jeff Novich around Netflix recommendations. It started with his tweet:

The conversation went back and forth a bit, and eventually lead me to make this statement:

I’ve been thinking about this conversation off and on since then.

Back in 2010, I took a consulting gig. with bit.ly for two core reasons:

1. A chance to work with Hilary Mason (still one of my absolute favorite people in 'tech.’).

2. A chance to dig into the 'big data’ that was passing through bit.ly

Shortly after taking the gig, I distinctly remember having a conversation with Greg Battle where I was taking the stance that 'data is valuable’ and he, in his usual zen-like way, pointed out that “everyone has data and everyone can get data. It’s not as valuable as people think”.

I remember this conversation to this day because it was 100% against my beliefs at that moment in time, and yet I *knew* that it was at least 80% correct (and more likely, as is Greg’s norm., 100% correct).

Fast forward through my hacks around bit.ly data, my (mostly failed) efforts with knowabout.it, and countless conversations with others playing around with 'big data’ and I think I can safely say…Greg was of course right.

Data itself (big or small) is pretty useless. But it goes further than that, because really in hindsight, that’s pretty obvious.

Now the common trend is to think that 'big data’ combined with 'intelligent algorithms’ is the holy grail. But again, this is wrong and for many reasons.

First - Big data is extremely messy.

Working with it still requires a lot of scrubbing, cleaning, and standardizing…what that really translates into is 'dropping’ a lot of stuff that just doesn’t fit into your model…and that’s actually a HUGE problem. As the old saying goes, “if you go looking for trouble, you’re sure to find it.”. If you’re forcing data into a model, then you aren’t really learning from the data, you are just using the data to confirm or break your model…and you can probably use it for either/both regardless. Bottom line, it will only confirm what you already know to be true.

Second - Algorithms are all about finding the 'norm’, the 'standard’, the 'average’, or 'the extreme’…the beauty of math is that there is an answer, and usually just one correct one. But big data is messy because it’s generated by people…and people are unique and different. There is no one correct answer…so the data and algorithms alone really can’t be used to reveal it. At best I believe they can reveal an average that you can base judgements off of.

When I look back on my own experiences, I can say that the most successful feature I’ve built on 'big data’ was probably the 'quiet sources’ in the orig. knowabout.it product.  People still randomly bring that up to me as something they really loved about the product.

Funny thing is, out of everything across the whole system it actually involved very little math or data. It was simply a logic solution. A different, human, way to look at the problem of 'catching what you miss’. It spoke to the 'why’ of what people miss.

This simple, logical, human approach to dealing with big data is what I think the actual secret to pulling value of out big data is right now.

And when you look around at the services that are 'winning’ share of mind right now, you’ll see this is how many of them are actually pulling it off…for example:

Digg - They actually have human editors/curators who are helping to pick what stories go on the homepage and in their emails; The algorithms and big data give suggestions, but the humans do the picking.

Reddit and Hacker News - Voting pushes things up and down so you could argue that it’s an algorithm, but I think it’s really the core set of contributors and the active authors who shamelessly work to get/request up-votes from their networks that determine what actually has a chance to tip.

Others are doing it in their own ways, but my point is that I don’t think any of these systems are truly just 'big data + algorithms’…and I’m not sure that, in our current state, anything would be any good if it really was.

All of which is a long winded way of saying, “Data-shmata. It’s about the people, damn it.

This post has received 38 loves.


ARCHIVE OF POSTS



This is the personal blog of Kevin Marshall (a.k.a Falicon) where he often digs into side projects he's working on for digdownlabs.com and other random thoughts he's got on his mind.

Kevin has a day job as CTO of Veritonic and is spending nights & weekends hacking on Share Game Tape. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.

If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).