Getting distinct records out of Mongo and sorting them in random order.

It’s been a really long time since I’ve posted any actual code up here…so I figured I would share a little snippet that I’ve been using quite a bit in the knowabout.it processing stuff lately. One of the things that happens a lot on the backend is indexing links or users…to save on processing, I don’t want to just loop through every user ever time the process runs. I only want to loop through those users that have links waiting to be indexed. Solving that problem is actually pretty easy because Mongo has a ‘distinct’ command that you can use to get a distinct list of results (in this case, I get a distinct list of users that have links waiting to be indexed). The next issue that we run into though is that the distinct command relies on an index, and that index will cause the results to be sorted (in the order that you defined the index – so either ascending or descending). In a lot of cases this is fine, but in my case it ends up meaning that every time I want to index links for my users the people who are at the start of the alphabet get their work done first. And again, not a huge deal, but if the process takes a long time to run (which it does the more and more users we acquire) the longer and longer the users at the end have to wait for a turn (technically this isn’t entirely true as I do thread much of this work - but even with threading the initial set of threads are the people at the start of the alphabet). So anyway - what I wanted was a unique list of users from a mongo collection, sorted in a random order…and that’s basically what this chunk of code does:

use MongoDB; # define the array we are going to store usernames in my @accounts = (); # create a connection to Mongo my $conn = MongoDB::Connection->new('YOUR_CONNECTION_HOST’,{}); # state what mongo instance we want to work with my $db = $conn->my_test_mongo; # get a distinct list of users from our collection my $users = $db->run_command(['distinct’ => 'some_collection_that_has_users’, 'key’ => 'user’]); # run_command returns a Hash (where 'values’ contains the data we want to get at) while (my ($key, $value) = each(%$users)) {   if ($key =~ /values/) {     foreach my $val (@$value) {       # we are going to use a random number to sort the array (so it can be diff order every time)       $newval = rand() . “||” . $val;       push(@accounts, $newval);     }   } } # not really required, but let’s clean up any dangling connection stuff undef $users; undef $db; undef $conn; # now we have an array of users, sort it (which is based on random numbers) @accounts = sort(@accounts); # finally we can loop through our sorted array and do something with each user foreach my $account (@accounts) {   # we have to get rid of the stuff we used to sort the array (so we just have the user we want to process)   @tmp = split(/||/, $account);   print $tmp[1] . ’\n’; }
I’m sure there are lots of ways to do this better (and be more efficient), but for now it’s a really nice, lightweight way for me to get a list of records in a random order. Note: If you don’t need to use the 'distinct’ command, and all you want is a random set of records from Mongo is some random order, the recommneded way would be to assign a random number to each document in your collection…put an index on that, and just sort based on it when querying (and I actually do use this approach quite successfully for other parts of knowabout.it). The reason I needed the above code is because, logically, you can’t combine a 'distinct’ call with a 'random’ function…at least not while expecting reliable, unique results.

This post has received 43 loves.


ARCHIVE OF POSTS



This is the personal blog of Kevin Marshall (a.k.a Falicon) where he often digs into side projects he's working on for digdownlabs.com and other random thoughts he's got on his mind.

Kevin has a day job as CTO of Veritonic and is spending nights & weekends hacking on Share Game Tape. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.

If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).