Geeks With Blogs

John Conwell: aka Turbo Research in the visual exploration of data
I wanted to run a simple sanity check to make sure I didn't have any duplicate twitter posts in my database. I didn't think I had any, but you never can be too sure, so I whipped up a simple MapReduce query to check. Right now I'm storing twitter posts in MongoDb using the document schema shown below:


Each document has two fields, a category field that holds what search was used to find the post, and a post field to store the returned Twitter post. I wanted to make sure that I didn't have any duplicate posts for any given category (though it is allowed to have duplicate posts across categories).

My map function created a key by concatenating the category and the post.post_id fields together, and returned this key with the value 1. Then my reduce function just counts how many values have this key...Easy Peasy.

When I ran the MapReduce job on MongoDb, the results said that out of 270,000 posts, I had 5 duplicated posts. I took the returned key which has the category and post id and ran a MongoDb query to search for them...and it returned one document. Hmmm...odd. I did the same test on the other 4 reported duplicates and got the same result. Only one document returned for each category and post id query. Really odd.

Ok, to troubleshoot this further I changed my map function to pass back both the value 1 and the document object as its value, and then changed the reduce function to return not just to sum of all dups, but the actual document objects that made up the dup.

This is where it really got odd. When I ran the MapReduce query again, I got the same results: 5 duplicate documents as expected. Then I printed out the two documents that made up each duplicate pair, and what did I see. They had unique category / post_id combinations!

Red Wine | 16201436807299072
Red Wine | 16201436807299073

For all file duplicate pairs that MongoDb returned, each pair had the same category, but the post ids were off by 1.

I'm not sure what to think about this. It seems like it might be be some kind of numeric handling or overflow error in javascript as a result of Twitter ids being so large. Should I change all my ids into strings when I store them in MongoDb? Maybe that would solve the issue. Also, since all my analytics are going to be MapReduce based, I'm not sure if I can trust the results or not.

Next steps in troubleshooting this is to add a string field to each post to hold the string value of each id. Then re-run the duplicate check to see if I get the same results.

Posted on Monday, May 23, 2011 9:36 AM MongoDb | Back to top

Comments on this post: MongoDb MapReduce Bug?

No comments posted yet.
Your comment:
 (will show your gravatar)

Copyright © John Conwell | Powered by: