mongodb - How to extract total count for values not in array type fields -
i storing complete message thread (with messages) single document. array field participants
holds participants user ids. every message has array field read_by
user ids read message.
example data:
db.threads_test.insert( { "subject" : "subject 1", "participants" : ["u1", "u2"], "messages" : [ {"message_id" : "m1", "message" : "msg 1", "read_by" : ["u1"]}, {"message_id" : "m2", "message" : "msg 2", "read_by" : ["u2"]} ]}); db.threads_test.insert( { "subject" : "subject 2", "participants" : ["u1", "u2"], "messages" : [ {"message_id" : "m3", "message" : "msg 3", "read_by" : ["u1"]}, {"message_id" : "m4", "message" : "msg 4", "read_by" : ["u1"]} ]}); db.threads_test.insert( { "subject" : "subject 3", "participants" : ["u1", "u3"], "messages" : [ {"message_id" : "m5", "message" : "msg 5", "read_by" : ["u1", "u3"]} ]});
i need found out how many unread threads , how many unread messages user have.
logic interpretation this:
- unread thread 1 user in
participants
array , has @ least 1 message user not inread_by
- unread message 1 belong thread has user in
participants
, not have user inread_by
expected unread counts per user:
u1: threds=1, messages=1 u2: threads=2, messages=3 u3: threads=0, messages=0
i've been checking aggregation framework not find solution.
mongo version 2.4.9
this not easy 1 aggregation framework , due many arrays, there ways can wrong. fortunately there sound logic pattern here when @ closely since comes down comparison point:
db.threads_test.aggregate([ // unwind arrays { "$unwind": "$messages" }, { "$unwind": "$messages.read_by" }, { "$unwind": "$participants" }, // group on distinct "message_id" comparing "particpant" , "read_by" { "$group": { "_id": { "_id": "$_id", "participant": "$participants", "message_id": "$messages.message_id" }, "unread": { "$min": { "$cond": [ { "$ne": [ "$participants", "$messages.read_by" ] }, 1, 0 ] } } }}, // sum of unread per thread { "$group": { "_id": { "_id": "$_id._id", "participant": "$_id.participant", }, "unread": { "$sum": "$unread" } }}, // sum per participant counting unread threads { "$group": { "_id": "$_id.participant", "threads": { "$sum": { "$cond": [ { "$ne": [ "$unread", 0 ] }, 1, 0 ] } }, "unread": { "$sum": "$unread" } }} ])
which gives result:
{ "_id" : "u2", "threads" : 2, "unread" : 3 } { "_id" : "u3", "threads" : 0, "unread" : 0 } { "_id" : "u1", "threads" : 1, "unread" : 1 }
the first $group
stage there critical. ater $unwind
being processed on each array there going lot of duplication invoved. fortunately each level of "thread" , "message" have own distinct "id" values. along distinct "participants" vital point.
when @ data in "unwound" form, should able see through duplication "key test" here compare "participant" , "read_by" values see if same. , "processing loops" ( except combinations not laid out ), need return "once" given message "participant" , "ready_by" "equal".
this explains "grouping" combination. "key" comprised of "thread", "participant" , "message_id" need $min
numerical result comparison "read_by". if @ least "one" of "read_by" matched count 1
otherwise it's 0
.
the next stages careful grouping totals. first total "unread" count per thread, count threads unread messages down final "participant" grouping key.
so though it's not "always" path finding solution, doing $unwind
operations @ start way visualize data can understand solution.
alternate method
as state, have mongodb 2.4 available this, , depending on size of collection processing $unwind
can cause lot of overhead. later versions have provision possibly problem.
i mentioned earlier "processing loops", can mapreduce
instead.
while aggregation framework preferred here, might need consider if size restriction:
db.threads_test.mapreduce( function () { var doc = this; doc.participants.foreach(function(participant) { doc.messages.foreach(function(message) { var obj = { threads: [], unread: 0 }; if ( message.read_by.indexof(participant) == -1 ) { obj.threads.push(doc._id.valueof()); obj.unread = 1; } emit(participant,obj); }) }) }, function (key,values) { var result = { "threads": [], "unread": 0 }; values.foreach(function(value) { value.threads.foreach(function(thread) { if ( result.threads.indexof(thread) == -1 ) result.threads.push(thread); }) result.unread += value.unread; }); return result; }, { "finalize": function(key,value) { value.threads = value.threads.length; return value; }, "out": { "inline": 1 } } )
same thing here really. each participant on thread each message, compared "read_by" list see if in there. emit "thread id" when message unread , result if "unread". emitted every message on thread in combination participant. "loop of loops".
the results "reduced" pulling distinct values "threads" , totalling unread messages participant.
since "threads" distinct list of "id" values, want length of list in end after reduction. "finalize" here , converts list numeric value of it's length.
same results not pretty restriction of mapreduce:
"results" : [ { "_id" : "u1", "value" : { "threads" : 1, "unread" : 1 } }, { "_id" : "u2", "value" : { "threads" : 2, "unread" : 3 } }, { "_id" : "u3", "value" : { "threads" : 0, "unread" : 0 } } ],
no matter 1 suits best, problem solution should clear both programitic brain , aggregation brain
Comments
Post a Comment