python - mongoDB `upsert` with multiple key values -


i'm pulling data amazon mechanical turk , saving in mongodb collection.

i have multiple workers repeat each task little redundancy helps me check quality of work.

every time pull data amazon using boto aws python interface obtain file containing completed hits , want insert them collection.

here document want insert collection:

    mongo_doc = \     {'subj_id'    :data['subj_id'],     'img_id'      :trial['img_id'],     'data_list'   :trial['data_list'],     'worker_id'   :worker_id,     'worker_exp'  :worker_exp,     'assignment_id':ass_id     } 
  • img_id identifier of image database of images.
  • subj_id identifier of person in image (there might multiple per image).
  • data_list data obtain amt workers.
  • worker_id, worker_exp, assignment_id variables amt worker , assignment.

successive pulls using boto contain same data, don't want have duplicate documents in collection.

i aware of 2 possible solutions none work me:

  1. i search document in collection , insert if not present. have high computational cost.

  2. i can use upsert way make sure document inserted if key not contained. of contained keys can duplicated since task repeated multiple workers.

note on part 2: - subj_id, img_id, data_list can duplicated since different workers annotate same subject, image , give same data. - worker_id, worker_exp, assignment_idcan duplicated since worker annotates multiple images within same assignment. - unique thing combination of these fields.

is there way can insert mongo_doc if not inserted previously?

as long "all" want here "insert" items have couple of choices here:

  1. create "unique" index across required fields , use insert. put, when combination of values same exists "duplicate key" error thrown. stops same thing being added twice , can alert exception. possibly best used bulk operations api , "unordered" flag operations. same "unordered" available insert_many(), prefer syntax of bulk api, allows better building , mixed operations:

    bulk = pymongo.bulk.bulkoperationbuilder(collection,ordered=false) bulk.insert(document) result = bulk.execute() 

    if multiple operations used before .execute() called sent server @ once , there "one" response. "unordered", items processed regarless of errors such "duplicate" key , "result" contains report of failed items.

    the obvious "cost" here creating "unique" index on fields use fair bit of space adding significant overhead "write" operations index information must written data.

  2. use "upsert" functionality $setoninsert. allows construct query "all required unique fields" in order "search" document see if 1 exists. standard "upsert" behaviour in document not found "new" document created.

    what $setoninsert adds, fields "set" within statement applied "upsert" occurs. on regular "match" assignments inside $setoninsert ignored:

    bulk = pymongo.bulk.bulkoperationbuilder(collection,ordered=true) bulk.find({      "subj_id": data["subj_id"],      "img_id": data["img_id"]      "data_list": data["data_list"],     "worker_id": data["worker_id"],      "worker_exp": data["worker_exp"],      "assignment_id": data["assignment_id"] }).upsert().update_one({     "$setoninsert": {         # "insert" fields or "data" object         "subj_id": data["subj_id"],          "img_id": data["img_id"]          "data_list": data["data_list"],         "worker_id": data["worker_id"],          "worker_exp": data["worker_exp"],          "assignment_id": data["assignment_id"]     },     "$set": {         # other fields "if" want update on match     } }) result = bulk.execute() 

    depending on needs can use $set or other operators thing "want" update if document matched, or leave out , "inserts" occur not matched.

    what cannot of course assign value of 1 field inside $setoninsert , $inc on other operations. produces conflict trying modify "same path" , throw error.

    in case better leave $inc field "out" of $setoninsert block , let it's operations normally. { "$inc": 1 } assign 1 anyway on first commit. same applies $push , other operators.

    the "cost" again asigning index, not "need" "unique" should be. without index operations "scanning collection" possible match rather index more efficient. not "required", cost of additional "writes" outweighs cost of "lookup" in case index not specified.

the further advantage when coupled "bulk" operations since "upsert" method $setoninsert not throw "duplicate key" error when unique keys in query, can used "ordered" writes batch demonstrated.

when "ordered" in batch of operations, operations processed in "sequence" added, if important "first" insert happen 1 comitted prefferable "unordered", while quicker to parallel execution, not of course guaranteed commit operations in same order in contructed.

either way, have costs maintaining "unique" items on multiple keys either form. possibly alternate @ "reduce" index cost @ replacing _id field of document values consider "unique".

since primary key "unique" , "required" minimizes "cost" of writing "additional indexes" , may option consider. _id doesn't "need" objectid, , since can composite object if have unique identifier wise use way, avoiding further unique duplication.


Comments

Popular posts from this blog

searchKeyword not working in AngularJS filter -

sequelize.js - Sequelize: sort by enum cases -

user interface - how to replace an ongoing process of image capture from another process call over the same ImageLabel in python's GUI TKinter -