c# - Strategy for splitting a large JSON file -
i'm trying split large json files smaller files given array. example:
{ "headername1": "headerval1", "headername2": "headerval2", "headername3": [{ "element1name1": "element1value1" }, { "element2name1": "element2value1" }, { "element3name1": "element3value1" }, { "element4name1": "element4value1" }, { "element5name1": "element5value1" }, { "element6name1": "element6value1" }] }
...down { "elementnname1": "elementnvalue1" } n large number
the user provides name represents array split (in example "headername3") , number of array objects per file, e.g. 1,000,000
this result in n files each containing top name:value pairs (headername1, headername3) , 1,000,000 of headername3 objects in each file.
i'm using excellent newtonsof json.net , understand need using stream.
so far have looked reading in jtoken objects establish propertyname == "headername3" occurs when reading in tokens read in entire json object each object in array , not have continue parsing json jtokens;
here's snippet of code building far:
using (streamreader osr = file.opentext(strinput)) { using (var reader = new jsontextreader(osr)) { while (reader.read()) { if (reader.tokentype == jsontoken.startobject) { intobjectcount++; } else if (reader.tokentype == jsontoken.endobject) { intobjectcount--; if (intobjectcount == 1) { intarrayrecordcount++; // here want read entire object record untyped json object if( intarrayrecordcount % 1000000 == 0) { //write these split file } } } } } }
i don't know - , in fact, , not concerned - structure of json itself, , objects can of varying structures within array. therefore not serializing classes.
is right approach? there set of methods in json.net library can use perform such operation?
any appreciated.
you can use jsonwriter.writetoken(jsonreader reader, true)
stream individual array entries , descendants jsonreader
jsonwriter
. can use jproperty.load(jsonreader reader)
, jproperty.writeto(jsonwriter writer)
read , write entire properties , descendants.
using these methods, can create state machine parses json file, iterates through root object, loads "prefix" , "postfix" properties, splits array property, , writes prefix, array slice, , postfix properties out new file(s).
here's prototype implementation takes textreader
, callback function create sequential output textwriter
objects split file:
enum splitstate { inprefix, insplitproperty, insplitarray, inpostfix, } public static void splitjson(textreader textreader, string tokenname, long maxitems, func<int, textwriter> createstream, formatting formatting) { list<jproperty> prefixproperties = new list<jproperty>(); list<jproperty> postfixproperties = new list<jproperty>(); list<jsonwriter> writers = new list<jsonwriter>(); splitstate state = splitstate.inprefix; long count = 0; try { using (var reader = new jsontextreader(textreader)) { bool doread = true; while (doread ? reader.read() : true) { doread = true; if (reader.tokentype == jsontoken.comment || reader.tokentype == jsontoken.none) continue; if (reader.depth == 0) { if (reader.tokentype != jsontoken.startobject && reader.tokentype != jsontoken.endobject) throw new jsonexception("json root container not object"); } else if (reader.depth == 1 && reader.tokentype == jsontoken.propertyname) { if ((string)reader.value == tokenname) { state = splitstate.insplitproperty; } else { if (state == splitstate.insplitproperty) state = splitstate.inpostfix; var property = jproperty.load(reader); doread = false; // jproperty.load() have advanced reader. if (state == splitstate.inprefix) { prefixproperties.add(property); } else { postfixproperties.add(property); } } } else if (reader.depth == 1 && reader.tokentype == jsontoken.startarray && state == splitstate.insplitproperty) { state = splitstate.insplitarray; } else if (reader.depth == 1 && reader.tokentype == jsontoken.endarray && state == splitstate.insplitarray) { state = splitstate.insplitproperty; } else if (state == splitstate.insplitarray && reader.depth == 2) { if (count % maxitems == 0) { var writer = new jsontextwriter(createstream(writers.count)) { formatting = formatting }; writers.add(writer); writer.writestartobject(); foreach (var property in prefixproperties) property.writeto(writer); writer.writepropertyname(tokenname); writer.writestartarray(); } count++; writers.last().writetoken(reader, true); } else { throw new jsonexception("internal error"); } } } foreach (var writer in writers) using (writer) { writer.writeendarray(); foreach (var property in postfixproperties) property.writeto(writer); writer.writeendobject(); } } { // make sure files closed in event of exception. foreach (var writer in writers) using (writer) { } } }
this method leaves files open until end in case "postfix" properties, appearing after array property, need appended. aware there limit of 16384 open files @ 1 time, if need create more split files, won't work. if postfix properties never encountered in practice, can close each file before opening next , throw exception in case postfix properties found. otherwise may need parse large file in 2 passes or close , reopen split files append them.
here example of how use method in-memory json string:
private static void testsplitjson(string json, string tokenname) { var builders = new list<stringbuilder>(); using (var reader = new stringreader(json)) { splitjson(reader, tokenname, 2, => { builders.add(new stringbuilder()); return new stringwriter(builders.last()); }, formatting.indented); } foreach (var s in builders.select(b => b.tostring())) { console.writeline(s); } }
prototype fiddle.
Comments
Post a Comment