How do I use Dataflow to calculate the top unique value results per key? -
i'm relatively new dataflow , programming model , struggling problem requires calculating top 10 weeks customer has highest spend. apologise if seems silly question.
the data have consists of customer ids use key , few million records containing timestamp , spend value.
i've created extract method looks (excluding logging , date formatter). receives bigquery table row extract customer id, spend , timestamp week number:
static class extractspend extends dofn<tablerow, kv<string, spendbyweek>> { private static final long serialversionuid = 0; @override public void processelement(processcontext c) { string custid = (string) row.get("customerid"); localdatetime date = localdatetime.parse((string) row.get("timestamp"), datetimeformatter); weekfields weekfields = weekfields.of(locale.getdefault()); int weeknumber = date.get(weekfields.weekofweekbasedyear()); double spend = (double) row.get("spend"); spendbyweek spendbyweek = new spendbyweek(weeknumber, spend.doublevalue()); c.output(kv.of(custid, spendbyweek)); } } but can't figure out how take output , group in such way can add spend values per customer id , week, sort them , output pcollection<string, list<double>> of each customer , top 10 weekly spend values.
would able me out this, please?
if want accomplish using grouping, you'd need first group customer id , week compute sum move week value , regroup customer id computing top. can using windowing rather putting week in key. see end details on doing that.
once you've done that, have pcollection<kv<string, spendbyweek>> each week occurs once given key. can determine top spendbyweek each given user-id defining comparator<spendbyweek> implements serializable , using top.perkey().
computing spend-per-week-per-user windows
as mentioned @ top can use windowing computing spend-per-week.
- write dofn similar extractspend takes tablerow , outputs kv individual rows of spend keyed customer id , output outputs using
outputwithtimestamp. - then apply windowing transformation, such
fixedwindowstake divide events windows of specified size. in case, wantfixedwindows.of(duration.standardweeks(1))orcalendarwindows.weeks(...). - then apply transform such
sum.doublesperkey().
at point, you'll have pcollection contains per-week-window kv<string, double> each entry total spend key in week.
- then can run
dofntakes hourly windows , moves information value (so havekv<string, spendandweek>) - apply window.into switch
globalwindows - and apply
top.perkeyoperation described above.
Comments
Post a Comment