How do I use Dataflow to calculate the top unique value results per key? -
i'm relatively new dataflow , programming model , struggling problem requires calculating top 10 weeks customer has highest spend. apologise if seems silly question.
the data have consists of customer ids use key , few million records containing timestamp , spend value.
i've created extract method looks (excluding logging , date formatter). receives bigquery table row extract customer id, spend , timestamp week number:
static class extractspend extends dofn<tablerow, kv<string, spendbyweek>> { private static final long serialversionuid = 0; @override public void processelement(processcontext c) { string custid = (string) row.get("customerid"); localdatetime date = localdatetime.parse((string) row.get("timestamp"), datetimeformatter); weekfields weekfields = weekfields.of(locale.getdefault()); int weeknumber = date.get(weekfields.weekofweekbasedyear()); double spend = (double) row.get("spend"); spendbyweek spendbyweek = new spendbyweek(weeknumber, spend.doublevalue()); c.output(kv.of(custid, spendbyweek)); } }
but can't figure out how take output , group in such way can add spend values per customer id , week, sort them , output pcollection<string, list<double>>
of each customer , top 10 weekly spend values.
would able me out this, please?
if want accomplish using grouping, you'd need first group customer id , week compute sum
move week value , regroup customer id computing top
. can using windowing rather putting week in key. see end details on doing that.
once you've done that, have pcollection<kv<string, spendbyweek>>
each week occurs once given key. can determine top spendbyweek
each given user-id defining comparator<spendbyweek>
implements serializable
, using top.perkey()
.
computing spend-per-week-per-user windows
as mentioned @ top can use windowing computing spend-per-week.
- write dofn similar extractspend takes tablerow , outputs kv individual rows of spend keyed customer id , output outputs using
outputwithtimestamp
. - then apply windowing transformation, such
fixedwindows
take divide events windows of specified size. in case, wantfixedwindows.of(duration.standardweeks(1))
orcalendarwindows.weeks(...)
. - then apply transform such
sum.doublesperkey()
.
at point, you'll have pcollection
contains per-week-window kv<string, double>
each entry total spend key in week.
- then can run
dofn
takes hourly windows , moves information value (so havekv<string, spendandweek>
) - apply window.into switch
globalwindows
- and apply
top.perkey
operation described above.
Comments
Post a Comment