How do I use Dataflow to calculate the top unique value results per key? -


i'm relatively new dataflow , programming model , struggling problem requires calculating top 10 weeks customer has highest spend. apologise if seems silly question.

the data have consists of customer ids use key , few million records containing timestamp , spend value.

i've created extract method looks (excluding logging , date formatter). receives bigquery table row extract customer id, spend , timestamp week number:

static class extractspend extends dofn<tablerow, kv<string, spendbyweek>> {     private static final long serialversionuid = 0;      @override     public void processelement(processcontext c) {         string custid = (string) row.get("customerid");         localdatetime date = localdatetime.parse((string) row.get("timestamp"), datetimeformatter);          weekfields weekfields = weekfields.of(locale.getdefault());         int weeknumber = date.get(weekfields.weekofweekbasedyear());          double spend = (double) row.get("spend");          spendbyweek spendbyweek = new spendbyweek(weeknumber, spend.doublevalue());         c.output(kv.of(custid, spendbyweek));     } } 

but can't figure out how take output , group in such way can add spend values per customer id , week, sort them , output pcollection<string, list<double>> of each customer , top 10 weekly spend values.

would able me out this, please?

if want accomplish using grouping, you'd need first group customer id , week compute sum move week value , regroup customer id computing top. can using windowing rather putting week in key. see end details on doing that.

once you've done that, have pcollection<kv<string, spendbyweek>> each week occurs once given key. can determine top spendbyweek each given user-id defining comparator<spendbyweek> implements serializable , using top.perkey().


computing spend-per-week-per-user windows

as mentioned @ top can use windowing computing spend-per-week.

  1. write dofn similar extractspend takes tablerow , outputs kv individual rows of spend keyed customer id , output outputs using outputwithtimestamp.
  2. then apply windowing transformation, such fixedwindows take divide events windows of specified size. in case, want fixedwindows.of(duration.standardweeks(1)) or calendarwindows.weeks(...).
  3. then apply transform such sum.doublesperkey().

at point, you'll have pcollection contains per-week-window kv<string, double> each entry total spend key in week.

  1. then can run dofn takes hourly windows , moves information value (so have kv<string, spendandweek>)
  2. apply window.into switch globalwindows
  3. and apply top.perkey operation described above.

Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -