python - How to get rows from numpy 2d where column value is maximum from group by other column? -
this pretty common sql query: select lines maximum value in column x, group group_id.
the result every group_id 1 (first) line column x value maximum within group.
i have 2d numpy array many columns lets simplify (id x y):
import numpy np rows = np.array([[1 22 1236] [1 11 1563] [2 13 1234] [2 10 1224] [2 23 1111] [2 23 1250]])
and want get:
[[1 22 1236] [2 23 1111]]
i able through cumbersome loop, like:
row_grouped_with_max = [] max_row = rows[0] last_max = max_row[1] last_row_group = max_row[0] row in rows: if last_max < row[1]: max_row = row if row[0] != last_row_group: last_row_group = row[0] last_max = 0 row_grouped_with_max.append(max_row) row_grouped_with_max.append(max_row)
how in clean numpy way?
might not clean, here's vectorized way solve -
# sorted "rows" sorted_rows = rows[np.argsort(rows[:,0])] # count of elements each id _,count = np.unique(sorted_rows[:,0],return_counts=true) # form mask fill elements x-column n1 = count.max() n2 = len(count) mask = np.arange(n1) < count[:,none] # form 2d matrix of id's each row each unique id id_2darray = np.empty((n2,n1)) id_2darray.fill(-np.inf) id_2darray[mask] = sorted_rows[:,1] # id based max indices grp_max_idx = np.argmax(id_2darray,axis=1) + np.append([0],count.cumsum()[:-1]) # finally, "maxed"-x rows out = sorted_rows[grp_max_idx]
sample input, output -
in [101]: rows out[101]: array([[ 2, 13, 1234], [ 1, 22, 1236], [ 2, 23, 1250], [ 6, 12, 1345], [ 4, 10, 290], [ 2, 10, 1224], [ 2, 23, 1111], [ 4, 45, 99], [ 1, 11, 1563], [ 4, 23, 89]]) in [102]: out out[102]: array([[ 1, 22, 1236], [ 2, 23, 1250], [ 4, 45, 99], [ 6, 12, 1345]])
Comments
Post a Comment