Python: Fastest way of parsing first column of large table in array -
so have got 2 big tables compare (9 columns , approx 30 million rows).
#!/usr/bin/python import sys import csv def compare(sam1, sam2, output): open(sam1, "r") s1, open(sam2, "r") s2, open(output, "w") out: reader1 = csv.reader(s1, delimiter = "\t") reader2 = csv.reader(s2, delimiter = "\t") writer = csv.writer(out, delimiter = "\t") list = [] line in reader1: list.append(line[0]) list = set(list) line in reader2: field in line: if field not in list: writer.writerow(line) if __name__ == '__main__': compare(sys.argv[1], sys.argv[2], sys.argv[3])
the first column contains identifier of rows , know ones in sam1.
so code working with, takes ages. there way speed up?
i tried speed converting list set, there no big difference.
edit: running quicker have whole lines out of input table , write lines exclusive id output file. how manage in quick way?
a few suggestions:
rather creating list turn set, work set directly:
sam1_identifiers = set() line in reader1: sam1_identifiers.add(line[0])
this more memory efficient, because have single set rather list , set. might make bit faster.
note i've changed variable name –
list
name of python builtin function, shouldn't use own variables.since want find identifiers in sam1, rather nested if/for statements, compare , throw away identifiers found in sam2 in set of ids in sam1.
sam2_identifiers = set() line in reader2: sam2_identifiers.add(line[0]) print sam1 - sam2
or even
sam2_identifiers = set() line in reader2: sam1_identifiers.discard(line[0]) print sam1_identifiers
i suspect that's faster nested loops.
perhaps i've missed something, don't through every column each line of sam2? isn't sufficient @
line[0]
identifier, sam1?
Comments
Post a Comment