2017/07/02

How to do list intersection efficiently in python

Question

In previous blog, I have investigated how to do array initialization efficiently. This time, I would like to investigate how to do intersections in Python efficiently.
For those need to read code immediately, here is the gist for this blog

Methods

Before digging in the implementations, let’s define what intersection is. By intersection, I mean finding same elements between 2 given lists.
A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
B = [2, 4, 6, 8, 10]
# c = a ^ b
# which is c = [2, 4, 6, 8]
How do we get c = [2, 4, 6, 8] efficiently?

M1: double loop

A double loop for finding same elements
def doubleLoop():
    C = [a for a in A for b in B if a == b]

def doubleLoop2():
    C = []
    for a in A:
        for b in B:
            if a == b:
                C.append(a)
Noted
  • list comprehension vs append

M2: Set intersection

Cast A and B to set and do intersection.
setA = set(A)
setB = set(B)

def setAnd():
    C = sorted(setA.intersection(setB))

def setAnd2():
    # No sort
    C = setA.intersection(setB)

def setAnd3():
    # Inline construct
    C = sorted(set(A).intersection(set(B)))

def setAnd4():
    # Inline construct and no sort
    C = set(A).intersection(set(B))
Noted:
  • setAnd ensures the ordering sames with output from doubleLoop or doubleLoop2 as set has no concept of ordering
  • setAnd2 demonstrates the meaning of no ordering
  • setAnd3 demonstrates how slow for casting them in a run time manner
  • setAnd4 is similar to `setAnd2

Result

Results of running each implementations for 1000000 times
# Double loop
# 1.4873290062
# Double loop2
# 1.70801305771
# Set AND
# 0.525309085846
# Set AND2
# 0.199652194977
# Set AND3
# 0.982024908066
# Set AND4
# 0.657478094101

Conclusion

  • Always use set to do intersections. No matter the results need to be sorted or not, it is always faster than the double loop way.
  • List comprehension is always faster than append. Check out more from my previous blog

Explanation

setAnd vs setAnd2

  • setAnd do an extra sort comparing to setAnd2
    • As set has no ordering, it need to be cast to list before which is an extra cost

setAnd vs setAnd3

A performance checks between using a pre-cast sets setA and setB and runtime-cast sets set(A) and set(B). Even though the runtime version is slower, it is still faster than double loop obviously.

setAnd3 vs setAnd4

Same as setAnd vs setAnd2

Gotcha

  • Set’s intersection works charm on unique elements inside involved lists. Benchmark here may not be applied to the multi-set one
  • Set has no concept of ordering. If ordering is a must for you, you need to do it yourself.

沒有留言:

張貼留言