Stuff Corner: Tricky inside the nltk implementation

In the previous assignment, we have been tried on using uni-gram and bi-gram powered by nltk.

Of cause, I can finish the assignment. However, Implementations for methods score_ngram and freq under class *CollocationFinder are quite tricky. Cost for calling them unwisely is O(n^2). Trust me. This is not a fun experience :D

Problems

Let me show you my algorithm before pointing out the reasons behind O(n^2)

# Cal uni-gram
# uc is an instance of class FreqDist
for gram in uc.items():
   unigram_scores.append({
        'words': gram[0],
        'score': uc.freq(gram[0])

    })

Well. Codes above are easy to understand and seems to flawless for computing the probability of an uni-gram. However, their cost is O(n^2).

Why?

After digging in to the source code of nltk, below function is the root cause for O(n^2).

def N(self):
        """
        Return the total number of sample outcomes that have been
        recorded by this FreqDist.  For the number of unique
        sample values (or bins) with counts greater than zero, use
        ``FreqDist.B()``.

        :rtype: int
        """
        return sum(self.values())

My codes call freq() for every possible uni-gram. Cost for this operation is O(n). Then,
N() is called by every freq(). Cost for N() is O(n). As a result, cost is O(n * n) = O(n^2).

How to solve?

Read codes below.

+N = uc_freq.N()

 # Calculate the probability for unigram
# uc_freq is an instance of FreqDist
-for gram in uc.items():
+for gram in uc_freq.items():
+    val = float(uc_freq[gram[0]]) / N
     unigram_scores.append({
         'words': gram[0],
-        'score': uc.freq(gram[0])
+        'score': round(val, 2)
     })

To summarise, we need to eliminate the side effect for N() during the for loop operations.

Therefore, I calculate the N outside the for loop operation and compute the probability myself instead of calling freq().

Although cost for N() is O(n) and the cost of for loop operation is still O(n), they are not multiplied. As a result, they are still running with cost n O(n) = O(n).

Stuff Corner

2014/10/31

Tricky inside the nltk implementation

沒有留言:

張貼留言

My other blogs

About me