Of cause, I can finish the assignment. However, Implementations for methods
score_ngram
and freq
under class *CollocationFinder
are quite tricky. Cost for calling them unwisely is O(n^2). Trust me. This is not a fun experience :D- Problems
Let me show you my algorithm before pointing out the reasons behind O(n^2)
# Cal uni-gram
# uc is an instance of class FreqDist
for gram in uc.items():
unigram_scores.append({
'words': gram[0],
'score': uc.freq(gram[0])
})
Well. Codes above are easy to understand and seems to flawless for computing the probability of an uni-gram. However, their cost is O(n^2).
- Why?
After digging in to the source code of
nltk
, below function is the root cause for O(n^2).def N(self):
"""
Return the total number of sample outcomes that have been
recorded by this FreqDist. For the number of unique
sample values (or bins) with counts greater than zero, use
``FreqDist.B()``.
:rtype: int
"""
return sum(self.values())
My codes call
freq()
for every possible uni-gram. Cost for this operation is O(n). Then,N()
is called by every freq()
. Cost for N()
is O(n). As a result, cost is O(n * n) = O(n^2).- How to solve?
Read codes below.
+N = uc_freq.N()
# Calculate the probability for unigram
# uc_freq is an instance of FreqDist
-for gram in uc.items():
+for gram in uc_freq.items():
+ val = float(uc_freq[gram[0]]) / N
unigram_scores.append({
'words': gram[0],
- 'score': uc.freq(gram[0])
+ 'score': round(val, 2)
})
To summarise, we need to eliminate the side effect for
N()
during the for loop operations.Therefore, I calculate the N outside the for loop operation and compute the probability myself instead of calling
freq()
.Although cost for
N()
is O(n) and the cost of for loop operation is still O(n), they are not multiplied. As a result, they are still running with cost n O(n) = O(n).