2014/10/31

Tricky inside the nltk implementation

In the previous assignment, we have been tried on using uni-gram and bi-gram powered by nltk.
Of cause, I can finish the assignment. However, Implementations for methods score_ngram and freq under class *CollocationFinder are quite tricky. Cost for calling them unwisely is O(n^2). Trust me. This is not a fun experience :D
  • Problems
Let me show you my algorithm before pointing out the reasons behind O(n^2)
# Cal uni-gram
# uc is an instance of class FreqDist
for gram in uc.items():
unigram_scores.append({
'words': gram[0],
'score': uc.freq(gram[0])

})
Well. Codes above are easy to understand and seems to flawless for computing the probability of an uni-gram. However, their cost is O(n^2).
  • Why?
After digging in to the source code of nltk, below function is the root cause for O(n^2).
def N(self):
"""
Return the total number of sample outcomes that have been
recorded by this FreqDist. For the number of unique
sample values (or bins) with counts greater than zero, use
``FreqDist.B()``.

:rtype: int
"""

return sum(self.values())
My codes call freq() for every possible uni-gram. Cost for this operation is O(n). Then,
N() is called by every freq(). Cost for N() is O(n). As a result, cost is O(n * n) = O(n^2).
  • How to solve?
Read codes below.
+N = uc_freq.N()

# Calculate the probability for unigram
# uc_freq is an instance of FreqDist
-for gram in uc.items():
+for gram in uc_freq.items():
+ val = float(uc_freq[gram[0]]) / N
unigram_scores.append({
'words': gram[0],
- 'score': uc.freq(gram[0])
+ 'score': round(val, 2)
})
To summarise, we need to eliminate the side effect for N() during the for loop operations.
Therefore, I calculate the N outside the for loop operation and compute the probability myself instead of calling freq().
Although cost for N() is O(n) and the cost of for loop operation is still O(n), they are not multiplied. As a result, they are still running with cost n O(n) = O(n).

2014/10/17

Analyze the trend of singing happy birthday in Mong Kong

Claim

Umbrella Revolution is a memorable and a breakthrough of the Hong Kong’s protests. Although I am focusing on (or trying to) analyzing the phenomenons from this protest which does not involved the democracy related issues, I love and support the Yellow Ribbon’s side.

Song of Happy birthday

This is a funny way and becomes a trend for protesters to repel the harassment from the opponents. Protesters sings the song of happy birthday to reply any insane acts from opponents. Here is an example for showing how this song works in this protest. Although it is ridiculous, it keeps both sides in peace.

How this trend diffused ?

Recap what we learnt from the lectures:
Processes of social diffusion of new practices:
1) New ideas and social practices are introduced by notable example (e.g. by very heavy advertising)
2) Initially, the rate of adoption is slow because new ways are unfamiliar, customs resist change, and results are uncertain
3) As early adopters convey more information about how to apply the new practices and their potential benefits, the innovation is adopted at an accelerating rate
4) After a period in which the new practices spread rapidly, the rate of diffusion slows down
5) The use of the innovation then either stabilizes or declines, depending upon its relative functional value
Information from the facebook, the trend originated from here (around 1:10) which posted on Oct 4. Views clicks (till now 2014/10/17) is 107,271 and around 1,045 shares. Facts meet point 1.
Below graphs shows the trend for keywords like “生日” (happy birthday) “示威” (protests) in google trend.
You can see trends for keywords “生日” on Oct 4 and Oct 5 are more or less the same. (Point 2 and Point 3). Suddenly, there are huge climbs on Oct 6 (Point 4). The trend decreased after Oct 6 (Point 5)

Summary

After comparing the theory from lecture notes with the facts, I learn and understand better the steps involved in social diffusion.

2014/10/14

Shortcuts on MAC OS Terminal

To save my life, I would like to document down the shortcuts for default MAC OS terminal.
  • Switch between terminal tabs
Command + Arrow left or right
  • Page Up / Down
fn + Arrow Up or Down
  • Jump to the Top or Bottom of the terminal
fn + Arrow Left or right
  • Move one line up or down
Command + Arrow Up or Arrow Down
  • Jump to the head of the input line
Ctrl A
  • Jump to the end of the input line
Ctrl E
  • Remove all the characters before the current pointer
Ctrl U
  • Remove all the characters after the current pointer
Ctrl K
  • Remove a word before the current pointer
Ctrl W
  • Navigate a command line word by word
Ctrl + Arrow Left or Right

2014/10/11

Tell you how to open the same pdf twice

For opening the same pdf twice,
Open the PDF for the first time -> Check out Menu bar -> Press Window -> Press New window
For showing those two pdf window in the same screen together,
Menu bar -> Window -> Tile -> Horizontally or Vertically

2014/10/03

Notes about Sentiment polarity

Recently, we are going through the materials about Sentiment Polarity. This is really an interesting topic. Here are few notes about this topic.
  • Dimensions
We need aspects to quantify a word’s characteristic. Here are the 3 aspects positive, objective and negative for telling the word’s characteristic (in digits).
If we claim this word is positive, we believe this word will be used on telling something good. On the other hand, if we claim this word is negative, we believe this word will be used on telling something bad. However, if we claim this word is objective, we believe that this word tells us nothing about any subjective attitude.
  • Graphical representation of Sentiment Polarity
There is a website which show you how to represent a word in 3 dimensions spaces.
Point to note that is they are using triangular graph instead of Cartesian coordinate system.

If you are not familiarize this type of presentation, here are my interpretations.
* As much as objective, as less as subjective value
Fact is a fact. There are no good or bad for a fact since this is an objective stuff. Actually, the triangular shape is telling you this behavior.
* Subjective value is either positive or negative or neither of them (IE both of them are zero)
Despite the objectiveness controls the level of subjective value, subject value must be either positive or negative or both of them are zero.
* What's next
After you can categorize the wording, you can get a bunch of digits. Then, you can do further analysis by using k-clustering, term weighting extra.
Update Note 1 (2014/10/11): Beautify the blog’s contents