In this project, I replicated the key results from the paper, “A Novel Two-stage Framework for Extracting Opinionated Sentences from News Articles” (Pujari, Rajkumar. Desai, Swara. Ganguly, Niloy. Goyal, Pawan).
The project entails using a combination of the Naive Bayes classifier and the Hyperlink-Induced Topic Search (HITS) algorithm to carry out fact/opinion classification of the sentences in a given corpus.
(Project’s Github Link)
Gaussian Naive Bayes
A simple algorithm based on bayes rule. The “naive” aspect of this classifier is that it assumes independence between every pair of features (hardly ever true in practice).
For the adjacent formula, y is a given class variable and the x variables represent the features. P(y) is the probability of observing class y in the training set. P(x_vector | y) is the probability of observing the specific x_vector given the class y. Note that the product over all conditionals P(x_i|y) is only possible because of our naive assumption. The final equation indicates how the classifier finds the class to predict.
Advantages-> (1) work pretty well in practice despite naive assumption, (2) can estimate the necessary parameters with relatively small amount of training data, (3) “can be extremely fast compared to more sophisticated methods” (source)
Disadvantages-> (1) more sophisticated models that are better suited for data which are trained well, can outperform NB models, (2) “known to be a bad estimator, so the probability outputs are not taken too seriously” (source) (3) can be particularly ineffective compared to more sophisticated models when there are significant dependent relations between pairs of features.
Example Real World Application->(1) document classification, (2) spam filtering
I’m working on evaluating the performance of a few classifiers on a certain data set using python. Recording the basics of the python code I used here for future reference.
Pandas Data Frames are two dimensional labeled data structures which columns. Its the perfect data structure to represent a data set- with each column representing a different feature.
I tinkered with the Galago toolkit from the lemur project as part of Dr. Chris Clifton‘s course on Web Information Search and Management course at Purdue University. It was great fun to work on developing an intuition for search engine indices.
Setting up the Galago Toolkit.
Corpus’ Used: (1) Wiki Small, (2) CACM