Category Archives: Science and Technical

Fact/Opinion Classification using the Naive Bayes Classifier and the Iterative Hyperlink-Induced Topic Search Algorithm

In this project, I replicated the key results from the paper, “A Novel Two-stage Framework for Extracting Opinionated Sentences from News Articles” (Pujari, Rajkumar. Desai, Swara. Ganguly, Niloy. Goyal, Pawan).

The project entails using a combination of the Naive Bayes classifier and the Hyperlink-Induced Topic Search (HITS) algorithm to carry out fact/opinion classification of the sentences in a given corpus.

(Project’s Github Link)

Choosing a Classifier from SKLearn

Gaussian Naive Bayes
A simple algorithm based on bayes rule. The “naive” aspect of this classifier is that it assumes independence between every pair of features (hardly ever true in practice). 
For the adjacent formula, y is a given class variable and the x variables represent the features. P(y) is the probability of observing class y in the training set. P(x_vector | y) is the probability of observing the specific x_vector given the class y. Note that the product over all conditionals P(x_i|y) is only possible because of our naive assumption. The final equation indicates how the classifier finds the class to predict.
Advantages-> (1) work pretty well in practice despite naive assumption, (2) can estimate the necessary parameters with relatively small amount of training data, (3) “can be extremely fast compared to more sophisticated methods” (source)
Disadvantages-> (1) more sophisticated models that are better suited for data which are trained well, can outperform NB models, (2) “known to be a bad estimator, so the probability outputs are not taken too seriously” (source) (3) can be particularly ineffective compared to more sophisticated models when there are significant dependent relations between pairs of features.
Example Real World Application->(1) document classification, (2) spam filtering
Continue reading

Using Pandas for Evaluating Classifiers

I’m working on evaluating the performance of a few classifiers on a certain data set using python. Recording the basics of the python code I used here for future reference.

Pandas Data Frames are two dimensional labeled data structures which columns. Its the perfect data structure to represent a data set- with each column representing a different feature.

Continue reading