Kaggle Spooky Author Identification - 0.29, Highest Public LB score for a published kernel

My published kernel in the Kaggle Contest, Spooky Author Identification. Highest Public LB score - 0.29 (Top 50) for a published kernel in the contest. Uses Simple Feature Engineering like Punctuation,Stop Words,Glove Sentence vectors. In addition, it creates stack features from simple Features such as tfidf and count vectors for words and chars. Multinomial naive bayes(mnb) is then applied with the following combination - tfidf+words+mnb,tfidf+chars+mnb,count+words+mnb,count+chars+mnb. Conv Nets on keras texttosequence, NNs on glove sentence vectors and Fast Text are also used as stack features. XGBoost, which is the final model which will use the simple and stack features as input.

View Code

Quora Indian Answer Classifier

Hobby Project - Algorithm which classifies Quora answers into Indian or Non-Indian based on the style of writing. Indian answer dataset is taken from the novels by Chetan Bhagat viz. Five Point Someone, 3 mistakes of my life and One night at a call centre. Non-Indian answer dataset is a collection of toefl and sat essays

View Code

Quora Chrome Extension

Hobby Project - Quora Chrome Extension which classifies Quora answers into genres such as information,stories and world affairs for improved reading experience

View Code

Image Denoising using Edge Patch Based Dictionaries and BIRCH unsupervised clustering algorithm

A technique to speed-up a nonlocal means (NLM) filter is implemented. In the original NLM filter, most of its computational time is spent on finding distances for all the patches in the search window. Here, a dictionary is built in which patches with similar photometric structures are clustered together. Dictionary is built only once with high resolution images belonging to different scenes. Since the dictionary is well organized in terms of indexing its entries, it is used to search similar patches very quickly for efficient NLM denoising.A substantial reduction in computational cost compared with the original NLM method, especially when the search window of NLM is large, without much affecting the PSNR. Second, it can be seen that by building a dictionary for edge patches as opposed to intensity patches, it is possible to reduce the dictionary size; thus, further improving the computational speed and memory requirement. The implemented method preclassifies similar patches with the same distance measure as used by NLM method. The implemented algorithm is shown to outperform other prefiltering based fast NLM algorithms computationally as well as qualitatively.

View Code

Improving running time of a network using segment trees

Linear networks of varying and highly volatile bandwidths are networks with links whose capacity of carrying data changes a lot due to presence of many physical parameters. Bandwidth volatility can also occur due to factors such as Bandwidth Throttling due to interference from the Internet Service Provider or due to congestion of networks. Consequently, it becomes extremely difficult to estimate throughput for a particular flow. In a channel consisting of links of varying bandwidths, the link of minimum bandwidth determines the throughput of the channel. Hence, estimating the throughput when the link bandwidths are known and are varying constantly with time is a difficult challenge. This project, simulates the currently used naive methods and a more efficient method using segment trees for throughput estimation.

View Code