Question

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.

I have put my data in a .csv as follows :

URL WebsiteText AlexaRank GooglePageRank

In my Test CSV we have :

URL WebsiteText AlexaRank GooglePageRank Label

Label is a binary classification indicating "good" with 1 or "bad" with 0.

I currently have my LR running using only the website text; which I run a TF-IDF on.

I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.

  • How can I normalize my ranking data for AlexaRank? I have a set of 10,000 webpages, for which I have the Alexa rank of all of them; however they aren't ranked 1-10,000. They are ranked out of the entire Internet, so while http://www.google.com may be ranked #1, http://www.notasite.com may be ranked #83904803289480. How do I normalize this in Scikit learn in order to get the best possible results from my data?
  • I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.

    loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
    
    print "loading data.."
    traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
    testdata = list(np.array(p.read_table('test.tsv'))[:,2])
    y = np.array(p.read_table('train.tsv'))[:,-1] #reading label
    
    tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode', analyzer='word',
    
    token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
    
    rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True,    intercept_scaling=1.0, class_weight=None, random_state=None)
    
    X_all = traindata + testdata
    lentrain = len(traindata)
    
    print "fitting pipeline"
    tfv.fit(X_all)
    print "transforming data"
    X_all = tfv.transform(X_all)
    X = X_all[:lentrain]
    X_test = X_all[lentrain:]
    
    print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
    
    #Add Two Integer Columns
    AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
    AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
    AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData
    
    #Add two columns to X.
    X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.
    
    print "training on full data"
    rd.fit(X,y)
    pred = rd.predict_proba(X_test)[:,1]
    testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
    pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
    pred_df.to_csv('benchmark.csv')
        print "submission file created.."`
    

Thank you very much for all feedback - please post if you need any further information!

Was it helpful?

Solution

I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.

  • This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
  • In order to understand why normalization would help in LR, let's revisit the logit formulation of LR. enter image description here
    In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.

Here is how you can scale the X matrix.

sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)

Don't forget to use same scaler to transform X_test.

X_test = sc.transform(X_test)

Now you can use the fitting procedure etc.

rd.fit(X, y)
re.predict_proba(X_test)

Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html

Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.

AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)

Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.

X = np.hstack((X, AllAlexaAndGoogleInfo))

hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.

OTHER TIPS

Regarding normalizing the numeric ranks either scikit StandardScaler or a logarithmic transform (or both) should work well enough.

For building up a working pipeline, I find my sanity greatly benefits from using the Pandas package and the sklearn.pipeline utilities. Here is a simple script that should do what you need.

First a couple of utlitlty classes I always seem to need. It would be nice to have something like these in sklearn.pipeline or sklearn.utilities.

from sklearn import base
class Columns(base.TransformerMixin, base.BaseEstimator):
    def __init__(self, columns):
        super(Columns, self).__init__()
        self.columns_ = columns
    def fit(self, *args, **kwargs):
        return self
    def transform(self, X, *args, **kwargs):
        return X[self.columns_]

class Text(base.TransformerMixin, base.BaseEstimator):
    def fit(self, *args, **kwargs):
        return self
    def transform(self, X, *args, **kwargs):
        return (X.apply("\t".join, axis=1, raw=False))

Now set up the pipeline. I used the SGDClassifier implementation of logistic regression since it tends to be more eficcient for high dimensional data like text classification also I usually find that hinge loss usually gives better results than logistic regression anyway.

from sklearn import linear_model as lin
from sklearn import  metrics
from sklearn.feature_extraction import text as txt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing as prep
import numpy as np
from pandas.io import parsers
import pandas as pd

pipe = Pipeline([
    ('feat', FeatureUnion([
        ('txt', Pipeline([
            ('txtcols', Columns(["WebsiteText"])),
            ('totxt', Text()),
            ('vect', txt.TfidfVectorizer()),
            ])),
        ('num', Pipeline([
            ('numcols', Columns(["AlexaRank", "GooglePageRank"])),
            ('scale', prep.StandardScaler()),
            ])),
        ])),
    ('clf', lin.SGDClassifier(loss="log")),
    ])

Next train the model:

train=parsers.read_csv("train.csv")
pipe.fit(train, train.Label)

Finally evaluate on test data:

test=parsers.read_csv("test.csv")
tstlbl=np.array(test.Label)

print pipe.score(test, tstlbl)

pred = pipe.predict(test)
print metrics.confusion_matrix(tstlbl, pred)
print metrics.classification_report(tstlbl, pred)
print metrics.f1_score(tstlbl, pred)

prob = pipe.decision_function(test)
print metrics.roc_auc_score(tstlbl, prob)
print metrics.average_precision_score(tstlbl, prob)

You will probably not get very good results with everything using default setting like this, but it should give you a working baseline to work from. I can suggest some parameter settings that usually work for me if you like.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top