Feature Elimination Using SVM Weights

Feature Elimination Using SVM WeightsSpecifically for SVMLight, but this feature elimination methodology can be used for any linear SVM.

Ori CohenBlockedUnblockFollowFollowingJun 30Figure 1: a random example of accuracy based on the number of SVM features used.

While working on my M.

Sc thesis, circa 2005–2007, I had to calculate features weights based on an SVM Model.

This was before SKlearn, which started in 2007.

The idea was to iteratively remove redundant features based on what the algorithm thought was the least influential features, to see how much we can remove without sacrificing performance.

Today, you can easily do these types of feature selections in SKlearn and other packages, however, if you insist on using SVMLight you can use the code below.

Overall, the methodology is still valid today.

At that time we had two options, SVMLight or LibSVM.

I chose SMLight.

Thorsten Joachims published a Perl script to use calculate SVM weights, but I was using Python, I rewrote his script in python and he had graciously put a Download Link on his website.

You can find the original Perl script here: http://www.

cs.

cornell.

edu/people/tj/svm_light/svm_light_faq.

htmlAnd the Python Script Here: http://www.

cs.

cornell.

edu/people/tj/svm_light/svm2weight.

py.

txtUsing this script will get you all the features’ weights.

this is incredibly useful later on, as you can see in the following pseudo-code, you can systematically eliminate features:K = 50%After training on all current features, select K features with the highest SVM weight s and K with the lowest (most negative) SVM weightsRetrainMeasure accuracy based on an unseen datasetIterate: Go to 2.

Stop when you have no more features to select from.

Pick the optimal ‘elbow’ as seen in Figure 1.

In this example, the point is where 128 features allow you to get the same accuracy as all the features.

You will notice that you can get a higher prediction result with only a subset of your features.

This is the essence of feature selection.

# Compute the weight vector of linear SVM based on the model file# Original Perl Author: Thorsten Joachims (thorsten@joachims.

org)# Python Version: Dr.

Ori Cohen (orioric@gmail.

com)# Call: python svm2weights.

py svm_modelimport sysfrom operator import itemgettertry: import psyco psyco.

full()except ImportError: print 'Psyco not installed, the program will just run slower'def sortbyvalue(d,reverse=True): ''' proposed in PEP 265, using the itemgetter this function sorts a dictionary''' return sorted(d.

iteritems(), key=itemgetter(1), reverse=True)def sortbykey(d,reverse=True): ''' proposed in PEP 265, using the itemgetter this function sorts a dictionary''' return sorted(d.

iteritems(), key=itemgetter(0), reverse=False)def get_file(): """ Tries to extract a filename from the command line.

If none is present, it assumes file to be svm_model (default svmLight output).

If the file exists, it returns it, otherwise it prints an error message and ends execution.

""" # Get the name of the data file and load it into if len(sys.

argv) < 2: # assume file to be svm_model (default svmLight output) print "Assuming file as svm_model" filename = 'svm_model' #filename = sys.

stdin.

readline().

strip() else: filename = sys.

argv[1] try: f = open(filename, "r") except IOError: print "Error: The file '%s' was not found on this system.

" % filename sys.

exit(0) return fif __name__ == "__main__": f = get_file() i=0 lines = f.

readlines() printOutput = True w = {} for line in lines: if i>10: features = line[:line.

find('#')-1] comments = line[line.

find('#'):] alpha = features[:features.

find(' ')] feat = features[features.

find(' ')+1:] for p in feat.

split(' '): # Changed the code here.

a,v = p.

split(':') if not (int(a) in w): w[int(a)] = 0 for p in feat.

split(' '): a,v = p.

split(':') w[int(a)] +=float(alpha)*float(v) elif i==1: if line.

find('0')==-1: print 'Not linear Kernel!.' printOutput = False break elif i==10: if line.

find('threshold b')==-1: print "Parsing error!." printOutput = False break i+=1 f.

close() #if you need to sort the features by value and not by feature ID then use this line intead: #ws = sortbyvalue(w) ws = sortbykey(w) if printOutput == True: for (i,j) in ws: print i,':',j i+=1Dr.

Ori Cohen has a Ph.

D.

in Computer Science with focus in machine-learning.

He leads the research team in Zencity.

io, trying to positively influence citizen lives.

.. More details

Leave a Reply