Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)

1.

Download the CoreNLP package.

Open your Linux terminal and type the following command: wget http://nlp.

stanford.

edu/software/stanford-corenlp-full-2018-10-05.

zip 2.

Unzip the downloaded package: unzip stanford-corenlp-full-2018-10-05.

zip 3.

Start the CoreNLP server: java -mx4g -cp “*” edu.

stanford.

nlp.

pipeline.

StanfordCoreNLPServer -port 9000 -timeout 15000 Note: CoreNLP requires Java8 to run.

Please make sure you have jdk and jre 1.

8.

x installed.

p Now, make sure that StanfordNLP knows where CoreNLP is present.

For that, you have export $CORENLP_HOME as the location of your folder.

In my case, this folder was in home itself so my path would be like export CORENLP_HOME=stanford-corenlp-full-2018-10-05/ After the above steps have been taken, you can start up the server and make requests in Python code.

Below is a comprehensive example of starting a server, making requests, and accessing data from the returned object.

from stanfordnlp.

server import CoreNLPClient # example text print(—) print(input text) print() text = “Chris Manning is a nice person.

Chris wrote a simple sentence.

He also gives oranges to people.

” print(text) # set up the client print(—) print(starting up Java Stanford CoreNLP Server.

) # set up the client with CoreNLPClient(annotators=[tokenize,ssplit,pos,lemma,ner,depparse,coref], timeout=30000, memory=16G) as client: # submit the request to the server ann = client.

annotate(text) # get the first sentence sentence = ann.

sentence[0] # get the dependency parse of the first sentence print(—) print(dependency parse of first sentence) dependency_parse = sentence.

basicDependencies print(dependency_parse) # get the first token of the first sentence print(—) print(first token of first sentence) token = sentence.

token[0] print(token) # get the part-of-speech tag print(—) print(part of speech tag of token) token.

pos print(token.

pos) # get the named entity tag print(—) print(named entity tag of token) print(token.

ner) # get an entity mention from the first sentence print(—) print(first entity mention in sentence) print(sentence.

mentions[0]) # access the coref chain print(—) print(coref chains for the example) print(ann.

corefChain) # Use tokensregex patterns to find who wrote a sentence.

pattern = ([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/ matches = client.

tokensregex(text, pattern) # sentences contains a list with matches for each sentence.

assert len(matches[“sentences”]) == 3 # length tells you whether or not there are any matches in this assert matches[“sentences”][1][“length”] == 1 # You can access matches like most regex groups.

matches[“sentences”][1][“0”][“text”] == “Chris wrote a simple sentence” matches[“sentences”][1][“0”][“1”][“text”] == “Chris” # Use semgrex patterns to directly find who wrote what.

pattern = {word:wrote} >nsubj {}=subject >dobj {}=object matches = client.

semgrex(text, pattern) # sentences contains a list with matches for each sentence.

assert len(matches[“sentences”]) == 3 # length tells you whether or not there are any matches in this assert matches[“sentences”][1][“length”] == 1 # You can access matches like most regex groups.

matches[“sentences”][1][“0”][“text”] == “wrote” matches[“sentences”][1][“0”][“$subject”][“text”] == “Chris” matches[“sentences”][1][“0”][“$object”][“text”] == “sentence” What I like the most here is the ease of use and increased accessibility this brings when it comes to using CoreNLP in python.

  My Thoughts on using StanfordNLP – Pros and Cons Exploring a newly launched library was certainly a challenge.

There’s barely any documentation on StanfordNLP!.Yet, it was quite an enjoyable learning experience.

A few things that excite me regarding the future of StanfordNLP: Its out-of-the-box support for multiple languages The fact that it is going to be an official Python interface for CoreNLP.

This means it will only improve in functionality and ease of use going forward It is fairly fast (barring the huge memory footprint) Straightforward set up in Python There are, however, a few chinks to iron out.

Below are my thoughts on where StanfordNLP could improve: The size of the language models is too large (English is 1.

9 GB, Chinese ~ 1.

8 GB) The library requires a lot of code to churn out features.

Compare that to NLTK where you can quickly script a prototype – this might not be possible for StanfordNLP Currently missing visualization features.

It is useful to have for functions like dependency parsing.

StanfordNLP falls short here when compared with libraries like SpaCy Make sure you check out StanfordNLP’s official documentation.

  End Notes There is still a feature I haven’t tried out yet.

StanfordNLP allows you to train models on your own annotated data using embeddings from word2vec/fasttext.

I’d like to explore it in the future and see how effective that functionality is.

I will update the article whenever the library matures a bit.

Clearly, StanfordNLP is very much in the beta stage.

It will only get better from here so this is a really good time to start using it – get a head start over everyone else.

For now, the fact that such amazing toolkits (CoreNLP) are coming to the Python ecosystem and research giants like Stanford are making an effort to open source their software, I am optimistic about the future.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Google+ (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window)Like this:Like Loading.

(adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply