Introduction to StanfordNLP: An NLP Library for 53 Languages (with Python code)

It is just a mapping between PoS tags and their meaning.

This helps in getting a better understanding of our document’s syntactic structure.

The output would be a data frame with three columns — word, pos and exp (explanation).

The explanation column gives us the most information about the text (and is hence quite useful).

Adding the explanation column makes it much easier to evaluate how accurate our processor is.

I like the fact that the tagger is on point for the majority of the words.

It even picks up the tense of a word and whether it is in base or plural form.

Dependency ExtractionDependency extraction is another out-of-the-box feature of StanfordNLP.

You can simply call print_dependencies() on a sentence to get the dependency relations for all of its words:doc.

sentences[0].

print_dependencies()The library computes all of the above during a single run of the pipeline.

This will hardly take you a few minutes on a GPU enabled machine.

We have now figured out a way to perform basic text processing with StanfordNLP.

It’s time to take advantage of the fact that we can do the same for 51 other languages!Implementing StanfordNLP on the Hindi LanguageStanfordNLP really stands out in its performance and multilingual text parsing support.

Let’s dive deeper into the latter aspect.

Processing text in Hindi (Devanagari Script)First, we have to download the Hindi language model (comparatively smaller!):stanfordnlp.

download('hi')Now, take a piece of text in Hindi as our text document:hindi_doc = nlp("""केंद्र की मोदी सरकार ने शुक्रवार को अपना अंतरिम बजट पेश किया.

कार्यवाहक वित्त मंत्री पीयूष गोयल ने अपने बजट में किसान, मजदूर, करदाता, महिला वर्ग समेत हर किसी के लिए बंपर ऐलान किए.

हालांकि, बजट के बाद भी टैक्स को लेकर काफी कन्फ्यूजन बना रहा.

केंद्र सरकार के इस अंतरिम बजट क्या खास रहा और किसको क्या मिला, आसान भाषा में यहां समझें""")This should be enough to generate all the tags.

Let’s check the tags for Hindi:extract_pos(hindi_doc)The PoS tagger works surprisingly well on the Hindi text as well.

Look at “अपना” for example.

The PoS tagger tags it as a pronoun — I, he, she — which is accurate.

Using CoreNLP’s API for Text AnalyticsCoreNLP is a time tested, industry grade NLP tool-kit that is known for its performance and accuracy.

StanfordNLP takes three lines of code to start utilizing CoreNLP’s sophisticated API.

Literally, just three lines of code to set it up!1.

Download the CoreNLP package.

Open your Linux terminal and type the following command:wget http://nlp.

stanford.

edu/software/stanford-corenlp-full-2018-10-05.

zip2.

Unzip the downloaded package:unzip stanford-corenlp-full-2018-10-05.

zip3.

Start the CoreNLP server:java -mx4g -cp "*" edu.

stanford.

nlp.

pipeline.

StanfordCoreNLPServer -port 9000 -timeout 15000Note: CoreNLP requires Java8 to run.

Please make sure you have JDK and JRE 1.

8.

x installed.

pNow, make sure that StanfordNLP knows where CoreNLP is present.

For that, you have to export $CORENLP_HOME as the location of your folder.

In my case, this folder was in the home itself so my path would be likeexport CORENLP_HOME=stanford-corenlp-full-2018-10-05/After the above steps have been taken, you can start up the server and make requests in Python code.

Below is a comprehensive example of starting a server, making requests, and accessing data from the returned object.

a.

Setting up the CoreNLPClientb.

Dependency Parsing and POSc.

Named Entity Recognition and Co-Reference ChainsWhat I like the most here is the ease of use and increased accessibility this brings when it comes to using CoreNLP in python.

My Thoughts on using StanfordNLP — Pros and ConsA few things that excite me regarding the future of StanfordNLP:Its out-of-the-box support for multiple languagesThe fact that it is going to be an official Python interface for CoreNLP.

This means it will only improve in functionality and ease of use going forwardIt is fairly fast (barring the huge memory footprint)Straightforward set up in PythonThere are, however, a few chinks to iron out.

Below are my thoughts on where StanfordNLP could improve:The size of the language models is too large (English is 1.

9 GB, Chinese ~ 1.

8 GB)The library requires a lot of code to churn out features.

Compare that to NLTK where you can quickly script a prototype — this might not be possible for StanfordNLPCurrently missing visualization features.

It is useful to have for functions like dependency parsing.

StanfordNLP falls short here when compared with libraries like SpaCyMake sure you check out StanfordNLP’s official documentation.

End NotesClearly, StanfordNLP is very much in the beta stage.

It will only get better from here so this is a really good time to start using it — get a head start over everyone else.

For now, the fact that such amazing toolkits (CoreNLP) are coming to the Python ecosystem and research giants like Stanford are making an effort to open source their software, I am optimistic about the future.

Originally published at www.

analyticsvidhya.

com on February 3, 2019.

.

. More details

Leave a Reply