Parsing XML, Named Entity Recognition in One-Shot

An XML parser is the piece of software that reads XML files and makes the information from those files available to applications.While reading an XML file, a parser checks the syntax for the format, and reports any violations.Our main goal today is not parsing XML, but Named Entity Recognition (NER), but the data we are going to use is stored in a XML format.NER is the task of determining the identity of entities mentioned in text..For example, given the sentence “ Paris is the capital of France.”, the idea is to determine that “Paris” refers to the city of Paris and not to Paris Hilton.The sentence “Paris is the capital of France” implies that Paris is a capital of a country, suggesting that Paris is a city, not a person’s name.The task of naming entities discovered in the document collection is extremely challenge..Thankfully, there are a collection of annotated training datasets for Named Entity Recognition in the NLP Interchange Format that were contributed by Data Science Group at UPB from Germany, and we are going to use one of them called 500newsgoldstandard.xml that can be found here.The DataTo investigate the data, we use prettify() to get the document as a nested data structure in a hierarchical format:with codecs.open("500newsgoldstandard.xml", "r", "utf-8") as file: soup = BeautifulSoup(file, "html.parser")print(soup.prettify())The data creators focused on recognizing three main classes of named entities: persons, places and organizations..As we can see from the first document:Figure 1The document has one sentence “The U.S..Patent Office allows genes to be patented as soon as someone isolates the DNA by removing it from the cell, says ACLU attorney Sandra Park”, in which “ACLU” (an organization) and “Sandra Park” (a person’s name) are labeled as named entities..Because they are in between namedentityintexttags.Data Pre-processingThe text pre-processing includes looping through each child of the element under textwithnamedentities , we labelled “N” as part of named entity “C” otherwise.The following code returns lists of documents each of which contains word and label pairs.docs = []for elem in soup.find_all("document"): texts = [] for child in elem.find("textwithnamedentities").children: if type(child) == Tag: if child.name == "namedentityintext": label = 'N' else: label = 'C' for w in child.text.split(" "): if len(w) > 0: texts.append((w, label)) docs.append(texts)We can investigate the first document again.docs[0]Figure 2POS TagsWe will apply word tokenization and part-of-speech tagging to the lists of documents.data = []for i, doc in enumerate(docs): tokens = [t for t, label in doc] tagged = pos_tag(tokens) data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])This gives us lists of tuples containing the individual word, POS tag, and its label.data[0]Figure 3Conditional Random Field (CRF)In the field of Named Entity Recognition, our input data is sequential, we predict variables that depend on each other as well as on other observed variables, that is, we take surrounding context into account when making predictions on a data point, think again “Paris is the capital of France” vs..“Paris Hilton”.Therefore, we will define some features, such as word identity, word parts, lower/title/upper flags, word suffix, word shape and word POS tag; also, some information from nearby words is used, as well as all these features for words that are not at the beginning of a document, all these features for words that are not at the end of a document.Conditional Random Field Python Library, Python-CrfsuitePython-crfsuite is a python binding to CRFsuite, that is an implementation of Conditional Random Fields (CRFs) for labeling sequential data..The library has been widely used for named entity recognition.The following code are largely taken from python-crfsuite, for feature extraction as mentioned above:feature_crfsuiteTrain the Modeltrain_crfModelEvaluate the Resultsevaluate_resultsFigure 4Not too shabby for the 1st attempt!Jupyter notebook can be found on Github..Enjoy the rest of the week.References:Introduction to Conditional Random FieldsImagine you have a sequence of snapshots from a day in Justin Bieber's life, and you want to label each image with the…blog.echen.mePerforming Sequence Labelling using CRF in PythonIn natural language processing, it is a common task to extract words or phrases of particular types from a given…www.albertauyeung.com. More details

Leave a Reply