Clean your data with unsupervised machine learning

From sample-checking some of the results we know there are issues ranging from bad links, unreadable PDFs to items which have been successfully read-in but the content itself is complete garbage.The articles relate to Company Modern Slavery returns from this database: https://www.modernslaveryregistry.org/These now reside in a Pandas data frame with ‘meta data’ on each item such as the company name and year of publication, alongside the text which has been scraped from the return:This is the starting point — the text data is the last column in the data frameQuick digression: MissingnoThe python Missingno package is super-useful..As we are not using this for our analysis no further work is needed but if there were gaps in other areas we would have to think around how best to handle these (for example removing these rows or trying to impute missing values).The chart shows all but one column are complete — we can now focus on our text data…Back to cleaning the text dataScanning through the text in the data frame there are clearly issues. For example, something has gone wrong with reading the PDF file in this item:CMR644311ABP UKModern Slavery StatementSeptember 2017)ABP UKModern Slavery StatementSeptember 2017 ABP UKModern Slavery StatementSeptember 2017AABP UKModern Slavery StatementSeptember 2017bABP…..This one looks even worse:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!”#!)*%!+!,(-&*(#!./&0#&-*!!1!!2–34!5!5/6!!1!!7–8(&9%!!1!!7:;!<=>!.3–8(%&-*($9!!!!!!!!!!!!!!Academia Ltd..Modern Slavery Compliance State….These are clearly beyond repair — but how do we separate them from the text files which have been read correctly?Machine Learning to the rescueWe could spend a huge amount of time trying to split out this corrupted information from the real data but this is exactly where machine learning shines..It sets out the steps taken by 3M United Kingdom PLC ending 31 December 2016 to prevent modern slavery and human trafficking in its business and supply chains..Cluster 1 is therefore part of the real data obtained from the Modern Slavery statements rather then poor data quality..We would want to investigate why this cluster exists when we come to analyse the data furtherClusters 4 & 5:#locate cluster 4 and 5 and return the text column as a list:combined.loc[(combined['Cluster']==4) | (combined['Cluster']==5)].text.valuesreturns:…UK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016lUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016oUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016yUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016eUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016eUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016sUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016 UK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016aUK MODERNSLAVERY ACTStatement for the financial year ending 31 December 2016n…Here is our messy data.. More details

Leave a Reply