Data Science with Azure Databricks at Clifford Chance

Guest blog by Mirko Bernardoni (Fiume Ltd) and Lulu Wan (Clifford Chance) With headquarters in London, Clifford Chance is a member of the “Magic Circle” of law firms and is one of the ten largest law firms in the world measured both by number of lawyers and revenue.

As a global law firm, we support clients at both the local and international level across Europe, Asia Pacific, the Americas, the Middle East and Africa.

Our global view, coupled with our sector approach, gives us a detailed understanding of our clients’ business, including the drivers and competitive landscapes.

To achieve our vision of becoming the global law firm of choice we must be the firm that creates the greatest value for our clients.

That means delivering service that is ever quicker, simpler, more efficient and more robust.

By investing in smart technology and applying our extensive legal expertise, we can continually improve value and outcomes for clients, making delivery more effective, every time Data Science and Legal Artificial intelligence is growing at a phenomenal speed and is now set to transform the legal industry by mining documents, reviewing and creating contracts, raising red flags and performing due diligence.

We are enthusiastic early adopters of AI and other advanced technology tools to enable us to deliver a better service to our clients.

To ensure we are providing the best value to our clients, Clifford Chance created an internal Data Science Lab, organised similar to a startup inside the firm.

We are working with, and as part of the Innovation Lab and Best Delivery Hub in Clifford Chance where we deliver initiatives helping lawyers do their daily work.

Applying data science to the lawyer’s work comes with many challenges.

These include handling lengthy documents, working with a specific domain language, analysing millions of documents and classifying them, extracting information and predicting statements and clauses.

For example, a simple document classification can become a complex exercise if we consider that our documents contain more than 5,000 words.

Data Science Lab process The process that enables the data science lab to work at full capacity can be summarised in four steps: Idea management.

Every idea is catalogued with a specific workflow for managing all progression gates and stakeholder’s interaction efficiently.

This focuses us on embedding the idea in our existing business processes or creating a new product.

Data processing.

It is up to the Data Science Lab to collaborate with other teams to acquire data, seek the necessary approvals and transform it in such a way that only the relevant data with the right permission in the right format reaches the data scientist.

Databricks with Apache SparkTM — we have an on-premise instance for filtering and obfuscating the data based on our contracts and regulations — allows us to move the data to Azure efficiently.

Thanks to the unified data analytics platform, the entire data team — data engineers and data scientists — can fix minor bugs in our processes.

Data science.

Without Databricks it would be incredibly expensive for us to conduct research.

The size of the team is small, but we are always looking to implement the latest academic research.

We need a platform that allows us to code in an efficient manner without considering all the infrastructure aspects.

Databricks provides a unified, collaborative environment for all our data scientists, while also ensuring that we can comply with the security standards as mandated by our organisation.

Operationalisation.

The Databricks platform is used to re-train the models and run the ETL process which moves data into production as necessary.

Again, in this case, unifying data engineering and data science was a big win for us.

It reduces the time to fix issues and bugs and helps us to better understand the data.

Workflow process for Data Science Lab Data Science Lab toolkit The Data science Lab requirements for building our toolkit are: Maintain high standards of confidentiality Build products as quickly as possible Keep control of our models and personalisation Usable by a small team of four members with mixed skills and roles These requirements drove us to automate all of our processes and choose the right platforms for development.

We had to unify data engineering and data science while reducing costs and time required to be operational.

We use a variety of third-party tools, including Azure Cloud, open-source and in-house build tools for our data stack: Spark on-premise installation for applying the first level of governance on our data (such as defining what can be copied in the cloud) Kafka and Event Hub are our transport protocol for moving the data in Azure Databricks Unified Data Analytics Platform for any ETL transformation, iterate development and test our built-in models MLflow to log models’ metadata, select best models and hyperparameters and models deployment Hyperopt for model tuning and optimisation at scale Azure Data Lake with Delta Lake for storing our datasets, enabling traceability and model storage Data Science Lab data ingestion and elaboration architecture An example use case: Document classification Having the ability to automatically label documents speed ups many legal processes when thousands or millions of documents are involved.

To build our model, we worked with the EDGAR dataset, which is an online public database from the U.

S.

Security and Exchange Commission (SEC).

EDGAR is the primary system for submissions by companies and others required to file information with the SEC.

The first step was to extract the documents from filings and find entries that are similar in size to our use case (more than 5,000 words) and extract only the relevant text.

The process took multiple iterations to get a usable labelled dataset.

We started from more than 15 million files and selected only 28,445 for creating our models.

What was novel about our approach was applying chunk embedding inspired from audio segmentation.

This entailed, dividing a long document into chunks and mapping to numeric space to achieve chunk embeddings.

For more details, you can read our published paper here: Long-length Legal Document Classification.

On the top of long short-term memory (LSTM), we employed an attention mechanism to enable our model to assign different scores to different parts across the whole document.

Throughout the entire architecture of the model, a set of hyperparameters, comprising embedding dimension, hidden size, batch size, learning rate and weight decay, play vital roles either in determining the performance of the model or the time to be consumed on training the model.

Model architecture Even though we can narrow down candidate values for each hyperparameter to a limited range of values, the total number of combinations is still massive.

In this case, implementing a greedy search over the hyperparameter space is unrealistic, but here Hyperopt makes life much easier.

What we only need to do is to construct the objective function and define the hyperparameter space.

Meanwhile, all the results generated during the training are stored in MLflow.

No model evaluations are lost.

t-SNE plot of projections of document embeddings, using Doc2Vec + BiLSTM Conclusion The Clifford Chance Data Science Lab team is able to deliver end-user applications and academic research with a small team and limited resources.

This has been achieved through automating processes and using a combination of Azure Cloud, Azure Databricks, MLflow and Hyperopt.

In the use case above, we achieved an F1 score greater than 0.

98 on our document classification task with long-length documents.

This is assisting multiple projects where we are dealing with huge numbers of files that require classification.

Looking forward, we plan to further automate our processes to reduce the workload of managing product development.

We are continuing to optimise our processes to add alerting and monitoring.

We plan to produce more scientific papers and contribute to the MLflow and Hyperopt open-source projects in the near future so we can share our specific use cases.

  Try Databricks for free.

Get started today.

Leave a Reply