This post sets out a number of resources to get you started with deep learning, with a focus on natural language processing for legal applications.
A Bit of Background
Deep learning is a bit of a buzz word. Basically, it relates to recent advances in neural networks. In particular, it relates to the number of layers that can be used in these networks. Each layer can be thought of as a mathematical operation. In many cases, it involves a multidimensional extension of drawing a line, y = ax + b, to separate a space into multiple parts.
I find it strange that when I studied machine learning in 2003/4, neural networks had gone out of fashion. The craze then was for support vector machines. Neural networks were seen as a bit of a dead end. While there was nothing wrong theoretically, in practice it wasn’t possible to train a network with more than a couple of layers. This limited their application.
What changed?
Computers and software improved. Memory increased. Researchers realised they could co-opt the graphical processing units of beefy graphics cards of hardcore gamers to perform matrix and vector multiplication. The Internet improved access to large scale data sets and enabled the fast propagation of results. Software tool kits and standard libraries arrived. You could now program in Python for free rather than pay large licence fees for Matlab. Python made it easy to combine functionality from many different areas. Software became good at differentiating and incorporating advanced mathematic optimisation techniques. Google and Facebook poured money into the field. Etc.
This all led to researchers being able to build neural networks with more and more layers that could be trained efficiently. Hence, “deep” means more than two layers and “learning” refers to neural network approaches.
Deep Natural Language Processing
Deep learning has a number of different application areas. One big split is between image processing and natural language processing. The former has seen big success with the use of convolutional neural networks (CNNs), while natural language processing has tended to focus on recurrent neural networks (RNNs), which operate on sequences within time.
Image processing has also typically considered supervised learning problems. These are problems where you have a corpus of labelled data (e.g. ‘ImageX’ – ‘cat’) and you want a neural network to learn the classifications.
Natural language processing on the other hand tends to work with unsupervised learning problems. In this case, we have a large body of unlabelled data (see the data sources below) and we want to build models that provide some understanding of the data, e.g. that model in some way syntactic or semantic properties of text.
Saying this there are cross overs – there are several highly-cited papers that apply CNNs to sentence structures, and document classification can be performed on the basis of a corpus of labelled documents.
Introductory Blog Posts
- http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- http://sebastianruder.com/word-embeddings-1/
- http://radimrehurek.com/gensim/tutorial.html
- https://www.tensorflow.org/tutorials/word2vec
- https://www.tensorflow.org/tutorials/recurrent
Courses
After you’ve read those blog articles a next step is to dive into the Udacity free Deep Learning course. This is taught in collaboration with Google Brain and is a great introduction to Logical Regression, Neural Networks, Data Wrangling, CNNs and a form of RNNs called Long Short Term Memory (LSTMs). It includes a number of interactive Jupyter/IPython Notebooks, which follow a similar path to the Tensorflow tutorials.
Udacity Deep Learning Course – https://www.udacity.com/course/deep-learning–ud730
Their Data Science, Github, Programming and Web Development courses are also very good if you need to get quickly up to speed.
Once you’ve completed that, a next step is working through the lecture notes and exercises for these Stanford and Oxford courses.
Stanford Deep Learning for Natural Language Processing – http://cs224d.stanford.edu/syllabus.html
Oxford Deep NLP (with special guests from Deepmind & Nvidia) – https://github.com/oxford-cs-deepnlp-2017/lectures
Data Sources
Once you’ve got your head around the theory, and have played around with some simple examples, the next step is to get building on some legal data. Here’s a selection of useful text sources with a patent slant:
USPTO bulk data – https://bulkdata.uspto.gov/ – download all the patents!
Some of this data will require cleaning / sorting / wrangling to access the text. There is an (experimental) USPTO project in Java to help with this. This can be found here: https://github.com/USPTO/PatentPublicData . I have also been working on some Python wrappers to access the XML in (zipped) situ – https://github.com/benhoyle/patentdata and https://github.com/benhoyle/patentmodels.
Wikipedia bulk data – https://dumps.wikimedia.org/enwiki/latest/ – download all the knowledge!
The file you probably want here is enwiki-latest-pages-articles.xml.bz2. This clocks in at 13 GB compressed and ~58 GB uncompressed. It is supplied as a single XML file. Again I need to work on some Python helper functions to access the XML and return text.
(Note: this is the same format as recent USPTO grant data – a good XML parser that doesn’t read the whole file into memory would be useful.)
WordNet.
The easiest way to access this data is probably via the NLTK toolkit indicated below. However, you can download the data for WordNet 3 here – https://wordnet.princeton.edu/wordnet/download/current-version/.
Bailli – http://www.bailii.org/ – a free online database of British and Irish case law & legislation, European Union case law, Law Commission reports, and other law-related British and Irish material.
There is no bulk download option for this data – it is accessed as a series of HTML pages. It would not be too difficult to build a Python tool to bulk download various datasets.
UK Legislation – Legislation.gov.uk.
This data is available via a web interface. Unfortunately, there does not appear to be a bulk download option or an API for supplying machine readable data.
On the to-do list is a Python wrapper for supplying structured or unstructured versions of UK legislation from this site (e.g. possibly downloading with requests then parsing the returned HTML).
European Patent Office Board of Appeal Case Law database – https://www.epo.org/law-practice/case-law-appeals/advanced-search.html.
Although there is no API or bulk download option as of yet, it is possible to set up an RSS feed link based on search parameters. This RSS feed link can be processed to access links to each decision page. These pages can then be accessed and converted into text using a few Python functions (I have some scripts to do this I will share soon).
UK Intellectual Patent Office Hearing Database – https://www.ipo.gov.uk/p-challenge-decision-results.htm.
Again a human accessible resource. However, the decisions are accessible by year in fairly easy to parse tables of data (I again have some scripts to do this that I will share with you soon).
Your Document / Case Management System.
Many law firms use some kind of document and/or case management system. If available online, there may be an API to access documents and data stored in these systems. Tools like Textract (see below) can be used to extract text from these documents. If available as some form of SQL database, you can often access the data using ODBC drivers.
Tools
Once you have some data the hard work begins. Ideally what you want is a nice text string per document or article. However, none of the data sources listed above enable you to access this easily. Hence, you need to start building some wrappers in Python to access and parse the data and return an output that can be easily processed by machine learning libraries. Here are some tools for doing this, and then to build your deep learning networks. For more details just Google the name.
NLTK
– brilliant for many natural language processing functions such as stemming, tokenisation, part of speech tagging and many more.
SpaCy
– an advanced set of NLP functions.
Gensim
– another brilliant library for processing big document libraries – particularly good for lazy functions that do not store all the data in memory.
Tensorflow
– for building your neural networks.
Keras
– a wrapper for Tensorflow or Theano that allows rapid prototyping.
Scikit-Learn
– provides implementations for most of the major machine learning techniques, such as Bayesian inference, clustering, regression and more.
Beautiful Soup
– great for easy parsing of semi-structured data such as websites (HTML) or patent documents (XML).
Textract
– a very simple wrapper over a number of different Linux libraries to extract text from a large variety of files.
Pandas
– think of this as a command line Excel, great for manipulating large lists of data.
Numpy
– numerical analysis in Python, used, amongst other things, for multidimensional arrays.
Jupyter Notebooks
– great for prototyping and research, the engineers squared paper notebook of the 21st century, plus they can be easily shared on GitHub.
Docker
– many modern toolkits require a bundle of libraries, it can be easier to setup a Docker image (a form of virtualised container).
Flask
– for building web servers and APIs.