Building a Claim-Figure-Description Dataset

When working with neural network architectures we need good datasets for training. The problem is good datasets are rare. In this post I sketch out some ideas for building a dataset of smaller, linked portions of a patent specification. This dataset can be useful for training natural language processing models.

What are we doing?

We want to build some neural network models that draft patent specification text automatically.

In the field of natural language processing, neural network architectures have shown limited success in creating captions for images (kicked off by this paper) and text generation for dialogue (see here). The question is: can we get similar architectures to work on real-world data sources, such as the huge database of patent publications?

How do you draft a patent specification?

As a patent attorney, I often draft patent specifications as follows:

Review invention disclosure.
Draft independent patent claims.
Draft dependent patent claims.
Draft patent figures.
Draft patent technical field and background.
Draft patent detailed description.
Draft abstract.

The invention disclosure may be supplied as a short text document, an academic paper, or a proposed standards specification. The main job of a patent attorney is to convert this into a set of patent claims that have broad coverage and are difficult to work around. The coverage may be limited by pre-existing published documents. These may be previous patent applications (e.g. filed by a company or its competitors), cited academic papers or published technical specifications.

Where is the data?

As many have commented, when working with neural networks we often need to frame our problem as map X to Y, where the neural network learns the mapping when presented with many examples. In the patent world, what can we use as our Xs and Ys?

If you work in a large company you may have access to internal reports and invention disclosures. However, these are rarely made public.
To obtain a patent, you need to publish the patent specification. This means we have multiple databases of millions of documents. This is a good source of training data.
Standards submissions and academic papers are also published. The problem is there is no structured dataset that explicitly links documents to patent specifications. The best we can do is a fuzzy match using inventor details and subject matter. However, this would likely be noisy and require cleaning by hand.
US provisional applications are occasionally made up of a “rough and ready” pre-filing document. These may be available as priority documents on later-filed patent applications. The problem here is that a human being would need to inspect each candidate case individually.

Claim > Figure > Description

At present, the research models and datasets have small amounts of text data. The COCO image database has one-sentence annotations for a range of pictures. Dialogue systems often use tweet or text-message length text segments (i.e. 140-280 characters). A patent specification in comparison is monstrous (around 20-100 pages). Similarly there may be 3 to 30 patent figures. Claims are better – these tend to be around 150 words (but can be pages).

To experiment with a self-drafting system, it would be nice to have a dataset with examples as follows:

Independent claim: one independent claim of one predefined category (e.g. system or method) with a word limit.
Figure: one figure that shows mainly the features of the independent claim.
Description: a handful of paragraphs (e.g. 1-5) that describe the Figure.

We could then play around with architectures to perform the following mappings:

Independent claim > Figure (i.e. task 4 above).
Independent claim + Figure > Description (i.e. task 7 above).

One problem is this dataset does not naturally exist.

Another problem is that ideally we would like at least 10,000 examples. If you spent an hour collating each example, and did this for three hours a day, it would take you nearly a decade. (You may or may not also be world class in example collation.)

The long way

Because of the problems above it looks like we will need to automate the building of this dataset ourselves. How can we do this?

If I was to do this manually, I would:

Get a list of patent applications in a field I know (e.g. G06).
Choose a category – maybe start with apparatus/system.
Get the PDF of the patent application.
Look at the claims – extract an independent claim of the chosen category. Paste this into a spreadsheet.
Look at the Figures. Find the Figure that illustrated most of the claim features. Save this in a directory with a sensible name (e.g. linked to the claim).
Look at the detailed description. Copy and paste the passages that mention the Figure (e.g. all those paragraphs that describe the features in Figure X). This is often a continuous range.

The shorter way

There may be a way we can cheat a little. However, this might only work for granted European patents.

One ~~bug-bear~~ enjoyable part of being a European patent attorney is adding reference numerals to the claims to comply with Rule 43(7) EPC. Now where else can you find reference numerals? Why, in the Figures and in the claims. Huzzah! A correlation.

So a rough plan for an algorithm would be as follows:

Get a list of granted EP patents (this could comprise a search output).
Define a claim category (e.g. based a string pattern – [“apparatus”, “system”]).
For each patent in the list:
1. Fetch the claims using the EPO OPS “Fulltext Retrieval” API.
2. Process the claims to locate the lowest number independent claim of the defined claim category (my PatentData Python library has some tools to do this).
3. If a match is found:
  1. Save the claim.
  2. Extract reference numerals from the claim (this could be achieved by looking for text in parenthesis or using a “NUM” part of speech from spaCy).
  3. Fetch the description text using the EPO OPS “Fulltext Retrieval” API.
  4. Extract paragraphs from the description that contain the extracted reference numerals (likely with some threshold – e.g. consecutive paragraphs with greater than 2 or 3 inclusions).
  5. Save the paragraphs and the claim, together with an identifier (e.g. the published patent number).
  6. Determine a candidate Figure number from the extracted paragraphs (e.g. by looking for “FIG* [/d]”).
  7. Fetch that Figure using the EPO OPS “Drawings” or images retrieval API.
    - Now we can’t retrieve specific Figures, only specific sheets of drawings, and only in ~50% of cases will these match.
    - We can either:
      - Retrieve all the Figures and then OCR these looking for a match with the Figure number and/or the reference numbers.
      - Start with a sheet equal to the Figure number, OCR, then if there is no match, iterate up and down the Figures until a match is found.
      - See if we can retrieve a mosaic featuring all the Figures, OCR that and look for the sheet number preceding a Figure or reference numeral match.
  8. Save the Figure as something loadable (TIFF format is standard) with a name equal to the previous identifier.

The output from running this would be triple similar to this: (claim_text, paragraph_list, figure_file_path).

We might want some way to clean any results – or at least view them easily so that a “gold standard” dataset can be built. This would lend itself to a Mechanical Turk exercise.

We could break down the text data further – the claim text into clauses or “features” (e.g. based on semi-colon placement) and the paragraphs into clauses or sentences.

The image data is black and white, so we could resize and resave each TIFF file as a binary matrix of a common size. We could also use any OCR data from the file.

What do we need to do?

We need to code up a script to run the algorithm above. If we are downloading large chunks of text and images we need to be careful of exceeding the EPO’s terms of use limits. We may need to code up some throttling and download monitoring. We might also want to carefully cache our requests, so that we don’t download the same data twice.

Initially we could start with a smaller dataset of say 10 or 100 examples. Get that working. Then scale out to many more.

If the EPO OPS is too slow or our downloads are too large, we could use (i.e. buy access to) a bulk data collection. We might want to design our algorithm so that the processing may be performed independently of how the data is obtained.

Another Option

Another option is that front page images of patent publications are often available. The Figure published with the abstract is often that which the patent examiner or patent drafter thinks best illustrates the invention. We could try to match this with an independent claim. The figure image supplied though is smaller. This maybe a backup option if our main plan fails.

Wrapping Up

So. We now have a plan for building a dataset of claim text, description text and patent drawings. If the text data is broken down into clauses or sentences, this would not be a million miles away from the COCO dataset, but for patents. This would be a great resource for experimenting with self-drafting systems.

Your Patent Department in 2030

Natural Language Processing and Deep Learning have the potential to overhaul patent operations for large patent departments. Jobs that used to cost hundreds of dollars / pounds per hour may cost cents / pence. This post looks at where I would be investing research funds.

The Path to Automation

In law, the path to automation is typically as follows:

Qualified Legal Professional > Associate > Paralegal > Outsourcing > Automation

Work is standardised and commoditised as we move down the chain. Today we will be looking at the last stage in the chain: automation.

virtual-reality-1802469_640 — [Insert generic public domain image of future.]

Potential Applications

At a high level, here are some potential applications of deep learning models that have been trained on a large body of patent publications:

Invention Disclosure > Patent Specification +/ Claims (Drafting)
Patent Claims + Citation > Amended Claims (Amendment)
Patent Claims > Corpus > Citations (Patent Search)
Invention Disclosure > Citations (Patent Search)
Patent Specification + Claims > Cleaned Patent Specification + Claims (Proof Reading)
Figures > Patent Description (Drafting)
Claims > Figures +/ Patent Description (Drafting)
Product Description (e.g. Manual / Website) > Citation (Infringement)
Group of Patent Documents > Summary Clusters (Text or Image) (Landscaping)
Official Communication > Response Letter Text (Prosecution)

Caveat

I know there is a lot of hype out there and I don’t particularly want to be responsible for pouring oil on the flames of ignorance. I have tried to base these thoughts on widely reviewed research papers. The aim is to provide more a piece of informed science fiction and to act as a guide as to what may be. (I did originally call it “Your Patent Department 2020” :).

Many of these things discussed below are still a long way off, and will require a lot of hard work. However, the same was said 10 years ago of many amazing technologies we now have in production (such as facial tagging, machine translation, virtual assistants, etc.).

Examples

Let’s dive into some examples.

Search

At the moment, patent drafting typically starts as follows: receive invention disclosure, commission search (in-house or external), receive search results, review by attorney, commission patent draft. This can take weeks.

Instead, imagine a world where your inventors submit an invention disclosure and within minutes or hours you receive a report that tells you the most relevant existing patent publication, highlights potentially novel and inventive features and tells you whether you should proceed with drafting or not.

The techniques already exist to do this. You can download all US patent publications onto a hard disk that costs $75. You can convert high-dimensionality documents into lower-dimensionality real vectors (see https://radimrehurek.com/gensim/wiki.html or https://explosion.ai/blog/deep-learning-formula-nlp). You can then compute distance metrics between your decomposed invention disclosure and the corpus of US patent publications. Results can be ranked. You can use a Long Short Term Memory (LSTM) decoder (see https://www.tensorflow.org/tutorials/seq2seq) on any difference vector to indicate novel and possibly inventive features. A neural network classifier trained on previous drafting decisions can provide a probability of proceeding based on the difference results.

Drafting

A draft patent application in a complicated field such as computing or electronics may take a qualified patent attorney 20 hours to complete (including iterations with inventors). This process can take 4-6 weeks.

Now imagine a world where you can generate draft independent claims from your invention disclosure and cited prior art at the click of a button. This is not pie-in-the-sky science fiction. State of the art systems that combine natural language processing, reinforcement learning and deep learning can already generate fairly fluid document summaries (see https://metamind.io/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization). Seeding a summary based on located prior art, and the difference vector discussed above, would generate a short set of text with similar language to that art. Even if the process wasn’t able to generate a perfect claim off the bat, it could provide a rough first draft to an attorney who could quickly iterate a much improved version. The system could learn from this iteration (https://deepmind.com/blog/learning-through-human-feedback/) allowing it to improve over time.

Or another option: how about your patent figures are generated automatically based on your patent claims and then your detailed description is generated automatically based on your figures and the invention disclosure? Prototype systems already exist that perform both tasks (see https://arxiv.org/pdf/1605.05396.pdf and http://cs.stanford.edu/people/karpathy/deepimagesent/).

Prosecution

In the old days, patent prosecution involved receiving a letter from the patent office and a bundle of printed citations. These would be processed, stamped, filed, carried around on an internal mail wagon and placed on a desk. More letters would be written culminating in, say, a written response and a set of amendments.

From this, imagine that your patent office post is received electronically, then automatically filed and docketed. Citations are also automatically retrieved and filed. Objection categories are extracted automatically from the text of the office action and the office action is categorised with a percentage indicating the chance of obtaining a granted patent. Additionally, the text of the citations is read and a score is generated indicating whether the citations remove novelty from your current claims (this is similar to the search process described above, only this time you know what documents you are comparing). If the score is lower than a given threshold, a set of amendment options are presented, along with a percentage chances of success. You select an option, maybe iterate the amendment, and then the system generates your response letter. This includes inserting details of the office action you are replying to (specifically addressing each objection that is raised), automatically generating passages indicating basis in the text of your application, explains the novel features, generates a problem-solution that has a basis in the text of your application, and provides pointers for why the novel features are not obvious. Again you iterate then file online.

Parts of this are already in place at major law firms (e.g. electronically filing and docketing). I have played with systems that can extract the text from an office action PDF and automatically retrieve and file documents via our document management application programming interface. With a set of labelled training data, it is easy to build an objection classification system that takes as input a simple bag of words. Companies such as Lex Machina (see https://lexmachina.com/) already crunch legal data to provide chances of litigation success; parsing legal data from say the USPTO and EPO would enable you to build a classification system that maps the full text of your application, and bibliographic data, to a chance of prosecution success based on historic trends (e.g. in your field since the 1970s). Vector-space representations of documents allow distance measures in n-dimensional space to be calculated, and decoder systems can translate these into the language of your specification. The lecture here explains how to create a question answering system using natural language processing and deep learning (http://media.podcasts.ox.ac.uk/comlab/deep_learning_NLP/2017-01_deep_NLP_11_question_answering.mp4). You could adapt this to generate technical problems based on document text, where the answer is bound to the vector-space distance metric. Indeed, patent claim space is relatively restricted (it is, at heart, a long sentence, where amendments are often additional sub-phrases of the sentence that are consistent with the language of the claimset); the nature of patent prosecution and added subject matter, naturally produces a closed-form style problem.

Imagining Reality is the First Stage to Getting There

There is no doubt that some of these scenarios will be devilishly hard to implement. It took nearly two decades to go from paper to properly online filing systems. However, prototypes of some of these solutions could be hacked up in a few months using existing technology. The low hanging fruit alone offers the potential to shave hundreds of thousands of dollars from patent prosecution budgets.

I also hope that others are aiming to get there too. If you are please get in touch!

Resources for (Legal) Deep Learning

This post sets out a number of resources to get you started with deep learning, with a focus on natural language processing for legal applications.

A Bit of Background

Deep learning is a bit of a buzz word. Basically, it relates to recent advances in neural networks. In particular, it relates to the number of layers that can be used in these networks. Each layer can be thought of as a mathematical operation. In many cases, it involves a multidimensional extension of drawing a line, y = ax + b, to separate a space into multiple parts.

I find it strange that when I studied machine learning in 2003/4, neural networks had gone out of fashion. The craze then was for support vector machines. Neural networks were seen as a bit of a dead end. While there was nothing wrong theoretically, in practice it wasn’t possible to train a network with more than a couple of layers. This limited their application.

What changed?

Computers and software improved. Memory increased. Researchers realised they could co-opt the graphical processing units of beefy graphics cards of hardcore gamers to perform matrix and vector multiplication. The Internet improved access to large scale data sets and enabled the fast propagation of results. Software tool kits and standard libraries arrived. You could now program in Python for free rather than pay large licence fees for Matlab. Python made it easy to combine functionality from many different areas. Software became good at differentiating and incorporating advanced mathematic optimisation techniques. Google and Facebook poured money into the field. Etc.

This all led to researchers being able to build neural networks with more and more layers that could be trained efficiently. Hence, “deep” means more than two layers and “learning” refers to neural network approaches.

Deep Natural Language Processing

Deep learning has a number of different application areas. One big split is between image processing and natural language processing. The former has seen big success with the use of convolutional neural networks (CNNs), while natural language processing has tended to focus on recurrent neural networks (RNNs), which operate on sequences within time.

Image processing has also typically considered supervised learning problems. These are problems where you have a corpus of labelled data (e.g. ‘ImageX’ – ‘cat’) and you want a neural network to learn the classifications.

Natural language processing on the other hand tends to work with unsupervised learning problems. In this case, we have a large body of unlabelled data (see the data sources below) and we want to build models that provide some understanding of the data, e.g. that model in some way syntactic or semantic properties of text.

Saying this there are cross overs – there are several highly-cited papers that apply CNNs to sentence structures, and document classification can be performed on the basis of a corpus of labelled documents.

Introductory Blog Posts

A good place to start are these blog posts and tutorials. I’m rather envious of the ability of these folks to write so clearly about such a complex topic.

Courses

After you’ve read those blog articles a next step is to dive into the Udacity free Deep Learning course. This is taught in collaboration with Google Brain and is a great introduction to Logical Regression, Neural Networks, Data Wrangling, CNNs and a form of RNNs called Long Short Term Memory (LSTMs). It includes a number of interactive Jupyter/IPython Notebooks, which follow a similar path to the Tensorflow tutorials.

Udacity Deep Learning Course – https://www.udacity.com/course/deep-learning–ud730

Their Data Science, Github, Programming and Web Development courses are also very good if you need to get quickly up to speed.

Once you’ve completed that, a next step is working through the lecture notes and exercises for these Stanford and Oxford courses.

Stanford Deep Learning for Natural Language Processing – http://cs224d.stanford.edu/syllabus.html

Oxford Deep NLP (with special guests from Deepmind & Nvidia) – https://github.com/oxford-cs-deepnlp-2017/lectures

Data Sources

Once you’ve got your head around the theory, and have played around with some simple examples, the next step is to get building on some legal data. Here’s a selection of useful text sources with a patent slant:

USPTO bulk data – https://bulkdata.uspto.gov/ – download all the patents!

Some of this data will require cleaning / sorting / wrangling to access the text. There is an (experimental) USPTO project in Java to help with this. This can be found here: https://github.com/USPTO/PatentPublicData . I have also been working on some Python wrappers to access the XML in (zipped) situ – https://github.com/benhoyle/patentdata and https://github.com/benhoyle/patentmodels.

Wikipedia bulk data – https://dumps.wikimedia.org/enwiki/latest/ – download all the knowledge!

The file you probably want here is enwiki-latest-pages-articles.xml.bz2. This clocks in at 13 GB compressed and ~58 GB uncompressed. It is supplied as a single XML file. Again I need to work on some Python helper functions to access the XML and return text.

(Note: this is the same format as recent USPTO grant data – a good XML parser that doesn’t read the whole file into memory would be useful.)

WordNet.

The easiest way to access this data is probably via the NLTK toolkit indicated below. However, you can download the data for WordNet 3 here – https://wordnet.princeton.edu/wordnet/download/current-version/.

Bailli – http://www.bailii.org/ – a free online database of British and Irish case law & legislation, European Union case law, Law Commission reports, and other law-related British and Irish material.

There is no bulk download option for this data – it is accessed as a series of HTML pages. It would not be too difficult to build a Python tool to bulk download various datasets.

UK Legislation – Legislation.gov.uk.

This data is available via a web interface. Unfortunately, there does not appear to be a bulk download option or an API for supplying machine readable data.

On the to-do list is a Python wrapper for supplying structured or unstructured versions of UK legislation from this site (e.g. possibly downloading with requests then parsing the returned HTML).

European Patent Office Board of Appeal Case Law database – https://www.epo.org/law-practice/case-law-appeals/advanced-search.html.

Although there is no API or bulk download option as of yet, it is possible to set up an RSS feed link based on search parameters. This RSS feed link can be processed to access links to each decision page. These pages can then be accessed and converted into text using a few Python functions (I have some scripts to do this I will share soon).

UK Intellectual Patent Office Hearing Database – https://www.ipo.gov.uk/p-challenge-decision-results.htm.

Again a human accessible resource. However, the decisions are accessible by year in fairly easy to parse tables of data (I again have some scripts to do this that I will share with you soon).

Your Document / Case Management System.

Many law firms use some kind of document and/or case management system. If available online, there may be an API to access documents and data stored in these systems. Tools like Textract (see below) can be used to extract text from these documents. If available as some form of SQL database, you can often access the data using ODBC drivers.

Tools

Once you have some data the hard work begins. Ideally what you want is a nice text string per document or article. However, none of the data sources listed above enable you to access this easily. Hence, you need to start building some wrappers in Python to access and parse the data and return an output that can be easily processed by machine learning libraries. Here are some tools for doing this, and then to build your deep learning networks. For more details just Google the name.

NLTK

– brilliant for many natural language processing functions such as stemming, tokenisation, part of speech tagging and many more.

SpaCy

– an advanced set of NLP functions.

Gensim

– another brilliant library for processing big document libraries – particularly good for lazy functions that do not store all the data in memory.

Tensorflow

– for building your neural networks.

Keras

– a wrapper for Tensorflow or Theano that allows rapid prototyping.

Scikit-Learn

– provides implementations for most of the major machine learning techniques, such as Bayesian inference, clustering, regression and more.

Beautiful Soup

– great for easy parsing of semi-structured data such as websites (HTML) or patent documents (XML).

Textract

– a very simple wrapper over a number of different Linux libraries to extract text from a large variety of files.

Pandas

– think of this as a command line Excel, great for manipulating large lists of data.

Numpy

– numerical analysis in Python, used, amongst other things, for multidimensional arrays.

Jupyter Notebooks

– great for prototyping and research, the engineers squared paper notebook of the 21st century, plus they can be easily shared on GitHub.

Docker

– many modern toolkits require a bundle of libraries, it can be easier to setup a Docker image (a form of virtualised container).

Flask

– for building web servers and APIs.

Now go build, share on GitHub and let me know what you come up with.