Building a Claim-Figure-Description Dataset

When working with neural network architectures we need good datasets for training. The problem is good datasets are rare. In this post I sketch out some ideas for building a dataset of smaller, linked portions of a patent specification. This dataset can be useful for training natural language processing models.

What are we doing?

We want to build some neural network models that draft patent specification text automatically.

In the field of natural language processing, neural network architectures have shown limited success in creating captions for images (kicked off by this paper) and text generation for dialogue (see here). The question is: can we get similar architectures to work on real-world data sources, such as the huge database of patent publications?

How do you draft a patent specification?

As a patent attorney, I often draft patent specifications as follows:

Review invention disclosure.
Draft independent patent claims.
Draft dependent patent claims.
Draft patent figures.
Draft patent technical field and background.
Draft patent detailed description.
Draft abstract.

The invention disclosure may be supplied as a short text document, an academic paper, or a proposed standards specification. The main job of a patent attorney is to convert this into a set of patent claims that have broad coverage and are difficult to work around. The coverage may be limited by pre-existing published documents. These may be previous patent applications (e.g. filed by a company or its competitors), cited academic papers or published technical specifications.

Where is the data?

As many have commented, when working with neural networks we often need to frame our problem as map X to Y, where the neural network learns the mapping when presented with many examples. In the patent world, what can we use as our Xs and Ys?

If you work in a large company you may have access to internal reports and invention disclosures. However, these are rarely made public.
To obtain a patent, you need to publish the patent specification. This means we have multiple databases of millions of documents. This is a good source of training data.
Standards submissions and academic papers are also published. The problem is there is no structured dataset that explicitly links documents to patent specifications. The best we can do is a fuzzy match using inventor details and subject matter. However, this would likely be noisy and require cleaning by hand.
US provisional applications are occasionally made up of a “rough and ready” pre-filing document. These may be available as priority documents on later-filed patent applications. The problem here is that a human being would need to inspect each candidate case individually.

Claim > Figure > Description

At present, the research models and datasets have small amounts of text data. The COCO image database has one-sentence annotations for a range of pictures. Dialogue systems often use tweet or text-message length text segments (i.e. 140-280 characters). A patent specification in comparison is monstrous (around 20-100 pages). Similarly there may be 3 to 30 patent figures. Claims are better – these tend to be around 150 words (but can be pages).

To experiment with a self-drafting system, it would be nice to have a dataset with examples as follows:

Independent claim: one independent claim of one predefined category (e.g. system or method) with a word limit.
Figure: one figure that shows mainly the features of the independent claim.
Description: a handful of paragraphs (e.g. 1-5) that describe the Figure.

We could then play around with architectures to perform the following mappings:

Independent claim > Figure (i.e. task 4 above).
Independent claim + Figure > Description (i.e. task 7 above).

One problem is this dataset does not naturally exist.

Another problem is that ideally we would like at least 10,000 examples. If you spent an hour collating each example, and did this for three hours a day, it would take you nearly a decade. (You may or may not also be world class in example collation.)

The long way

Because of the problems above it looks like we will need to automate the building of this dataset ourselves. How can we do this?

If I was to do this manually, I would:

Get a list of patent applications in a field I know (e.g. G06).
Choose a category – maybe start with apparatus/system.
Get the PDF of the patent application.
Look at the claims – extract an independent claim of the chosen category. Paste this into a spreadsheet.
Look at the Figures. Find the Figure that illustrated most of the claim features. Save this in a directory with a sensible name (e.g. linked to the claim).
Look at the detailed description. Copy and paste the passages that mention the Figure (e.g. all those paragraphs that describe the features in Figure X). This is often a continuous range.

The shorter way

There may be a way we can cheat a little. However, this might only work for granted European patents.

One ~~bug-bear~~ enjoyable part of being a European patent attorney is adding reference numerals to the claims to comply with Rule 43(7) EPC. Now where else can you find reference numerals? Why, in the Figures and in the claims. Huzzah! A correlation.

So a rough plan for an algorithm would be as follows:

Get a list of granted EP patents (this could comprise a search output).
Define a claim category (e.g. based a string pattern – [“apparatus”, “system”]).
For each patent in the list:
1. Fetch the claims using the EPO OPS “Fulltext Retrieval” API.
2. Process the claims to locate the lowest number independent claim of the defined claim category (my PatentData Python library has some tools to do this).
3. If a match is found:
  1. Save the claim.
  2. Extract reference numerals from the claim (this could be achieved by looking for text in parenthesis or using a “NUM” part of speech from spaCy).
  3. Fetch the description text using the EPO OPS “Fulltext Retrieval” API.
  4. Extract paragraphs from the description that contain the extracted reference numerals (likely with some threshold – e.g. consecutive paragraphs with greater than 2 or 3 inclusions).
  5. Save the paragraphs and the claim, together with an identifier (e.g. the published patent number).
  6. Determine a candidate Figure number from the extracted paragraphs (e.g. by looking for “FIG* [/d]”).
  7. Fetch that Figure using the EPO OPS “Drawings” or images retrieval API.
    - Now we can’t retrieve specific Figures, only specific sheets of drawings, and only in ~50% of cases will these match.
    - We can either:
      - Retrieve all the Figures and then OCR these looking for a match with the Figure number and/or the reference numbers.
      - Start with a sheet equal to the Figure number, OCR, then if there is no match, iterate up and down the Figures until a match is found.
      - See if we can retrieve a mosaic featuring all the Figures, OCR that and look for the sheet number preceding a Figure or reference numeral match.
  8. Save the Figure as something loadable (TIFF format is standard) with a name equal to the previous identifier.

The output from running this would be triple similar to this: (claim_text, paragraph_list, figure_file_path).

We might want some way to clean any results – or at least view them easily so that a “gold standard” dataset can be built. This would lend itself to a Mechanical Turk exercise.

We could break down the text data further – the claim text into clauses or “features” (e.g. based on semi-colon placement) and the paragraphs into clauses or sentences.

The image data is black and white, so we could resize and resave each TIFF file as a binary matrix of a common size. We could also use any OCR data from the file.

What do we need to do?

We need to code up a script to run the algorithm above. If we are downloading large chunks of text and images we need to be careful of exceeding the EPO’s terms of use limits. We may need to code up some throttling and download monitoring. We might also want to carefully cache our requests, so that we don’t download the same data twice.

Initially we could start with a smaller dataset of say 10 or 100 examples. Get that working. Then scale out to many more.

If the EPO OPS is too slow or our downloads are too large, we could use (i.e. buy access to) a bulk data collection. We might want to design our algorithm so that the processing may be performed independently of how the data is obtained.

Another Option

Another option is that front page images of patent publications are often available. The Figure published with the abstract is often that which the patent examiner or patent drafter thinks best illustrates the invention. We could try to match this with an independent claim. The figure image supplied though is smaller. This maybe a backup option if our main plan fails.

Wrapping Up

So. We now have a plan for building a dataset of claim text, description text and patent drawings. If the text data is broken down into clauses or sentences, this would not be a million miles away from the COCO dataset, but for patents. This would be a great resource for experimenting with self-drafting systems.

Your Patent Department in 2030

Natural Language Processing and Deep Learning have the potential to overhaul patent operations for large patent departments. Jobs that used to cost hundreds of dollars / pounds per hour may cost cents / pence. This post looks at where I would be investing research funds.

The Path to Automation

In law, the path to automation is typically as follows:

Qualified Legal Professional > Associate > Paralegal > Outsourcing > Automation

Work is standardised and commoditised as we move down the chain. Today we will be looking at the last stage in the chain: automation.

virtual-reality-1802469_640 — [Insert generic public domain image of future.]

Potential Applications

At a high level, here are some potential applications of deep learning models that have been trained on a large body of patent publications:

Invention Disclosure > Patent Specification +/ Claims (Drafting)
Patent Claims + Citation > Amended Claims (Amendment)
Patent Claims > Corpus > Citations (Patent Search)
Invention Disclosure > Citations (Patent Search)
Patent Specification + Claims > Cleaned Patent Specification + Claims (Proof Reading)
Figures > Patent Description (Drafting)
Claims > Figures +/ Patent Description (Drafting)
Product Description (e.g. Manual / Website) > Citation (Infringement)
Group of Patent Documents > Summary Clusters (Text or Image) (Landscaping)
Official Communication > Response Letter Text (Prosecution)

Caveat

I know there is a lot of hype out there and I don’t particularly want to be responsible for pouring oil on the flames of ignorance. I have tried to base these thoughts on widely reviewed research papers. The aim is to provide more a piece of informed science fiction and to act as a guide as to what may be. (I did originally call it “Your Patent Department 2020” :).

Many of these things discussed below are still a long way off, and will require a lot of hard work. However, the same was said 10 years ago of many amazing technologies we now have in production (such as facial tagging, machine translation, virtual assistants, etc.).

Examples

Let’s dive into some examples.

Search

At the moment, patent drafting typically starts as follows: receive invention disclosure, commission search (in-house or external), receive search results, review by attorney, commission patent draft. This can take weeks.

Instead, imagine a world where your inventors submit an invention disclosure and within minutes or hours you receive a report that tells you the most relevant existing patent publication, highlights potentially novel and inventive features and tells you whether you should proceed with drafting or not.

The techniques already exist to do this. You can download all US patent publications onto a hard disk that costs $75. You can convert high-dimensionality documents into lower-dimensionality real vectors (see https://radimrehurek.com/gensim/wiki.html or https://explosion.ai/blog/deep-learning-formula-nlp). You can then compute distance metrics between your decomposed invention disclosure and the corpus of US patent publications. Results can be ranked. You can use a Long Short Term Memory (LSTM) decoder (see https://www.tensorflow.org/tutorials/seq2seq) on any difference vector to indicate novel and possibly inventive features. A neural network classifier trained on previous drafting decisions can provide a probability of proceeding based on the difference results.

Drafting

A draft patent application in a complicated field such as computing or electronics may take a qualified patent attorney 20 hours to complete (including iterations with inventors). This process can take 4-6 weeks.

Now imagine a world where you can generate draft independent claims from your invention disclosure and cited prior art at the click of a button. This is not pie-in-the-sky science fiction. State of the art systems that combine natural language processing, reinforcement learning and deep learning can already generate fairly fluid document summaries (see https://metamind.io/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization). Seeding a summary based on located prior art, and the difference vector discussed above, would generate a short set of text with similar language to that art. Even if the process wasn’t able to generate a perfect claim off the bat, it could provide a rough first draft to an attorney who could quickly iterate a much improved version. The system could learn from this iteration (https://deepmind.com/blog/learning-through-human-feedback/) allowing it to improve over time.

Or another option: how about your patent figures are generated automatically based on your patent claims and then your detailed description is generated automatically based on your figures and the invention disclosure? Prototype systems already exist that perform both tasks (see https://arxiv.org/pdf/1605.05396.pdf and http://cs.stanford.edu/people/karpathy/deepimagesent/).

Prosecution

In the old days, patent prosecution involved receiving a letter from the patent office and a bundle of printed citations. These would be processed, stamped, filed, carried around on an internal mail wagon and placed on a desk. More letters would be written culminating in, say, a written response and a set of amendments.

From this, imagine that your patent office post is received electronically, then automatically filed and docketed. Citations are also automatically retrieved and filed. Objection categories are extracted automatically from the text of the office action and the office action is categorised with a percentage indicating the chance of obtaining a granted patent. Additionally, the text of the citations is read and a score is generated indicating whether the citations remove novelty from your current claims (this is similar to the search process described above, only this time you know what documents you are comparing). If the score is lower than a given threshold, a set of amendment options are presented, along with a percentage chances of success. You select an option, maybe iterate the amendment, and then the system generates your response letter. This includes inserting details of the office action you are replying to (specifically addressing each objection that is raised), automatically generating passages indicating basis in the text of your application, explains the novel features, generates a problem-solution that has a basis in the text of your application, and provides pointers for why the novel features are not obvious. Again you iterate then file online.

Parts of this are already in place at major law firms (e.g. electronically filing and docketing). I have played with systems that can extract the text from an office action PDF and automatically retrieve and file documents via our document management application programming interface. With a set of labelled training data, it is easy to build an objection classification system that takes as input a simple bag of words. Companies such as Lex Machina (see https://lexmachina.com/) already crunch legal data to provide chances of litigation success; parsing legal data from say the USPTO and EPO would enable you to build a classification system that maps the full text of your application, and bibliographic data, to a chance of prosecution success based on historic trends (e.g. in your field since the 1970s). Vector-space representations of documents allow distance measures in n-dimensional space to be calculated, and decoder systems can translate these into the language of your specification. The lecture here explains how to create a question answering system using natural language processing and deep learning (http://media.podcasts.ox.ac.uk/comlab/deep_learning_NLP/2017-01_deep_NLP_11_question_answering.mp4). You could adapt this to generate technical problems based on document text, where the answer is bound to the vector-space distance metric. Indeed, patent claim space is relatively restricted (it is, at heart, a long sentence, where amendments are often additional sub-phrases of the sentence that are consistent with the language of the claimset); the nature of patent prosecution and added subject matter, naturally produces a closed-form style problem.

Imagining Reality is the First Stage to Getting There

There is no doubt that some of these scenarios will be devilishly hard to implement. It took nearly two decades to go from paper to properly online filing systems. However, prototypes of some of these solutions could be hacked up in a few months using existing technology. The low hanging fruit alone offers the potential to shave hundreds of thousands of dollars from patent prosecution budgets.

I also hope that others are aiming to get there too. If you are please get in touch!

Patents: Another Language?

I have been playing with natural language processing.

Now I have a body of patent data (see here), I can do some interesting things. For example, most people would say that patents have a pretty specific terminology. I say: show me the data.

Taking all patent publications in 2001 as an example, I programmed a little routine that:

Extracted the text data of each patent publication;
Split the text data into words;
Filtered the words for non-words (e.g. punctuation etc.);
Applied a stemming algorithm (from 1979!); and
Recorded the frequency distribution of the results.

In total I counted 277493492 occurrences of 287455 unique word stems.

In common with most written material, 100 words accounted for 50% of the published material. Amazing when you think about it.

(Next time you get a drafting bill from a patent attorney, complain that half their work is shuffling 100 words around :)).

Here is the graph (click to zoom for full glory).

Cumulative Percentage of Top100 Words — Cumulative Percentage of Top 100 Words (click for full-size)

Patent Stopwords

There is more.

“Stopwords” are common words that are often filtered out when analysing documents. The Natural Language Tool Kit provides a set based on a general analysis of written English. These include words such as:

…’did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’…

In total there are 127 stopwords in this collection representing high-frequency content that has little lexical use.

I thought it would be interesting to compare these stopwords with the 127 most frequent in our frequency count.

Words that occurred frequently in (US) patent publications that do not comprise regular stopwords include:

said use first one form invent thi may second data claim wherein accord control signal present devic provid portion includ embodi compris method layer surfac system process exampl step ha shown connect posit prefer oper gener mean inform circuit imag unit time materi also end wa member line film side least select apparatu output element refer receiv describ direct base light section set show substrat contain display view valu part cell two plural group structur number optic electrod input result abov respect region memori plate case differ user

These words will be familiar to most patent professionals. The result of the stemming operation can be seen in certain words, e.g. “oper” – these should be treated as “oper*” – “operates”, “operating”, “operate” etc.. You can see that stemming is not perfect (“thi” may relate to “this”, which has been taken to be a plural form) but it is generally good enough. Without the stemming there would be many different variations of the same word in our counts.

Now this list of “patent stopwords” is useful. Firstly, these words are probably not useful for searching in isolation (we may move onto n-grams later). Secondly, they can be used as a dictionary of sorts for claim drafting. Thirdly, they could be used to distinguish patent text from non-patent text (e.g. as the basis for a feature vector for this classification).

The words that occur in patent specifications but also occur in “the real world” are also interesting:

the of a to and in is for be an as by with or are that from which at on can it have such each not when between other into through further more about than will so if then

These can be used as universal stopwords.

Further Fun

There are a number of paths for further analysis:

Extend across the whole US patent publication corpus from 2001 to 2014. I may need to optimise my code to do this!
Perform a similar analysis for different classification levels – e.g. do patents classified as G have a different vocabulary from those classified as H?
Look at infrequent or unique words – How many are there? Are they useful for searching clusters?