Can You Mine Patent [Big] Data at Home?

Possibly. Let’s give it a go.

Data

In my experience, no one has quite realised how amazing this link is. It is a hosting (by Google) of bulk downloads of patent and trademark data from the US Patent and Trademark Office.

Just think about this for a second.

Here you can download images of most commercial logos used between 1870(!) and the present day. Back in the day, doing image processing and machine learning, I would have given my right arm for such a data set.

Moreover, you get access (eventually) to the text of most US patent publications. Considering there are over 8 million of these, and considering that most new and exiting technologies are the subject of a patent application, this represents a treasure trove of information on human innovation.

Although we are limited to US-based patent publications this is not a problem. The US is the world’s primary patent jurisdiction – many companies only patent in the US and most inventions of importance (in modern times) will be protected in the US. At this point we are also not looking at precise legal data – the accuracy of these downloads is not ideal. Instead, we are looking at “Big Data” (buzzword cringe) – general patterns and statistical gists from “messy” and incomplete datasets.

Storage

Initially, I started with 10 years worth of patent publications: 2001 to 2011. The data from 2001 onwards is pretty reliable; I have been told the OCR data from earlier patent publications is near useless.

An average year is around 60 GBytes of data (zipped!). Hence, we need a large hard drive.

You can pick up a 2TB external drive for about £60. I have heard they crash a lot. You might want to get two and mirror the contents using rsync.

[Update: command for rsync I am using is:

rsync -ruv /media/EXTHDD1/'Patent Downloads' /media/EXTHDD2/'Patent Downloads'

where EXTHDD1 and EXTHDD2 are the two USB disk drives.]

Download

I have an unlimited package on BT Infinity (hurray!). A great help to download the data is a little Firefox plugin called FlashGot. Install it, select the links of the files you want to download, right-click and choose “Flashgot” selection. This basically sets off a little wget script that gets each of the links. I set it going just before bed – when I wake up the files are on my hard-drive.

The two sets of files that look the most useful are the 2001+ full-text archives or the 2001+ full-text and embedded images. I went for 10 years worth of the latter.

Data Structure

The structure of the downloaded data is as follows:

Directory: Patent Downloads
- Directory: [Year e.g. 2001] – Size ~ 65GB
  - [ZIP File – one per week – name format is date e.g. 20010607.ZIP] – Size ~ 0.5GB
    - Directory: DTDS [Does what it says of the tin – maybe useful for validation but we can generally ignore for now]
    - Directory: ENTITIES [Ditto – XML entities]
    - Directories: UTIL[XXXX] [e.g. UTIL0002, UTIL0003 – these contain the data] – Size ~ 50-100MB
      - [ZIP Files – one per publication – name format is [Publication Number]-[Date].ZIP e.g. US20010002518A1-20010607.ZIP] – Size ~ 50-350KB
        
        [XML File for the patent publication data – name format is [Publication Number]-[Date].XML e.g. US20010002518A1-20010607.XML] – Size ~ 100Kb
        
        [TIF Files for the drawings – name format is [Publication Number]-[Date]-D[XXXXX].TIF where XXXXX is the drawing number e.g. US20010002518A1-20010607-D00012.TIF] – Size ~20kB

[Update: this structure varies a little 2004+ – there are a few extra layers directories between the original zipped folder and the actual XML.]

ZIP Files & Python

Python is my programming language of choice. It is simple and powerful. Any speed disadvantage is not really felt for large scale, overnight batch processing (and most modern machines are well up to the task).

Ideally I would like to work with the ZIP files directly without unzipping the data. For one-level ZIP files (e.g. the 20010607.ZIP files above) we can use the ‘zipfile‘, a built-in Python module. For example, the following short script ‘walks‘ through our ‘Patent Downloads’ directory above and prints out information about each first-level ZIP file.

import os
import zipfile
import logging
logging.basicConfig(filename="processing.log", format='%(asctime)s %(message)s')

exten = '.zip'
top = "/YOURPATH/Patent Downloads"

def print_zip(filename):
	print filename
	try:
		zip_file = zipfile.ZipFile(filename, "r")
		# list filenames

		for name in zip_file.namelist():
			print name,
		print

		# list file information
		for info in zip_file.infolist():
			print info.filename, info.date_time, info.file_size

	except Exception, ex:
		#Log error
		logging.exception("Exception opening file %s") %filename
		return

def step(ext, dirname, names):
	ext = ext.lower()

	for name in names:
		if name.lower().endswith(ext):
			print_zip(str(os.path.join(dirname, name)))

# Start the walk
os.path.walk(top, step, exten)

This code is based on that helpfully provided at PythonCentral.io. It lists all the files in the ZIP file. Now we have a start at a way to access the patent data files.

However, more work is needed. We come up against a problem when we hit the second-level of ZIP files (e.g. US20010002518A1-20010607.ZIP). These cannot be manipulated again recursively with zipfile. We need to think of a way around this so we can actually access the XML.

As a rough example of the scale we are taking about – a scan through 2001 to 2009 listing the second-level ZIP file names took about 2 minutes and created a plain-text document 121.9 MB long.

Next Time

Unfortunately, this is all for now as my washing machine is leaking and the kids are screaming.

Next time, I will be looking into whether zip_open works to access second-level (nested) ZIP files or whether we need to automate an unzip operation (if our harddrive can take it).

We will also get started on the XML processing within Python using either minidom or ElementTree.

Until then…