Can You Mine Patent [Big] Data at Home?

Possibly. Let’s give it a go.

Big data - from DARPA

Data

In my experience, no one has quite realised how amazing this link is. It is a hosting (by Google) of bulk downloads of patent and trademark data from the US Patent and Trademark Office.

Just think about this for a second.

Here you can download images of most commercial logos used between 1870(!) and the present day. Back in the day, doing image processing and machine learning, I would have given my right arm for such a data set.

Moreover, you get access (eventually) to the text of most US patent publications. Considering there are over 8 million of these, and considering that most new and exiting technologies are the subject of a patent application, this represents a treasure trove of information on human innovation.

Although we are limited to US-based patent publications this is not a problem. The US is the world’s primary patent jurisdiction – many companies only patent in the US and most inventions of importance (in modern times) will be protected in the US. At this point we are also not looking at precise legal data – the accuracy of these downloads is not ideal. Instead, we are looking at “Big Data” (buzzword cringe) – general patterns and statistical gists from “messy” and incomplete datasets.

Storage

Initially, I started with 10 years worth of patent publications: 2001 to 2011. The data from 2001 onwards is pretty reliable; I have been told the OCR data from earlier patent publications is near useless.

An average year is around 60 GBytes of data (zipped!). Hence, we need a large hard drive.

You can pick up a 2TB external drive for about £60. I have heard they crash a lot. You might want to get two and mirror the contents using rsync.

[Update: command for rsync I am using is:

rsync -ruv /media/EXTHDD1/'Patent Downloads' /media/EXTHDD2/'Patent Downloads'

where EXTHDD1 and EXTHDD2 are the two USB disk drives.]

Flashgot
Flashgot

Download

I have an unlimited package on BT Infinity (hurray!). A great help to download the data is a little Firefox plugin called FlashGot. Install it, select the links of the files you want to download, right-click and choose “Flashgot” selection. This basically sets off a little wget script that gets each of the links. I set it going just before bed – when I wake up the files are on my hard-drive.

The two sets of files that look the most useful are the 2001+ full-text archives or the 2001+ full-text and embedded images. I went for 10 years worth of the latter.

Folders (cc: Shoplet Office Supplies)
Folders (cc: Shoplet Office Supplies)

Data Structure

The structure of the downloaded data is as follows:

  • Directory: Patent Downloads
    • Directory: [Year e.g. 2001] – Size ~ 65GB
      • [ZIP File – one per week – name format is date e.g. 20010607.ZIP] – Size ~ 0.5GB
        • Directory: DTDS [Does what it says of the tin – maybe useful for validation but we can generally ignore for now]
        • Directory: ENTITIES [Ditto – XML entities]
        • Directories: UTIL[XXXX] [e.g. UTIL0002, UTIL0003 – these contain the data] – Size ~ 50-100MB
          • [ZIP Files – one per publication – name format is [Publication Number]-[Date].ZIP e.g. US20010002518A1-20010607.ZIP] – Size ~ 50-350KB
            • [XML File for the patent publication data – name format is [Publication Number]-[Date].XML e.g. US20010002518A1-20010607.XML] – Size ~ 100Kb
            • [TIF Files for the drawings –  name format is [Publication Number]-[Date]-D[XXXXX].TIF where XXXXX is the drawing number e.g. US20010002518A1-20010607-D00012.TIF] – Size ~20kB

[Update: this structure varies a little 2004+ – there are a few extra layers directories between the original zipped folder and the actual XML.]

The original ZIPs
The original ZIPs

ZIP Files & Python

Python is my programming language of choice. It is simple and powerful. Any speed disadvantage is not really felt for large scale, overnight batch processing (and most modern machines are well up to the task).

Ideally I would like to work with the ZIP files directly without unzipping the data. For one-level ZIP files (e.g. the 20010607.ZIP files above) we can use the ‘zipfile‘, a built-in Python module. For example, the following short script ‘walks‘ through our ‘Patent Downloads’ directory above and prints out information about each first-level ZIP file.

import os
import zipfile
import logging
logging.basicConfig(filename="processing.log", format='%(asctime)s %(message)s')

exten = '.zip'
top = "/YOURPATH/Patent Downloads"

def print_zip(filename):
	print filename
	try:
		zip_file = zipfile.ZipFile(filename, "r")
		# list filenames

		for name in zip_file.namelist():
			print name,
		print

		# list file information
		for info in zip_file.infolist():
			print info.filename, info.date_time, info.file_size

	except Exception, ex:
		#Log error
		logging.exception("Exception opening file %s") %filename
		return

def step(ext, dirname, names):
	ext = ext.lower()

	for name in names:
		if name.lower().endswith(ext):
			print_zip(str(os.path.join(dirname, name)))

# Start the walk
os.path.walk(top, step, exten)

This code is based on that helpfully provided at PythonCentral.io. It lists all the files in the ZIP file. Now we have a start at a way to access the patent data files.

However, more work is needed. We come up against a problem when we hit the second-level of ZIP files (e.g. US20010002518A1-20010607.ZIP). These cannot be manipulated again recursively with zipfile. We need to think of a way around this so we can actually access the XML.

As a rough example of the scale we are taking about – a scan through 2001 to 2009 listing the second-level ZIP file names took about 2 minutes and created a plain-text document 121.9 MB long.

Next Time

Unfortunately, this is all for now as my washing machine is leaking and the kids are screaming.

Next time, I will be looking into whether zip_open works to access second-level (nested) ZIP files or whether we need to automate an unzip operation (if our harddrive can take it).

We will also get started on the XML processing within Python using either minidom or ElementTree.

Until then…

Existential [Trainee] Priority Crisis

Every so often you get a case that needs to be filed on the last day of the one-year priority periodHowever, when this happens you need to know how long a year is. 

“FOOL!” You may shout.

But no, does a one-year period include or exclude a day of a starting event? I.e. if you file a first application on 1 January 2013, do you have until 1 January 2014 INCLUSIVE to file a priority-claiming application? Or must the priority-claiming application be filed BY 1 January 2014 EXCLUSIVE, i.e. by 31 December 2013? Trainees may stumble here.

Confusingly the patent legislation in Europe and UK is not entirely helpful. To get an answer you need to go old skool: back to 1883 and the Paris Convention for the Protection of Intellectual Property.

More precisely, Article 4, paragraph C, clause 2 averts your crisis:

C.

(1) The periods of priority referred to above shall be twelve months for patents and utility models, and six months for industrial designs and trademarks.

(2) These periods shall start from the date of filing of the first application; the day of filing shall not be included in the period.

Hurray! We can file on 1 January 2014.

6 Quick Tips for Social Media Success

The link-bait title is only half tongue-in-cheek.

Last night I attended a great little seminar on improving business-to-business social media use run by Bath and Bristol Marketing Network [I cheated a little – it’s a network for “marketing professionals” rather than “marketing amateurs”]. The speakers were Noisy Little Monkey – a digital marketing agency [who I now respect even more knowing they have an office in Shepton Mallet].

The main points that filtered through my fatigued post-5.30pm brain were:

  1. Identify your audience.
  2. Use images/graphics as well as text.
  3. Plan, test, measure, evaluate, repeat.
  4. Social media is not about conversion
  5. Identify the Twitter geeks who are going to push your content.
  6. Use editorial and event calendars to generate a content plan for a year.
Social Media Drives Growth!
Social Media Drives Growth!
CC: mkhmarketing

Here’s some more detail:

Identify your audience

  • Even better, categorise it.
  • Identify 5-10 groups and write a half-page “persona” for each group.
  • E.g. Michael Smith – manager of a software company – 45 – lives in Hereford with 2 kids.
  • Bear these “personas” in mind when writing content.

Use images/graphics as well as text

Plan, test, measure, evaluate, repeat

  • The tools are there – e.g. X Analytics, Twitter analysis tools like FollowerWonk etc. – build evidence and base strategy on it.
  • Prepare a monthly report that gives traffic/demographic/content statistics.
  • Systematically experiment with variations on format and content and use the above statistics to evaluate. E.g. What topics pique interest? Do images actually make a difference to engagement and sharing?

Social media is not about conversion

  • Sales come from phone calls, website visits, face-to-face encounters. Social media is the noise that pushes people into the sales funnel. It does work.
  • That said the pressure on pushy sales is removed.
  • Educating and entertaining become more important.

Identify the Twitter geeks who are going to push your content

  • As in most things, only 1-5% of a group actually drives conversations.
  • For example, on Twitter there are key individuals that are followed by many – if you were looking to get exposure work out what they like and what makes them tick. Find out what their interests are to aim content at them for retweets, comments and blog conversations.
  • You can identify individuals using tools – you can sort by individuals who have a large number of followers in areas you operate in who are likely to retweet URLs.

Use editorial and event calendars to generate a content plan for a year

  • You might know when IP events are going to be held. You might know when technology events are to be held. You can  plan your content (e.g. blog posts) around these.
  • Also you can find out magazine and newspaper editorial calendars (just google “magazine name” + “editorial calendar”) – you can have a yearly plan of when articles are published and fit blog articles into this.