A Dip in the Patent Information Seas (An EPO Online Patent Services Tutorial)

Over Christmas I had a chance to experiment with the European Patent Office’s Online Patent Services. This is a web service / application programming interface (API) for accessing the large patent databases administered by the European Patent Office. It has enormous potential.

To get to grips with the system I set myself a simple task: taking a text file of patent publication numbers (my cases), generate a pie chart of the resulting classifications. In true Blue Peter-style, here is one I made earlier (it’s actually better in full SVG glory, but WordPress.com do not support the format):

Here is how to do it: –

Step 1 – Get Input

Obtain a text file of publication numbers. Most patent management systems (e.g. Inprotech) will allow you to export to Excel. I copied and pasted from an Excel column into a text file, which resulted in a list of publication numbers separated by new line (“\n”) elements.

Step 2 – Register

Register for a free EPO OPS account here: http://www.epo.org/searching/free/ops.html . About a day later the account was approved.

Step 3 – Add an App

Setup an “app” at the EPO Developer Portal. After registering you will receive an email with a link to do this. Generally the link is something like: https://developers.epo.org/user/[your no.]/apps. You will be asked to login.

Setup the “app” as something like “myapp” or “testing” etc.. You will then have access to a key and a secret for this “app”. Make a note of these. I copied and pasted them into an “config.ini” file of the form:

[Login Parameters]
C_KEY="[Copied key value]"
C_SECRET="[Copied secret value]"

Step 4 – Read the Docs

Read the documentation. Especially ‘OPS version 3.1 documentation – version 1.2.10 ‘. Also see this document for a description of the XML Schema (it may be easier than looking at the schema itself).

Step 5 – Authenticate

Now onto some code. First we need to use that key and secret to authenticate ourselves using OAuth.

I first of all tried urllib2 in Python but this was not rendering the POST payload correctly so I reverted back to urllib, which worked. When using urllib I found it easier to store the host and authentication URL as variables in my “config.ini” file. Hence, this file now looked like:

[Login Parameters]
C_KEY="[Copied key value]"
C_SECRET="[Copied secret value]"

[URLs]
HOST=ops.epo.org
AUTH_URL=/3.1/auth/accesstoken

Although object-oriented-purists will burn me at the stake, I created a little class wrapper to store the various parameters. This was initialised with the following code:

import ConfigParser
import urllib, urllib2
import httplib
import json
import base64
from xml.dom.minidom import Document, parseString
import logging
import time

class EPOops():

	def __init__(self, filename):
		#filename is the filename of the list of publication numbers

		#Load Settings
		parser = ConfigParser.SafeConfigParser()
		parser.read('config.ini')
		self.consumer_key = parser.get('Login Parameters', 'C_KEY')
		self.consumer_secret = parser.get('Login Parameters', 'C_SECRET')
		self.host = parser.get('URLs', 'HOST')
		self.auth_url = parser.get('URLs', 'AUTH_URL')

		#Set filename
		self.filename = filename

		#Initialise list for classification strings
		self.c_list = []

		#Initialise new dom document for classification XML
		self.save_doc = Document()

		root = self.save_doc.createElement('classifications')
		self.save_doc.appendChild(root)

The authentication method was then as follows:

def authorise(self):
		b64string = base64.b64encode(":".join([self.consumer_key, self.consumer_secret]))
		logging.error(self.consumer_key + self.consumer_secret + "\n" + b64string)
		#urllib2 method was not working - returning an error that grant_type was missing
		#request = urllib2.Request(AUTH_URL)
		#request.add_header("Authorization", "Basic %s" % b64string)
		#request.add_header("Content-Type", "application/x-www-form-urlencoded")
		#result = urllib2.urlopen(request, data="grant_type=client_credentials")
		logging.error(self.host + ":" + self.auth_url)

		#Use urllib method instead - this works
		params = urllib.urlencode({'grant_type' : 'client_credentials'})
		req = httplib.HTTPSConnection(self.host)
		req.putrequest("POST", self.auth_url)
		req.putheader("Host", self.host)
		req.putheader("User-Agent", "Python urllib")
		req.putheader("Authorization", "Basic %s" % b64string)
		req.putheader("Content-Type" ,"application/x-www-form-urlencoded;charset=UTF-8")
		req.putheader("Content-Length", "29")
		req.putheader("Accept-Encoding", "utf-8")

		req.endheaders()
		req.send(params)

		resp = req.getresponse()
		params = resp.read()
		logging.error(params)
		params_dict = json.loads(params)
		self.access_token = params_dict['access_token']

This results in an access token you can use to access the API for 20 minutes.

Step 6 – Get the Data

Once authentication is sorted, getting the data is pretty easy.

This time I used the later urllib2 library. The URL was built as a concatenation of a static look-up string and the publication number as a variable.

The request uses an “Authentication” header with a “Bearer” variable containing the access token. You also need to add some error handling for when your allotted 20 minutes runs out – I looked for an error message mentioning an invalid access token and then re-performed the authentication if this was detected.

I was looking at “Biblio” data. This returned the classifications without the added overhead of the full-text and claims. The response is XML constructed according to the schema described in the Docs above.

The code for this is as follows:

def get_data(self, number):
		data_url = "/3.1/rest-services/published-data/publication/epodoc/"
		request_type = "/biblio"
		request = urllib2.Request("https://ops.epo.org" + data_url + number + request_type)
		request.add_header("Authorization", "Bearer %s" % self.access_token)
		try:
			resp = urllib2.urlopen(request)
		except urllib2.HTTPError, error:
			error_msg = error.read()
			if "invalid_access_token" in error_msg:
				self.authorise()
				resp = urllib2.urlopen(request)

		#parse returned XML in resp
		XML_data = resp.read()
		return XML_data

Step 7 – Parse the XML

We now need to play around with the returned XML. Python offers a couple of libraries to do this, including Minidom and ElementTree. ElementTree is preferred for memory-management reasons but I found that the iter() / getiterator() methods to be a bit dodgy in the version I was using, so I fell back on using Minidom.

As the “Biblio” data includes all publications (e.g. A1, A2, A3, B1 etc), I selected the first publication in the data for my purposes (otherwise there would be a duplication of classifications). To do this I selected the first “<exchange-document>” tag and its child tags.

As I was experimenting, I actually extracted the classification data as two separate types: text and XML. Text data for each classification, simply a string such as “G11B 27/ 00 A I”, can be found in the “<classification-ipcr>” tag. However, when looking at different levels of classification this single string was a bit cumbersome. I thus also dumped an XML tag – “<patent-classification>” – containing a structured form of the classification, with child tags for “<section>”, “<class>”, “<subclass>”, “<main-group>” and “<subgroup>”.

My function saved the text data in a list and the extracted XML in a new XML string. This allowed me to save these structures to disk, more so I could pick up at a later date without continually hitting the EPO data servers.

The code is here:

def extract_classification(self, xml_str):
		#extract the  elements
		dom = parseString(xml_str)
		#Select first publication for classification extraction
		first_pub = dom.getElementsByTagName('exchange-document')[0]
		self.c_list = self.c_list + [node.childNodes[1].childNodes[0].nodeValue for node in first_pub.getElementsByTagName('classification-ipcr')]

		for node in first_pub.getElementsByTagName('patent-classification'):
			self.save_doc.firstChild.appendChild(node)

Step 8 – Wrap It All Up

The above code needed a bit of wrapping to load the publication numbers from the text file and to save the text list and XML containing the classifications. This is straightforward and shown below:

def total_classifications(self):
		number_list = []

		#Get list of publication numbers
		with open("cases.txt", "r") as f:
			for line in f:
				number_list.append(line.replace("/","")) #This gets rid of the slash in PCT publication numbers

		for number in number_list:
			XML_data = self.get_data(number.strip())
			#time.sleep(1) - might want this to be nice to EPO 🙂
			self.extract_classification(XML_data)

		#Save list to file
		with open("classification_list.txt", "wb") as f:
			f.write("\n".join(str(x) for x in self.c_list))

		#Save xmldoc to file
		with open("save_doc.xml", "wb") as f:
			self.save_doc.writexml(f)

Step 9 – Counting

Once I have the XML data containing the classifications I wrote a little script to count the various classifications at each level for charting. This involved parsing the XML and counting unique occurrences of strings representing different levels of classification. For example, level “section” has values such as “G”, “H”. The next level, “class”, was counted by looking at a string made up of “section” + “class”, e.g. “G11B”. The code is here:

from xml.dom.minidom import parse
import logging, pickle, pygal
from pygal.style import CleanStyle

#create list of acceptable tags - tag_group - then do if child.tagName in tag_group

#initialise upper counting dict
upper_dict = {}

#initialise list of tags we are interested in
tags = ['section', 'class', 'subclass', 'main-group', 'subgroup']

with open("save_doc.xml", "r") as f:
	dom = parse(f)

#Get each patent-classification element
for node in dom.getElementsByTagName('patent-classification'):
	#Initialise classification string to nothing
	class_level_val = ""
	logging.error(node)
	#for each component of the classification
	for child in node.childNodes:
		logging.error(child)
		#Filter out "text nodes" with newlines
		if child.nodeType is not 3 and len(child.childNodes) > 0:

			#Check for required tagNames - only works if element has a tagName
			if child.tagName in tags:

				#if no dict for selected component
				if child.tagName not in upper_dict:
					#make one
					upper_dict[child.tagName] = {}
				logging.error(child.childNodes)

				#Get current component value as catenation of previous values
				class_level_val = class_level_val + child.childNodes[0].nodeValue

				#If value is in cuurent component dict
				if class_level_val in upper_dict[child.tagName]:
					#Increment
					upper_dict[child.tagName][class_level_val] += 1
				else:
					#Create a new entry
					upper_dict[child.tagName][class_level_val] = 1

print upper_dict
#Need to save results
with open("results.pkl", "wb") as f:
	pickle.dump(upper_dict, f)

The last lines print the resulting dictionary and then save it in a file for later use. After looking at the results it was clear that past the “class” level the data was not that useful for a high-level pie-chart, there were many counts of ‘1’ and a few larger clusters.

Step 10 – Charting

I stumbled across Pygal a while ago. It is a simple little charting library that produces some nice-looking SVG charts. Another alternative is ‘matlibplot‘.

The methods are straightforward. The code below puts a rim on the pie-chart with a breakdown of the class data.

#Draw pie chart
pie_chart = pygal.Pie(style=CleanStyle)
pie_chart.title = 'Classifications for Cases (in %)'

#Get names of different sections for pie-chart labels
sections = upper_dict['section']

#Get values from second level - class
classes = upper_dict['class']
class_values = classes.keys() #list of different class values

#Iterate over keys in our section results dictionary
for k in sections.keys():
 #check if key is in class key, if so add value to set for section

 #Initialise list to store values for each section
 count_values = []
 for class_value in class_values:
 if k in class_value: #class key - need to iterate from class keys
 #Add to list for k
 #append_tuple = (class_value, classes[class_value]) - doesn't work
 count_values.append(classes[class_value])
 #count_values.append(append_tuple)
 pie_chart.add(k, count_values)

pie_chart.render_to_file('class_graph.svg')

That’s it. We now have a file called “class_graph” that we can open in our browser. The result is shown in the pie-chart above, which shows the subject-areas where I work. Mainly split between G and H. The complete code can be found on GitHub: https://github.com/benhoyle/EPOops.

Going Forward

The code is a bit hacky, but it is fairly easy to refine into a production-ready method. Options and possibilities are:

Getting the data from a patent management system directly (e.g. via an SQL connection in Python).
Adding the routine as a dynamic look-up on a patent attorney website – e.g. on a Django or Flask-based site.
Look up classification names using the classification API.
The make-up of a representative’s cases would change fairly slowly (e.g. once a week for an update). Hence, you could easily cache most of the data, requiring few look-ups of EPO data (the limit is 2.5GB/week for a free account).
Doing other charting – for example you could plot countries on Pygal’s world map.
Adapt for applicants / representatives using EPO OPS queries to retrieve the publication numbers or XML to process.
Looking at more complex requests, full-text data could be retrieved and imported into natural language processing libraries.

A Dip in the Patent Information Seas (An EPO Online Patent Services Tutorial)

Step 1 – Get Input

Step 2 – Register

Step 3 – Add an App

Step 4 – Read the Docs

Step 5 – Authenticate

Step 6 – Get the Data

Step 7 – Parse the XML

Step 8 – Wrap It All Up

Step 9 – Counting

Step 10 – Charting

Going Forward

Published by Ben Hoyle

One thought on “A Dip in the Patent Information Seas (An EPO Online Patent Services Tutorial)”

Leave a comment Cancel reply

Step 1 – Get Input

Step 2 – Register

Step 3 – Add an App

Step 4 – Read the Docs

Step 5 – Authenticate

Step 6 – Get the Data

Step 7 – Parse the XML

Step 8 – Wrap It All Up

Step 9 – Counting

Step 10 – Charting

Going Forward

Share this:

Related

Published by Ben Hoyle

One thought on “A Dip in the Patent Information Seas (An EPO Online Patent Services Tutorial)”

Leave a comment Cancel reply