Archive for the ‘Automated Law’ Category

Finding a good patent attorney (or patent client) is a lot like dating.

uagb8u160d

Once upon a time, dates were centred around [the golf course / an elite educational establishment alumni group / the locker room / a City gentleman’s club]* (delete as appropriate).

Dates were also primarily a male affair. Typically among greying men in suits and ties.

However, we now live in the 21st century. We have at our disposal the data to make much better matches.

Finding Companies

8yob4gayh8

There are several free public lists you can use to find companies. These include:

From these lists you can collate a large list of companies that may or may not require intellectual property services. I prefer a long CSV list with no fancy formatting.

Matching by Technology

Most companies specialise in particular areas of technology. Likewise, most patent attorneys have specific experience in certain technologies. A good technology match saves time and money.

One way to match by technology is to use the International Patent Classification.

If you have lots of time (or a work experience student or a Mechanical Turk) you can take each company from your list, one-by-one, and perform a search on EspaceNet. You can then look through the results and make a note of the classifications of the patent applications returned from the search.

If you have no time, but a geeky interest in Python, you can automate this using the excellent EPO Online Patent Services.

yp10kayy2d

Through a few hacky functions (which can be found on GitHub), you can:

  • Iterate through a large list of companies / applicants;
  • Clean the company / applicant name to ensure relevant search results;
  • Process the search results to extract the classifications;
  • Process the search results to determine the patent agent of record;
  • Process the classifications to build up a technology profile for each company / applicant; and
  • Process the classifications to rank companies / applicants within a particular technology area.

For example, say you are a patent attorney with 20 years worth of experience in organic macromolecular compounds or centrifugal apparatus. Who would you look at helping? How about:

classification-filter_2016-10-27

Or say you wanted to know what technology areas Company X worked in? How about:

classifications_2016-10-26

(* Quiz: any idea who this may be? Guesses in the comments…)

Or say you work for Company X and you wonder which patent attorneys work for your competitors or in a particular technology area. How about:

agents_2016-10-25_20-18-18

From here?

By improving matching, e.g. between companies and patent attorneys, we can open up legal services. As the potential of technology grows, legal service provision need not be limited to a small pool of ad-hoc connections. Companies can get a better price by looking outside of expensive traditional patent centres. Work product can be improved as those with the experience and passion for a particular area of technology can be matched with companies that feel the same.

In a previous post, we looked at some measures of patent attorney (or firm) success:

  • Low cost;
  • Minimal mistakes;
  • Timely actions; and
  • High legal success rate.

In this post, we will look at how we can measure these.


Legal Success 

Let’s start with legal success. For legal success rate we identified the following:

  • Case grants (with the caveat that the claims need to be of a good breadth);
  • Cases upheld on opposition (if defending);
  • Cases revoked on opposition (if opposing);
  • Oral hearings won; and
  • Court cases won.

When looking to measure these we come across the following problems:

  • It may be easy to obtain the grant of a severely limited patent claim (e.g. a long claim with many limiting features) but difficult to obtain the grant of a more valuable broader claim (e.g. a short claim with few limiting features).
  • Different technical fields may have different grant rates, e.g. a well-defined niche mechanical field may have higher grant rates than digital data processing fields (some “business method” areas have grant rates < 5 %).
  • Cases are often transferred between firms or in-house counsel. More difficult cases are normally assigned to outside counsel. A drafting attorney may not necessarily be a prosecuting attorney.
  • During opposition or an oral hearing, a claim set may be amended before the patent is maintained (e.g. based on newly cited art). Is this a “win”? Or a “loss”? If an opponent avoids infringement by forcing a limitation to a dependent claim, that may be a win. What if there are multiple opponents?
  • In court, certain claims may be held invalid, certain claims held infringed. How do you reconcile this with “wins” and “losses”?

One way to address some of the above problems is to use a heuristic that assigns a score based on a set of outcomes or outcome ranges. For example, we can categorise an outcome and assign each category of outcome a “success” score. To start this we can brainstorm possible outcomes of each legal event.

To deal with the problem of determining claim scope, we can start with crude proxies such as claim length. If claim length is measured as string length, (1 / claim_length) may be used as a scoring factor. As automated claim analysis develops this may be replaced or supplemented by claim feature or limiting phrase count.

Both these approaches could also be used together, e.g. outcomes may be categorised, assigned a score, then weighted by a measure of claim scope.

For example, in prosecution, we could have the following outcomes:

  • Application granted;
  • Application abandoned; and
  • Application refused.

Application refused is assigned the lowest or a negative score (e.g. -5). Abandoning an application is often a way to limit costs on cases that would be refused. However, applications may also be abandoned for strategic reasons. This category may be assigned the next lowest or a neutral score (e.g. 0). Getting an application granted is a “success” and so needs a positive score. It maybe weighted by claim breadth (e.g. constant / claim_length for shortest independent claim).

In opposition or contentious proceeding we need to know whether the attorney is working for, or against, the patent owner. One option maybe to set the sign of the score based on this information (e.g. a positive score for the patentee is a negative score for the opponent / challenger). Possible outcomes for opposition are:

  • Patent maintained (generally positive for patentee, and negative for opponent);
  • Patent refused (negative for patentee, positive for opponent).

A patent can be maintained with the claims as granted (a “good” result) or with amended claims (possibly good, possibly bad). As with prosecution we can capture this by weighting a score by the scope of the broadest maintained independent claim (e.g. claim_length_as_granted / claim_length_as_maintained).

Oral hearings (e.g. at the UK Intellectual Property Office or the European Patent Office) may be considered a “bonus” to a score or a separate metric, as any outcome would be taken into account by the above legal result.

For UK court cases, we again need to consider whether the attorney is working for or against the patentee. We could have the following outcomes:

  • Patent is valid (all claims or some claims);
  • Patent is invalid (all claims or some claims);
  • Patent is infringed (all claims or some claims);
  • Patent is not infringed (all claims or some claims);
  • Case is settled out of court.

Having a case that is settled out of court provides little information, it typically reflects a position that both sides have some ground. It is likely better for the patentee than having the patent found invalid but not as good as having a patent found to be valid and infringed. Similarly, it may be better for a claimant than a patent being found valid but not infringed, but worse than the patent being found invalid and not infringed.

One option to score to partial validity or infringement (e.g. some claims valid/invalid, some claims infringed/not infringed) is to determine a score for each claim individually. For example, dependent claims may be treated using the shallowest dependency – effectively considering a new independent claim comprising the features of the independent claim and the dependents. A final score may be computed by summing the individual scores.

So this could work as a framework to score legal success based on legal outcomes. Theses legal outcomes may be parsed based on patent register data, claim data and/or court reports. There is thus scope for automation.

We still haven’t dealt with the issues of case transfers or different technical fields. One way to do this is to normalise or further weigh scores developed based on the above framework.

For technical fields, scores could be normalised based on average legal outcomes or scores for given classification groupings. There is a question of whether this data exists (I think it does for US art units, it may be buried in an EP report somewhere, I don’t think it exists for the UK). A proxy normalisation could be used where data is not available (e.g. based on internal average firm or company grant rates) or based on other public data, such as public hearing results.

Transferred cases could be taken into account by weighting by: time case held / time since case filing.

Timely Actions

These may be measured by looking at the dates of event actions. These are often stored in patent firm record systems, or are available in patent register data.

It is worth noting that there are many factors outside the control of an individual attorney. For example, instructions may always be received near a deadline for a particular client, or a company may prefer to keep a patent pending by using all available extensions. The hope is that, as a first crude measure, these should average out over a range of applicants or cases.

For official responses, a score could be assigned based on the difference between the official due date  and the date the action was completed. This could be summed over all cases and normalised. This can be calculated from at least EP patent register data (and could possibly be scraped from UKIPO website data).

For internal timeliness, benchmarks could be set, and a negative score assigned based on deviations from these. Example benchmarks could be:

  • Acknowledgements / initial short responses sent with 1 working day of receipt;
  • Office actions reported with 5 working days of receipt;
  • Small tasks or non-substantive work (e.g. updating a document based on comments, replying to questions etc.) performed within 5 working days of receipt / instruction; and
  • Substantive office-action and drafting work (e.g. reviews / draft responses) performed within 4 weeks of instruction.

Minimal Mistakes

This could be measured, across a set of cases, as a function of:

  • a number of official communications issued to correct deviations;
  • a number of requests to correct deficiencies (for cases where no official communication was issued); and/or
  • a number of newly-raised objections (e.g. following the filing of amended claims or other documents).

This information could be obtained by parsing document management system names (to determine communication type / requests), from patent record systems, online registers and/or by parsing examination communications.

Low cost

One issue with cost is that it is often relative: a complex technology may take more time to analyse or a case with 50 claims will cost more to process than a case with 5. Also different companies may have different charging structures. Also costs of individual acts need to be taken in context – an patent office response may seem expensive in isolation, but if it allows grant of a broad claim, may be better than a series of responses charged at a lower amount.

One proxy for cost is time, especially in a billable hours system. An attorney that obtains the same result in a shorter time would be deemed a better attorney. They would either cost less (if charged by the hour) or be able to do more (if working on a fixed fee basis).

In my post on pricing patent work, we discussed methods for estimating the time needed to perform a task. This involved considering a function of claim number and length, as well as citation number and length. One option for evaluating cost is to calculate the ratio: actual_time_spent / predicted_time_spent and then sum this over all cases.

Another approach is to look at the average number of office actions issued in prosecution – a higher number would indicate a higher lifetime cost. This number could be normalised per classification grouping (e.g. to counter the fact that certain technologies tend to get more objections).

The time taken would need to be normalised by the legal success measures discussed above. Spending no time on any cases would typically lead to very high refusal rates, and so even though a time metric would be low, this would not be indicative of a good attorney. Similarly, doing twice the amount of work may lead to a (small?) increase in legal success but may not be practically affordable. It may be that metrics for legal success are divided by a time spent factor.

Patent billing or record systems often keep track of attorney time. This would be the first place to look for data extraction.

Final Thoughts

An interesting result of this delve into detail is we see that legal success and cost need to be evaluated together, but that these can be measured independently of timeliness and error, which in turn . may be measured independently of each other. Indeed, timeliness and error avoidance may be seen as baseline competences, where deviations are to be minimised.

It would also seem possible, in theory at least, to determine these measures of success automatically, some from public data sources and others from existing internal data. Those that can be determined from public data sources raise the tantalising (and scary for some?) possibility of comparing patent firm performance, measures may be grouped by firm or attorney. It is hard to think how a legal ranking based on actual legal performance (as opposed to an ability to wine and dine legal publishers) would be bad for those paying for legal services.

It is also worth raising the old caveat that measurements are not the underlying thing (in a Kantian mode). There are many reasonable arguments about the dangers of metrics, e.g. from the UK health, railways or school systems. These include:

  • the burden of measurement (e.g. added bureaucracy);
  • modifying behaviour to enhance the metrics (e.g. at the cost of that which is not measured or difficult to measure);
  • complex behaviour is difficult to measure, any measurement is a necessarily simplified snapshot of one aspect; and
  • misuse by those in power (e.g. to discriminate or as an excuse or to provide backing for a particular point of view).

These, and more, need to be borne in mind when designing the measures. However, I believe the value of relatively objective measurement in an industry that is far too subjective is worth the risk.

This is a question that has been on my mind for a while. The answer I normally get is: “well, you just kind of know don’t you?” This isn’t very useful for anyone. The alternative is: “it depends”. Again, not very useful. Can we think of any way to at least try to answer the question? (Even if the answer is not perfect.)

The question begets another: “how do we measure success?”


For a company this may be:

  • the broadest, strongest patent (or patent portfolio) obtained at the lowest cost;
  • a patent or patent portfolio that covers their current and future products, and that reduces their UK tax bill; and/or
  • a patent or patent portfolio that gets the company what it asks for in negotiations with third parties.

For an in-house attorney or patent department this may be:

  • meeting annual metrics, including coming in on budget;
  • a good reputation with the board of directors or the C-suite; and/or
  • no surprises.

For an inventor this may be:

  • minimum distruption to daily work;
  • respect from peers in the technology field; and/or
  • recognition (monetary or otherwise) for their hard work.

For a patent firm this may be:

  • a large profit;
  • high rankings in established legal publications; and/or
  • a good reputation with other patent firms and prospective or current clients.

For a partner of a patent firm this may be:

  • a large share of the profit divided by time spent in the office; and/or
  • a low blood pressure reading.

As we can see, metrics of success may vary between stakeholders. However, there do appear to be semi-universal themes:

  1. Low cost (good for a company, possibly bad for patent attorneys);
  2. Minimal mistakes (good for everyone);
  3. Timely actions (good for everyone but sometimes hard for everyone); and
  4. High legal success rate (good for everyone).

High legal success rate (4) may include high numbers of:

  • Case grants (with the caveat that the claims need to be of a good breadth);
  • Cases upheld on opposition (if defending);
  • Cases revoked on opposition (if opposing);
  • Oral hearings won; and
  • Court cases won.

I will investigate further how these can be measured in practice in a future post. I add the caveat that this is not an exhaustive list, however, rather than do nothing out of the fear of missing something, I feel it is better to do something, in full knowledge I have missed things but that these can be added on iteration.

Cost is interesting, because we see patent firms directly opposed to their clients. Their clients (i.e. companies) typically wish to minimise costs and patent firms wish to maximise profits, but patent firm profits are derived from client costs. For patent firms (as with normal companies), a client with a high profit margin is both an asset and a risk; the risk being that a patent firm of a similar caliber (e.g. with approximately equal metrics for 2-4 above) could pitch for work with a reduced (but still reasonable) profit margin. In real life there are barriers to switching firms, including the collective knowledge of the company, its products and portfolio, and social relationships and knowledge. However, everything has a price; if costs are too high and competing firms price this sunk knowledge into their charging, it is hard to reason against switching.

There is a flip side for patent firms. If they can maximise 2-4, they can rationalise higher charges; companies have a choice if they want to pay more for a firm that performs better.

On cost there is also a third option. If patent firms have comparative values for 2-4, and they wish to maintain a given profit margin, they can reduce costs through efficiencies. For most patent firms, costs are proportional to patent attorney time, reduce the time it takes to do a job and costs reduce. The question is then: how to reduce time spent on a matter while maintaining high quality, timeliness and success? This is where intelligence, automation and strategy can reap rewards.

In-house, the low cost aim still applies, wherein for a department cost may be measured in the number of patent attorneys that are needed or outside-counsel spend, as compared to a defined budget.

In private practice, and especially in the US, we often see an inverse of this measurement: a “good” patent attorney (from a patent firm perspective) is someone who maximises hourly billings, minimises write-downs, while anecdotally maintaining an adequate level for 2-4. One problem is maximising hourly billings often leads to compromise on at least 2 and 3; large volumes of work, long hours, and high stress are often not conducive to quality work. This is why I have an issue with hourly billing. A base line is that a profit per se is required, otherwise the business would not be successful. Further, a base line of profit can be set, e.g. allowing for a partner salary of X-times the most junior rate, an investment level of Y%, a bonus pool for extra work performed etc.. However, beyond that, the level of profit is a factor to maximise, subject to constraints, i.e. 1-4 above, where the constraints take priority. The best solution is to align profit with the constraints, such that maximising 1-4 maximises profit. That way everyone benefits. How we can do this will be the subject of a future post.

So, let’s return to our original question: what makes a good patent attorney?

From the above, we see it is a patent attorney that at least makes minimal mistakes, operates in a timely manner, has a high legal success rate and provides this at a low cost. In private practice, it is also a patent attorney that aligns profit with these measures.

One source of frustration with a time-based charging structure (“billable hours”) is that it is difficult to accurately estimate how long a piece of work will take. This post looks at ways we can address this. (Or at least puts down my thoughts on virtual paper .)

Many professional services are priced based on an hourly (or day) rate. This is true of private practice patent attorneys. Although there are critics, the persistence of the billable hour suggests it may be one of the least worst systems available.

Most day-to-day patent work in private practice consists of relatively small items of work. Here “small” means around £1k to £10k, as compared to the £1m cases or transactions of large law firms. These small items of work typically stretch over a few weeks or months.

When performing patent work an unforeseen issue or a overly long publication can easily derail a cost estimate. For example, it is relatively easy to find that a few more hours are needed after looking into an examiner objection or piece of prior art in more detail. This often presents a lose-lose situation for both attorney and client – the work needs to be done, so either the attorney has to cap their charges in line with an estimate or the client needs to pay above an estimate to complete the job. This is not just an issue for patent attorneys – try comparing any quote from a builder or plumber with the actual cost of the work.

This got me thinking about taxis. They have been around for a while, and recent services like Uber offer you a price on your phone that you then accept. This is a nice system for both customer and driver – the customer gets a set price and likewise the driver gets a fare proportional to her time. Could something like that work for patent work?

For taxi services, the underlying variable is miles (or kilometres depending on your Brexit stance). A cost is calculated by adding a mile-based rate to a basic charge, with minimum and cancellation charges.

For patent work, one underlying variable is words. Take an examination report (or “office action”). The amount of time it takes to respond to novelty and inventive step objections is typically proportional to the length of the patent specification in question, the number of claims and the number of prior art citations.

Now, we can use EPO OPS to retrieve the full text of a patent application, including  description and claims. We can also retrieve details of citations and their relevance to the patent application (e.g. category ‘X’, ‘Y’ or ‘A’). I am working on parsing PDF documents such as examination reports to extract the text therein. In any case, this information can be quickly entered from a 5 minute parse of an examination report.

Wikipedia also tells me that an average reading rate for learning or comprehension is around 150-200 words per minute.

This suggests that we can automate a time estimate based on:

  • Words in the description of a published patent application – WPA (based on a need to read the patent application);
  • Number of claims – NCPA (applications with 100s of claims take a lot longer to work on);
  • Words in the claims – WCPA (claims with more words are likely to take more time to understand);
  • For each relevant citations (category ‘X’ or ‘Y’ – this could be a long sparse vector) :
    • Words in the description of the citation – WCITx (as you need to read these to deal with novelty or inventive step objections); and
  • A base time estimate multiplied by the number of objections raised – BTo (possibly weighted by type):
    • E.g. x amount of time per clarity objection, y amount of time per novelty objection.

Even better we need not work out the relationship ourselves. We can create a numeric feature vector with the above information and let a machine learning system figure it out. This would work based on a database of stored invoicing data (where an actual time spent or time billed amount may be extracted to associate with the feature vector).

The result would be an automated system for pricing an examination report response based on publically available data. We could host this on Heroku. By doing this we have just created a marketplace for patent responses – a single weight could be used by patent firms to set their pricing.

Similar pricing models could also be applied to patent drafting. The cost of a draft may be estimated based on a set length, number of drawings, number of claims, number of independent claims, length of invention disclosure and length of known prior art. The variables for a response are similar for a European opposition or an appeal, just with different weights and expanded feature vectors to cover multiple parties.

This would yield a compromise between billable hours and fixed fees. For example, a variable yet fair fixed fee estimate may be generated automatically before the work is performed. The client gets predictability and consistency in pricing. The attorney gets paid in proportion to her efforts.

 

 

 

One thing I have been trying to do recently is to connect together a variety of information sources. This has inevitably involved Python.

Estonian Snake Pipe by Diego Delso, Wikimedia Commons, License CC-BY-SA 3.0

Estonian Snake Pipe by Diego Delso, Wikimedia Commons, License CC-BY-SA 3.0

Due to the Windows-centric nature of business software, I have also needed to setup Python on a Windows machine. Although setting up Python is easy on a Linux machine it is a little more involved for Windows (understatement). Here is how I did it.

  • First, download and install one of the Python Windows installers from here. As I am using several older modules I like to work with version 2.7 (the latest release is 2.7.8).
  • Second, if connecting to a Microsoft SQL database, install the Python ODBC module. I downloaded the 32-bit version for Python 2.7 from here.
  • Third, I want to install IPython as I find a notebook is the best way to experiment. This is a little long-winded. Download the ez_install.py script as described and found here. I downloaded into my Python directory. Next run the script from the directory (e.g. python ez_setup.py). Then add the Python scripts directory to your Environmental Variables as per here. Then install IPython using the command: easy_install ipython[all].
  • Fourth, download a Windows installer for Numpy and Pandas from here. I downloaded the 32-bit versions for Python 2.7. Run the installers.

Doing this I can now run a iPython notebook (via the command: ipython notebook – this will open a browser window for your default browser). I found Pandas gave me an error on the initial import as dateutil was missing – this was fixed by running the command: easy_install python-dateutil.

Now the aim is to connect the European Patent Office’s databases of patent and legal information to internal SQL databases and possibly external web-services such as the DueDil API

 

 

As you may remember, a while back I posted some ideas for a patent workflow tool. It is taking a while, what with actual work and family commitments. However, I finally have a rough-and-ready prototype covering at least the initial review stage.

The application* is built in Flask. It generates an XML document containing the entered data. Fields are rendered based on the XML document (making use of XSLT). To avoid file system headaches, XML data is stored as string data in an SQLite3 database. The data is indexed using a hash masquerading as a key identifier. The key identifier can then be passed as a URL parameter to retrieve a particular XML document. Although nowhere near a fully working “thing”, the code is here if you want a look: https://github.com/benhoyle/attass .

Initial Review: Process Overview in Pictures

First we enter our case reference:

Enter Case Reference

Enter Case Reference

Then we enter the communication details and the objections raised:

Communication Overview

Communication Overview

Then we briefly enter salient details of each objection raised in the communication. This can be used for reporting and as a reference for a later, more detailed review:

Enter First Objection

Enter First Objection

There is an option to enter further objections under the same category (see the lower checkbox). This adds an additional XML element and populates it with data from a template. Once submit is pressed, fields for a next objection will load:

Enter Second Objection

Enter Second Objection

The result is then a populated XML document that forms the starting point for a response:

XML Document

XML Document

Where from here?

I have similar review workflows in progress for novelty and inventive step. They follow a similar pattern: an XML template defines data entry and works the user through a number of review steps. I have JavaScript functions that breaks down a claim into features – I can use this as a front-end for my novelty review. Inventive step has a different process for each of the UK, Europe and US, wherein the process incorporates the current practice from case law.

The aim is that objections entered in the initial review will be addressed through a detailed review and/or instructions. As we initially enter the objections, we do not have to worry about missing objections or approaching things in a less efficient order. The workflow also allows a response to be split into a number of modular processes. These are then ripe for outsourcing, e.g. to paralegal staff or trainees, allowing an attorney to concentrate their time on the “meat” of the objections and thus saving money for clients. The workflow also provides mental scaffolding that is perfect for trainees and/or sleep-deprived attorneys with young children/dogs.

I use a combination of Evernote, Remember the Milk and Trello to jot down ideas, plan and set out to-do lists. Currently pending are:

  • Map XML to more user-friendly form fields;
  • Sort the loading of existing data;
  • Sort the CSS for that tiny textarea;
  • Add some JavaScript time-savers to the front-end (e.g. that allow a user to click “same communication” for multiple objections);
  • Build an XSL file that transforms the result of the initial review to text for storage or reporting;
  • Work out how to use cloud storage APIs to automatically save a copy of the above to a document management system;
  • Add detailed review workflow, including bespoke processes for novelty, inventive step and patentability/excluded subject matter;
  • Add easy “report bug/suggest feature” reporting for iterative updates; and
  • Host on a £30 Raspberry Pi in the office.

* Aside: ‘web-site’/’web-app’/’app’/’application’ are all kind of the same thing. “Web-site” was a traditionally static site that hosted HTML documents. No-one really does that any more though; nearly all sites are built dynamically, making them more like a traditionally client-server application (especially with JavaScript on the front end and Python or similar on the back-end).

I have been playing with natural language processing.

Now I have a body of patent data (see here), I can do some interesting things. For example, most people would say that patents have a pretty specific terminology. I say: show me the data.

Taking all patent publications in 2001 as an example, I programmed a little routine that:

  • Extracted the text data of each patent publication;
  • Split the text data into words;
  • Filtered the words for non-words (e.g. punctuation etc.);
  • Applied a stemming algorithm (from 1979!); and
  • Recorded the frequency distribution of the results.

In total I counted 277493492 occurrences of 287455 unique word stems.

In common with most written material, 100 words accounted for 50% of the published material. Amazing when you think about it.

(Next time you get a drafting bill from a patent attorney, complain that half their work is shuffling 100 words around :)).

Here is the graph (click to zoom for full glory).

Cumulative Percentage of Top100 Words

Cumulative Percentage of Top 100 Words (click for full-size)

Patent Stopwords

There is more.

“Stopwords” are common words that are often filtered out when analysing documents. The Natural Language Tool Kit provides a set based on a general analysis of written English. These include words such as:

…’did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’,  ‘but’, ‘if’, ‘or’, ‘because’,  ‘as’,  ‘until’,  ‘while’,  ‘of’,  ‘at’,  ‘by’,  ‘for’…

In total there are 127 stopwords in this collection representing high-frequency content that has little lexical use.

I thought it would be interesting to compare these stopwords with the 127 most frequent in our frequency count.

Words that occurred frequently in (US) patent publications that do not comprise regular stopwords include:

said use first one form invent thi may second data claim wherein accord control signal present devic provid portion includ embodi compris method layer surfac system process exampl step ha shown connect posit prefer oper gener mean inform circuit imag unit time materi also end wa member line film side least select apparatu output element refer receiv describ direct base light section set show substrat contain display view valu part cell two plural group structur number optic electrod input result abov respect region memori plate case differ user

These words will be familiar to most patent professionals. The result of the stemming operation can be seen in certain words, e.g. “oper” – these should be treated as “oper*” – “operates”, “operating”, “operate” etc.. You can see that stemming is not perfect (“thi” may relate to “this”, which has been taken to be a plural form) but it is generally good enough. Without the stemming there would be many different variations of the same word in our counts.

Now this list of “patent stopwords” is useful. Firstly, these words are probably not useful for searching in isolation (we may move onto n-grams later). Secondly, they can be used as a dictionary of sorts for claim drafting. Thirdly, they could be used to distinguish patent text from non-patent text (e.g. as the basis for a feature vector for this classification).

The words that occur in patent specifications but also occur in “the real world” are also interesting:

the of a to and in is for be an as by with or are that from which at on can it have such each not when between other into through further more about than will so if then

These can be used as universal stopwords.

Further Fun

There are a number of paths for further analysis:

  • Extend across the whole US patent publication corpus from 2001 to 2014. I may need to optimise my code to do this!
  • Perform a similar analysis for different classification levels – e.g. do patents classified as G have a different vocabulary from those classified as H?
  • Look at infrequent or unique words – How many are there? Are they useful for searching clusters?

Over Christmas I had a chance to experiment with the European Patent Office’s Online Patent Services. This is a web service / application programming interface (API) for accessing the large patent databases administered by the European Patent Office. It has enormous potential.

To get to grips with the system I set myself a simple task: taking a text file of patent publication numbers (my cases), generate a pie chart of the resulting classifications. In true Blue Peter-style, here is one I made earlier (it’s actually better in full SVG glory, but WordPress.com do not support the format):

Classifications for Cases (in %)

Classifications for Cases (in %)

Here is how to do it: –

Step 1 – Get Input

Obtain a text file of publication numbers. Most patent management systems (e.g. Inprotech) will allow you to export to Excel. I copied and pasted from an Excel column into a text file, which resulted in a list of publication numbers separated by new line (“\n”) elements.

Step 2 – Register

Register for a free EPO OPS account here: http://www.epo.org/searching/free/ops.html . About a day later the account was approved.

Step 3 – Add an App

Setup an “app” at the EPO Developer Portal. After registering you will receive an email with a link to do this. Generally the link is something like: https://developers.epo.org/user/[your no.]/apps. You will be asked to login.

Setup the “app” as something like “myapp” or “testing” etc.. You will then have access to a key and a secret for this “app”. Make a note of these. I copied and pasted them into an “config.ini” file of the form:

[Login Parameters]
C_KEY="[Copied key value]"
C_SECRET="[Copied secret value]"

Step 4 – Read the Docs

Read the documentation. Especially ‘OPS version 3.1 documentation – version 1.2.10 ‘. Also see this document for a description of the XML Schema (it may be easier than looking at the schema itself).

Step 5 – Authenticate

Now onto some code. First we need to use that key and secret to authenticate ourselves using OAuth.

I first of all tried urllib2 in Python but this was not rendering the POST payload correctly so I reverted back to urllib, which worked. When using urllib I found it easier to store the host and authentication URL as variables in my “config.ini” file. Hence, this file now looked like:

[Login Parameters]
C_KEY="[Copied key value]"
C_SECRET="[Copied secret value]"

[URLs]
HOST=ops.epo.org
AUTH_URL=/3.1/auth/accesstoken

Although object-oriented-purists will burn me at the stake, I created a little class wrapper to store the various parameters. This was initialised with the following code:

import ConfigParser
import urllib, urllib2
import httplib
import json
import base64
from xml.dom.minidom import Document, parseString
import logging
import time

class EPOops():

	def __init__(self, filename):
		#filename is the filename of the list of publication numbers

		#Load Settings
		parser = ConfigParser.SafeConfigParser()
		parser.read('config.ini')
		self.consumer_key = parser.get('Login Parameters', 'C_KEY')
		self.consumer_secret = parser.get('Login Parameters', 'C_SECRET')
		self.host = parser.get('URLs', 'HOST')
		self.auth_url = parser.get('URLs', 'AUTH_URL')

		#Set filename
		self.filename = filename

		#Initialise list for classification strings
		self.c_list = []

		#Initialise new dom document for classification XML
		self.save_doc = Document()

		root = self.save_doc.createElement('classifications')
		self.save_doc.appendChild(root)

The authentication method was then as follows:

def authorise(self):
		b64string = base64.b64encode(":".join([self.consumer_key, self.consumer_secret]))
		logging.error(self.consumer_key + self.consumer_secret + "\n" + b64string)
		#urllib2 method was not working - returning an error that grant_type was missing
		#request = urllib2.Request(AUTH_URL)
		#request.add_header("Authorization", "Basic %s" % b64string)
		#request.add_header("Content-Type", "application/x-www-form-urlencoded")
		#result = urllib2.urlopen(request, data="grant_type=client_credentials")
		logging.error(self.host + ":" + self.auth_url)

		#Use urllib method instead - this works
		params = urllib.urlencode({'grant_type' : 'client_credentials'})
		req = httplib.HTTPSConnection(self.host)
		req.putrequest("POST", self.auth_url)
		req.putheader("Host", self.host)
		req.putheader("User-Agent", "Python urllib")
		req.putheader("Authorization", "Basic %s" % b64string)
		req.putheader("Content-Type" ,"application/x-www-form-urlencoded;charset=UTF-8")
		req.putheader("Content-Length", "29")
		req.putheader("Accept-Encoding", "utf-8")

		req.endheaders()
		req.send(params)

		resp = req.getresponse()
		params = resp.read()
		logging.error(params)
		params_dict = json.loads(params)
		self.access_token = params_dict['access_token']

This results in an access token you can use to access the API for 20 minutes.

Step 6 – Get the Data

Once authentication is sorted, getting the data is pretty easy.

This time I used the later urllib2 library. The URL was built as a concatenation of a static look-up string and the publication number as a variable.

The request uses an “Authentication” header with a “Bearer” variable containing the access token. You also need to add some error handling for when your allotted 20 minutes runs out – I looked for an error message mentioning an invalid access token and then re-performed the authentication if this was detected.

I was looking at “Biblio” data. This returned the classifications without the added overhead of the full-text and claims. The response is XML constructed according to the schema described in the Docs above.

The code for this is as follows:

def get_data(self, number):
		data_url = "/3.1/rest-services/published-data/publication/epodoc/"
		request_type = "/biblio"
		request = urllib2.Request("https://ops.epo.org" + data_url + number + request_type)
		request.add_header("Authorization", "Bearer %s" % self.access_token)
		try:
			resp = urllib2.urlopen(request)
		except urllib2.HTTPError, error:
			error_msg = error.read()
			if "invalid_access_token" in error_msg:
				self.authorise()
				resp = urllib2.urlopen(request)

		#parse returned XML in resp
		XML_data = resp.read()
		return XML_data

Step 7 – Parse the XML

We now need to play around with the returned XML. Python offers a couple of libraries to do this, including Minidom and ElementTree. ElementTree is preferred for memory-management reasons but I found that the iter() / getiterator() methods to be a bit dodgy in the version I was using, so I fell back on using Minidom.

As the “Biblio” data includes all publications (e.g. A1, A2, A3, B1 etc), I selected the first publication in the data for my purposes (otherwise there would be a duplication of classifications). To do this I selected the first “<exchange-document>” tag and its child tags.

As I was experimenting, I actually extracted the classification data as two separate types: text and XML. Text data for each classification, simply a string such as “G11B  27/    00            A I”, can be found in the  “<classification-ipcr>” tag. However, when looking at different levels of classification this single string was a bit cumbersome. I thus also dumped an XML tag – “<patent-classification>” – containing a structured form of the classification, with child tags for “<section>”, “<class>”, “<subclass>”, “<main-group>” and “<subgroup>”.

My function saved the text data in a list and the extracted XML in a new XML string. This allowed me to save these structures to disk, more so I could pick up at a later date without continually hitting the EPO data servers.

The code is here:

def extract_classification(self, xml_str):
		#extract the  elements
		dom = parseString(xml_str)
		#Select first publication for classification extraction
		first_pub = dom.getElementsByTagName('exchange-document')[0]
		self.c_list = self.c_list + [node.childNodes[1].childNodes[0].nodeValue for node in first_pub.getElementsByTagName('classification-ipcr')]

		for node in first_pub.getElementsByTagName('patent-classification'):
			self.save_doc.firstChild.appendChild(node)

Step 8 – Wrap It All Up

The above code needed a bit of wrapping to load the publication numbers from the text file and to save the text list and XML containing the classifications. This is straightforward and shown below:

def total_classifications(self):
		number_list = []

		#Get list of publication numbers
		with open("cases.txt", "r") as f:
			for line in f:
				number_list.append(line.replace("/","")) #This gets rid of the slash in PCT publication numbers

		for number in number_list:
			XML_data = self.get_data(number.strip())
			#time.sleep(1) - might want this to be nice to EPO 🙂
			self.extract_classification(XML_data)

		#Save list to file
		with open("classification_list.txt", "wb") as f:
			f.write("\n".join(str(x) for x in self.c_list))

		#Save xmldoc to file
		with open("save_doc.xml", "wb") as f:
			self.save_doc.writexml(f)

Step 9 – Counting

Once I have the XML data containing the classifications I wrote a little script to count the various classifications at each level for charting. This involved parsing the XML and counting unique occurrences of strings representing different levels of classification. For example, level “section” has values such as “G”, “H”. The next level, “class”, was counted by looking at a string made up of “section” + “class”, e.g. “G11B”. The code is here:

from xml.dom.minidom import parse
import logging, pickle, pygal
from pygal.style import CleanStyle

#create list of acceptable tags - tag_group - then do if child.tagName in tag_group

#initialise upper counting dict
upper_dict = {}

#initialise list of tags we are interested in
tags = ['section', 'class', 'subclass', 'main-group', 'subgroup']

with open("save_doc.xml", "r") as f:
	dom = parse(f)

#Get each patent-classification element
for node in dom.getElementsByTagName('patent-classification'):
	#Initialise classification string to nothing
	class_level_val = ""
	logging.error(node)
	#for each component of the classification
	for child in node.childNodes:
		logging.error(child)
		#Filter out "text nodes" with newlines
		if child.nodeType is not 3 and len(child.childNodes) > 0:

			#Check for required tagNames - only works if element has a tagName
			if child.tagName in tags:

				#if no dict for selected component
				if child.tagName not in upper_dict:
					#make one
					upper_dict[child.tagName] = {}
				logging.error(child.childNodes)

				#Get current component value as catenation of previous values
				class_level_val = class_level_val + child.childNodes[0].nodeValue

				#If value is in cuurent component dict
				if class_level_val in upper_dict[child.tagName]:
					#Increment
					upper_dict[child.tagName][class_level_val] += 1
				else:
					#Create a new entry
					upper_dict[child.tagName][class_level_val] = 1

print upper_dict
#Need to save results
with open("results.pkl", "wb") as f:
	pickle.dump(upper_dict, f)

The last lines print the resulting dictionary and then save it in a file for later use. After looking at the results it was clear that past the “class” level the data was not that useful for a high-level pie-chart, there were many counts of ‘1’ and a few larger clusters.

Step 10 – Charting

I stumbled across Pygal a while ago. It is a simple little charting library that produces some nice-looking SVG charts. Another alternative is ‘matlibplot‘.

The methods are straightforward. The code below puts a rim on the pie-chart with a breakdown of the class data.

#Draw pie chart
pie_chart = pygal.Pie(style=CleanStyle)
pie_chart.title = 'Classifications for Cases (in %)'

#Get names of different sections for pie-chart labels
sections = upper_dict['section']

#Get values from second level - class
classes = upper_dict['class']
class_values = classes.keys() #list of different class values

#Iterate over keys in our section results dictionary
for k in sections.keys():
 #check if key is in class key, if so add value to set for section

 #Initialise list to store values for each section
 count_values = []
 for class_value in class_values:
 if k in class_value: #class key - need to iterate from class keys
 #Add to list for k
 #append_tuple = (class_value, classes[class_value]) - doesn't work
 count_values.append(classes[class_value])
 #count_values.append(append_tuple)
 pie_chart.add(k, count_values)

pie_chart.render_to_file('class_graph.svg')

That’s it. We now have a file called “class_graph” that we can open in our browser. The result is shown in the pie-chart above, which shows the subject-areas where I work. Mainly split between G and H. The complete code can be found on GitHub: https://github.com/benhoyle/EPOops.

Going Forward

The code is a bit hacky, but it is fairly easy to refine into a production-ready method. Options and possibilities are:

  • Getting the data from a patent management system directly (e.g. via an SQL connection in Python).
  • Adding the routine as a dynamic look-up on a patent attorney website – e.g. on a Django or Flask-based site.
  • Look up classification names using the classification API.
  • The make-up of a representative’s cases would change fairly slowly (e.g. once a week for an update). Hence, you could easily cache most of the data, requiring few look-ups of EPO data (the limit is 2.5GB/week for a free account).
  • Doing other charting – for example you could plot countries on Pygal’s world map.
  • Adapt for applicants / representatives using EPO OPS queries to retrieve the publication numbers or XML to process.
  • Looking at more complex requests, full-text data could be retrieved and imported into natural language processing libraries.

Possibly. Let’s give it a go.

Big data - from DARPA

Data

In my experience, no one has quite realised how amazing this link is. It is a hosting (by Google) of bulk downloads of patent and trademark data from the US Patent and Trademark Office.

Just think about this for a second.

Here you can download images of most commercial logos used between 1870(!) and the present day. Back in the day, doing image processing and machine learning, I would have given my right arm for such a data set.

Moreover, you get access (eventually) to the text of most US patent publications. Considering there are over 8 million of these, and considering that most new and exiting technologies are the subject of a patent application, this represents a treasure trove of information on human innovation.

Although we are limited to US-based patent publications this is not a problem. The US is the world’s primary patent jurisdiction – many companies only patent in the US and most inventions of importance (in modern times) will be protected in the US. At this point we are also not looking at precise legal data – the accuracy of these downloads is not ideal. Instead, we are looking at “Big Data” (buzzword cringe) – general patterns and statistical gists from “messy” and incomplete datasets.

Storage

Initially, I started with 10 years worth of patent publications: 2001 to 2011. The data from 2001 onwards is pretty reliable; I have been told the OCR data from earlier patent publications is near useless.

An average year is around 60 GBytes of data (zipped!). Hence, we need a large hard drive.

You can pick up a 2TB external drive for about £60. I have heard they crash a lot. You might want to get two and mirror the contents using rsync.

[Update: command for rsync I am using is:

rsync -ruv /media/EXTHDD1/'Patent Downloads' /media/EXTHDD2/'Patent Downloads'

where EXTHDD1 and EXTHDD2 are the two USB disk drives.]

Flashgot

Flashgot

Download

I have an unlimited package on BT Infinity (hurray!). A great help to download the data is a little Firefox plugin called FlashGot. Install it, select the links of the files you want to download, right-click and choose “Flashgot” selection. This basically sets off a little wget script that gets each of the links. I set it going just before bed – when I wake up the files are on my hard-drive.

The two sets of files that look the most useful are the 2001+ full-text archives or the 2001+ full-text and embedded images. I went for 10 years worth of the latter.

Folders (cc: Shoplet Office Supplies)

Folders (cc: Shoplet Office Supplies)

Data Structure

The structure of the downloaded data is as follows:

  • Directory: Patent Downloads
    • Directory: [Year e.g. 2001] – Size ~ 65GB
      • [ZIP File – one per week – name format is date e.g. 20010607.ZIP] – Size ~ 0.5GB
        • Directory: DTDS [Does what it says of the tin – maybe useful for validation but we can generally ignore for now]
        • Directory: ENTITIES [Ditto – XML entities]
        • Directories: UTIL[XXXX] [e.g. UTIL0002, UTIL0003 – these contain the data] – Size ~ 50-100MB
          • [ZIP Files – one per publication – name format is [Publication Number]-[Date].ZIP e.g. US20010002518A1-20010607.ZIP] – Size ~ 50-350KB
            • [XML File for the patent publication data – name format is [Publication Number]-[Date].XML e.g. US20010002518A1-20010607.XML] – Size ~ 100Kb
            • [TIF Files for the drawings –  name format is [Publication Number]-[Date]-D[XXXXX].TIF where XXXXX is the drawing number e.g. US20010002518A1-20010607-D00012.TIF] – Size ~20kB

[Update: this structure varies a little 2004+ – there are a few extra layers directories between the original zipped folder and the actual XML.]

The original ZIPs

The original ZIPs

ZIP Files & Python

Python is my programming language of choice. It is simple and powerful. Any speed disadvantage is not really felt for large scale, overnight batch processing (and most modern machines are well up to the task).

Ideally I would like to work with the ZIP files directly without unzipping the data. For one-level ZIP files (e.g. the 20010607.ZIP files above) we can use the ‘zipfile‘, a built-in Python module. For example, the following short script ‘walks‘ through our ‘Patent Downloads’ directory above and prints out information about each first-level ZIP file.

import os
import zipfile
import logging
logging.basicConfig(filename="processing.log", format='%(asctime)s %(message)s')

exten = '.zip'
top = "/YOURPATH/Patent Downloads"

def print_zip(filename):
	print filename
	try:
		zip_file = zipfile.ZipFile(filename, "r")
		# list filenames

		for name in zip_file.namelist():
			print name,
		print

		# list file information
		for info in zip_file.infolist():
			print info.filename, info.date_time, info.file_size

	except Exception, ex:
		#Log error
		logging.exception("Exception opening file %s") %filename
		return

def step(ext, dirname, names):
	ext = ext.lower()

	for name in names:
		if name.lower().endswith(ext):
			print_zip(str(os.path.join(dirname, name)))

# Start the walk
os.path.walk(top, step, exten)

This code is based on that helpfully provided at PythonCentral.io. It lists all the files in the ZIP file. Now we have a start at a way to access the patent data files.

However, more work is needed. We come up against a problem when we hit the second-level of ZIP files (e.g. US20010002518A1-20010607.ZIP). These cannot be manipulated again recursively with zipfile. We need to think of a way around this so we can actually access the XML.

As a rough example of the scale we are taking about – a scan through 2001 to 2009 listing the second-level ZIP file names took about 2 minutes and created a plain-text document 121.9 MB long.

Next Time

Unfortunately, this is all for now as my washing machine is leaking and the kids are screaming.

Next time, I will be looking into whether zip_open works to access second-level (nested) ZIP files or whether we need to automate an unzip operation (if our harddrive can take it).

We will also get started on the XML processing within Python using either minidom or ElementTree.

Until then…

Patent attorneys: we care about the independent claims. An independent claim is a paragraph of text that defines an invention. Each invention has a number of discrete features. Can I build a function to spilt a claim into its component features?

The answer is possibly. Here is one way I could go about doing it.

First I would start with a JavaScript file: claimAnalysis.js. I would link this to an HTML page: claimAnalysis.html. This HTML page would have a large text box to copy and paste the text of an independent claim.

On a keyup() or onchange() event I would then run the following algorithm:

  • Get text as from text box as a string.
  • Set character placemarker as 0.
  • From placemarker, find character from set of character:s [“,”, “:”, “;”,”-” or new line].
  • Store characters from 0 to found character index as string in array.
  • Repeat last two steps until “.” or end of text.

From this we should have a rough breakdown of a claim into feature string arrays. It will not be perfect but it would make a good start.

We can then show each located string portion in the array to a user. For example, with JavaScript we can add a table within a form containing input text boxes in rows. Each text box can contain a string portion. We can also add a checkbox to each portion or table row.

The user can then be offered “spilt” or “join” option buttons.

  • “Split” requires only one selection.
  • The user is told to place the cursor/select text in the box where they want the split to occur (using selectionStart property?).
  • Two features are then created based on the cursor position or selected text.
  • “Join” requires > 1 features to be selected via the checkboxes.
  • All selected features are combined into one string portion in one text box which replaces the previous text boxes (possibly by redrawing the table).

Once any splitting or joining is complete the user can confirm the features. A confirm button could use the POST method to input the features to a PHP script that saves them as XML on the server.

<claim><number>1</number><feature id="1">A method for doing something comprising:</feature>...</claim>