Key Insights from a Year of Working with LLMs (3/4): Technical Implementations

Introduction

This post continues our exploration of insights gained from a year of work with Large Language Models (LLMs). In our previous posts, we discussed general process insights and model-specific observations. Now, we turn our attention to the nuts and bolts of how to actually build LLM-powered applications.

The rapid evolution of LLM capabilities has opened up exciting possibilities for innovation across various industries, including law and business. However, translating these possibilities into practical, efficient, and robust applications is difficult. It feels a bit like the early web. We are always fighting the last war – our tools are optimised for web 2.0 or 3.0 but we’re dealing with web 4.0. There are no textbooks. Everyone is iterating in parallel and capabilities of the models are maturing at a similar rate to the tool being built to help speed up development. There are though some technologies that have adapted well to the new world of LLM API calls.

In this post, we’ll explore several key areas of technical implementation:

  1. Introduction
  2. Constructing a Robust Tech Stack for LLM-Powered Systems
    1. Local or Web-Based Models?
    2. Backend / Frontend Split
    3. Backend: FastAPI with Async Capabilities
    4. Data Validation and API I/O: Pydantic
    5. Database Interactions: SQLAlchemy or SQLModel
    6. Watch Out
    7. Application Structure: OOP and Separation of Concerns
    8. Frontend Options
    9. The Glue: Docker and Docker Compose
  3. Using Asynchronous Programming
    1. Async for LLM Apps
    2. The Async Advantage
    3. Async Task Queue
    4. Async LLM Clients
      1. OpenAI Async Client
      2. Anthropic Async Client
      3. Benefits of Using Async Clients
    5. Async Libraries for Python
    6. The Async Learning Curve
    7. Testing and Debugging Async Code
    8. The Async Ecosystem is Still Maturing
  4. Parallelisation for High-Quality Output Generation
    1. Why Parallelise?
    2. Parallelisation Strategies
      1. Simple Parallel Requests
      2. Parallel Variations for Quality
      3. Parallel Subtasks
    3. Considerations and Caveats
    4. Implementing Parallelisation
  5. Leveraging LLMs in the Development Process
    1. Setting Up Your LLM Workspace
    2. Leveraging LLMs for Feature Development
    3. Best Practices and Considerations
    4. Real-World Example: Developing a New API Endpoint
    5. The Limits of LLM Assistance
  6. Prompt Tips
    1. Keep it Simple
    2. Storage
    3. Versioning
    4. Make it Easy to Test
      1. Build a Test Dataset
    5. Prompt Logging
  7. Conclusion

Constructing a Robust Tech Stack for LLM-Powered Systems

When it comes to building LLM-powered applications, your choice of tech stack can make the difference between a smooth-running system and a nightmarish tangle of spaghetti code. After a year of trial and error, here’s what I’ve found works well – at least for now.

Local or Web-Based Models?

As discussed before, there really is no competition here. The web-based LLMs are far superior to locally hosted versions, despite the reservations on privacy and the need for web-connectivity. You need a hefty processor machine with a large GPU (24GB+) to host a mid-tier novel (similar to the discussed “mini” model capabilities), and that will typically run slower or about the same as a web-based API call.

Backend / Frontend Split

Like many web developers, I’ve come to structure projects with a backend/frontend split. The backend is Python and implements calls to the LLMs. It supplies data in a structured format. The frontend makes it all look pretty and deals with the user interactions. You can build the frontend with Python, but Javascript-based frameworks are more common in production.

Splitting between backend and frontend means that you can later scale these separately. Backend engineers don’t need to worry as much about user interface design and frontend engineers don’t need to worry about complex processing and LLM-wrangling. You can also more easily outsource frontend design to outside agencies, if you have a well-defined backend that does the clever stuff.

Backend: FastAPI with Async Capabilities

FastAPI has emerged as a go-to framework for building a web API to support LLM applications. It’s fast (as the name suggests), it’s simple, it’s built with async in mind, and it has excellent documentation.

Being the defacto industry standard means there is good support within LLMs and the web for guidance and advice.

The async capabilities are particularly useful when working with LLMs, as we’ll discuss in more detail later.

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Hello, LLM World"}

Data Validation and API I/O: Pydantic

If you use FastAPI, you’ll be familiar with the use of Pydantic to define the input/output JSON structures of the API. Pydantic is a godsend when it comes to data validation and serialization. It integrates seamlessly with FastAPI and helps catch a multitude of sins before they propagate through your system.

from pydantic import BaseModel

class LLMRequest(BaseModel):
    prompt: str
    max_tokens: int = 100

@app.post("/generate")
async def generate_text(request: LLMRequest):
    # Your LLM logic here
    pass

Using data classes as structures to move data around interfaces is also coming to be used within non-web-facing code as well. This can help when dealing with database objects, where you need to convert database-based classes that require a session object into serialised versions that can be cached and passed between different microservices.

Database Interactions: SQLAlchemy or SQLModel

For database interactions, SQLAlchemy has been the standard for years. However, SQLModel, which builds on SQLAlchemy and Pydantic, is worth keeping an eye on. It’s still maturing, but it offers a more intuitive API for those already familiar with Pydantic.

from sqlmodel import SQLModel, Field

class User(SQLModel, table=True):
    id: int = Field(default=None, primary_key=True)
    name: str
    email: str

I’ve found that SQLAlchemy has a more mature set of documentation for async database access (for SQLAlchemy 2.0). SQLModel’s async documentation is a “Still to Come” holding page. Async database patterns are still a little patchy in terms of support and documentation – definitely not “mature”, more on the “experimental” side of things.

Watch Out

One problem I found is that lots of documentation, tutorials, and examples that feature FastAPI, SQLAlchemy/SQLModel and Pydantic assume a simple non-async use case where all your processing can be performed in your FastAPI route methods. This doesn’t work for LLM applications – if you are fetching anything from an LLM via a web-API (e.g., via the OpenAI or Anthropic Python clients), it will take too long for synchronous web handling – FastAPI is designed to return data quickly from a database in < 200ms, but LLM API calls can often take seconds to complete.

Also many materials only give you solutions that use the syntax for SQLAlchemy 1.x or Pydantic 1.x, which have now been supplanted with 2.x versions.

Application Structure: OOP and Separation of Concerns

When structuring your application, it’s beneficial to separate your business logic from your LLM interaction logic. This makes your code more maintainable and easier to test. Here’s a basic structure I’ve found useful:

my_llm_app/
├── api/
│   ├── routes/
│   └── models/
├── core/
│   ├── config.py
│   └── dependencies.py
├── services/
│   ├── llm_service.py
│   └── business_logic_service.py
├── utils/
└── main.py

As code grows in complexity, there is a need to balance separation of concerns with practical organisation. Real-world projects often have more complex structures than simple examples suggest. Here’s a more comprehensive approach based on actual project structures:

your_project/
├── api/
│ ├── crud/
│ ├── routes/
│ ├── schemas/
│ └── main.py
├── config/
│ ├── __init__.py
│ ├── settings.py
│ └── logging.py
├── database/
│ ├── __init__.py
│ └── db_engine.py
├── logic/
│ ├── chat_processor/
│ ├── database_manager/
│ └── vector_processor/
├── llm/
│ ├── __init__.py
│ ├── embeddings.py
│ ├── prompt_builder.py
│ └── chat_completion.py
├── models/
│ ├── graph/
│ ├── knowledge/
│ └── common.py
├── utils/
├── tests/
│ ├── fixtures/
│ ├── unit/
│ └── integration/
├── migrations/
├── notebooks/
├── docker/
│ ├── Dockerfile.dev
│ ├── Dockerfile.prod
│ └── docker-compose.yml
└── .env

Key aspects of this structure:

  1. API Layer: Separates routing, CRUD operations, and schema definitions.
  2. Config: Centralises configuration management, including environment-specific settings.
  3. Database: Houses database-related code, including code to setup the database and get a session object.
  4. Logic: Core business logic is divided into focused modules (e.g., chat processing, database management).
  5. LLM Integration: A dedicated directory for LLM-related functionality, keeping it separate from other business logic.
  6. Models: Defines data structures used throughout the application.
  7. Utils: For shared utility functions.
  8. Tests: Organised by test type (unit, integration) with a separate fixtures directory.
  9. Docker: Keeps all Docker-related files in one place.

This structure promotes:

  • Modularity: Each directory has a clear purpose, making it easier to locate and maintain code.
  • Scalability: New features can be added by creating new modules in the appropriate directories.
  • Separation of Concerns: LLM logic, business logic, and API handling are kept separate.
  • Testability: The structure facilitates comprehensive testing strategies.

Remember, this structure is flexible. Depending on your project’s needs, you might:

  • Add a services/ directory for external service integrations.
  • Include a frontend/ directory for UI components if it’s a full-stack application.
  • Create a scripts/ directory for maintenance or data processing scripts.

The key is to organise your code in a way that makes sense for your specific project and team. Start with a structure like this, but be prepared to adapt as your project evolves.

Frontend Options

For the frontend, your choice largely depends on your specific needs:

  • React: Great for complex, interactive UIs
  • Streamlit: Excellent for quick prototyping and data-focused applications
  • HTMX: A lightweight option if you want to stick with Python on the frontend

Personally, I’ve found Streamlit to be a lifesaver for rapid prototyping, while React offers the flexibility needed for more complex applications.

The Glue: Docker and Docker Compose

Finally, Docker and Docker Compose have been invaluable for ensuring consistency across development and production environments. They also make it easier to manage the various components of your stack.

version: '3'
services:
  web:
    build: .
    ports:
      - "8000:8000"
  db:
    image: postgres:13
    environment:
      POSTGRES_DB: myapp
      POSTGRES_PASSWORD: mypassword

In production, you’ll likely use an orchestration script to spin up the services on a cloud provider (like Azure or AWS) but they’ll use your web container build.

Using Asynchronous Programming

If you’re building LLM-powered applications and you’re not using asynchronous programming, you’re in for a world of pain. Or at least a world of very slow, unresponsive applications. Let me explain why.

Async for LLM Apps

LLM API calls are slow. I’m talking 1-10 seconds slow. That might not sound like much, but in computer time, it’s an eternity. If you’re making these calls synchronously, your application is going to spend most of its time twiddling its thumbs, waiting for responses.

Async programming allows your application to do other useful work while it’s waiting for those sluggish LLM responses.

The Async Advantage

To illustrate the difference, let’s look at a simple example. Imagine we need to make three LLM API calls:

import asyncio
import time

async def fake_llm_call(call_id):
    await asyncio.sleep(2)  # Simulating a 2-second API call
    return f"Result from call {call_id}"

async def main_async():
    start = time.time()
    results = await asyncio.gather(
        fake_llm_call(1),
        fake_llm_call(2),
        fake_llm_call(3)
    )
    end = time.time()
    print(f"Async took {end - start} seconds")
    print(results)

asyncio.run(main_async())

This async version will take about 2 seconds to complete all three calls.

Now, let’s look at the synchronous equivalent:

def sync_fake_llm_call(call_id):
    time.sleep(2)  # Simulating a 2-second API call
    return f"Result from call {call_id}"

def main_sync():
    start = time.time()
    results = [
        sync_fake_llm_call(1),
        sync_fake_llm_call(2),
        sync_fake_llm_call(3)
    ]
    end = time.time()
    print(f"Sync took {end - start} seconds")
    print(results)

main_sync()

This synchronous version will take about 6 seconds.

That’s three times slower! In a real-world application with multiple API calls, the difference can be even more dramatic.

Async Task Queue

When building LLM-powered applications with FastAPI, you’ll often encounter scenarios where you need to handle long-running tasks without blocking the main application (i.e. waiting for the LLM client to return data). This is where an async task queue comes in handy. It allows you to offload time-consuming operations (like complex LLM generations) to background workers, improving the responsiveness of your API.

Here’s a basic example of how you can implement an async task queue in FastAPI:

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio

app = FastAPI()

class Task(BaseModel):
    id: str
    status: str = "pending"
    result: str = None

tasks = {}

async def process_task(task_id: str):
    # Simulate a long-running LLM task
    await asyncio.sleep(10)
    tasks[task_id].status = "completed"
    tasks[task_id].result = f"Result for task {task_id}"

@app.post("/tasks")
async def create_task(background_tasks: BackgroundTasks):
    task_id = str(len(tasks) + 1)
    task = Task(id=task_id)
    tasks[task_id] = task
    background_tasks.add_task(process_task, task_id)
    return {"task_id": task_id}

@app.get("/tasks/{task_id}")
async def get_task(task_id: str):
    return tasks.get(task_id, {"error": "Task not found"})

In this example, we’re using FastAPI’s BackgroundTasks to manage our task queue. When a client creates a new task, we immediately return a task ID and start processing the task in the background. The client can then poll the /tasks/{task_id} endpoint to check the status of their task.

This approach has several advantages:

  1. Responsiveness: Your API can quickly acknowledge task creation without waiting for the task to complete.
  2. Scalability: You can easily distribute tasks across multiple workers.
  3. Fault Tolerance: If a task fails, it doesn’t bring down your entire application.

For more complex scenarios, you might want to consider using a dedicated task queue system like Celery or RQ (Redis Queue). These provide additional features like task prioritisation, retries, and distributed task processing. However, for many LLM applications, FastAPI’s built-in background tasks are more than sufficient.

(A few years ago I had to almost always resort to the more complex Celery or Redis, which came with it’s own set of headaches – async BackgroundTasks have come on a lot and are now easier and quicker to implement for most initial use cases.)

Here’s an example of running such a task queue in a Jupyter Notebook using a Test Client.

Remember, when working with async task queues, it’s crucial to handle errors gracefully and provide clear feedback to the client about the task’s status. You might also want to implement task timeouts and cleanup mechanisms to prevent your task queue from growing indefinitely.

Async LLM Clients

Both OpenAI and Anthropic now how working async clients. These were patchy or undocumented a few months ago but they are slowly gaining ground as part of the official documented software development kits (SDKs). The async clients allow you to make non-blocking API calls to LLM services. This can significantly improve the performance of applications that need to make multiple LLM requests concurrently.

OpenAI Async Client

OpenAI’s Python library supports async operations out of the box. Here’s a basic example of how to use it:

from openai import AsyncOpenAI
import asyncio

async def main():
    client = AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"}
        ]
    )
    print(response.choices[0].message.content)

# This for in a script
asyncio.run(main())
# Or this for within Jupyter Notebooks
# For interactive environments (like Jupyter)
loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(main())
else:
    loop.run_until_complete(main())

Key points about OpenAI’s async client:

  • It’s part of the official openai Python package.
  • You can create an AsyncOpenAI client to make async calls.
  • Most methods that make API requests have async versions.
  • It’s compatible with Python’s asyncio ecosystem.

Anthropic Async Client

Anthropic also provides async support in their official Python library. Here’s a basic example:

from anthropic import AsyncAnthropic
import asyncio

async def main():
    client = AsyncAnthropic()
    response = await client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=300,
        messages=[
            {
                "role": "user",
                "content": "What is the capital of Spain?",
            }
        ],
    )
    print(response.content)

# This for in a script
asyncio.run(main())
# Or this for within Jupyter Notebooks
# For interactive environments (like Jupyter)
loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(main())
else:
    loop.run_until_complete(main())

Key points about Anthropic’s async client:

  • It’s part of the official anthropic Python package.
  • You can create an AsyncAnthropic client for async operations.
  • Most API methods have async counterparts.
  • It integrates well with Python’s async ecosystem.

Benefits of Using Async Clients

  1. Improved Performance: You can make multiple API calls concurrently, reducing overall wait time.
  2. Better Resource Utilisation: Your application can do other work while waiting for API responses.
  3. Scalability: Async clients are better suited for handling high volumes of requests.

Async Libraries for Python

Python’s asyncio library is your bread and butter for async programming. However, you’ll also want to get familiar with libraries like aiohttp for making asynchronous HTTP requests, which you might need if interacting with other web-API services.

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://python.org')
        print(html[:50])

# This for in a script
asyncio.run(main())
# Or this for within Jupyter Notebooks
# For interactive environments (like Jupyter)
loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(main())
else:
    loop.run_until_complete(main())

The Async Learning Curve

Now, I won’t lie to you – async programming can be a bit of a mind-bender at first. Concepts like coroutines, event loops, and futures might make you feel like you’re learning a whole new language. And in a way, you are. You’re learning to think about programming in a fundamentally different way.

The good news is that once it clicks, it clicks. And the performance benefits for LLM applications are well worth the initial struggle.

I’ve found support get a lot better over the last 12 months. Examples were pretty much non-existent a year ago. Now there are a few and LLMs are up to date enough to offer a first (non-working) example to get you going.

Testing and Debugging Async Code

One word of warning: testing and debugging async code can be… interesting. Race conditions and timing-dependent bugs can be tricky to reproduce and fix. Tools like pytest-asyncio can help, but be prepared for some head-scratching debugging sessions.

The Async Ecosystem is Still Maturing

As of 2024, the async ecosystem in Python is still a bit rough around the edges. Documentation can be sparse, and best practices are still evolving. But don’t let that deter you – async is the future for LLM applications, and the sooner you get on board, the better off you’ll be.

Remember, the goal here isn’t to make your code more complex. It’s to make your LLM applications more responsive and efficient. And in a world where every millisecond counts, that’s not just nice to have – it’s essential.

Parallelisation for High-Quality Output Generation

When it comes to LLM-powered applications, sometimes you need to go fast, and sometimes you need to go deep. Parallelisation lets you do both – if you’re clever about it.

Why Parallelise?

First off, let’s talk about why you’d want to parallelise your LLM requests:

  1. Speed: Obviously, doing things in parallel is faster than doing them sequentially. If you’re making multiple independent LLM calls, why wait for one to finish before starting the next?
  2. Improved Output Quality: By running multiple variations of a prompt in parallel, you can generate diverse outputs and then select or combine the best results.
  3. Handling Complex Tasks: Some tasks require multiple LLM calls with interdependent results. Parallelisation can help manage these complex workflows more efficiently.

Parallelisation Strategies

Here are a few strategies you can employ:

Simple Parallel Requests

This is the most straightforward approach. If you have multiple independent tasks, just fire them off concurrently.

import asyncio
from openai import AsyncOpenAI

async def generate_text(client, prompt):
    response = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    client = AsyncOpenAI()
    prompts = [
        "Write a short poem about Python",
        "Explain quantum computing in simple terms",
        "List 5 benefits of exercise"
    ]
    results = await asyncio.gather(*(generate_text(client, prompt) for prompt in prompts))
    for prompt, result in zip(prompts, results):
        print(f"Prompt: {prompt}\nResult: {result}\n")

# This for in a script
asyncio.run(main())
# Or this for within Jupyter Notebooks
# For interactive environments (like Jupyter)
loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(main())
else:
    loop.run_until_complete(main())

Parallel Variations for Quality

Generate multiple variations of the same prompt in parallel, then select or combine the best results.

async def generate_variations(client, base_prompt, num_variations=3):
    tasks = []
    for i in range(num_variations):
        prompt = f"{base_prompt}\nVariation {i+1}:"
        tasks.append(generate_text(client, prompt))
    return await asyncio.gather(*tasks)

async def main():
    client = AsyncOpenAI()
    base_prompt = "Generate a catchy slogan for a new smartphone"
    variations = await generate_variations(client, base_prompt)
    print("Generated slogans:")
    for i, slogan in enumerate(variations, 1):
        print(f"{i}. {slogan.strip()}")

# This for in a script
asyncio.run(main())
# Or this for within Jupyter Notebooks
# For interactive environments (like Jupyter)
loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(main())
else:
    loop.run_until_complete(main())

Parallel Subtasks

For complex tasks, break them down into subtasks that can be executed in parallel.

async def research_topic(client, topic):
    tasks = [
        generate_text(client, f"Provide a brief overview of {topic}"),
        generate_text(client, f"List 3 key points about {topic}"),
        generate_text(client, f"Suggest 2 potential applications of {topic}")
    ]
    overview, key_points, applications = await asyncio.gather(*tasks)
    return f"Overview: {overview}\n\nKey Points: {key_points}\n\nApplications: {applications}"

async def main():
    client = AsyncOpenAI()
    topic = "Machine Learning"
    result = await research_topic(client, topic)
    print(result)

# This for in a script
asyncio.run(main())
# Or this for within Jupyter Notebooks
# For interactive environments (like Jupyter)
loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(main())
else:
    loop.run_until_complete(main())

Considerations and Caveats

While parallelisation can significantly boost your application’s performance, it’s not without its challenges:

  1. API Rate Limits: Most LLM providers have rate limits. Make sure your parallel requests don’t exceed these limits, or you’ll end up with errors instead of speed gains. When working with legal applications my number of “matters” are normally in the thousands and my rate limits are in the order of millions per minute, so it makes sense to speed things up by running requests in parallel.
  2. Cost Considerations: Remember, each parallel request costs money. Make sure the benefits outweigh the increased API usage costs.
  3. Result Consistency: When generating variations, you might get inconsistent results. You’ll need a strategy to reconcile or choose between different outputs.
  4. Complexity: Parallel code can be harder to debug and maintain. Make sure the performance gains justify the added complexity.
  5. Resource Management: Parallelisation can be resource-intensive. Monitor your application’s memory and CPU usage, especially if you’re running on constrained environments.

Implementing Parallelisation

In Python, the asyncio library is your friend for implementing parallelisation. For more complex scenarios, you might want to look into libraries like aiohttp for making async HTTP requests, or even consider distributed task queues like Celery for large-scale parallelisation.

Remember, parallelisation isn’t a magic bullet. It’s a powerful tool when used judiciously, but it requires careful thought and implementation. Start simple, measure your gains, and scale up as needed. And always, always test thoroughly – parallel bugs can be particularly sneaky!

Leveraging LLMs in the Development Process

As developers, we’re always on the lookout for tools that can streamline our workflow and boost productivity. LLMs like ChatGPT and Claude have emerged as powerful allies in the development process. Let’s dive into how you can effectively leverage these AI assistants in your coding projects.

Setting Up Your LLM Workspace

One approach that’s worked well is setting up a dedicated Claude Project for each coding project. Here’s how you can structure this:

  1. Initialise the Project: Create a new Claude Project for your coding project (basically ~ to a GitHub repository).
  2. Load Key Materials: Populate the project with essential documents:
    • Style guides and quality guidelines
    • Project background and requirements
    • Key code files developed so far
    • Examples of similar implementations you admire
  3. Branch Management: Develop individual features on separate git branches, each with its own chat off the main project.
  4. Keep It Updated: As you progress, regularly update the project materials with the latest code files.

Leveraging LLMs for Feature Development

LLMs like Claude excel at quickly coding discrete features across 2-3 files, but they can struggle with integrating into complex existing systems. Here’s a strategy to make the most of their capabilities:

  1. Generate a Toy Version: Use Claude to create a standalone version of your feature.
  2. Implement as a Separate Package: Take the LLM-generated code and implement it on a new feature git branch as a separate Python package (i.e., folder) within the existing code repo.
  3. Iterate with Tests: Get the feature working independently, using tests to guide your iterations. Pop your tests in a separate test folder corresponding to your separate Python package.
  4. Manual Integration: Once your feature is working as desired, handle the integration into your main codebase yourself.

If you need to edit existing functions and classes:

  1. Find and Isolate Files and Tests: Identify the files that will be modified by your proposed update. This is generally the first step in bug fixing anyway. If in doubt, create a test that calls the method, function, or class you need to modify/fix/improve, then trace the call stack in an IDE debug mode (I use Pycharm Pro). Copy this files into the chat. Try to keep these below 3-5 (good coding practice anyway).
  2. Use Test-Driven Development: Write new tests in your existing suite that match the new functionality you want to implement.
  3. Get LLM to Sketch and Summarise the Existing Interfaces: Define your updates schematically first – set out new method interfaces, variables, and passed data. Use these in your tests (which should be failing at this stage).
  4. Fix the Interfaces and Generate Internal Code: Then get the LLM to fill in the functional code. Copy into your git branch and run the tests. Iterate pasting back the errors until you get something working.
  5. Aim for 100% Coverage and Use to Prevent Unintended Consequences: By generating tests as you code, you can ensure that other portions of you codes are covered by tests. This helps when you modify anything (LLMs and you will get it wrong first 1-3 times) – run the tests, see what is broke, work out why, fix, repeat until everything passes.
  6. Use Git/GitHub Pull Requests and Code Review: This is an extra review stage to catch dodgy LLM solutions that still pass the tests before integrating into your production code.

Best Practices and Considerations

While LLMs can be incredibly helpful, it’s important to use them judiciously:

  1. Code Review: Always review LLM-generated code thoroughly. These models can produce impressive results, but they can also make mistakes or introduce subtle bugs.
  2. Avoid Integration Hallucinations: LLMs like Claude or Copilot may hallucinate existing code structures. Don’t rely on them for integrating with your existing codebase.
  3. Use for Ideation: LLMs are excellent for brainstorming and getting quick prototypes. Use them to explore different approaches to a problem. I’m not a front-end person but in a few hours you can get a working React prototype you can show to frontend developers and go – something like this!
  4. Documentation and Comments: Ask the LLM to provide detailed comments and documentation for the code it generates. This can save you time and ensure better code understanding.
  5. Learning Tool: Use LLMs to explain complex code or concepts. They can be excellent teachers for new programming paradigms or libraries.

Real-World Example: Developing a New API Endpoint

Let’s say you’re adding a new API endpoint to your FastAPI application. Here’s how you might use Claude:

  1. Outline the Feature: Describe the new endpoint to Claude, including expected inputs and outputs.
  2. Generate Initial Code: Ask Claude to create a basic implementation, including the endpoint function, any necessary data models, and unit tests.
  3. Iterate and Refine: Use Claude to help refine the code, optimize performance, and enhance error handling.
  4. Documentation: Have Claude generate OpenAPI documentation for your new endpoint.
  5. Testing: Use Claude to suggest additional test cases and help implement them.
  6. Manual Integration: Once you’re satisfied with the standalone implementation, integrate it into your main application yourself.

The Limits of LLM Assistance

While LLMs are powerful tools, they’re not a replacement for human developers. They excel at certain tasks:

  • Quick prototyping
  • Explaining complex concepts
  • Generating boilerplate code
  • Suggesting optimisation techniques

But they struggle with:

  • Understanding the full context of large codebases
  • Making architectural decisions that affect the entire system
  • Ensuring code aligns with all business rules and requirements

Integrating LLMs like Claude into your development process can significantly boost productivity and spark creativity. However, you need to understand their strengths and limitations. Use them as powerful assistants, but remember that the final responsibility for the code and its integration lies with you, the developer.

By leveraging LLMs thoughtfully in your workflow, you can accelerate feature development, improve code quality, and free up more of your time for the complex problem-solving that human developers do best.

Prompt Tips

Best practice on using prompts has been developing at a steady pace. The cycle seems to be:

  1. fudge something to get it working and get around limitations;
  2. have major LLM providers improve the limitations so the fudge isn’t needed as much;
  3. find documented examples where someone has had a better idea;
  4. GOTO 1.

Here are some aspects of using prompts that work at the moment.

Keep it Simple

Keep prompts simple.

Initially, I’d have lots of fancy string concatenation methods to generate conditional strings based on the content of my data. While this is helpful in streamlining the prompts and leaving out extra information that can sometimes confuse the LLM, it does make it difficult to view and evaluate prompts.

A better approach is to treat each LLM interaction as a single action with a single prompt. The single prompt has variable slots and these are populated by your logic code prior to the request.

Storage

A nice tool to use is Jinja templates. Those familiar with Flask will be well acquainted with these for creating HTML templates. But they can also be used for generating just text templates. And they have a long history of use and examples to help you.

Based on traditional Jinja template use, a good setup is to have a “prompt” or “prompt template” folder that has a number of text files (simple “.txt”).

# prompt_templates/patent_review.txt
You are a patent analysis assistant.

Patent Document:
Title: {{ patent_title }}
Filing Date: {{ filing_date }}
Inventors: {{ inventors }}
Technology Field: {{ tech_field }}

Please analyse the following patent content:
---
{{ patent_content }}
---

Focus your analysis on:
1. Novel features claimed
2. Scope of protection
3. Potential prior art considerations
4. Commercial implications
# prompt_templates/contract_summary.txt
You are a contract analysis assistant.

Contract Details:
Title: {{ contract_title }}
Parties: {{ parties }}
Date: {{ contract_date }}
Type: {{ contract_type }}

Please analyse the following contract:
---
{{ contract_content }}
---

Provide a summary covering:
1. Key terms and conditions
2. Obligations of each party
3. Important deadlines
4. Risk areas
5. Recommended amendments
# prompt_templates/legal_analysis.txt
You are a legal assistant helping analyse documents.

Document Information:
Title: {{ document_title }}
Type: {{ document_type }}
Author: {{ author }}
Date Created: {{ date_created }}
Jurisdiction: {{ jurisdiction }}
Reviewer: {{ reviewer_name }}

Please analyse the following document content for {{ analysis_type }}:
---
{{ document_content }}
---

Provide a detailed analysis covering:
1. Key legal points
2. Potential issues or risks
3. Recommended actions

Here’s a short Python example that uses these:

# Directory structure:
# my_project/
# ├── prompt_templates/
# │   ├── legal_analysis.txt
# │   ├── patent_review.txt
# │   └── contract_summary.txt
# ├── db_connector.py
# └── prompt_manager.py

# db_connector.py
import sqlite3
from dataclasses import dataclass
from typing import Optional

@dataclass
class Document:
    id: int
    title: str
    content: str
    doc_type: str
    date_created: str
    author: Optional[str]

class DatabaseConnector:
    def __init__(self, db_path: str):
        self.db_path = db_path
        
    def get_document(self, doc_id: int) -> Document:
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.cursor()
            cursor.execute("""
                SELECT id, title, content, doc_type, date_created, author 
                FROM documents 
                WHERE id = ?
            """, (doc_id,))
            row = cursor.fetchone()
            
            if not row:
                raise ValueError(f"Document with id {doc_id} not found")
                
            return Document(
                id=row[0],
                title=row[1],
                content=row[2],
                doc_type=row[3],
                date_created=row[4],
                author=row[5]
            )

# prompt_manager.py
import os
from pathlib import Path
from typing import Dict, Any
from jinja2 import Environment, FileSystemLoader
from anthropic import Anthropic

class PromptManager:
    def __init__(self, templates_dir: str, anthropic_api_key: str):
        self.env = Environment(
            loader=FileSystemLoader(templates_dir),
            trim_blocks=True,
            lstrip_blocks=True
        )
        self.client = Anthropic(api_key=anthropic_api_key)
        
    def load_template(self, template_name: str) -> str:
        """Load a template file and return it as a string."""
        return self.env.get_template(f"{template_name}.txt")
        
    def render_prompt(self, template_name: str, variables: Dict[str, Any]) -> str:
        """Render a template with the provided variables."""
        template = self.load_template(template_name)
        return template.render(**variables)
        
    async def send_prompt(self, prompt: str, model: str = "claude-3-sonnet-20240229") -> str:
        """Send the rendered prompt to Claude and return the response."""
        message = await self.client.messages.create(
            model=model,
            max_tokens=4000,
            messages=[{"role": "user", "content": prompt}]
        )
        return message.content

# Example usage
async def main():
    # Initialize managers
    db = DatabaseConnector("legal_docs.db")
    prompt_mgr = PromptManager(
        templates_dir="prompt_templates",
        anthropic_api_key="your-api-key"
    )
    
    # Get document from database
    doc = db.get_document(doc_id=123)
    
    # Prepare variables for template
    variables = {
        "document_title": doc.title,
        "document_content": doc.content,
        "document_type": doc.doc_type,
        "author": doc.author or "Unknown",
        "date_created": doc.date_created,
        "analysis_type": "legal_review",
        "jurisdiction": "UK",
        "reviewer_name": "John Smith"
    }
    
    # Render prompt from template
    prompt = prompt_mgr.render_prompt("legal_analysis", variables)
    
    # Send to Claude and get response
    response = await prompt_mgr.send_prompt(prompt)
    print(response)

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Versioning

You’ll endlessly tweak your prompts. This makes it difficult to keep on track of changes.

Git is your first friend here. Make sure your template directory is added to git, so you can track changes over time. If you host your code in GitHub or GitBucket you can use a web-interface to flick through different versions of the prompts with different commits.

You might also want to set up a manual or automated version numbering system (e.g. 0.1.0, 2.2.12) to reflect changes as you go along. This could be available programmatically and stored with your prompt in logs and test examples.

Make it Easy to Test

The Anthropic Workbench has got a lot better over the last year and provides a template for how to structure prompts.

The Workbench has a “Prompt” screen where you can compose and test prompts and an “Evaluate” screen where you can evaluate prompts over a number of runs. The Evaluate option provides a useful guide to how you can evaluate prompts over time.

Build a Test Dataset

Use a machine-learning style dataset with input and output pairs, where the output is a ground-truth example. Structure your input as per your prompt variables. Anthropic Workbench allows you to upload examples in CSV format. So think about storing this data in a spreadsheet style format (you can always stored in a database and write a script to convert to CSV via pandas).

Each row of your test dataset should have column headings that reflect your prompt variables, then a column that represents an “ideal_output”, then a column “model_output” that can be used to store results. The Workbench also uses a scoring system so you might want to add a “score” column for recording a manually assigned score from 1 (Poor) to 5 (Excellent).

Data arranged in this way can then be uploaded and exported from the Anthropic Workbench, allowing you to quickly compare Anthropic models.

I’d also recommend storing the current version of your prompt, either by copying and pasting or (better) providing a link to the current prompt file and current version (e.g., GitHub link and commit hash).

If your data is arranged in this format you can also write little scripts to automate sending the examples off into APIs and record the results.

Prompt Logging

API calls cost money. But you often need to run end-to-end tests through the LLM API. It can also be difficult to debug LLM applications – is the problem in your code and logic or in the prompt?

Proper logging of prompts can help you trace errors and find out exactly what you were sending to the LLM. Problem is prompts are often long and verbose. Simple standard output line logging falls down and other logging gets drowned out by logged API calls and responses.

Another problem is the actual LLM call can have a series of text and/or image user messages, but a human needs to read the simple flattened LLM text with populated data. Also images are transmitted as base64 encoded strings – you need to filter these out of your prompts otherwise they will take over any logs with encoded string rubbish.

There are several companies out there that propose to sell LLM logging solutions. But I find them complex and they often involve sending yet more data off into a nondescript third party cloud – not great if you are trying to secure the call chain.

If I was clever, I’d be able to set up a separate Python logger for the LLM calls and have this log to a separate database.

But I’m not. So as a compromise, I’ve found that a simple standalone JSON or MongoDB package I can reuse across projects works well. I can then pass the messages that I pass to the LLM client, and write some simple code to flatten this and save it as a cleaned string. I can then save with some UUIDs of related objects/users/tasks to then trace it back to a code flow.

Conclusion

As we’ve explored in this post covering technical implementations of LLM-powered systems, building robust applications requires careful consideration of architecture, asynchronous programming, parallelisation strategies, and thoughtful prompt engineering. The rapid evolution of LLM capabilities means we’re often building tomorrow’s solutions with today’s tools, requiring flexibility and adaptability in our approach.

We’ve seen how a well-structured tech stack, centered around FastAPI and modern async programming patterns, can provide a solid foundation for LLM applications. The move towards asynchronous programming isn’t just a performance optimisation—it’s becoming essential for handling the inherently slow nature of LLM API calls effectively. Similarly, parallelisation strategies offer powerful ways to improve both speed and output quality, though they require careful management of resources and rate limits.

The integration of LLMs into the development process itself represents an interesting meta-level application of these technologies. While they can significantly accelerate certain aspects of development, success lies in understanding their limitations and using them judiciously as part of a broader development strategy.

Perhaps most importantly, we’ve learned that effective implementation isn’t just about the code—it’s about creating sustainable, maintainable systems. This means paying attention to prompt management, testing strategies, and logging systems that can help us understand and improve our LLM applications over time.

In our next and final post in this series, we’ll examine the persistent challenges and issues that remain in working with LLMs, including questions of reliability, evaluation, and the limitations we continue to encounter. Until then, remember that we’re all still pioneers in this rapidly evolving field—don’t be afraid to experiment, but always build with care and consideration for the future.

Key Insights from a Year of Working with LLMs (2/4)

  1. Introduction
  2. Model-Specific Observations
    1. The Arrival of Capable Vision Models
    2. GPT4o and Claude 3.5 Sonnet: Solid Foundations for Complex Tasks
      1. Evolution and Current Capabilities
      2. History
      3. Practical Applications
    3. The Unexpected Utility of Mini Models
      1. Key Mini Models
      2. Characteristics and Use Cases
      3. Potential Applications
      4. Limitations in Legal Work
    4. Long Contexts: Opportunities and Challenges
      1. Rapid Evolution
      2. Current State of Long Contexts
      3. Opportunities
      4. Challenges
      5. The Output Context Limitation
      6. Further Work
  3. Conclusion

Introduction

In the first post of this series, we explored general process insights gained from a year of intensive work with Large Language Models (LLMs). We discussed the importance of context, the power of iterative refinement, and the shift in creative processes that LLMs have brought about. These insights laid the groundwork for understanding how LLMs are changing our approach to complex tasks, particularly in legal and business contexts.

As we continue our exploration, this second post focuses on model-specific observations. Over the past year, we’ve seen notable advancements in LLM capabilities, from the introduction of powerful vision models to the emergence of highly capable “mini” models. We’ve also witnessed the expansion of context windows and the refinement of top-tier models like GPT-4 and Claude 3.5 Sonnet.

In this post, we’ll look into four key areas:

  1. The arrival of capable vision models and their potential applications
  2. GPT-4o and Claude 3.5 Sonnet as solid foundations for complex tasks
  3. The unexpected utility of mini models in specific use cases
  4. The opportunities and challenges presented by long context windows

These observations are valuable for anyone working with or planning to implement LLMs in their workflows. Understanding the strengths, limitations, and optimal use cases for different models can help you make more informed decisions about which tools to use and how to use them effectively.

Let’s dive into these model-specific insights and explore how they can shape our approach to leveraging LLMs in legal and business contexts.

Model-Specific Observations

The Arrival of Capable Vision Models

Over the past year, we’ve seen a notable advancement in LLM technology with the integration of vision capabilities (officially Large Vision Models – LVMs – but I’ll assume LLMs include LVMs). Models like GPT-4o (previously GPT4 – Vision) and Claude 3.5 Sonnet can now process and understand visual information alongside text, opening up new possibilities for AI applications.

These models can analyse images, answer questions about visual content, and even count objects within images. Implementing vision capabilities requires some adjustments to classic text pipelines. For example, API requests need tweaking to provide image as well as text data, with the possibility of interleaving the two. Practically images are provided as base64 encoded strings or via URLs. For those interested in implementation details, both OpenAI and Anthropic provide comprehensive documentation.

While many organisations are still focused on text-based LLM integration, vision models offer exciting possibilities:

  • Document Analysis: Extracting information from forms, diagrams, or handwritten notes.
  • Visual Compliance Checks: Identifying safety violations or non-compliant elements in images.
  • Enhanced Legal Research: Analysing visual evidence or diagrams in patent applications.

Looking ahead, technologies like ColPali, which combines vision and language models for document retrieval, show promise for improving how we search and analyse visual information in documents. Additionally, specialised models like YOLO (You Only Look Once – we’re onto version 2 3) for object detection and SAM2 (Segment Anything Model 2) for image segmentation offer potential for more nuanced understanding of visual content, though their integration into business processes is still in early stages. For example, we are missing mature real-time libraries for rapid image evaluations.

Despite their potential, vision models do present challenges, including increased computational requirements, new privacy considerations for visual data, and the need for human oversight to ensure accuracy and reliability.

The integration of vision capabilities into LLMs represents a significant step forward. As these technologies mature, we can expect to see innovative applications across multiple industries, including law and business. However, it’s worth noting that many organisations are still in the early stages of exploring these capabilities, with text-based applications remaining the primary focus for now.

GPT4o and Claude 3.5 Sonnet: Solid Foundations for Complex Tasks

Over the past year, we’ve witnessed significant advancements in large language models, with GPT4o and Claude 3.5 Sonnet emerging as robust foundations for complex tasks in various domains, including law and business.

Evolution and Current Capabilities

GPT4o (OpenAI)

  • Released on May 13, 2024, as part of the GPT-4 family
  • Key feature: Incorporation of vision functionality (multimodal capabilities)
  • Context window: 128,000 tokens
  • Performance: Faster than GPT-4 Turbo, generally more stable
  • Qualitative assessment: Capabilities between GPT-4 and GPT-4 Turbo, suitable for most GPT-4 use cases
  • Current pricing (as of October 2024): $2.50 / 1M input tokens, $10.00 / 1M output tokens

Claude 3.5 Sonnet (Anthropic)

  • Released on June 20, 2024, as an upgrade to the Claude 3 family
  • Context window: 200,000 tokens
  • Key improvements: Enhanced coding abilities, multistep workflows, and image interpretation
  • Notable feature: Introduction of Artifacts for real-time code rendering
  • Extended output capability: Up to 8,192 tokens (as of August 19, 2024)
  • Current pricing (as of October 2024): $1.25 / 1M input tokens, $5.00 / 1M output tokens

History

The development of these models represents a rapid evolution in AI capabilities. Here’s a brief timeline of key milestones:

OpenAI’s GPT Series:

  • March 15, 2022: GPT-3.5 released (as “text-davinci-002” and “code-davinci-002”)
  • March 14, 2023: GPT-4 released (with initial waitlist)
  • November 6, 2023: GPT-4 Turbo introduced, featuring 128K context window
  • May 13, 2024: GPT4o (Omni) released, incorporating vision functionality
  • September 2024: o1 Series Preview released (o1-preview, o1-mini)

Anthropic’s Claude Series:

  • March 2023: Claude 1 API launched
  • July 2023: Claude 2 released, with 100K token context window
  • November 2023: Claude 2.1 released, expanding context to 200K tokens
  • March 14, 2024: Claude 3 family (Haiku, Sonnet, Opus) released
  • June 20, 2024: Claude 3.5 Sonnet released, with improved coding and image interpretation
  • August 19, 2024: Extended output (8,192 tokens) for Claude 3.5 Sonnet made generally available

This timeline illustrates the rapid pace of development in large language models, with both OpenAI and Anthropic consistently increasing context sizes, improving performance, and introducing new capabilities such as vision processing and extended output generation.

The rapidly developing oligopoly is good for consumers and businesses, driving the conversion of venture capital cash into useful feature development.

Practical Applications

Both models (GPT4o and Claude 3.5 Sonnet) are effective for generating professional level output. There is still a bias to an American, online-marketing influenced enthusiasm but this can often been eliminated via careful prompting.

Both models are capable of generating working small code projects (with Claude 3.5 having the slight edge) and, via an iterative chat, of generating good enough content for standard Internet publishing (this blog series is generated in collaboration with Claude 3.5). Neither model is ready to be left alone to output content without careful human checking though. Their fluency with words often hides a shallowness of information content. This can be addressed to a certain extent in providing high quality “first draft” information and good context. But “garbage in, garbage out” still reigns.

For example, both models are capable of writing competent sounding business letters and emails. In terms of reasoning abilities, they can probably meet the level of “average, half-awake, mid-tier, business manager, graduate”, especially if all the content needed for the reasoning is available in the prompt and there are clearly defined action possibilities. They are likely able to as a secretarial first-instance response generator to be checked / changed / regenerated by a human in the loop.

Later on in the post series we’ll look in more detail at the current limitations of these models for legal applications.

While these models provide powerful capabilities, it’s crucial to remember that human oversight remains necessary, especially for high-stakes tasks. The value of GPT4o and Claude 3.5 Sonnet lies in augmenting human intelligence, offering rapid information processing and idea generation.

At present model development has been rather quicker than development capacity to really build systems around them. Just last summer (August 2023) it appeared that progress had stalled with GPT-4, but Claude coming online and continued rapid model evolution means we are coming to expect jobs in capability every 6 months or so. Indeed, Claude Opus 3.5 – their higher specification model – is still to be released.

As we continue working with these models, we’re constantly discovering new applications, in turn pushing the boundaries of AI-assisted professional work. The rapid evolution of these models, as evidenced by their release timelines, suggests that we can expect further improvements and capabilities in the near future.

The Unexpected Utility of Mini Models

As discussed briefly in our first post, mini models have emerged as surprisingly capable tools for specific tasks within complex workflows. These models, designed for speed and efficiency, offer a balance between performance and cost that opens up new possibilities for AI integration in various fields.

Key Mini Models

GPT-4o mini (OpenAI)

  • Link: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
  • Release date: July 18, 2024
  • Context window: 128,000 tokens
  • Key features:
    • Scores 82% on MMLU (Massive Multitask Language Understanding)
    • Supports text and vision inputs
    • Knowledge cutoff up to October 2023
    • Supports up to 16,384 output tokens per request
    • Improved tokenizer for more cost-effective handling of non-English text
    • Strong performance in function calling and long-context tasks
  • Pricing:
    • Input: $0.15 per million tokens
    • Output: $0.60 per million tokens
    • Over 60% cheaper than GPT-3.5 Turbo

Claude 3 Haiku (Anthropic)

  • Link: https://www.anthropic.com/news/claude-3-haiku
  • Release date: March 14, 2024
  • Context window: 200,000 tokens
  • Key features:
    • Fastest model in the Claude 3 family, processing 21K tokens (~30 pages) per second for prompts under 32K tokens
    • Strong performance on industry benchmarks (MMLU score of 0.752)
    • State-of-the-art vision capabilities
    • Designed for rapid analysis of large datasets
    • Three times faster than peers for most workloads
    • Optimized for enterprise-grade security and robustness
  • Pricing:
    • Input: $0.25 per million tokens
    • Output: $1.25 per million tokens

Gemini 1.5 Flash (Google)

  • Link: https://developers.googleblog.com/en/gemini-15-flash-8b-is-now-generally-available-for-use/
  • Release date: May 24, 2024 (initial version gemini-1.5-flash-001)
  • Context window: 1 million tokens
  • Key features:
    • Optimized for narrower or high-frequency tasks where response time is crucial
    • Natively multimodal, supporting text, images, audio, and video inputs
    • Improved performance for tasks like chat, transcription, and long context language translation
    • Supports adding image, audio, video, and PDF files in text or chat prompts
    • JSON mode support (added August 30, 2024)
  • Pricing:
    • 50% lower price compared to the previous 1.5 Flash model
    • Described as having the “lowest cost per intelligence of any Gemini model”

Characteristics and Use Cases

These mini models are designed for real-time interaction, embodying a “System 1” thinking style – quick and generally accurate, but potentially prone to errors. Their cost-effectiveness for high-volume requests makes them particularly suitable for tasks that require rapid processing of large amounts of data.

While Claude 3.5 Sonnet and GPT4o are around the $1-1.5 per million tokens, these models are between 3-25% of the cost. GPT4o-mini is about 15c (10p-ish) for a million tokens and Gemini Flash is even cheaper at 3-4c (2-3p-ish).

One standout feature is their effectiveness in “needle in a haystack” information retrieval. Models like Gemini 1.5 Flash, with its massive 1 million token context window, can process entire movies or document libraries in a single pass, making it ideal for tasks that involve sifting through large volumes of unstructured data. For example, in litigation if you needed to trawl through video, looking for when a particular person came on screen or trying to find when a particular event occurred, sticking it into Gemini Flash is a no-brainer. Even if it is wrong it could save hours of human effort.

Potential Applications

  • Format translation (e.g., generating JSON from text)
  • Non-mathematical analysis tasks
  • Real-time chat and customer service applications
  • Rapid document classification and information extraction
  • Initial data preprocessing for more complex AI pipelines

Despite their impressive capabilities, we’ve found these mini models to have limited applications in generative legal work. The primary concerns are:

  1. Reliability: They are often too unreliable for even short content generation in legal contexts, where accuracy is paramount. It feels a bit like asking a teenager.
  2. Cost considerations: In legal practice, the budget often allows for using more robust, albeit more expensive, models. The potential cost savings from mini models are outweighed by the need for precision.
  3. Analysis time: Legal work often involves longer, more thorough analysis, which aligns better with the capabilities of more comprehensive models.
Er – no GPT-4o-mini – there are 5 ducks, you’ve missed a duckling.

However, these models may still find utility in “under-the-bonnet” grunt work tasks within legal workflows, such as initial document sorting or basic information extraction.

Mini models represent an intriguing development in the AI landscape. While they may not be suitable for all aspects of legal work, their speed, efficiency, and cost-effectiveness open up new possibilities for AI integration in various fields. As these models continue to evolve, we may see them taking on increasingly sophisticated roles in complex workflows, complementing rather than replacing their larger counterparts.

Long Contexts: Opportunities and Challenges

The past year has seen a remarkable evolution in the context handling capabilities of large language models. Just a year ago, most models were limited to around 8,000 tokens of input. Now, we’re seeing context windows ranging from 128,000 tokens (as in GPT4o) to 200,000 tokens (Claude 3.5 Sonnet), and even up to 1 million tokens in some mini models like Gemini 1.5 Flash.

Rapid Evolution

This rapid expansion of context windows has opened up new possibilities for AI applications:

  1. Whole Document Analysis: Models can now process entire documents, books, or even small databases in a single pass.
  2. Complex Multi-Step Reasoning: Longer contexts allow for more nuanced, multi-stage problem-solving within a single prompt.
  3. Enhanced Information Retrieval: The ability to search for and synthesize information across vast amounts of text has dramatically improved.

Current State of Long Contexts

  • GPT4o: 128,000 tokens
  • Claude 3.5 Sonnet: 200,000 tokens
  • Mini Models: 128,000 to 1 million tokens (e.g., Gemini 1.5 Flash)

This expansion allows for processing of hundreds of pages of text in a single API call, a capability that was unthinkable just a year ago. GPT4o and Claude 3.5 can also have a mixture of text and images, and Gemini 1.5 Flash can ingest whole videos.

Opportunities

  1. Legal Document Review: Analyzing entire contracts or legal cases in context.
  2. Research Synthesis: Combining and summarizing large volumes of academic papers or reports.
  3. Content Creation: Generating long-form content with consistent themes and references throughout.
  4. Data Analysis: Processing and interpreting large datasets in a single pass.
  5. Patent Prosecution Analysis: An average patent is around 32K-50K tokens. Thus you can theoretically include 4-8 items of prior art in a prompt. This is roughly the amount of cited art in a patent office action.
  6. Patent Family Analysis: You could fit a medium-sized family into Gemini’s context window to ask specific simple questions of the family (e.g., is a feature described? where?). I’d be a bit skeptical of the response but it would be good as a first stage.
  7. A combination of RAG and prompt: RAG was often a solution to the small 4K input context window. It might still be useful as part of a hybrid approach. This could include RAG vector search, BM25 search, and raw LLM review, followed by reranking and reference citing.

Challenges

  1. Computational Costs: Processing such large contexts can be computationally expensive and time-consuming.
  2. Relevance Dilution: With more context, models may struggle to focus on the most relevant information.
  3. Consistency Across Long Outputs: Maintaining coherence and consistency in very long generated texts can be challenging.
  4. Memory Limitations: Even with long input contexts, models can sometimes “forget” earlier parts of the input in their responses.

In practice, I’ve found that you can’t completely trust the output of the models. When modelling a prior art review, both models sometimes got confused between a “claim 1” under review and a “claim 1” as featured in a patent application dumped into the context. I’ve also had it miss simple answers to some questions depending on the sample – e.g. in 1 in 5 runs.

It seems to have promise as part of a traceable and reviewable analysis trail, whereby a human is needed to check the answer and the citation before confirming and moving onto a next stage of review.

The Output Context Limitation

While input contexts have grown dramatically, output contexts remain more limited, typically ranging from 4,000 to 8,000 tokens. This limitation presents a challenge for generating very long documents or responses. To address this, developers and users often need to implement iterative approaches:

  1. Chunked Generation: Breaking down large document generation into smaller, manageable pieces.
  2. Summarisation and Expansion: Generating a summary first, then expanding on each section iteratively.
  3. Context Sliding: Moving a “window” of context through a large document to generate or analyse it piece by piece.
  4. Keeping Track: While generating a small section of the document, keep the other portions of the document being generated, and any relevant information for the small section in the prompt. This is possible with the larger contexts.

Further Work

The advent of long context models represents a significant leap forward in AI capabilities. However, it also brings new challenges in terms of effective utilisation and management of these vast input spaces. The development on the model side has iterated so quickly I’ve found that many developers are caught out a bit, having built systems for limited capabilities, yet finding the goal posts moving rapidly.

Conclusion

As we reflect on the developments in LLMs over the past year, it’s rather striking how quickly the landscape has evolved. From the emergence of capable vision models to the unexpected utility of mini models, and the expansion of context windows, we’ve seen a flurry of advancements that most of the technology world is struggling to keep up with.

The solid foundations provided by models like GPT4o and Claude 3.5 Sonnet offer glimpses of what might be possible in fields such as law and business. Yet, it’s important to remember that these tools, impressive as they may be, are not without their limitations. The challenges of consistency, reliability, and the need for human oversight remain ever-present.

The rise of mini models and the expansion of context windows present intriguing opportunities, particularly in tasks requiring rapid processing of vast amounts of information. However, one ought to approach these developments with a measure of caution, especially in fields where accuracy and nuance are paramount.

In the coming posts, we’ll delve deeper into the technical aspects of implementing these models and explore some of the persistent challenges that we’ve encountered. For now, we’re still very much in the early stages of understanding how best to harness these powerful tools.

Key Insights from a Year of Working with LLMs (1/4)

  1. Introduction
  2. Series Overview
  3. General Process Insights
    1. The crucial role of context
    2. Leveraging LLMs for information condensation
    3. Navigating document and thought hierarchies
    4. Starting small: The power of mini models in complex tasks
    5. The iterative path to quality
    6. Embracing a new creative process
  4. Conclusion

Introduction

The past year has seen significant developments in the field of Large Language Models (LLMs). From the release of GPT-4 to the emergence of open-source alternatives, the landscape has changed considerably, presenting new opportunities and challenges. As a patent attorney working at the intersection of law and technology, I’ve had the opportunity to observe and work with these advancements firsthand.

Over the last twelve months, I’ve explored the practical applications of LLMs, examining their potential to improve legal processes, support decision-making, and change how we interact with complex information. This experience has been a mix of insights, setbacks, and unexpected discoveries. What I’ve learned goes beyond technical knowledge; it’s a different approach to problem-solving, creativity, and human-machine collaboration.

This matters, particularly for those of us in legal technology. As LLMs become more capable, they’re not just enhancing our abilities; they’re prompting us to reconsider our approach to legal research, document analysis, and legal writing. There’s potential to automate routine tasks, find new insights in large document sets, and generate more nuanced legal arguments. However, like any tool, the value lies in understanding its capabilities, limitations, and how to use it effectively.

In this series of four posts, I’ll share key insights from a year of working intensively with LLMs. This first post focuses on general process insights, exploring how we can effectively integrate LLMs into our workflows and thinking processes. Subsequent posts will cover model-specific observations, technical implementation lessons, and persistent challenges and issues.

Whether you’re experienced with AI or just starting to explore LLMs in your practice, I hope you’ll find something of value in these observations.

(PS: Bear with me – some of these posts were written with help from Claude as I can then write 4 posts in the normal time it takes me to write 1.)

Series Overview

Over the next four weeks, I’ll be sharing insights gleaned from a year of intensive work with Large Language Models. Each post will focus on a different aspect of my experience:

  1. General Process Insights: This post explores how working with LLMs has changed our approach to problem-solving, creativity, and information processing. We’ll look at strategies for effective LLM integration into existing workflows and thought processes.
  2. Model-Specific Observations: The second post will look into the characteristics of different LLM models, including GPT-4, Claude 3.5 Sonnet, and various “mini” models. We’ll discuss their strengths, limitations, and optimal use cases.
  3. Technical Implementation Lessons: In the third post, we’ll get into the nuts and bolts of working with LLMs. This will cover topics like parallelization, asynchronous programming, and building robust tech stacks for LLM-powered applications.
  4. Persistent Challenges and Issues: The final post will address the ongoing challenges in working with LLMs, including issues of variability, evaluation difficulties, and limitations in certain types of tasks.

In this first post, we’ll focus on the general process insights. These are the broad lessons that apply across different models and applications, forming the foundation for effective work with LLMs. Let’s dive in.

General Process Insights

The crucial role of context

It’s easy to underestimate just how much implicit information we, as humans, bring to even the simplest tasks. When using LLMs, we need to be acutely aware of this context and find ways to make it explicit.

Here are some key observations about context:

  1. Implicit knowledge is vast: We often don’t realize how much background knowledge we’re using when we perform tasks. This can range from common-sense understanding to domain-specific expertise that we’ve internalized over years of practice.
  2. Context shapes interpretation: The same input can be interpreted differently depending on the context. For example, legal terms often have specific meanings that differ from their everyday usage.
  3. LLMs need explicit context: While LLMs have broad knowledge, they don’t have the same intuitive understanding of context that humans do. We need to provide this context explicitly.

To effectively provide context to LLMs, consider:

  • Defining the scope: Clearly outline the domain and specific area of focus for the task.
  • Providing background information: Include relevant facts, definitions, or principles that a human expert would typically keep in mind.
  • Specifying the desired outcome: Be explicit about what you want the LLM to produce or achieve.

For instance, when using an LLM for patent analysis, you might need to specify:

  • The specific area of technology
  • Relevant sections of patent law
  • Any particular conventions or standards in the field
  • The specific type of analysis you’re looking for (e.g., novelty, non-obviousness)

By making context explicit, we not only improve the performance of LLMs but also often gain a deeper understanding of our own thought processes and the knowledge we bring to bear on complex tasks.

It’s worth noting that determining what context is necessary can be an iterative process. As you work with LLMs, you may discover gaps in the provided context that need to be filled to achieve optimal results.

Leveraging LLMs for information condensation

A particularly useful capability I’ve observed in LLMs over the past year is their proficiency in condensing large volumes of information into concise, high-quality outputs. This ability, coupled with the relatively low cost of LLM operations, opens up new possibilities for information processing and synthesis.

Key observations on LLM-based information condensation:

  1. Efficient summarisation: LLMs excel at distilling the essence of lengthy documents or multiple sources into clear, concise summaries.
  2. Cost-effective processing: The computational cost of running LLMs for condensation tasks is often lower than one might expect, especially when compared to the time and resources required for human-led summarisation.
  3. Scalability: This approach allows for the processing of much larger volumes of information than would be feasible with manual methods.

To leverage this capability effectively, consider the following approaches:

  • Iterative condensation: For very large datasets, use a multi-step process where initial summaries are further condensed in subsequent passes.
  • Guided focus: Provide specific prompts or questions to direct the LLM’s attention to particular aspects of the information.
  • Cross-referencing: Use LLMs to identify and synthesise common themes or contradictions across multiple documents.

In legal practice, this capability can be particularly valuable. For instance:

  • Case law analysis: Condensing multiple relevant cases into a brief overview of key principles.
  • Contract review: Summarising lengthy contracts to highlight key terms, obligations, and potential issues.
  • Patent landscaping: Distilling large patent databases into concise overviews of technological trends or competitive positions.

It’s worth noting that while LLMs are powerful tools for information condensation, the output should always be verified by a knowledgeable human. LLMs can occasionally miss nuances or make errors, especially in complex or specialised fields.

By incorporating LLM-based condensation into our workflows, we can process and synthesise far more information than previously possible, potentially uncovering insights that might otherwise remain hidden in the sheer volume of available data.

Another insight I’ve gained over the past year is the importance of navigating hierarchies of documents and thoughts when working with LLMs. This is particularly relevant in legal contexts, where documents often have complex, nested structures.

Key observations on working with hierarchies:

  1. Structure mirrors thought: Document hierarchies often reflect the structures our brains use to organise complex information. For instance, the layout of legislation or legal arguments typically mirrors the logical progression of legal reasoning.
  2. Thought process hierarchies: By mapping out the hierarchy of thoughts or decision-making processes, we can guide LLMs to produce more coherent and logical outputs. This is essentially teaching the LLM to “think” in a structured manner similar to a legal professional. This is what OpenAI is doing at run-time with their o1 series of models.
  3. LLMs and structure: While LLMs can work with hierarchies, they are not hard-coded within their architectures – they are embedded in the millions of weight values over multiple transformer layers. We need to provide this structure explicitly, which often reveals insights into our own cognitive processes.
  4. Hard-coding helps: For specific tasks, hard-coding the document or thought structure into your prompts or preprocessing steps can significantly improve LLM performance.

Practical approaches to leverage hierarchies:

  • Document mapping: Create clear maps of document structures, especially for complex legal documents like contracts or legislation.
  • Hierarchical prompting: Structure your prompts to reflect the hierarchy of the document or thought process you’re working with.
  • Recursive analysis: Use LLMs to analyze documents at different levels of the hierarchy, then synthesize the results.

In legal applications, this approach can be particularly valuable:

  • Legislation analysis: Navigate through acts, sections, subsections, and clauses in a structured manner that mirrors legal interpretation processes.
  • Contract review: Analyze contracts by clearly delineating main sections, clauses, and subclauses, reflecting how legal professionals mentally organize contract information.
  • Legal reasoning: Structure legal arguments by explicitly laying out primary issues, sub-issues, and supporting points, mirroring the thought process of building a legal case.

By leveraging these hierarchies, we not only improve LLM performance but often gain deeper insights into the structure of our own knowledge and cognitive processes. This approach can lead to more precise, contextually aware analyses and outputs, particularly valuable in fields like law where structure and logical progression are crucial.

Starting small: The power of mini models in complex tasks

When working on complex projects with LLMs, such as content generation or intricate legal analysis, it’s often necessary to use more capable models like GPT-4 or Claude 3.5 Sonnet. However, an important insight I’ve gained is the value of identifying and isolating simpler subtasks within these complex projects where mini models can be effectively employed.

Mini models, including GPT-4o Mini, Claude 3.5 Haiku, and Gemini Flash, have proven surprisingly capable for well-defined, narrower tasks. These might include:

  1. Classification of document types or legal concepts
  2. Extracting specific information (dates, case numbers, party names)
  3. Formatting or standardising text
  4. Simple translations or paraphrasing

By breaking down larger, complex tasks into smaller, clearly defined subtasks, it becomes possible to leverage these mini models effectively. This approach offers several advantages:

  1. Efficiency: Mini models often process requests faster and at a lower cost.
  2. Focused Performance: They can be highly accurate within their specific, limited domains.
  3. Scalability: Easier to deploy and manage for high-volume, repetitive tasks.

For example, in a complex legal document analysis project, while the overarching analysis might require a more sophisticated model, tasks like initial document classification, key information extraction, or standardising citation formats could be handled efficiently by mini models.

This strategy of task decomposition not only optimises resource usage but also allows for more modular, maintainable systems. It’s a reminder that in the world of LLMs, bigger isn’t always better – the key is matching the right tool to the right task.

The iterative path to quality

Working with LLMs over the past year has reinforced the importance of iteration in producing high-quality outputs. The process is remarkably similar to making a croissant: add information, fold, refine, add more information, fold, refine. This iterative approach allows for a level of refinement that would be difficult to achieve in a single pass.

Key observations on the iterative process:

  1. Layered refinement: Each iteration adds a layer of nuance or precision to the output, much like the layers in a croissant.
  2. Rapid cycles: LLMs can perform these iterations much faster than humans, allowing for numerous refinement cycles in a short time.
  3. Evolving context: Each iteration can incorporate new context or feedback, leading to progressively more accurate or tailored results.

Practical approaches to leverage iteration:

  • Staged prompts: Break down complex tasks into a series of prompts, each building on the output of the previous one.
  • Feedback loops: Incorporate human feedback or automated evaluation metrics between iterations to guide the refinement process.
  • Parallel iterations: Run multiple iterative processes simultaneously with slightly different parameters, then synthesize the results.

In legal applications, this iterative approach can be particularly valuable:

  • Contract drafting: Progressively refine contract language through multiple passes, each focusing on different aspects (e.g., clarity, legal compliance, risk mitigation).
  • Legal research: Iteratively refine search queries and synthesize findings, gradually building a comprehensive understanding of a legal issue.
  • Argument construction: Build legal arguments layer by layer, starting with main points and progressively adding supporting evidence, counterarguments, and rebuttals.

It’s worth noting that while LLMs can perform iterations rapidly, the process still benefits from human oversight. The role of the legal professional shifts from performing each iteration manually to guiding the overall process and providing critical evaluation.

This iterative approach allows us to harness the speed and processing power of LLMs while still maintaining the nuanced understanding and quality control that are crucial in legal work. By embracing this process, we can produce higher quality outputs more efficiently than ever before.

Embracing a new creative process

Working with LLMs over the past year has led to some interesting shifts in how we approach creative and business processes. These changes can seem quite unusual at first, but they offer new possibilities for efficiency and quality in our work.

Key observations on the new creative process:

  1. Emphasis on refinement: The creative act is no longer just about the initial draft from mind to keyboard. A significant part of the process now involves editing and refining LLM-generated content.
  2. Perfection isn’t the starting point: With LLMs, it’s often more efficient to generate a rough output quickly and then refine it, rather than striving for perfection in the first draft.
  3. Iterative improvement: The process becomes more about guiding and shaping the output through multiple iterations, rather than creating everything from scratch.
  4. Hidden steps become visible: Steps in the creative process that were previously internal and hidden now become explicit when working with LLMs.

Practical approaches to this new process:

  • Rapid prototyping: Use LLMs to quickly generate multiple versions or ideas, then refine the most promising ones.
  • Structured refinement: Develop systematic approaches to reviewing and improving LLM outputs.
  • Collaborative creativity: View the process as a collaboration between human expertise and LLM capabilities, each contributing their strengths.

In legal and business contexts, this approach can be valuable for:

  • Document drafting: Quickly generate initial drafts of contracts, memos, or reports, then focus human expertise on refining and customizing.
  • Brainstorming: Use LLMs to generate a wide range of ideas or solutions, then apply human judgment to select and develop the best ones.
  • Analysis and research: Generate initial analyses or research summaries, then iteratively refine and expand based on human insight and additional prompts.

It’s important to note that this shift doesn’t diminish the value of human creativity and expertise. Instead, it allows us to focus our efforts on higher-level thinking, critical analysis, and nuanced refinement.

By embracing this new creative process, we can often produce higher quality work more efficiently. However, it does require a mindset shift, moving away from the idea of the solitary creator and towards a more collaborative, iterative approach to creativity and problem-solving.

Conclusion

As we’ve explored in this post, working with LLMs over the past year has led to several important insights about how we approach complex tasks and creative processes. We’ve seen the value of starting small with mini models for well-defined subtasks, the importance of externalising our thought processes, and the crucial role of context in guiding LLM outputs.

We’ve also discovered the power of LLMs in condensing large amounts of information and navigating complex document hierarchies, mirroring our own thought processes. The iterative nature of working with LLMs has emerged as a key factor in producing high-quality outputs, leading us to embrace a new creative process that emphasises refinement and collaboration between human and machine.

These insights have implications for how we approach legal and business tasks. By leveraging LLMs effectively, we can often work more efficiently, process larger volumes of information, and potentially uncover insights that might otherwise remain hidden.

However, it’s important to remember that these tools are just that – tools. They require thoughtful application, guided by human expertise and judgment. As we continue to integrate LLMs into our workflows, we’ll need to remain adaptable, continuously refining our processes to make the most of these powerful capabilities.

In the next post in this series, we’ll jump into specific observations about different LLM models, exploring their strengths, limitations, and optimal use cases. Until then, I encourage you to consider how these general insights might apply to your own work with LLMs, whether you’re just starting out or already deeply engaged with these technologies.

Can Long-Context Large Language Models Do Your Job?

In this post we test the abilities of long-context large language models for performing patent analysis. How do they compare with a patent partner charging £400-600 an hour?

Or have I cannibalised my job yet?

Or do we still need Retrieval Augmented Generation?

  1. What is a Long-Context Large Language Model?
    1. Large Language Models (LLMs)
    2. Long Context
  2. What Patent Task Shall We Test?
  3. Top Models – March 2024 Edition
  4. Can I use a Local Model?
  5. How much?
    1. GPT4 Turbo
    2. Claude 3
  6. First Run
    1. Prompts
    2. Getting the Text
    3. Simple Client Wrappers
    4. Results
      1. D1 – GPT4-Turbo
      2. D1 – Claude 3 Opus
      3. D1 – First Round Winner?
      4. D2 – GPT4-Turbo
      5. D2 – Claude 3 Opus
      6. D2 – First Round Winner?
  7. Repeatability
  8. Working on the Prompt
  9. Does Temperature Make a Difference?
    1. GPT4-Turbo and D1
      1. Temperature = 0.7
      2. Temperature = 0.1
    2. Claude 3 and D1
      1. Temperature = 0.7
      2. Temperature = 0.1
    3. GPT4-Turbo and D2
      1. Temperature = 0.7
      2. Temperature = 0.1
    4. Claude 3 and D2
      1. Temperature = 0.7
      2. Temperature = 0.1
  10. Failure Cases
    1. Missing or Modified Claim Features
    2. Making Up Claim Features
    3. Confusing Claim Features
  11. Conclusions and Observations
    1. How do the models compare with a patent partner charging £400-600 an hour?
    2. Have I cannibalised my job yet?
    3. Do we still need Retrieval Augmented Generation?
    4. What might be behind the variability?
    5. Model Comparison
  12. Further Work
    1. Vision
    2. Agent Personalities
    3. Whole File Wrapper Analysis
    4. “Harder” Technology

What is a Long-Context Large Language Model?

Large Language Models (LLMs)

Large Language Models (LLMs) are neural network architectures. They are normally based on a Transformer architecture that applies self-attention over a number of layers (~11?). The more capable models have billions, if not trillions, of parameters (mostly weights in the neural networks). The most efficient way to access these models is through a web Application Programming Interface (API).

Long Context

LLMs have what is called a “context window”. This is a number of tokens that can be ingested by the LLM in order to produce an output. Tokens are roughly mapped to words (the Byte-Pair Encoding – BPE – tokeniser that is preferred by most models is described here – tokens are often beginnings of words, word bodies, and word endings).

Early LLMs had a context of ~512 tokens. This quickly grew to between 2000 and 4000 tokens for commercially available models in 2023. Context is restricted because the Transformer architecture performs its matrix computations over the context; the size of the context thus fixes the size of certain matrix computations – the longer the context, the more parameters and the larger the matrices involved.

In late 2023/early 2024, a number of models with long context emerged. The context window for GPT3.5 quickly extended to 8k, then 16k, then 32k. This was then followed later in 2023 by a longer 32k context for the more capable GPT4 model, before a 128k context window was launched in November 2023 for the GPT4-Turbo model.

(Note: I’ve often found a lag between the “release” of models and their accessibility to Joe Public via the API – often a month or so.)

In January 2024, we saw research papers documenting input contexts of up to a million tokens. These appear to implement an approach called ring attention, that was described in a paper in October 2023. Anthropic AI released a model called Opus in March 2024 that appeared comparable to GPT4 and had a stable long context of 200k tokens.

We thus seem to be entering a “long context” era, where whole documents (or sets of documents) can be ingested.

What Patent Task Shall We Test?

Let’s have a look at a staple of patent prosecution: novelty with respect to the prior art.

Let’s start reasonably easy with a mechanical style invention. I’ve randomly picked WO2015/044644 A1 from the bucket of patent publications. It’s a Dyson application to a hair dryer (my tween/teenage girls are into hair these days). The prior art citations are pretty short.

  1. A hair care appliance comprising a body having an outer wall, a duct extending
    at least partially along the body within the outer wall, an interior passage
    extending about the duct for receiving a primary fluid flow, a primary fluid
    outlet for emitting the primary fluid flow from the body, wherein the primary
    fluid outlet is defined by the duct and an inner wall of the body, wherein at least
    one spacer is provided between the inner wall and the duct.
Claim 1

In the International phase we have three citations:

D1 and D2 are used to support a lack of novelty, so we’ll look at them.

Note: we will not be looking at whether the original claim is or is not novel from a legal perspective. I have purposely not looked into anything in detail, nor applied a legal analysis. Rather we are looking at how the language models compare with a European Examiner or Patent Attorney. The European Examiner may also be incorrect in their mapping. As we know, LLMs can also “hallucinate” (read: confabulate!).

Top Models – March 2024 Edition

There are two:

  • GPT4-turbo; and
  • Claude 3 Opus.

These are the “top” models from each of OpenAI and Anthropic. I have a fair bit of experience with GPT3.5-Turbo, and I’ve found anything less than the “top” model is not suitable for legal applications. It’s just too rubbish.

For the last year (since April 2024), GPT4 has been the king/queen, regularly coming 10-20% above other models in evaluations. Nothing has been close to beating it.

GPT4-turbo performs slightly worse that GPT4, but it’s the only model with a 128k token context. It is cheaper and quicker than GPT4. I’ve found it good at producing structured outputs (e.g., nice markdown headings etc.) and at following orders.

Claude 3 Opus has a 200k token context and is the new kid on the block. The Opus model is allegedly (from the metrics) at the level of GPT4.

It’s worth noting we are looking at the relatively bleeding edge of progress here.

  • GPT4-turbo was only released on 6 November 2023. On release it had certain issues that were only resolved with the 25 January 2024 update. We will use the 25 January 2024 version of the model. I’ve noticed this January model is better than the initially released model.

Can I use a Local Model?

Short answer: no.

Longer answer: not yet.

There are a couple of 1 million token models available. See here if you are interested. I tried to run one locally.

It needed 8.8TB of RAM. (My beefy laptop has 64GB RAM and 8GB VRAM – only short 8724GB.)

Progress though is super quick in the amateur LLM hacking sphere (it’s only big matrix multiplication in an implementation). So we might have an optimised large context model by the end of the year.

Also I’ve found the performance of the “best” open-source 7B parameter models (those that I can realistically run on my beefy computers) is still a long way away from GPT4, more GPT3.5-Turbo level, which I have found “not good enough” for any kind of legal analysis. Also, I’ve found open-source models to be more tricky to control to get appropriate output (e.g., doing what you ask, keeping to task etc.).

How much?

You have to pay for API access to GPT4-Turbo and Claude 3. It’s not a lot though, being counted in pence for each query. I’ve found it’s worth paying £5-10 a month to do some experiments on the top models.

Here are some costings based on the patent example above, that has two short prior art documents.

The claim is around 100 tokens. The prior art documents (D1 and D2) are around 3000 and 6000 tokens. Throw in a bundle of tokens for the input prompts and you have around 9200 tokens input for two prior art documents.

On the output side, a useful table comparing a claim with the prior art is around 1500 tokens.

GPT4 Turbo

GPT4-Turbo has a current pricing of $10/1M tokens on the input and $30/1M tokens on the output. So we have about 10 cents ($0.092) on the input and about 5 cents on the output ($0.045). Around 15 cents in total (~12p). Or around 1s (!!!) of chargeable patent partner time.

Claude 3

The pricing for Claude is similar but a little more expensive – $15/1M on the input and $75/1M on the output (reflecting the alleged more-GPT4 than GPT4-Turbo level).

So we have about 15 cents ($0.138) on the input and about 15 cents on the output ($0.1125). Around 30 cents in total (~24p). Or around 2s (!!!) of chargeable patent partner time.

These costs are peanuts compared to the amounts charged by attorneys and law firms. It opens up the possibility of statistical analysis, e.g. multiple iterations or passes through the same material.

First Run

For our experiments we will try to keep things as simple as possible. To observe behaviour “out-of-the-box”.

Prompts

For a system prompt I will use:

You are a patent law assistant.

You will help a patent attorney with patent prosecution.

Take an European Patent Law perspective (EP).

As our analysis prompt scaffold I will use:

Here is an independent patent claim for a patent application we are prosecuting:    
---
{}
---

Here is the text from a prior art document:
---
{}
---

Is the claim anticipated by the prior art document?
* Return your result with a markdown table with a feature mapping
* Cite paragraph numbers, sentence location, and/or page/line number to support your position
* Cite snippets of the text to demonstrate any mapping

The patent claim gets inserted in the first set of curly brackets and the prior art text gets inserted in the second set of curly brackets.

We will use the same prompts for both models. We will let the model choose the columns and arrangement of the table.

Getting the Text

To obtain the prior art text, you can use a PDF Reader to OCR the text then save as text files. I did this for both prior art publication PDFs as downloaded from EspaceNet.

  • You can also set up Tesseract via a Python library, but it needs system packages so can be fiddly and needs Linux (so I sometimes create a Docker container wrapper).
  • Python PDF readers are a little patchy in my experience. There are about four competing libraries with stuff folding and being forked all over the place. They can struggle on more complex PDFs. I think I use pyPDF. I say “I think” because you did have to use pyPDF2, a fork of pyPDF, but then they remerged the projects, so pyPDF (v4) is a developed version of pyPDF2. Simples, no?
  • You can also use EPO OPS to get the text data. But this is also a bit tricky to set up and parse.
  • It’s worth noting that the OCRed text is often very “noisy” – it’s not nicely formatted in any way, often has missing or misread characters, and the whitespace is all over the place. I’ve traditionally struggled with this prior to the LLM era.

The claim text I just copied and pasted from Google patents (correctness not guaranteed).

Simple Client Wrappers

Nothing fancy to get the results, just some short wrappers around the OpenAI and Anthropic Python clients:

def compare_claim_with_prior_art_open_ai(claim: str, prior_art: str, system_msg: str = SYSTEM_PROMPT, model: str = OPENAI_MODEL):
"""Get the chat based on a user message."""
completion = openai_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_msg},
{"role": "user", "content": PROMPT_SCAFFOLD.format(claim, prior_art)}
],
temperature=0.3
)
return completion.choices[0].message.content

def compare_claim_with_prior_art_anthropic(claim: str, prior_art: str, system_msg: str = SYSTEM_PROMPT, model: str = ANTHROPIC_MODEL):
"""Get the chat based on a user message."""
message = anthropic_client.with_options(max_retries=5).messages.create(
model=model,
max_tokens=4000,
temperature=0.3,
system=SYSTEM_PROMPT,
messages=[
{"role": "user", "content": PROMPT_SCAFFOLD.format(claim, prior_art)}
]
)
return message.content[0].text

Results

(In the analysis below, click on the images if you need to make the text bigger. Tables in WordPress HTML don’t work as well.)

D1 – GPT4-Turbo

Here’s GPT4-Turbo first off the blocks with D1:

Let’s compare again with the EP Examiner:

Successes:

  • hair care appliance” – yes, gets this and cites the same objects as the EP Examiner (actually does a better job of referencing but hey ho).
  • spacer” – while GPT4-Turbo says this is not “explicitly mentioned”, it does cite the “struts 24”, which are the same features cited by the EP Examiner.

Differences:

  • outer wall” – deemed to be not explicitly present – doesn’t make the jump made by the EP Examiner to find this feature implicit in the structure of the “hair dryer 2”.
  • duct…within the outer wall” – GPT4-Turbo decides to cite an inner hot air passageway formed by the fan 3 and heater 4 – on a brief look this seems possibly valid in isolation. However, there is an argument that it’s the outer passageway 12 that better extends within the outer wall.
  • interior passage” – GPT4-Turbo can’t find this explicitly mentioned. Interestingly, the EP Examiner doesn’t cite anything directly to anticipate this feature, so we can maybe assume it is meant to be implicit?
  • primary fluid flow outlet” – GPT4-Turbo cites the “blower opening 7”, which is an fluid outlet.
  • primary fluid flow outlet defined by the duct and an inner wall of the body” – GPT4-Turbo says this is implicit saying it is defined by “inner structures”. It’s not the most convincing but looking at the picture in Figure 1, it could be argued. I do think the EP Examiner’s “cold air nozzle” is a bit of a better fit. But you could possible argue both?

We will discuss this in more detail in the next section, but for now let’s also look at Claude 3…

D1 – Claude 3 Opus

Now let’s see how new kid, Opus, performs:

Successes:

  • hair care appliance” and “outer wall” – yes, gets this and cites the same objects as the EP Examiner (actually does a better job of referencing but hey ho).
  • primary fluid outlet” – hedges its bets by referring to both the hot and cold air streams but slightly better matches the EP Examiners citation.

Differences:

  • duct…within the outer wall” – Claude 3’s a bit more bullish than GPT4-Turbo, announcing this is not disclosed. I’d warrant that there’s more evidence for it being disclosed than not disclosed so would side more with the EP Examiner than Claude.
  • interior passage” – Again, whereas GPT4-Turbo was a little more tentative, Claude 3 appears more confident in saying this is not disclosed. I don’t necessarily trust its confidence but as before the EP Examiner is silent on what explicitly anticipates this feature.
  • primary fluid flow outlet defined by the duct and an inner wall of the body” – Claude 3 here says it is not disclosed, but I don’t thing this is entirely right.
  • “spacer” – Claude 3 says this isn’t disclosed and doesn’t mention the “struts 24”.

D1 – First Round Winner?

I’d say GPT4-Turbo won that round for D1.

It didn’t entirely match the EP Examiner’s mapping, but was pretty close.

Both models were roughly aligned and there was overlap in cited features.

I’d still say the EP Examiner did a better job.

Let’s move onto D2.

D2 – GPT4-Turbo

Here’s what the EP Examiner said about D2:

Helpful. Here’s the search report:

Also helpful.

Here’s Figure 1:

And here’s the results:

Successes:

  • body having an outer wall” – yes, in isolation this can be argued.
  • duct” – does say this appears to be present but does indicate the word “duct” is not explicitly used (CTRL-F says: “correct”).
  • interior passage” – GPT4-Turbo cites the flow “through the casing to the grille”, where the casing is 12 in Figure 1 and the grill is 24 (using those would help GPT4-Turbo!). This I think can be argued in isolation.
  • primary fluid outlet” – lines 50 to 60 of column 2 do refer to a “blow opening” as quoted and the “primary fluid flow” does go from the grille to the “blow opening”. Good work here.

Differences / Failures:

  • A hair care appliance” has gone walkabout from the claim features.
  • …defined by the duct and an inner wall” – GPT4-Turbo says this is not explicitly disclosed but does take a guess that it is implicitly disclosed. I would like some more detailed reasoning about what features could stand in for the duct and inner wall. But I’d also say the GPT4-Turbo is not necessarily wrong. In the Figure 1, there is a “air flow passage 33” between the “back shell 20” and the “reflector-shield 28”, which could be mapped to a “duct” and an “inner wall”?
  • spacer” – GTP4-Turbo can’t find this. If you mapped the “air flow passage 33” to the “duct”, “spacers” may be implicit? A discussion on this and its merits would be useful. Checking D2, I see there is explicit disclosure of “spacer means” in line 55 of column 3. I’m surprised this is absent.

D2 – Claude 3 Opus

Successes:

  • hair care appliance” and “outer wall” – yes, although I think GPT4-Turbo’s “back shell 20” is better.
  • primary fluid outlet” – yes, I think element 36 and the front “grille” can be argued in isolation to be a “primary fluid outlet”

Differences / Failures:

  • duct” – Claude 3 does say this is present but the cited text isn’t amazingly useful, despite being from the document. It’s not clear what is meant to be the “duct”. However, it is true you could argue something within the back and front shell is a duct.
  • “interior passage” – similar to “duct” above. Claude 3 says it is present but the text passage provided, while from the document, doesn’t seem entirely relevant to the claim feature.
  • definition of “primary fluid outlet” – Claude’s 3 reasoning here seems appropriate if you have the molded “multiple purpose element 36” as the “primary fluid outlet” but there is maybe room to argue “periphery openings 42” help define the “element 36”? Definitely room for a discussion about whether this feature is present.
  • “spacer” – as per GPT4-Turbo, Claude 3 says this is not present despite there being “spacer means” in line 55 of column 3.

D2 – First Round Winner?

GPT4-Turbo and Claude 3 both do a little less well on the twice-as-long D2.

They do have the disadvantage of not being able to use the figures (*yet*).

Their lack of discussion of the “air flow passage 33” formed from “openings 42” is a little worrying. As is their ignorance of the “spacer means” in line 55 of column 3.

Patent attorney and EP Examiner win here.

Repeatability

As I was running some tests (coding is iterative, you fail, then correct, then fail, then correct until it works), I noticed that there was a fair bit of variation in the mapping tables I was getting back from both models. This is interesting as a human being would expect a mapping to be relatively stable – the claims features are either anticipated, or they are not.

Here’s GPT4-Turbo again on D1:

Here’s the previous run:

We can see the following issues:

  • In the first analysis GPT4-Turbo thought the “outer wall” was disclosed. In the second run, it said it was not explicitly mentioned.
  • Also note how we have slightly different “features” for each run, and differing columns and formats.
  • The mapping for the “duct” is also different, with differently levels of “confidence” on the presence and the possible implicit features.
  • On the first run, GPT4-Turbo though the “interior passage” was “not explicitly mentioned” but on the second run thought it was implied by structures and provided paragraph references.
  • Different features are mapped to the “primary fluid outlet”.
  • It locates the “struts 24” on both runs but on the first run thinks they are “functionally similar”, while on the second run finds them to “serve a different purpose”.

Uh oh. We have quite a different mapping each time we perform the run.

Let’s look at running Claude 3 again:

As compared to the previous run:

Claude 3 seems slightly more consistent between runs. We can see that the columns have shifted around, and I don’t necessarily agree with the mapping content, but the mapping detail seems mostly conserved.

Let’s look at another run for Claude 3 on D2:

Here Claude 3 does much better than the first run. The citation column appears more relevant. And party-time, it’s found and mentioned the “spacer means”. The “interior passage” mapping is better in my opinion, and is more reflectively of what I would cite in isolation on a brief run through.

Working on the Prompt

Maybe we can overcome some of these variability problems by working on the prompt.

It may be that the term “anticipated” is nudging the analysis in a certain, more US centric, direction. Let’s try explicitly referencing Article 54 EPC, which is more consistent with us setting a “European Patent Law perspective” in the system prompt.

Also let’s try shaping the mapping table, we can specify columns we want filled in.

Here’s then a revised prompt:

Here is an independent patent claim for a patent application we are prosecuting:
---
{}
---

Here is the text from a prior art document:
---
{}
---

Is the claim novel under Art.54 EPC when compared with the prior art document?
* Return your result with a markdown table with a feature mapping
* Cite paragraph numbers, sentence location, and/or page/line number to support your position
* Cite snippets of the text to demonstrate any mapping

Here is the start of the table:
| # | Feature Text | In prior art? Y/N | Where in prior art? | Any implicit disclosure? | Comments |
|---| --- | --- | --- | --- | --- |

Does that help?

In short – not really!

GPT4-Turbo seems to do a little worse with this new prompt. It appears more certain about the mapping – e.g. the “duct” is deemed not in the prior art (“N”), with no implicit disclosure and simply a statement that “The prior art does not explicitly describe a duct within the outer wall of the body”. This can be compared to the first run where this was deemed present “indirectly”.

GPT4-Turbo also introduces an error into the claim mapping, which we discuss later below.

Even though we specify more columns, the amount of text generated appears roughly the same. This means that for both models our reasoning is a bit shorter, and the models tend towards more fixed statements of presence or, more often, non-presence.

Also, although our “In prior art? Y/N” column provides a nice single letter output we can parse into a structured “True” or “False”, it does seem to nudge the models into a more binary conclusion. For example, the comments tend to confirm the presence conclusion without additional detail, whereas when the model was able to pick the columns, there was a longer, more useful discussion of potentially relevant features.

I had hoped that the “Any implicit disclosure” column would be a (sub) prompt for considering implicit disclosures a bit more creatively. This doesn’t seem to be the case for both models. Only Claude 3 uses it once in the D2 mapping (although it does use it there in the way I was hoping). I think we will ditch that column for now.

This little experiment suggests that keeping any mapping table as simple as possible helps improve performance. It also shows that LLM-wrangling is often as much of an art as a science.

Does Temperature Make a Difference?

Temperature is a hyperparameter that scales the logits output by the model prior to sampling the probabilities. This is a nice explanation. Or in English, it controls how “deterministic” or “random” the model output is. Values of around 0.1 / 0.2 should have pretty consistent output without much variation, values around and above 1 will be a lot more “creative”.

I general use a temperature of somewhere between 0.3 and 0.7. I have found that higher temperatures (around 0.7) are sometimes better for logical analysis where a bit of “thinking outside the obvious” is required.

Let’s go back to a three column table with “Claim Feature”, “Prior Art Mapping”, and “Cited Portions”. Let’s then run a round with a temperature of 0.7 and a temperature of 0.1. We will at least keep the prompt the same in both cases.

From the experiments above, it may be difficult to determine the effect of temperature over and above variability inherent in the generating of responses, but let’s have a look anyway.

(Those with a proper science degree look away.)

GPT4-Turbo and D1

Temperature = 0.7

Temperature = 0.1

There doesn’t actually seem to be that much difference between the mappings here, apart from that underlying variability discussed before.

It may be that with the temperature = 0.7 run, the model is freer to diverge from a binary “yes/no” mapping.

In the temperature = 0.1 run, GPT4-Turbo has actually done pretty well, matching the EP Examiner’s conclusions on all features apart from the last feature (but at least indicating what could be mapped).

Claude 3 and D1

Temperature = 0.7

Temperature = 0.1

Here we can again see that Claude 3 seems more consistent between runs. While there are some small differences, the two runs are very similar, with often word-for-word matches.

Claude 3 does well here, pretty much matching the EP Examiner’s objection in both cases.

GPT4-Turbo and D2

Temperature = 0.7

Temperature = 0.1

Here we can see the variation of the GPT4-Turbo. With one mapping, all the features are found in the prior art; with the other mapping, nearly all the features are not found in the prior art. Which to believe?!

Claude 3 and D2

Temperature = 0.7

Temperature = 0.1

Again Claude 3 seems much more consistent across runs. But I’m not that impressed with the reasoning – e.g. compare these to the “good” GPT4-Turbo run above.

So in conclusion, temperature doesn’t seem to make a load of difference here. It is not a silver bullet, transforming “bad” mappings into “good”. The issues with performance and consistency appear to be model, rather than hyperparameter based.

Failure Cases

Missing or Modified Claim Features

With GPT4-Turbo there was a case where the claim features did not entirely sync up with the supplied claim text:

Here “[a] hair care appliance comprising” has gone walk-about.

Also GPT4-Turbo seems to paraphrase some of the claim features in the table.

The feature “an interior passage extending about the duct for receiving a primary fluid flow” becomes “interior passage for receiving a primary fluid flow“. Such paraphrasing by a trainee would give senior patent attorneys the heebie-jeebies.

Making Up Claim Features

This is an interesting failure case from GPT4-Turbo. It appears to get carried away and adds three extra claim features to the table. The claim doesn’t mention infra-red radiation anywhere…

As with the case below, it seems to be getting confused with the “claim 1” we are comparing and the “claim 1” of the prior art. It is interesting to note this occurs for the longer prior art document. It is a nice example of long document “drift”. I note how RAG offers a solution to this below.

I found similar behaviour from GPT3.5-turbo. GPT3.5-turbo was a lot worse, it often just made up the claim features, or decided to take them from the comparison text instead of the claim. Similar to if you gave the exercise to a 6 year-old.

Confusing Claim Features

Here Claude 3 does what at first site looks like a good job. Until, you realise the LLM is mapping what appears to be a claim from the prior art document, onto the prior art document.

This may be an issue I thought we’d might see in long context models. In the prompt we put the claim we are comparing first. But then we have 6000 tokens from D2. It looks like this might cause the model to “forget” the specific text of the claim but “remember” that we are mapping some kind of “claim” and so pick the nearest “claim” – claim 1 of D2.

Looking at the claim 1 of D2 this does appear to be the case:

In a hand held hair dryer having means for directing air flow toward hair to be dried, the improvement comprising, in combination:
a casing having a forward grille-like support member adapted to be faced toward the hair to be dried;
an infra-red, ring-shaped, radiator in said casing spaced rearwardly of the grille-like member;
a motor carried on the grille-like member and extending rearwardly thereof centrally of said ring shaped radiator;
shield means between the ring-shaped radiator and the motor for protecting the motor from the infrared radiation; radiation reflector means, including a portion spaced rearwardly of the ring-shaped radiator for directing reflected radiation toward and through the grille-like member;
a flat air propeller operatively associated with and driven by the motor and located spaced axially formed of the rearward portion of the reflector and rearward of the ring-shaped radiator, the propeller being operative to direct only a gentle flow of air through said grille toward the hair to be dried, to avoid destruction and disarray of the hairdo, but to move the top layers of hair sufficiently to permit radiation drying of the hair mass; and
means for introducing cooling air into the casing to cool portions of the casing and the motor.

Claim 1 of D2

It’s interesting to note that this was also a problem we found with GPT3.5-Turbo.

Conclusions and Observations

What have we found out?

  • Results are at a possibly-wrong, average-ability, science-graduate level.
  • Prompt crafting is an art – you can only learn by doing.
  • Temperature doesn’t matter that much.
  • Variability is a problem with GPT4-Turbo.
  • LLMs can get confused on longer material.

At the start we had three questions:

  1. How do the models compare with a patent partner charging £400-600 an hour?
  2. Have I cannibalised my job yet?
  3. Do we still need Retrieval Augmented Generation?

Let’s see if we can partially answer them.

How do the models compare with a patent partner charging £400-600 an hour?

Ignoring cost, we are not there yet. Patent attorneys can sigh in relief for maybe another year.

But considering the models cost the same as 1-2s of patent partner time, they didn’t do too bad at all.

One big problem is consistency.

GPT4-Turbo has some runs and some feature mappings that I am fairly happy with. The problem is I perform a further run with the same prompt and the same parameters and I get a quite different mapping. It is thus difficult to “trust” the results.

Another big problem is apparent confidence.

Both models frequently made quite confident statement on feature disclosure. “This feature is not disclosed”. However, on the next mapping run, or by tweaking the prompt, the feature was found to be disclosed. So the confident statements are more features of the output. The models don’t seem to do accurate shades of confidence out-of-the-box.

If you are a skeptical person like myself, you might not believe what you are told by human or machine (watch the slide into cynicism though). In which case, you’d want to see and review the evidence for any statement yourself before agreeing. If you treat LLMs in this manner, like a brand new graduate trainee, sometimes helpful, sometimes off, then you are well placed.

If you are a nice trusting human being that thinks that both human beings and machines are right in what they say, you will come to harm using LLMs. LLMs are particularly slippery because they provide the most likely, not the most factually correct, output. While the two are correlated, correlation does not necessarily equal truth (see: science).

Often, a clear binary mapping (“Yes – the feature is disclosed”) leads the model to later justify that sampling (“The feature is disclosed because the feature is disclosed”) rather than provide useful analysis. We had better performance when we were less explicit in requiring a binary mapping. However, this then leads to problems in parsing the results – is the feature disclosed or not?

Have I cannibalised my job yet?

Not quite.

But if I needed to quickly brainstorm mappings for knocking out a claim (e.g., in an opposition), I might run several iterations of this method and look at the results.

Or if I was drafting a claim, I could “stress test” novelty against known prior art by iterating (e.g. 10-20 times?) and looking at the probabilities of feature mappings.

If neither model can map a feature, then I would be more confident in the robustness in examination. These would be the features it is worth providing inventive step arguments for in the specification. But I would want to do a human review of everything as well.

While I do often disagree with many of the mappings, they tend not to be completely “wrong”. Rather they are often just poor argued or evidenced, miss something I would pick up on, or the mapping is inconsistent across the whole claim. So at the level of “quick and dirty opposition”, or “frustrating examiner getting the case off their desk”.

If models do map a feature, even if I don’t agree with the mappings, they give me insight into possible arguments against my claim features. This might enable me to tweak the claim language to break these mappings.

Do we still need Retrieval Augmented Generation?

Surprisingly, I would say “yes”.

The issues with the claim feature extraction and the poorer performance on the longer document, indicate that prompt length does make a difference even for long-context models. Sometimes the model just gets distracted or goes off on one. Quite human like.

Also I wasn’t amazingly impressed with the prior art citations. The variability in passages cited, the irrelevance of some passages, and the lack of citation of some obvious features reduced my confidence that the models were actually finding the “best”, most representative disclosure. The “black box” nature of a large single prompt makes it difficult to work out why a model has indicated a particular mapping.

RAG, in the most basic form as some kind of vector comparison, provides improved control and explainability. You can see that the embeddings indicate “similarity” (whether this is true “semantic similarity” is an open question – but all the examples I have run show there is some form of “common sense” relevance in the rankings). So you can understand that one passage is cited because it has a high similarity. I find this helps reduce the variability and gives me more confidence in the results.

You can also get better focus from RAG approaches. If you can identify a subset of relevant passages first, it then becomes easier to ask the models to map the contents of those passages. The models are less likely to get distracted. This though comes at the cost of holistic consistency.

RAG would also allow you to use GPT4 rather than GPT4-Turbo, by reducing the context length. GPT4 is still a little better in my experience.

What might be behind the variability?

The variability in the mappings, and the features that are mapped, even in this relatively simple mechanic case, might hint at a deeper truth about patent work: maybe there is no “right” answer.

Don’t tell the engineers and scientists, but maybe law is a social technology, where what matters is: does someone else (e.g., a figure in authority) believe your arguments?

Of course, you need something that cannot be easily argued to be “incorrect”. But LLMs seem to be good enough that they don’t suggest wildly wrong or incorrect mappings. At worst, they believe something is not there and assert that confidently, whereas a human might say, “I’m not sure”.

But.

There may just be an inherent ambiguity in mapping a description of one thing to another thing. Especially, if the words are different, the product is different, the person writing it is different, the time is different, the breakfast is different. There might be several different ways of mapping something, with different correspondences having differing strengths and weaknesses, if differing areas. Why else would you need to pay clever people to argue for you?

I have seen this sometimes in trainees. If you come from a position of having completed a lot of past papers for the patent exams, but worked on few real-world cases, you are more likely to think there is a clearly “right” answer. The feature *is* disclosed, or the feature is *not* disclosed. Binary fact. Bosh.

However, do lots of real-world cases and you often think the exams are trying to trick you. “What, there is a clearly defined feature that is clearly different?” 80-90% of cases often have at least one feature that is borderline disclosed – it is there if you interpret all these things this way, but it isn’t there if you take this interpretation. Real-life is more like the UK P6 exam. You need to pick a side and commit to it, but have emergency plans B-H if plan A fails. Most of the time for a Rule 161 EPC communication, you recommend just arguing your side on the interpretation. The Examiner 90% of the time won’t budge, but that doesn’t say that what you say is wrong, or that a court or another jurisdiction will always agree with the Examiner.

This offers up the interesting possibility that LLMs might be better at patent exams than the exercise above…

Model Comparison

I was impressed at Claude 3 Opus. While I think GPT4-Turbo still has the edge, especially at half the price, Claude 3 Opus gave it a run for it’s money. There wasn’t a big difference in quality.

Claude 3 Opus also had some properties that stood out over GPT4-Turbo:

  • It seemed more reliable on repeated runs. There was less variability between runs.
  • It has nearly double the token context length. You could stick in all the prior art documents cited on a case.

Interestingly both Claude 3 and GPT4-Turbo tended to fall down in similar ways. They would both miss pertinent features, or sometimes get distracted in long prompts.

Based on these experiments, I’d definitely look at setting up my systems to modularly use LLMs, so I could evaluate both GPT4-Turbo and Claude 3.

Setting up billing and API access for Anthropic was also super easy, OpenAI-level. I have also tried to access Azure and Google models. They are horrendously and needlessly complicated. Life is too short.

Further Work

Vision

I didn’t look at the vision capabilities in this test. But both GPT4-Turbo and Claude 3 Opus offer vision capabilities (using a Vision Transformer to tokenise the image). One issue is that GPT4-Turbo doesn’t offer vision with long context – it’s still limited to a small context prompt (or it was last time I looked at the vision API). The vision API also has strong “alpha” vibes that I’d like to settle down.

But because you are all cool, here’s a sneak peak of GPT4-Turbo working just with the claim 1 text and Figure 1:

Claim Feature Figure 1 (D1) Reference Numeral in D1 Match (Yes/No) Notes
Hair care appliance Hair care appliance (likely a hair dryer) Yes The figure depicts a hair care appliance.
Body having an outer wall Visible outer wall 1 Yes The body of the appliance with an outer wall is clearly shown.
Duct extending within the outer wall Duct present 4 Yes There is a duct extending along the body within the outer wall.
Interior passage for receiving fluid flow Space around the duct Yes There appears to be an interior passage for airflow around the duct.
Primary fluid outlet Outlet for emitting fluid flow 7, 9, 13 Yes The end of the appliance acts as the fluid outlet.
Outlet defined by the duct and an inner wall Defined by duct and inner wall 13, 14 Yes The primary fluid outlet seems to be defined by the duct and the inner wall.
At least one spacer between the inner wall and the duct Presence of spacer(s) ? No It is unclear if spacers are present as they are not clearly depicted or labeled.

Pretty good!

A very similar analysis to the text, just from the image.

It’s definitely worth looking at integrating vision and text models. But how to do so is not obvious, especially how to efficient combine vision and long context input (there are some engineering challenges to getting the figures from a PDF involving finding the TIFFs or chopping pages into JPEGs that are boring and fiddly but essential).

Agent Personalities

We used fairly simple prompts in our example.

But we also commented on how often the law was a social language game.

Does your analysis of a claim differ if you are an examiner versus an attorney? Or if you are a judge versus an inventor? Or a patent manager versus a CEO?

It’s an open question. My first thought is: “yes, of course it does”. Which suggests that there may be mileage in performing our analysis from different perspectives and then integrating the results. With LLMs this is often as easy as stating in the user or system prompt – “YOU ARE A PATENT EXAMINER” – this nudges the context in a particular direction. It would be interesting to see whether that makes a material difference to the mapping output.

Whole File Wrapper Analysis

In our analysis with two prior art documents, we had 10,000 tokens. These were short prior art documents and we saw there was some degradation with the longer document. But we are still only 5-10% of the available prompt context.

It is technically possible to stick in all the citations from the search report (Xs, Ys, As) and go “ANALYSE!”. Whether you’d get anything useful or trustworthy is still an open question based on the present experiments. You could also get the text from the EPO prosecution ZIP or from the US File Wrapper.

I’d imagine this is where the commercial providers will go first as it’s the easiest to implement. The work is mainly in the infrastructure of getting the PDFs, extracting the text from the PDFs, then feeding into a prompt. A team of developers at a Document Management company could build this in a week or so (I can do it in that timespan and I’m a self-taught coder). It would cost though – on my calculations around £10-15 on the API per query, so 10x+ that on charges to customers. If your query is rubbish (which is often is for the first 10 or so attempts), you’ve spent £10-15 on nothing. This is less of a no-brainer than 15p.

Looking at the results here, and from reading accounts on the web, I’d say there is a large risk of confusion in a whole file wrapper prompt, or “all the prior arts”. What happens when you have a claim 1 at the start, then 10 other claim 1s?

Most long-context models are tested using a rather hacky “needle in a haystack” metric. This involves inserting some text (often incongruous, inserted at random; machine learning engineers and proper scientists or linguistics weep now) and seeing whether the query spots it and reports accordingly. GPT4-Turbo and Claude 3 Opus seem to pass this test. But finding something is an easier task than reasoning over large text portions (it just involves configuring the attention to find it over the whole input space, which is easy-ish; “reasoning” requires attention computations over multiple separated portions).

So I predict you’ll see a lot of expensive “solutions” from those that already manage data but these may be ineffective unless you are clever. They would maybe work for simple questions, like “where is a spacer between a duct and an inner wall possibly described?” but it would be difficult to trust the output without checking or know what exactly the black box was doing. I still feel RAG offers the better solution from an explanability perspective. Maybe there is a way to lever the strengths of both?

“Harder” Technology

Actually my experience is that there is not a big drop off with perceived human difficulty of subject matter.

My experiments for hardcore A/V coding, cryptography, gene editing all show a similar performance to the mechanical example above – not perfect, but also not completely wrong. This is surprising to us, because we are used to seeing a human degradation in performance. But it turns out words are words, train yourself to spin magic in them, and one area of words is just as easy as another area of words.

What is a patent? Asking Again in the Age of Machine Learning

Large Language Models (LLMs) and other Machine Learning (ML) approaches have made huge progress over the last 5-6 years. They are leading to existential questioning in professions that pride themselves on a mastery of language. This includes the field of patent law.

When new technologies arrive, they also allow us a different perspective. Let’s look.

  1. What is a patent?
    1. Claims
      1. Why claims?
    2. Description and Figures
  2. How do computers “see” claims?
    1. A Very Brief History of Patent Information
    2. From Physical to Digital
    3. What does this all mean for my claims?
  3. Comparing claims using computers
    1. Traditional Patent Searching
    2. How Patent Attorneys Compare Claims
      1. Construe the claim
      2. Split the Claim into Features
        1. A Short Aside on Segmentation
        2. Things or Events as Features
      3. Match Features
        1. What does it mean for a feature to match?
      4. Look at the number of matched features
    3. Can we automate?
      1. Fuzzy matching
      2. word2vec
      3. Transformers
  4. Settlers in a Brave New World

What is a patent?

At its heart, a patent is a description of a thing or a process.

It is made up primarily of two portions:

  • claims – these define the scope of legal protection.
  • description and figures – these provide the detailed background that supports and explains the features of the claims.

Claims

These are a set of numbered paragraphs. Each claim is a single sentence. A claimset is typically arranged in a hierarchy:

  • independent claims
    • These claims stand alone and do not refer to other claims.
    • They represent the broadest scope of protection.
    • They come in different types representing different types of protection. These relate to different infringing acts.
  • dependent claims
    • These claims refer to one or more other claims.
    • They ultimately depend on one of the other independent claims.
    • They offer additional limitations that act as fallback positions – if an independent claim is found to lack novelty or an inventive step, a combination of that same independent claim and one or more dependent claims may be found to provide novelty and an inventive step.

Why claims?

An independent claim seeks to provide a specification of a thing or a process so that a legal authority can decide whether an act infringes upon the claim. This typically means that another thing or process is deemed to fall within the specification of the thing or process in the claim.

Patents arose from legal decrees on monopolies. They started to become a legal concept in the 15th and 16th centuries. At first, the legal authority was a monarch or guild. So you can think of them as an attempt 500-odd years ago to describe a thing or process for some form of human negotiation.

A key point is that claims are inherently linguistic. The specification of a thing or a process is provided in a written form, in whatever language is used by the patent jurisdiction in question. So we are using words to specify a thing or a process in a way that allows for comparison with other things or processes.

Normally we want the specification to be as broad as possible – to cover as many different things or processes as possible so as to maximise a monopoly. But there is a tension with the requirements that a claim be novel and inventive (non-obvious). There is a dialectic process (examination) that refines the language. I want a monopoly for “a thing” (“1. A thing.”) but there are pre-existing “things” that are a problem for novelty.

So claims are not only compared with other things and processes when determining infringement, there are also compared with things and processes that were somehow available to the public prior to the filing of a patent application containing the claims.

Description and Figures

In a patent application there is also a written description and normally one or more figures. These are “extras” that help understanding and building up a context for any comparison of the claims.

If we are examining claims for novelty and inventive step, we are often comparing them with the description and figures of existing patent publications. This is because claims are typically more abstract than the written description, and the written description contains a lot more information. We are using the principle that the specific anticipates the general.

Figures are traditionally line diagrams. They started as engineering drawings and since extended to more abstract diagrams, like flowcharts for processes and system diagrams for complicated information technology equipment.

How do computers “see” claims?

A Very Brief History of Patent Information

If we want to help ourselves compare claims, either for infringement or examination, it would be good to automate some of the process. Computers are a good tool for this job.

Patent applications used to be handwritten (as were all documents). If copies were to be made, these would also be handwritten.

Later, they were printed using mechanical printing presses. The process for this used to be the arrangement of the letters and characters in a frame to form pages of text, which were then inked and pressed onto paper. Illustrations were typically originally hand-drawn, and then reproduced using etchings or lithography.

As typewriters became common in the 20th century, patent specifications were typed from handwritten versions or as a patent attorney dictated. When I started in the profession in 2005, there were still “secretaries” that typed up letters and patent specifications.

Computers came rather late to the patent profession. It was only in the 1990s they started entering into the office and it was only in the 21st century that word processors finally replaced physical type and paper.

We still refer to “patent publications” and there is a well-trodden legal process for publication. This was because it used to take a lot of work to publish a patent specification. This seems strange in an age when anyone can publish anything in seconds at the click of a button.

From Physical to Digital

Computers are actually closer to their analogue cousins than we normally realise.

At a basic level, a text document of a set of patent claims comprises a sequence of character encodings. Each character encoding is a sequence of bits (values of 0 and 1). A character is selected from a set that includes lower case letters, upper case letters, and numbers. Normally there is a big dictionary of numbers associated with each character. You can think of a character as anything that is either printed or controls the printing. In the past, characters would be printed by selecting a particular block with a carving or engraving of the two-dimensional visual pattern that represents the character. If you imagine a box of blocks, where each block is numbered, that’s pretty much how character encoding works in a computer.

For example, the patent claim – “1. A thing.” is 049 046 032 065 032 116 104 105 110 103 046 in a decimal representation of the ASCII encoding. This can then be converted into its binary equivalent, e.g. 00110001 00101110 00100000 01000001 00100000 01110100 01101000 01101001 01101110 01100111 00101110. In an actual character sequence, there is typically no delimiting character (“space” is still just a character), so what you have is 0011000100101110001000000100000100100000011101000110100001101001011011100110011100101110. What bits relate to which character is determined based on fixed-length partitioning.

Another hangover from mechanical printing and typing is that many of the control and spacing characters are digital versions old mechanical commands. For example, “carriage return” doesn’t really make sense inside a computer, a computer doesn’t have a carriage. However, a typewriter has a carriage that pings forwards and backwards. Similarly, the “tab” character is a short cut for those on typewriters having to type tables. Any actual text thus contains not only the references to the letters used to form the words, but also the control characters that dictate the whitespace and file structure.

A sequence of character encodings is typically referred to as a “string” (from the mental image of beads on a string). This may be stored or transmitted. Word processors store character encodings in a more complex digital wrapper. Microsoft Word rather silently shifted a decade ago from a proprietary wrapper to a more open extended mark-up (XML) format (which is why you have all those different options for saving Office files). A modern Word file is actually a zip file of XML files.

Things get more confusing when we consider the digital replacement for physical prints – PDF files. PDF files are different beasts from word processing files. They are concerned with defining the layout of elements within a displayed document. While both word processing documents and PDF files store strings of text somewhere underneath the wrapping, the wrapping is quite different.

What does this all mean for my claims?

It means that much of the linguistic structure we perceive in a written patent claim exists in our heads rather than in the digital medium.

The digital medium just stores sequences of character encodings. A digital representation of a patent claim does not even contain a machine representation of “words”.

This still confuses many people. They assume that “words” and even sometimes the semantic meaning exist “somewhere” in the computer. They assume that the computer has a concept of “words” and so can compute with “words”. This was false…until a few years ago.

Comparing claims using computers

Traditional Patent Searching

Patent searching can be thought of as a way of comparing a patent claim with a body of prior publication documents. You can see the limitations of traditional computer representations of text when you consider patent searching.

Most digital patent searching, at least that developed prior to 2020ish, is based on key word matching. This works because it does not need the computer to understand language. All it consists of is character sequence matching.

For example, if you are looking for a “thing”, you type in “thing”. This gets converted into a binary sequence of bits. The computer then searches through portions of encoded text looking for a matching binary sequence of bits. It’s a simple seek exercise. It’s also slow and fragile – “entity” or “widget” can pretty much have the same meaning but will not be located.

Now there are some tricks to speed up keyword matching on large documents. You can do a simple form of tokenisation by splitting character sequences on whitespace characters (e.g., a defined list of character encodings that define spaces, full stops, or line returns). These represent words 80-90% of the time but there are lots of issues (compare splitting on ” ” and “.” for “A thing.” and “This is 3.3 ml.”). The resulting character sequences following the split can then be counted. This is called “indexing” the text. This then has the power of reducing the text to a “bag of words” – the “index”. It turns out that lots of words are repeated used (e.g., “it”, “the”, “a” etc.). The bag of words, represented as a set of unique character sequences, thus has much fewer entries that the complete text. You can also ignore words that don’t help (called “stopwords”, they are normally chosen to exclude “a”, “the”, “there” and other high-frequency words). The “index” can thus be much more quickly searched for character sequence matches. (This ignores most of the very clever real world optimisations for keyword searching in large databases but is roughly still how things work so stay with me.)

Now, key word searching is only a rough proxy for a claim comparison.

If you try to search the complete character sequence of the claim against all the patents every published, it is very likely you will not find a match. This is because the longer the sequence of characters, the more unique it is. You would only find a match in Borge’s library. The Google PageRank claim is around 600 characters. You would need to find a string with 600 characters arranged identically. And you would not match against semantically identical descriptions in prior publications that just used a different punctuation character encoding somewhere amongst those 600 characters (don’t get me started on hyphens).

Multiple term key word searching typically involves picking multiple key words from the claim we wish to compare and doing a big AND query, looking for all those words to have matches with a body of text. Even more complex approaches such as “within 5” typically just perform a normal character match then look for another match in a subset of character encodings either side of the character match.

How Patent Attorneys Compare Claims

Patent attorneys learn their skill through repeatedly working with patents over many years, typically decades. It’s a rather unique and niche skill. But often it’s one that cannot be easily explained or formalised.

That’s why it’s always a useful exercise to image explaining what you do to a lay person. Your gran or a five-year old.

When I was first training as a patent attorney, coming from a science and engineering background, I did think there was a “right” way of comparing patent claims and that it was just a matter of learning this and applying it. For the law, you quickly realise that this isn’t how things work. Training typically consisted of working with a skilled qualified attorney and watching how they did things. And then seeking to rationalise those things into a general scheme. It’s much more of a dark art. After working with many different attorneys, you realise there is lots of stylistic variation. You realise the courts often have an intuitive feel for what is right, and this is used to guide a rationalised logic process within the bounds of previous attempts. The rationalised logic is what you end up with (the court report), while the intuitive feeling often hides in plain sight.

Anyway, claim comparison is typically split into the following process:

  • Construe the claim
  • Split the claim into features
  • Match features
  • Look at the number of matched features

If all the features match in a way that is agreed by everyone then the claim covers an infringement or the claim is anticipated by prior art.

Construe the claim

“Construing” a claim is shorthand for interpreting the terms within the claim. Typically, it concentrates on areas of the claim that may be unclear, or are open to different interpretations. For example, a claim could have a typo or error, or a term might have multiple meanings that cover different groups of things.

Construing the claim is typically performed early on as it allows multiple parties to have a consistent interpretation of the text. It is thus needed before any matching is performed. It is often presented as an exercise that is “independent” of the later stages of the comparison. However, in practice, construction is performed with an eye on the comparison – if the infringement or prior art revolves around whether a particular feature is present (e.g., does it have a “wheel”?) then the terms that describe that feature have greater weight when construing (e.g., what is a “wheel”?).

Claim construction is something that is hard to translate to an automated analysis. It involves having parties to a disagreement agree on a set of premises or context. It thus naturally involves mental models within the minds of multiple groups of people, people that have a vested interest in an interpretation one way or another.

Where there is disagreement, the description and figures are typically used as an information source for resolving error and ambiguity. For example, if the description and figures clearly state that “tracked propulsion” is “not a wheel”, then it would be hard for a party to argue that “wheel” covers “tracked propulsion”. Similarly, if the claim refers to a “winjet” and the description consistently describes a “widget”, then it seems clear “winjet” is a typo and what was meant was “widget”.

Claim construction can also be seen as making the implicit, explicit. Certain terms in a claim may be deemed to have a minimum number of properties or attributes. These may be based on the “common general knowledge” as represented by sources such as textbooks or dictionaries. These can be taken as a “base line” that are then modified by any explicit specification in the claim or description and figures. Again, if the parties agree that both objects of comparison have an X, there is little reason to go into this level of detail. It is mainly terms about which the comparison turns that undergo this analysis. These terms are typically those where there is the greatest difference between the parties and the strongest arguments. One of the roles of the courts or the patent examination bodies is to shape the argument so that points of agreement can be quickly be admitted, and the differing points number a reasonable amount. (If there are lots of differences, and many of these, on the face of it, are supported, it is difficult to bring a case or find agreement within the authority; if there are no differences that are contested, the case is typically easy to bring to summary judgement.)

When construing the claim, prior decisions of the courts can also be brought to bear. If a higher court rules that using the phrase “X” has a particular interpretation, this can be applied in the present case.

Split the Claim into Features

What are claim “features”?

Here we can go back to our original split between “things” and “processes”.

“Things” are deemed to have a static instantiation (whether physical or digital). Things are deemed to be composed of other things: systems have different system components, physical mechanic devices have different parts, and chemical compositions have different molecular and/or atomic constituents.

“Processes” are a set of events that unfold in time, typically sequentially. They are often methods that involve a set of steps. Each step may be seen as a different action and/or configured state of things and matter.

When we are looking at claim “features”, we are looking to segment the text of the claim into sub-portions that we can consider individually. Psychologically, we are looking to “chunk” the content of the claim. We chunk because our working memories are limited. When comparing we need to hold one chunk in the working memory, and compare it with one or more other chunks. Our brains can hold a sequence of about three or four “chunks” in working memory at any one time, or hold two items for comparison. We decompose the claim into features as a way to work out if a match exists – we can say the whole matches if each of the parts match.

Now, we only need to break a claim into features because it is complex. If the claim was “1. A bicycle.”, we could likely hold the whole claim in our working memories and compare it with other entities. In this case, we might need to use the previous step of claim construction to determine what the minimum properties of a “bicycle” were. (Two wheels? Is a tricycle, a motorcycle, or a unicycle a “bicycle”?). Here we see that the definition of claim features can be a recursive process, where the depth of recursion into both explicit and implicit features depends on the level of disagreement between parties (and likelihood of collective agreement between different parties within an authority, such as between a primary examiner and senior examiner). Recursion can also be used to “zoom in” on a particular feature comparison, while then concluding on a match at a “zoomed out” level of the feature (e.g., this does match a bicycle because X is a first wheel and Y is a second wheel).

A Short Aside on Segmentation

Claim feature extraction is a form of semantic segmentation.

Segmentation in images made a huge leap in 2023 with the launch of Meta’s Segment Anything model. In images, segmentation is often an act of determining a context-dependent pixel boundary in two-dimensions.

For the sequence of characters that form a patent claim, we have a one-dimensional problem. We need to determine the “feature” breakpoints in the sequence of characters.

It turns out patent attorneys provide clues as to this semantic segmentation via the use of whitespace. Patent attorneys will often add whitespace such that the claim is partitioned into pseudo-features by way of the two-dimensional lay out.

In the example above we see that commas, semi-colons, and new lines break the patent claim into five natural “features”.

It turns out there are a number of problems with the reliability of automated segmentation based on whitespace:

  • The text is often transformed when it is loaded into different systems, meaning that original white space may be lost or omitted. Fairly often new lines are stripped out, or stripped out then manually replaced.
  • There are many different encodings of many different forms of whitespace – they are multiple versions of the new line character for example.
  • Real-world patent claims often have a multi-tier nested structure that requires more advanced recursive segmentation.
Things or Events as Features

Those familiar with patent law will realise that when someone refers to “claim features”, they are normally referring to portions of text within the claim that are indicated as separate sections by the author’s use of whitespace. Claim charts are tables that often have 5-10 rows, where each row is a feature that is a different portion of the claim text determined in this way. Claim charts normally are structured to fill up one page of A4, so we can easily get an idea of the feature matches.

However, we can ask a deeper question – what are those different whitespace separately portions of the claim actually representing?

Or put another way – what do we mean by semantic segmentation of the claim text?

Let’s have a look at the simple WIPO claim example above. Using new lines we can split that into the following features:

  • [a]n apparatus (, comprising:)
  • a plurality of printed pages;
  • a binding configured to hold the printed pages together;
  • a cover attached to the binding,
  • characterized in that, the cover is detachable from the binding.

Looking closely, we see that actually those text portions are centred on different things. The claim defines an “apparatus“, that forms the top line. This apparatus has a number of components: pages, a binding, and a cover. We see that the middle three segments are based around definitions of each of these components. The last section then defines a characteristic of the apparatus in terms of the cover and binding components.

So for a “thing” claim, we see that our semantic focus for segmentation is “sub-things”. “Things” are made of interconnected “sub-things” and this pattern repeats recursively. We can look at different “things” or “sub-things” in isolation of its connections to focus on it’s individual properties. Things at each level are defined by the interconnection and inter-configuration of sub-things at a lower level.

Now in English grammar, we have a term for “things”: nouns. Nouns and noun-phrases are the terms we use to classify the location of “things” in text. So when we semantically segment a claim, we are doing this based on the noun content of the claim.

Method claims are slightly different. We no longer have a subdivision by static structural “things”. Rather we have a partition by time, or more precisely different sequences of actions within time. Take another claim example from WIPO:

If you were to ask a patent attorney to split that claim into “features”, they would likely choose each step – i.e. each clause starting on a new line and ending with a semi-colon and new line:

  • [a] process for producing fried rice (, comprising the steps of:)
  • turning the heat source on;
  • cooking rice in water over the heat source for a predetermined period;
  • placing a predetermined amount of oil in a pan;
  • cooking other ingredients and seasoning in the pan over the heat source;
  • placing the cooked rice in the pan; (and)
  • stirring consistently the rice and the other ingredients for a predetermined length of time over the heat source.

These steps are different actions in time, where time runs sequentially across the steps.

Now we can see that method claims also share certain aspects of the “thing” claims. We have several “things” that are acted on in the method, including: “fried rice”, “heat source”, “rice”, “oil”, “pan”, “ingredients”, “seasoning”, “cooked rice”, and “length of time”. We can also see that some of those “things” are actually different states of the same thing – we start with “rice”, which then becomes “cooked rice”, which is output by the method as “fried rice”.

Even though a method consisting of: “turning”, “cooking”, “placing”, “cooking”, “placing”, and “stirring” would be a valid patent claim, it would likely lack novelty. For example, the quite different method of cooking a chicken dinner below would fall within that method:

  • turning a chicken breast in flour;
  • cooking a set of potatoes in water;
  • placing the chicken breast and cooked potatoes on baking trays;
  • cooking the chicken breast and potatoes in the oven;
  • placing the cooked chicken breast and potatoes on a plate; and
  • stirring gravy to pour over the plate.

So we see that it is the things that are involved in each step that define the (sub) features of the step.

Match Features

Once we have identified features in the claim the next step is comparing each of those features. For infringement, we are comparing with a possibly infringing thing or process. For examination, we are comparing with a prior publication.

Splitting a claim into features lessens the cognitive load of the comparison. It also allows agreement on uncontentious aspects, focusing effort on key points of disagreement. Much of the time, there is only really one feature that may or may not differ. Often one missing feature is all you need to avoid infringement and/or argue for an inventive step.

Now, you might say that matching is easy, just like spot the difference.

Going back to an image analogy, visual features may be segmented portions of a two-dimensional extent. In spot the difference we compare two images that are scaled to the same dimensions. We are then looking for some form of visual equivalence in the pixel patterns in different portions of the image.

Words are harder though. We are dealing with at least one level of abstraction from physical reality. We are looking for a socially agreed correspondence between two sets of words.

The facts of the case determine what features will be in contention and which may be more easily matched. Different features will be relevant for different comparisons. Inventive step considerations still involve a feature matching exercise, but they involve different feature matches in different portions of prior art.

What does it mean for a feature to match?

We have our claim feature, which is set out in a portion of the claim text (our segmented portion).

Our first challenge is to identify what we are comparing with the claim. These can sometimes be fuzzy-edged items that need to be human-defined. Sometimes they are harder-edged and more unanimously agreed upon as “things” to compare. For infringement, the comparison may be based on a written description of a defined product, or a documented procedure. For examination, it is often a prior-published patent application.

Our second challenge is to find something in the comparison item that is relevant to the particular feature. There may be multiple candidates for a match. At an early stage this might be a general component of a product or thing, or a particular component of a particular embodiment of a patent application as set out in one or more figures.

Once we have something to compare, and have identified a rough candidate correspondence, the detailed analysis of the match then depends on whether we are looking at infringement or examination for novelty.

For infringement, we have a “match” if the language of the claim feature can be said to describe a corresponding feature in the potentially infringing product or process. At this stage we can ignore the nuances of the infringement type (e.g., use vs sale), as this normally only follows if we have a clear infringing product or process. To be more precise, we have a “match” if a legal authority agrees that the language of the claim feature covers a corresponding feature in the potentially infringing product or process. So there is also a social aspect.

For the examination of novelty, we have a “match” if a portion of a written description can be said to describe all the aspects of the claim feature. As claim features are typically at a higher level of abstraction, this can also be thought of as: would an abstracted version of the written description produce a summary that is identical to the claim feature?

A match is not necessarily boolean; if there is a particular point of interpretation or ambiguity there may be numerous options to decide. A decision is made based on reason (or reasons), sometimes with an appeal to previous cases (case law) or analogy or even public policy. If you asked 100 people, you might get X deciding one way and 100-X deciding the other.

Look at the number of matched features

This is normally the easy part. If we have iterated through our “matching” for each identified claim feature, and the set of claim features exhaustively cover all of the claim text, then we simply total up the number of deemed “matches”.

If all the features match, we have a matching product or process for infringement, or our claim is anticipated by the prior art.

If any feature does not match, then we do not have infringement (ignoring for now legal “in-filling” possibilities) and our claim has novelty, with the non-matching features being the “novel” features of the claim.

Any non-matching features may then be subject to a further analysis on the grounds of inventive step. If the non-matching feature is clearly found in another document, and a skilled person would seek out that other document and combine teachings with no effort, then the non-matching feature is said to lack an inventive step.

Can we automate?

Given the above, we can ask the valid question: can we automate the process?

The answer used to be “no”. The best we could do was to compare strings, and we’ve seen above that any different in surface form of the string (including synonyms or differences in spelling or white space) would throw out an analysis of even single words.

Fuzzy matching

Before 2010, natural language processing (NLP) engineers tried tinkering with a number of approaches to match words. This normally fell within the area of “fuzzy matching”. An approach used since the 60s is calculating the Levenshtein distance, a measure of the minimum number of single-character edits that change one word into another. This could catch small typos but still needed a rough string match. It failed with synonyms and irregular verbs.

word2vec

In the early 2010s though, things began to change. Techniques such as word2vec were developed. This allowed researches to replace a string version of a word with a list of floating point numbers. These numbers represented latent properties of use of the string in a large corpus of documents. Or put another way, we could compare words using numbers.

Early word vector approaches offered the possibility of comparing words with the same meaning but different string patterns. In particular, we found that the vectors representing the words had some cool geometric properties – as we moved within the high-dimensional vector space we saw natural transitions of meaning. Words with similar meanings (i.e., that were used in similar ways in large corpora) had vectors that were nearby in vector space.

So words such as “server” and “computer” might be matched in a claim based on their word2vec vectors despite there being no string match. I remember playing with this using the gensim library.

Transformers

We didn’t know it at the time, but word vectors were the beginning not the end of NLP magic.

In the early days, word embeddings allowed us to create numerical representations of word tokens for input into more complex neural network architectures. At first you could generate embeddings for a dictionary of 100,000 words, giving you a matrix of vector-size x 100k and you could then select your inputs based on a classic whitespace tokenisation of the text.

Quickly, people realised that actually you didn’t need the word2vec as a separate process, but you could learn that matrix of embeddings as part of your neural architecture. Sequence to sequence models were built on top of recurrent neural network architectures. Then in 2017, Attention is All You Need came along, which turbo-charged the transformer revolution. Fairly quickly in 2018 we arrived at BERT, which was the encoder side of AIAYN and was built into many NLP pipelines as a magical classifying workhorse, and GPT, the foundation model that became the infamous ChatGPT. In 2023, we saw the public release of GPT4, which took language models from an interesting toy for making you sound like a pirate to possible production language computer. In 2024, we are still struggling to get anywhere near to the abilities of GPT4.

With large language models like BERT and GPT, you get embeddings of any text “for free” – it’s a first stage of the model. We can thus now embed longer strings of text and convert it into a vector representation. These vectors can then be compared using mathematics. STEM – 1 ; humanities – 0 (just don’t take a close look at society).

Settlers in a Brave New World

The power of word embeddings and large language models now open up whole new avenues of “legal word processing” that were previously unimaginable. We’ve touched on using retrieval augmented generation here and here. We can apply the same approaches to patent documents and claims.

We now have a form of computer that takes a text input and produces a text output. We don’t quite know how it works but it seems to pass the Turing Test, while reasoning in a slightly stunted and alien manner.

This then provides the opportunity to automate the process described above, to arrive at automated infringement and novelty opinions. At scale. While we sleep. For pennies.

I’m excited.

RAG for Legal Documents

I’ve been working for a while with Retrieval Augmented Generation (RAG) systems. This ranges from simple vectorisation of text to more complex libraries such as LangChain or LlamaIndex. This post explores some of the nuances of applying RAG to legal documents, with tips for powerful production systems.

I don’t particularly like the acronym “RAG”; I’m not keen on acronyms, jargon, or buzzwords. But I can’t control language, and I don’t have a better suggestion, so hey-ho.

Why RAG?

I am already seeing many people use large language models (LLMs) like ChatGPT to answer legal questions.

Normally the answers are wrong in some ways. Sometimes they are wrong in dangerous ways. However, the answers are not completely wrong; they are wrong in a “nearly there” or “halfway there” kind of way.

This matches what I have read in other fields and professions, such as medicine. The 80% of an answer is often there. 15% of the content is definitely wrong. 5% is dangerously wrong.

Often the low hanging fruit is found. But the law is applied in the wrong way based on the probabilities of Internet use, or important laws are missed.

RAG systems offer a way to improve the initial answers of LLMs by getting them to work with sources of information. We explored how this could work in this blog post. The present post develops that work with some more advanced techniques.

Naive Chunking

The main problem with naive RAG implementations is that it is independent of document structure. Early transformer architectures like BERT were limited to around 512 tokens, meaning documents were “chunked” in batches of around 400 words or less (a token is a word or word part). Normally, text was extracted naively – just dumping any string content from larger structures into a single string, then chunking that single string based on token counts. In practice this makes for poor retrieval, as semantically continuous sections are broken apart mid-section & with disregard for meaning.

Document as a Tree

Now, a more sophisticated view of a document is as a tree structure. A tree is a particular instance of a graph. The tree has parent and child nodes, separated by edges. There are well known methods for building and navigating trees.

Now most electronic documents are in a tree form already. The webpage you are reading this on is a tree in the form of a complex Document Object Model (DOM). XML and JSON data structures are often parsed into nested dictionaries that can be represented as trees. Word documents are stored as XML under the hood. PDFs – well…

However, most electronic documents are designed for machines to parse and extract data not semantic meaning. For example, often they are arranged like (and built from) complex databases. Only a subset of information is semantically useful to human beings, which is why we need to render things as pretty webpages that hide much of the complexity.

RAG is concerned with the semantic content of documents rather than the syntactic contents. In fact, sticking XML or JSON into a RAG system as text tends to throw it off, as matches concentrate in similarities with the encoded syntax rather than the encoded semantic content. RAG encoders just need the semantic content in a manner similar to a human viewer. Also, the raw electronic data structure, when represented as a string, is very verbose, so it costs a lot in time and resources to encode.

Legal Documents

Legal documents are often long and complex. That’s why people pay lawyers a lot of money. But lawyers are still human, and humans can’t make sense of large streams of dense text (at least without an entertaining narrative or decent characterisation). Hence, lawyers use a variety of tools to help humans parse the documents. These include “sections” (with an arbitrary level of nesting) and “tables of contents”. These are all forms of tree structure. Normally, we represent them linearly with indents to represent levels in the tree hierarchy.

But we can also visualise the same data structure as more of a tree-like shape:

As humans, we know how to use these tree structures to quickly arrive at the section we need, by iterating over the tree. Often we approximate something similar to a breadth-first search – we look at the first level of topics, determine if one of them is relevant to our query, then look at the next set of titles within the relevant topic, repeating until we arrive at a relevant portion of text.

The tree structure of legal documents is also used when citing sources. In legal argument, we cite either the absolute reference assigned to the leaf nodes (e.g., “Section 4, Patents Act 1977”) or the trail down the tree (“Guidelines for Examination, A-II-1.2.1”).

In a particular area of law, like patent law, we often have a small number of “documents” but a large number of text chunks. In the area of case law, we have a large number of documents, each with around 100-200 paragraphs, often split via headings or subheadings. In the area of patent law, we have a very large number of patent publications (250 million odd), each with between 20 and 500+ paragraphs, split roughly into a number of canonical headings, some with subsections and most semantically chunked by embodiment.

RAG on Trees

Based on the above we have an intuition that using the tree structure of legal documents can provide for more meaningful vector search and source citation. How might we begin to build a useful search system?

Now LlamaIndex does have a tree index but this seems separate from their vector index. We really want to combine the power of freeform vector search, with the guidance of the tree structure for relevance and easy citation.

1. Parsing the Document

This is often a challenge in itself. Each document type needs a bespoke solution. Let’s take the EPO Guidelines for Examination as an example. We have the option of a PDF download or a series of web-pages.

a) PDF

Off-the-shelf PDF parsers are poor at extracting the tree structure – you get text chunks that split up some semantic sections or merge other sections.

A text chunk from a PDF with a naive off-the-shelf PDF parser

b) Web pages

Most web-page versions of materials tend to be located across a series of web-pages, with one web-page per leaf node of the tree. However, because this is 2024, the actual HTML for the webpage is a mess of Javascript and other front-end rendering functions, leaving us scratching our heads as to where the data is actually located.

What you see above…with what you get below

2. Building the Tree

If we write our own scripts per document source to handle data ingress (e.g., based on a PDF file path or a web URL), our next challenge is to parse the initial data structure into a useful semantic tree.

This often requires simplifying and validating the data structure we start with, placing everything into a common tree model.

LlamaIndex does have the data abstractions of Documents and Nodes. This is a good start. Documents here would be the high level data source (e.g., Guidelines for Examination, Case Law Book, Statute, Rules etc.) and the Nodes would represent the different levels of the tree. We thus would need to work out how the levels of the tree are represented in the initially received data structure representing the parsed data source and convert that into Nodes and their relationships.

With a tree, the relationships are the relatively straightforward parent-child relationship. The tree is represented as a Directed Acyclic Graph (DAG – no relation), where the direction normally indicates parent to child (as per the graphs above). LlamaIndex has a relationships property built into the Node data model so we could use that. Or we could build our own simple model.

To leverage the good work of the LlamaIndex team and speed up development, we might create the Documents and Nodes ourselves, but then use the inbuilt vectorisation of those nodes. We can then build custom retrieval pipelines. However, there is merit to building our own version, in that it is typically easier to see what is going on and we are not beholden to a fast-moving library.

My experience has been that plain old cosine similarity via the dot product works pretty much near identically to more complex methods, and using OpenAI’s API or SentenceTransformer for embeddings also has similar performance. What makes a big difference to performance is the logic of retrieval and how the prompts are generated, typically in an iterative manner.

In terms of the text data stored with each node in the tree, we can start with the text of the titles as presented.

3. Querying

If we have a tree data structure representing our legal document, with the tree representing the semantic sections of the document we can begin to have some fun with query strategies.

Unfortunately, it’s all a bit “wild west” at the moment in terms of querying strategies and algorithms. There are a lot of academic papers with some useful ideas and custom solutions but none of these are ready for a production implementation or have a stable library you can drop in. They will come in time, but I’m impatient!

a) Naive Start

A naive starting point is to flatten the tree and to perform a vector search over all the nodes. This would give us results that mix different layers of the tree.

One problem with this is that titles are short and they often do not contain all the information that is needed to determine whether the nodes below are relevant. This leads to noisy results at best and just noise at worst.

b) Breadth-First Vector Search

The next natural step is to combine a normal breadth-first search with the vector search. This would involve filtering nodes by level, performing a vector search on the filtered nodes, and then picking a level to explore based on those results.

This again would suffer from the problem of the un-informative title text as discussed above.

c) Breadth-First Vector Search with Summary Embedding

One suggestion to address uninformative parent node text is to build a summary for every parent node, based on the contents of the child nodes. This can be built iteratively from the leaf nodes up.

For example, we iterate over the leaf nodes, get a lowest level set of parent nodes, iterate over those parent nodes and use a cheap LLM (like GPT3.5-turbo or a local 7B model) to summarise the text in text form at each parent. We then repeat starting with those parent nodes, until we have processed the whole tree. A human-readable text summary could be encoded with the title to get the embedding and/or may be read by a human being.

A variation on the above that uses the same algorithm generates an embedding for a parent node based on the concatenated text content of the children nodes. This can then be searched in the breadth-first manner above and should be more relevant and accurate.

The summary encoding discussed here would only need to be run once after the tree has been built.

d) Breadth-First Vector Search with Query-Summary Embedding

A further possibility to increase the accuracy of the method above, is to create custom summaries for each query. This custom summary may be used together with, or instead of, the generic summary above. This is based on the intuition that in a large section only some bits may be relevant, so a generic summary might not capture all the ways the section is relevant.

One problem with this approach is it would be more expensive. However, as cheaper or local models tend to be good at summarising, and embeddings cost peanuts, the cost may not be prohibitive (e.g., 1-50p per query via an API or free but slower if we use local models).

This approach would implement a Heath-Robinson (Rube-Goldberg for our US cousins) version of attention, but in a manner that may be more “explainable”. A user is provided with feedback in the form of the custom summary that represents what portions are deemed relevant to place in the summary.

The custom summary may also be used in the actual generation prompt. Normally, RAG systems dump the text of the most relevant nodes in a prompt together with a question to get an answer. Custom summaries of sections prior to specific text from the section might improve question-answering by setting up the context for the probabilistic retrieval.

e) Chain-of-thought or Iterative Querying

Another approach that could be used modularly with the above approaches is an iterative refinement of the nodes.

For example, if we perform a breadth-first search and select a particular section based on a generic and/or custom summary, we could combine the child node text (with a view on LLM context limits) and apply some kind of scoring based on whether the LLM thought the combined text was relevant. Now my experience is that LLMs aren’t very good at consistent scoring, especially over successive requests, but the higher power models are not too bad at ranking sections in order of relevance or determining reasons why a bit of text may or may not be relevant. This could then be built into a scoring system for selection of relevant portions of text or lower nodes. However, it does get expensive quickly.

4. Linking Between Trees

With the example of the EPO Guidelines above, we also see the possibility of linking between trees. Some of these links we might be able to parse from the extracted data. For example, we might be able to extract the links to associated articles and rules as the hyperlinks are within the <div> with the text that forms the link node.

We would need to store some metadata with each node representing a parsed form of the article or rule (e.g., maybe a flattened list of the string portions of a particular article or rule node and an integer version of the main part and/or subsequent part).

In PDFs and other unstructured text, we can maybe setup a “simple” regex rule to detect citations of articles and rules:

Another possibility is to also extract cases cited by each node, storing the case as string, number, and hyperlink.

The Manual of Patent Practice has links to sections, cases, and paragraphs of cases within the text. These could be parsed by regex, HTML parsing, and/or LLM to generate metadata on links.

Vector Section with Legal Filtering

Once extracted, links between portions of statute, cases, and guidance could be used in the retrieval logic prior to prompt generation.

For example, a flattened search over all nodes in a set of guidance could be used to extract the linked articles across the top N results. This could then be combined with a further flattened search over the articles themselves. The articles could then be scored based on both searches. Top scoring articles could then be used to refine vector search results and/or included in an question-answering prompt.

Improving Access to the Law

In my experience with tests, the strength of vector search is its ability to brainlessly search without any pre-thought. The user does not need to work out what key words need to be typed, they can just dump a text description of their problem, poor spelling and all. This has huge power for increasing access to the law.

However, vector search is noisy, has weird quirks and biases, and out-of-the-box RAG is not “safe” enough to provide any coherent legal advice. But improvement seems to be within the realm of possibility. With some work.

I tried a few examples on D1 EQE exam questions. GPT4, the best performing model provides a half-correct answer (a score of 30-40% or 0% depending on how harsh the Examiner was feeling).

Here is an example of asking GPT4 what articles of the EPC might be relevant to answer a D1 question:

Example without RAG

Unfortunately, GPT4 missed the main bit of the answer, which involved Article 123(2) EPC (added subject matter), as this required some next-level working from the initial facts.

When using a naive RAG implementation, I got an improved answer that did highlight the applicability of Article 123(2) EPC. The top 10 retrieved portions of the Guidelines and Case Law were vaguely relevant, but they needed more work on filtering. The poor PDF and web-page parsing did not help.

Improved initial working with RAG

This offers a possibility that with a bit of tweaking and building, we could up our 0-40% GPT4 result to a scrapped pass of 51%. If GPT4 can pass the EQE, what does that mean for legal knowledge?

(In my mind possibly good things for industry and the public, possibly painful change for the patent profession.)

Future Work

The systems are not there yet. The out-of-the-box-in-a-hackathon-afternoon solutions are not good enough as of February 2024. But progress seems possible. We haven’t hit a technology “wall” yet with integrating the power of LLMs.

The changes will likely be as big as the Internet.

This is how I accessed the law when I started in the patent profession:

This is how we currently access the EPO Guidelines:

UK patent case law:

And UK patent legislation:

Even an implementation of the out-of-the-box approaches would be a big improvement over this.

But we haven’t yet really touched the possibilities of augmented law, when we start building systems that “reason” and process over large bodies of text. Exciting times!

Talking Legislation – Asking the Patents Act

We all are told that Large Language Models (LLMs) such as ChatGPT are prone to “hallucinations”. But did you know we can build systems that actively help to reduce or avoid this behaviour?

In this post, we’ll be looking at build a proof-of-concept legal Retrieval-Augmented Generation (RAG) system. In simple terms, it’s an LLM generative system that cites sources for its answers. We’ll look at applying it to some UK patent legislation.

(Caveat: I have again used GPT-4 to help with speeding up this blog post. The rubbish bits are its input.)

Scroll down to the bottom if you want to skip the implementation details and just look at the results.

If you just want to have a look at the code, you can find that here: https://github.com/Simibrum/talking_legislation

Introduction

The complex and often convoluted nature of legislation and legal texts makes them a challenging read for both laypeople and professionals alike. With the release of highly capable LLMs like GPT-4, more people have been using them to answer legal queries in a conversational manner. But there is a great risk attached – even capable LLMs are not immune to ‘hallucinations’ – spurious or inaccurate information.

What if we could build a system that not only converses with us but also cites its sources?

Enter Retrieval-Augmented Generation (RAG), a state-of-the-art technology that combines the best of both worlds: the text-generating capabilities of LLMs and the credibility of cited sources.

Challenges

Getting the Legislation

The first hurdle is obtaining the legislation in a format that’s both accurate and machine-readable.

Originally the official version of a particular piece of legislation was the version that was physically printed by a particular authority (such as the Queen or King’s printers). In the last 20 years, the law has mostly moved onto PDF versions of this printed legislation. While originally digital scans, most modern pieces of legislation are available as a digitally generated PDF.

PDF documents have problems though.

  • They are a nightmare to machine-read.
  • Older scanned legislation needs to be converted into text using Optical Character Recognition (OCR). This is slow and introduces errors.
  • Even if we have digital representations of the text within a PDF, these representations are structured for display rather than information extraction. This makes it exceedingly difficult to extract structured information that is properly ordered and labelled.

Building the RAG Architecture

Implementing a RAG system is no small feat; it involves complex machine learning models, a well-designed architecture, and considerable computational resources.

Building a Web Interface

The user experience is crucial. A web interface has to be intuitive while being capable of handling the often lengthy generative timespans that come with running complex models.

Solutions

Using XML from an Online Source

In the UK, we have the great resource: www.legislation.gov.uk.

Many lawyers use this to view up-to-date legislation. What many don’t know though is it has a hidden XML data layer that provides all the information that is rendered within the website. This is a perfect machine-readable source.

Custom XML Parser

Even though we have a good source of machine-readable information, it doesn’t mean we have the information in a useful format for our RAG system.

Most current RAG systems expect “documents” to be provided as chunks of text (“strings” – very 1984). For legislation, the text of each section makes a good “document”. The problem is that the XML does not provide a clean portion of text as you see it on-screen:

Rather, the text is split up across different XML tags with useful accompanying metadata:

To convert the XML into a useful Python data structure, we need to build a custom XML parser. This turns the retrieved XML into text objects along with their metadata, making it easier to reference and cite the legislative sources. As with any markup processing, the excellent Beautiful Soup library is our friend. The final solution requires some recursive parsing of the structure. This always makes my head hurt and requires several attempts to get it working.

Langchain for Embeddings and RAG Architecture

This mini project provided a great excuse to check out the Langchain library in Python. I’d seen many use this on Twitter to quickly spin up proof-of-concept solutions around LLMs.

At first I was skeptical. The power of langchain is it does a lot with a few lines of code, but this also means you are putting yourself in the hands of the coding gods (or community). Sometimes the abstractions are counter-productive and dangerous. However, in this case I wanted to get something up-and-running quickly for evaluation so I was happy to talk on the risks.

This is pretty bleeding edge in technology terms. I found a couple of excellent blog posts detailing how you can build a RAG system with langchain. Both are only from late August 2023!

The general outline of the system is as follows:

  • Configure a local data store as a cache for your generated embeddings.
  • Configure the model you want to use to generate the embeddings.
    • OpenAI embeddings are good if you have the API setup and are okay with the few pence it costs to generate them. The benefit of OpenAI embeddings is you don’t need a GPU to run the embedding model (and so you can deploy into the cloud).
    • HuggingFace embeddings that implement the sentence-transformer model are a free alternative that work just as well and are very quick on a GPU machine. They are a bit slow though for a CPU deployment.
  • Configure an LLM that you want to use to answer a supplied query. I used the OpenAI Chat model with GPT3.5 for this project.
  • Configure a vector store based on the embedding model and a set of “documents”. This also provides built-in similarity functions.
  • And finally, configure a Retrieval Question-and-Answer model with the initialised LLM and the vector store.

You then simply provide the Retrieval Question-and-Answer model with a query string, wait a few seconds, then receive an answer from the LLM with a set of “documents” as sources.

Web Interface

Now you can run the RAG system as a purely command-line application. But that’s a bit boring.

Instead, I now like to build web-apps for my user interfaces. This means you can easily launch later on the Internet and also take advantage of a whole range of open-source web technologies.

Many Python projects start with Flask to power a web interface. However, Flask is not great for asynchronous websites with lots of user interaction. LLM based systems have the added problem of processing times in the seconds thanks to remote API calls (e.g., to OpenAI) and/or computationally intensive neural-network forward passes.

If you need a responsive website that can cope with long asynchronous calls, the best framework for me these days is React on the frontend and FastAPI on the backend. I hadn’t used React for a while so the project was a good excuse to refresh my skills. Being more of a backend person, I found having GPT-4 on call was very helpful. (But even the best “AI” struggles with the complexity of Javascript frontends!)

I also like to use Bootstrap as a base for styling. It enables you to create great-looking user interface components with little effort.

Docker

If you have a frontend and a backend (and possibly a task queue), you need to enter the realm of Docker and Docker Compose. This helps with managing what is in effect a whole network of interacting computers. It also means you can deploy easily.

WebSockets

I asked ChatGPT for some suggestions on how to manage slow backend processes:

I’d built systems with both async functionality and task queues, so thought I might experiment with WebSockets for this proof-of-concept. As ChatGPT says:

Or a case of building a TCP-like system on-top of HTTP to overcome the disadvantages of the stateless benefits of HTTP! (I’m still scared by CORBA – useful: never.)

Anyway, the WebSockets implementation was a pretty simple setup. The React front end App sets up a WebSocket connection when the user enters a query:

And this is received by an asynchronous backend endpoint within the FastAPI implementation:

Results and Observations

Here are some examples of running queries against the proof-of-concept system. I think it works really well – especially as I’m only running the “less able” GPT3.5 model. However, there are a few failure cases and these are interesting to review.

Infringement

Here’s a question on infringement. The vector search selects the right section of the legislation and GPT3.5 does a fair job of summarising the long detail.

We can compare this with a vanilla query to GPT3.5-turbo:

And to a vanilla query using GPT4:

Inventors

Here’s an example question regarding the inventors:

Again, the vector search finds us the right section and GPT-3.5 summarises it well. You’ll see GPT3.5 also integrates pertinent details from several relevant sections. You can also click through on the cited section and be taken to the actual legislation.

Here’s vanilla GPT3.5:

Failure Case – Crown Use

Here’s an interesting failure case – we ask a question about Crown Use. Here, the vector search is biased to returning a shorter section (122) relating to the sale of forfeited items. We find that section 55 that relates to Crown Use does not even feature in the top 4 returned sections (but would possibly be number 5 given that section 56 is the fourth entry).

Interestingly, this is a case where vanilla GPT3.5 actually performs better:

WebSocket Example

If you are interested in the dynamics of the WebSockets (I know all you lawyers are), here’s the console log as we create a websocket connection and fire off a query:

And here’s the backend log:

Future Work

There are a few avenues for future improvement:

  • Experiment with the more expensive GPT4 model for question answering.
  • Extend the number of returned sources.
  • Add an intermediate review stage (possibly using the cheaper GPT3.5).
  • Add some “agent-like” behaviour – e.g. before returning an answer, use an LLM to consider whether the question is well-formed or requires further information/content from the user.
  • Add the Patent Rules in tandem.
  • Use a conventional LLM query in parallel to steer output review (e.g., an ensemble approach would maybe resolve the “Crown Use” issue above).
  • Add an HTML parser and implement on the European Patent Convention (EPC).

Summary

In summary, then:

Positives

  • It seems to work really well!
  • The proof-of-concept uses the “lesser” GPT3.5-turbo model but often has good results.
  • The cited sources add a layer of trust and verifiability.
  • Vector search is not perfect but is much, much better than conventional keyword search (I’m glad it’s *finally* becoming a thing).
  • It’s cool being able to build systems like this for yourself – you get a glimpse of the future before it arrives. I’ve worked with information retrieval systems for decades and LLMs have definitely unlocked a whole cornucopia of useful solutions.

Negatives

  • Despite citing sources, LLMs can still misinterpret them.
  • The number of returned sources is a parameter that can significantly influence the system’s output.
  • Current vector search algorithms tend to focus more on (fuzzy) keyword matching rather than the utility of the returned information, leaving room for further refinement.

Given I could create a capable system in a couple of days, I’m sure we’ll see this approach everywhere within a year or so. Just think what you could do with a team of engineers and developers!

(If anyone is interested in building out a system, please feel free to get in touch via LinkedIn, Twitter, or GitHub using the links above.)

Building a Claim-Figure-Description Dataset

When working with neural network architectures we need good datasets for training. The problem is good datasets are rare. In this post I sketch out some ideas for building a dataset of smaller, linked portions of a patent specification. This dataset can be useful for training natural language processing models.

What are we doing?

We want to build some neural network models that draft patent specification text automatically.

In the field of natural language processing, neural network architectures have shown limited success in creating captions for images (kicked off by this paper) and text generation for dialogue (see here). The question is: can we get similar architectures to work on real-world data sources, such as the huge database of patent publications?

How do you draft a patent specification?

As a patent attorney, I often draft patent specifications as follows:

  1. Review invention disclosure.
  2. Draft independent patent claims.
  3. Draft dependent patent claims.
  4. Draft patent figures.
  5. Draft patent technical field and background.
  6. Draft patent detailed description.
  7. Draft abstract.

The invention disclosure may be supplied as a short text document, an academic paper, or a proposed standards specification. The main job of a patent attorney is to convert this into a set of patent claims that have broad coverage and are difficult to work around. The coverage may be limited by pre-existing published documents. These may be previous patent applications (e.g. filed by a company or its competitors), cited academic papers or published technical specifications.

Where is the data?

As many have commented, when working with neural networks we often need to frame our problem as map X to Y, where the neural network learns the mapping when presented with many examples. In the patent world, what can we use as our Xs and Ys?

  • If you work in a large company you may have access to internal reports and invention disclosures. However, these are rarely made public.
  • To obtain a patent, you need to publish the patent specification. This means we have multiple databases of millions of documents. This is a good source of training data.
  • Standards submissions and academic papers are also published. The problem is there is no structured dataset that explicitly links documents to patent specifications. The best we can do is a fuzzy match using inventor details and subject matter. However, this would likely be noisy and require cleaning by hand.
  • US provisional applications are occasionally made up of a “rough and ready” pre-filing document. These may be available as priority documents on later-filed patent applications. The problem here is that a human being would need to inspect each candidate case individually.

Claim > Figure > Description

At present, the research models and datasets have small amounts of text data. The COCO image database has one-sentence annotations for a range of pictures. Dialogue systems often use tweet or text-message length text segments (i.e. 140-280 characters). A patent specification in comparison is monstrous (around 20-100 pages). Similarly there may be 3 to 30 patent figures. Claims are better – these tend to be around 150 words (but can be pages).

To experiment with a self-drafting system, it would be nice to have a dataset with examples as follows:

  • Independent claim: one independent claim of one predefined category (e.g. system or method) with a word limit.
  • Figure: one figure that shows mainly the features of the independent claim.
  • Description: a handful of paragraphs (e.g. 1-5) that describe the Figure.

We could then play around with architectures to perform the following mappings:

  • Independent claim > Figure (i.e. task 4 above).
  • Independent claim + Figure > Description (i.e. task 7 above).

One problem is this dataset does not naturally exist.

Another problem is that ideally we would like at least 10,000 examples. If you spent an hour collating each example, and did this for three hours a day, it would take you nearly a decade. (You may or may not also be world class in example collation.)

The long way

Because of the problems above it looks like we will need to automate the building of this dataset ourselves. How can we do this?

If I was to do this manually, I would:

  • Get a list of patent applications in a field I know (e.g. G06).
  • Choose a category – maybe start with apparatus/system.
  • Get the PDF of the patent application.
  • Look at the claims – extract an independent claim of the chosen category. Paste this into a spreadsheet.
  • Look at the Figures. Find the Figure that illustrated most of the claim features. Save this in a directory with a sensible name (e.g. linked to the claim).
  • Look at the detailed description. Copy and paste the passages that mention the Figure (e.g. all those paragraphs that describe the features in Figure X). This is often a continuous range.

The shorter way

There may be a way we can cheat a little. However, this might only work for granted European patents.

One bug-bear enjoyable part of being a European patent attorney is adding reference numerals to the claims to comply with Rule 43(7) EPC. Now where else can you find reference numerals? Why, in the Figures and in the claims. Huzzah! A correlation.

So a rough plan for an algorithm would be as follows:

  1. Get a list of granted EP patents (this could comprise a search output).
  2. Define a claim category (e.g. based a string pattern – [“apparatus”, “system”]).
  3. For each patent in the list:
    1. Fetch the claims using the EPO OPS “Fulltext Retrieval” API.
    2. Process the claims to locate the lowest number independent claim of the defined claim category (my PatentData Python library has some tools to do this).
    3. If a match is found:
      1. Save the claim.
      2. Extract reference numerals from the claim (this could be achieved by looking for text in parenthesis or using a “NUM” part of speech from spaCy).
      3. Fetch the description text using the EPO OPS “Fulltext Retrieval” API.
      4. Extract paragraphs from the description that contain the extracted reference numerals (likely with some threshold – e.g. consecutive paragraphs with greater than 2 or 3 inclusions).
      5. Save the paragraphs and the claim, together with an identifier (e.g. the published patent number).
      6. Determine a candidate Figure number from the extracted paragraphs (e.g. by looking for “FIG* [/d]”).
      7. Fetch that Figure using the EPO OPS “Drawings” or images retrieval API.
        • Now we can’t retrieve specific Figures, only specific sheets of drawings, and only in ~50% of cases will these match.
        • We can either:
          • Retrieve all the Figures and then OCR these looking for a match with the Figure number and/or the reference numbers.
          • Start with a sheet equal to the Figure number, OCR, then if there is no match, iterate up and down the Figures until a match is found.
          • See if we can retrieve a mosaic featuring all the Figures, OCR that and look for the sheet number preceding a Figure or reference numeral match.
      8. Save the Figure as something loadable (TIFF format is standard) with a name equal to the previous identifier.

The output from running this would be triple similar to this: (claim_text, paragraph_list, figure_file_path).

We might want some way to clean any results – or at least view them easily so that a “gold standard” dataset can be built. This would lend itself to a Mechanical Turk exercise.

We could break down the text data further – the claim text into clauses or “features” (e.g. based on semi-colon placement) and the paragraphs into clauses or sentences.

The image data is black and white, so we could resize and resave each TIFF file as a binary matrix of a common size. We could also use any OCR data from the file.

What do we need to do?

We need to code up a script to run the algorithm above. If we are downloading large chunks of text and images we need to be careful of exceeding the EPO’s terms of use limits. We may need to code up some throttling and download monitoring. We might also want to carefully cache our requests, so that we don’t download the same data twice.

Initially we could start with a smaller dataset of say 10 or 100 examples. Get that working. Then scale out to many more.

If the EPO OPS is too slow or our downloads are too large, we could use (i.e. buy access to) a bulk data collection. We might want to design our algorithm so that the processing may be performed independently of how the data is obtained.

Another Option

Another option is that front page images of patent publications are often available. The Figure published with the abstract is often that which the patent examiner or patent drafter thinks best illustrates the invention. We could try to match this with an independent claim. The figure image supplied though is smaller. This maybe a backup option if our main plan fails.

Wrapping Up

So. We now have a plan for building a dataset of claim text, description text and patent drawings. If the text data is broken down into clauses or sentences, this would not be a million miles away from the COCO dataset, but for patents. This would be a great resource for experimenting with self-drafting systems.

 

 

Quick Post – Machine Readable Patents Act

I’ve finally found out how to access UK legislation in XML format – http://www.legislation.gov.uk/developer/uris – you just add /data.xml to the end of the statute URI!

E.g. – https://www.legislation.gov.uk/ukpga/1977/37/data.xml .

If anyone wants to play with the legislation you can use the requests and Beautiful Soup libraries in Python to parse the XML. If you want a bit more power you can use lxml.

Patent Search as a Deep Learning Problem

This article will look into how the process of obtaining a patent could be automated using deep learning approaches. A possible pipeline for processing a patent application will be discussed. It will be shown how current state of the art natural language processing techniques could be applied.

Brief Overview of Patent Prosecution

First, let’s briefly look at how a patent is obtained. A patent application is filed. The patent application includes a detailed description of the invention, a set of figures, and a set of patent claims. The patent claims define the proposed legal scope of protection. A patent application is searched and examined by a patent office. Relevant documents are located and cited against the patent application. If an applicant can show that their claimed invention is different from each citation, and that any differences are also not obvious over the group of citations, then they can obtain a granted patent. Often, patent claims will be amended by adding extra features to clearly show a difference over the citations.

Patent Data

For a deep learning practitioner the first question is always: what data do I have? If you are lucky enough to have labelled datasets then you can look at applying supervised learning approaches.

It turns out that the large public database of patent publications is such a dataset. All patent applications needs to be published to continue to grant. This will be seen as a serendipitous gift for future generations.

Search Process

In particular, a patent search report can be thought of as the following processes:

img_0179

A patent searched locates a set of citations based on the language of a particular claim.

img_0178

Each located citation is labelled as being in one of three categories:

– X: relevant to the novelty of the patent claim.
– Y: relevant to the inventive step of the patent claim. (This typically means the citation is relevant in combination with another Y citation.)
– A: relevant to the background of the patent claim. (These documents are typically not cited in an examination report.)

In reality, these two processes often occur together. For our ends, we may wish to add a further category: N – not cited.

Problem Definition

Thinking as a data scientist, we have the following data records:

(Claim text, citation detailed description text, search classification)

This data may be retrieved (for free) from public patent databases. This may need some intelligent data wrangling. The first process may be subsumed into the second process by adding the “not cited” category. If we move to a slightly more mathematical notation, we have as data:

(c, d, s)

Where c and d are based on a (long) string of text and s is a label with 4 possible values. We then want to construct a model for:

P(s | c, d)

I.e. a probability model for the search classifications given the claim text and citation detailed description. If we have this we can do many cool things. For example, for a set c, we can iterate over a set of d and select the documents with the highest X and Y probabilities.

Representations for c and d

Machine learning algorithms operate on real-valued tensors (n*m -dimensional arrays). more than that, the framework for many discriminative models maps data in the form of a large tensor X to a set of labels in the form of a tensor Y. For example, each row in X and Y may relate to a different data sample. The question then becomes how do we map (c, d, s) to (X, Y)?

Mapping s to Y is relatively easy. Each row of Y may be an integer value corresponding to one of the four labels (e.g. 0 to 3). In some cases, each row may need to represent the integer label as a “one hot” encoding, e.g. a value of [2] > [0, 0, 1, 0].

Mapping c and d to X is harder. There are two sub-problems: 1) how do we combine c and d? and 2) how do we represent each of c and d as sets of real numbers?

There is an emerging consensus on sub-problem 2). A great explanation may be found in Matthew Honnibal’s post Embed, Encode, Attend, Predict. Briefly summarised, we embed words from the text using a word embedding (e.g. based on Word2Vec or GloVe). This outputs a sequence of real-valued float vectors for each word (e.g. vectors of length ~300). We then encode this sequence of vector into a document matrix, e.g. where each row of the matrix represents a sentence encoding. One common way to do this is to apply a bidirectional recurrent neural network (RNN – such as an LSTM or GRU), where outputs of a forward and backward network are concatenated. An attention mechanism is then applied to reduce the matrix to a vector. The vector then represents the document.

img_0180

A simple way to address sub-problem 1) is to simply concatenate c and d (in a similar manner to the forward and backward passes of the RNN). A more advanced approach might use c as an input to the attention mechanism for the generation of the document vector for d.

Obtain the Data

To get our initial data records – (Claim text, citation detailed description text, search classification) – we have several options. For a list of patent publications, we can obtain details of citation numbers and search classifications using the European Patent Office’s Open Patent Services RESTful API. We can also obtain a claim 1 for each publication. We can then use the citation numbers to look up the detailed descriptions, either using another call to the OPS API or using the USPTO bulk downloads.

I haven’t looked in detail at the USPTO examination datasets but the information may be available there as well. I know that the citations are listed in the XML for a US grant (but without the search classifications). Most International (PCT / WO) publications include the search report, so as a push you could OCR and regex the search report text to extract a (claim number, citation number, search category) tuple.

Training

Once you have a dataset consisting of X and Y from c, d, s, the process then just becomes designing, training and evaluating different deep learning architectures. You can start with a simple feed forward network and work up in complexity.

I cannot guarantee your results will be great or useful, but hey if you don’t try you will never know!

What are you waiting for?