Built with R 4.2.1


This example looks at a set of patent applications based on an initial seed set.

Data Collection

Setup

Before starting, we’ll need to load the package, and point to a directory where we’d like things saved:

# install if needed: remotes::install_guthub("uva-bi-sdad/uspto")
library(uspto)
outDir <- "eda/seed_expand/"

We’ll start by identifying a small set of applications with a meteorology Cooperative Patent Classification (CPC) class, and that mention “global warming” or “climate change” in their abstracts:

seed_set <- uspto_search(
  'g01w.cpcl. AND ("global warming".ab. "climate change".ab.)',
  "US-PGPUB",
  outFile = paste0(outDir, "seed_set.csv")
)

This gives us 9 applications from between 2012-05-17 and 2022-06-09.

Examination Expansions

For some context around the examination of these applications, we’ll want to collect all of the other applications examined by the examiners of the initial set. Examiners are not included in applications, so we’ll need to identify them by the office actions associated with the applications.

Seed Office Actions

First, we need to know which examiners examined the seed set of applications:

office_actions <- download_office_actions(
  paste0("patentApplicationNumber:(", paste(
    sub("/", "", seed_set$applicationNumber, fixed = TRUE),
    collapse = " "
  ), ")"),
  paste0(outDir, "seed_set_office_actions.json.xz"),
  verbose = FALSE
)

From these 14 office actions, we can get a set of examiners:

examiners_ids <- unique(vapply(office_actions, function(oa) trimws(oa$examinerEmployeeNumber[[1]]), ""))

And now we have 6 examiners who examined at least one application in our seed set.

Examiner-Based Expansion

Next, we want to collect all of the other applications our identified examiners examined. Again, since examiners are not directly associated with applications, we’ll have to first collect all of their office actions:

all_office_actions <- download_office_actions(
  paste0("examinerEmployeeNumber:(", paste(examiners_ids, collapse = " "), ")"),
  paste0(outDir, "all_office_actions.json.xz"),
  verbose = FALSE
)
all_applications <- unique(vapply(all_office_actions, function(oa) oa$patentApplicationNumber[[1]], ""))

This gives us 3201 total applications.

Collect Application Data

There are at least three sources of information that might be of interest when considering the examination of a patent application: One we have in office actions, another similar to those are the full prosecution histories, and finally, the text of the applications themselves.

Prosecution Histories

Prosecution histories track when happens to an application between the submitter and the patent office. These are recorded in the Patent Examination Data System (PEDS):

library(jsonlite)
examinations_file <- paste0(outDir, "examination_records.json.xz")
if (file.exists(examinations_file)) {
  examination_records <- read_json(examinations_file)
} else {
  # have to break up the calls based on some limit on query length theoretically
  n <- length(all_applications)
  filters <- lapply(
    split(sub("/", "", all_applications, fixed = TRUE), sort(rep_len(seq_len(ceiling(n / 750)), n))),
    function(set) paste0("applId:(", paste(set, collapse = " "), ")")
  )
  examination_records <- unlist(lapply(filters, function(f) {
    download_peds(filters = list(f), verbose = FALSE)
  }), FALSE, FALSE)
  con <- xzfile(examinations_file)
  write_json(examination_records, con, auto_unbox = TRUE)
  close(con)
}

Application Text

Finally, we can collect the actual content of each application, based on the document IDs included in the examination records:

document_ids <- unique(unlist(lapply(examination_records, function(r) {
  r$patentCaseMetadata$patentPublicationIdentification$publicationNumber
})))
oriDir <- "eda/original/"
if (!dir.exists(oriDir)) oriDir <- "../eda/original/"
applications <- uspto_download(document_ids, outDir = paste0(oriDir, "applications"))

Analysis

We can start with a simple look at the text to get a feel for what sets our seed set apart:

# identify the applications that were part of the initial seed set
applications$seed <- applications$applicationNumber %in% seed_set$applicationNumber

# count up terms used within abstracts
library(lingmatch)
dtm <- lma_dtm(applications$abstractHtml, "function", dc.min = 2)

# identify the terms that most identify the seed set
term_seed_sim <- lma_simets(t(dtm), applications$seed, metric = "cosine")
term_seed_sim <- sort(term_seed_sim, TRUE)[1:45]
library(splot)
splot(
  term_seed_sim ~ names(term_seed_sim),
  type = "bar", sort = FALSE, title = FALSE, labx = FALSE, laby = "Similarity with Seed"
)

Examination Timing

Of more interest might be how long aspects of examination takes, and what features of the application might be associated with that time.

We can extract timing from the examination records:

examination_timing <- do.call(rbind, lapply(examination_records, function(r) {
  times <- extract_event_timing(r$prosecutionHistoryDataBag$prosecutionHistoryData)
  data.frame(
    Classification_Time = if (!is.null(times[[1]])) times[[1]]$days[[1]] else NA,
    Assignment_Time = if (!is.null(times[[2]])) times[[2]]$days[[1]] else NA,
    First_Action_Time = if (!is.null(times[[3]])) times[[3]]$days[[1]] else NA,
    Average_Action_Time = if (!is.null(times[[3]])) mean(times[[3]]$days) else NA,
    Number_of_Actions = if (!is.null(times[[3]])) nrow(times[[3]]) else 0,
    row.names = paste0(
      substring(r$patentCaseMetadata$applicationNumberText$value, 1, 2),
      "/",
      substring(r$patentCaseMetadata$applicationNumberText$value, 3)
    )
  )
}))
examination_timing <- examination_timing[applications$applicationNumber, ]

# look at only applications with extracted times
has_timings <- which(!is.na(examination_timing$First_Action_Time))
examination_timing <- examination_timing[has_timings, ]

Then look at the relationship between timings and actions:

cors <- lma_simets(t(examination_timing), metric = "pearson")
colnames(cors) <- seq_len(ncol(cors))
rownames(cors) <- paste0("(", colnames(cors), ") ", rownames(cors))
round(cors, 3)
#> 5 x 5 sparse Matrix of class "dtCMatrix" (unitriangular)
#>                             1      2     3     4 5
#> (1) Classification_Time I      .     .     .     .
#> (2) Assignment_Time     0.187  I     .     .     .
#> (3) First_Action_Time   0.151  0.161 I     .     .
#> (4) Average_Action_Time 0.138  0.156 0.855 I     .
#> (5) Number_of_Actions   0.141 -0.001 0.156 0.044 I

Or associate timing and number of actions with terms used in the application’s abstract:

# align applications with valid timings
term_time_sim <- lma_simets(t(dtm[has_timings, ]), t(examination_timing), metric = "pearson")
for (aspect in colnames(term_time_sim)) {
  terms <- term_time_sim[, aspect]
  terms <- sort(terms, TRUE)[1:45]
  splot(
    terms ~ names(terms),
    title = paste0("Terms Most Associated With ", gsub("_", " ", aspect, fixed = TRUE)),
    type = "bar", sort = FALSE, laby = "Pearson's r", labx = FALSE
  )
}

Art Unit Topics

We can also get a feel for what seems to characterize Group Art Units by extracting topics from the applications they review.

We can start by getting art unit information for each application:

artunits <- unlist(lapply(examination_records, function(r) {
  res <- r$patentCaseMetadata$groupArtUnitNumber$value
  names(res) <- sub("(\\d{2})", "\\1/", r$patentCaseMetadata$applicationNumberText$value, perl = TRUE)
  res
}))[applications$applicationNumber]

# number of applications assigned to each art unit
table(artunits)
#> artunits
#> 1734 2852 2857 2863 2864 2865 3623 
#>  533  808  500   24  844  274  133

Then, we could get the same sort of associated set of terms, as with the seed set:

gaus <- unique(artunits)
term_artunit <- lma_simets(
  t(dtm), vapply(artunits, "==", logical(length(gaus)), gaus),
  metric = "pearson"
)
colnames(term_artunit) <- gaus

# look at top 10 terms for each
get_top_terms <- function(loadings, n = 10) {
  kable(as.data.frame(vapply(colnames(loadings), function(col) {
    rownames(loadings)[order(loadings[, col], decreasing = TRUE)[seq_len(n)]]
  }, character(n))))
}
get_top_terms(term_artunit)
2865 2857 3623 2852 1734 2863 2864
line system business image gas sun data
corners malfunction user member stream tentative motion
reconstructs transformed users forming catalyst tolerances values
reserves set sales toner dioxide closure sensors
protrusions reference actions developing carbon golden based
imparted runtime customer developer containing radiations associated
constrained boosted service direction oxidation sub-assembly determining
leakage build-up task formed hydrogen aims signals
differential centrifugal score portion exhaust varies signal
prompting incipient scheduling apparatus removal uncorrected monitoring

# we could use this to roughly predict art unit in-sample:
predicted_art_unit <- gaus[max.col(dtm %*% term_artunit)]
round(table(artunits, predicted_art_unit) / as.numeric(table(artunits)), 4)
#>         predicted_art_unit
#> artunits   1734   2852   2857   2863   2864   2865   3623
#>     1734 0.9962 0.0019 0.0000 0.0000 0.0000 0.0000 0.0019
#>     2852 0.0062 0.6349 0.0594 0.0000 0.2438 0.0223 0.0334
#>     2857 0.0060 0.0320 0.5680 0.0060 0.3540 0.0140 0.0200
#>     2863 0.0000 0.0417 0.0417 0.7917 0.1250 0.0000 0.0000
#>     2864 0.0213 0.0284 0.0249 0.0000 0.8993 0.0095 0.0166
#>     2865 0.0073 0.0255 0.0584 0.0000 0.3212 0.5657 0.0219
#>     3623 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000

We could also extract general topics from this set of applications, then see how much they are represented in each art unit’s assigned applications:

# weigh terms
wdtm <- lma_weight(dtm, "tf-idf", FALSE)

# get 10 simple topics
## remotes::install_github("miserman/lusilab")
library(lusilab)
loadings <- taffyInf(wdtm, 10)
colnames(loadings) <- paste0("taffy_", seq_len(ncol(loadings)))
scores <- dtm %*% loadings
get_top_terms(loadings)
taffy_1 taffy_2 taffy_3 taffy_4 taffy_5 taffy_6 taffy_7 taffy_8 taffy_9 taffy_10
b image data gas signal battery information methods member method
slidably forming collected stream frequency pack exercise systems developer-carrying determining
announcement toner acquisition streams output charge mental disclosed insulates includes
sequential latent associated exhaust synchronous capacity wrist described bearing identifying
k forms structured acid delays voltage similarity load-switching developer using
withdrawn carrier interpret flue esr blocks advice pc accommodate comprises
orientation electrostatic collector lean digital lifespan credit utilize overlapped drilling-performance
counter-current recording collection p.sub.i rails discharging acquires answers image steps
s-selective transfer battery-powered p.sub.o unmodulated estimating schedule include developing calculating
physical member gas-volume permeate modulated state action problems photosensitive obtaining
# compare with kmeans
topics_km <- kmeans(wdtm, 10)
loadings_km <- t(topics_km$centers)
dimnames(loadings_km) <- list(colnames(dtm), paste0("km_", seq_len(ncol(loadings))))
scores_km <- dtm %*% loadings_km
get_top_terms(loadings_km)
km_1 km_2 km_3 km_4 km_5 km_6 km_7 km_8 km_9 km_10
brand mi seismic defect battery member torque gas data b
merchant mrow data exercise charge image steering stream signal orientation
vibration msub wavefields sales voltage developer renewable dioxide sensor absorption
version math receiver insolation capacity toner mi carbon system gas
personality mn wavefield lead state developing unit co.sub device liquid
code mstyle subsurface r.sub current portion relay catalyst plurality means
positioning mfrac synthetic failure-type soc unit vortex solution value co.sub
detailed g.sub volume information energy forming software fuel power core
features mspace domain model estimating transfer location-unknown exhaust unit chamber
options g.sub.n model types secondary belt longitude acid method exhaust
# compare with NMF
library(RcppML)
topics_nmf <- nmf(wdtm, 10, verbose = FALSE)
loadings_nmf <- t(topics_nmf$h)
dimnames(loadings_nmf) <- list(colnames(dtm), paste0("nmf_", seq_len(ncol(loadings_nmf))))
colnames(topics_nmf$w) <- colnames(loadings_nmf)
get_top_terms(loadings_nmf)
nmf_1 nmf_2 nmf_3 nmf_4 nmf_5 nmf_6 nmf_7 nmf_8 nmf_9 nmf_10
data signal gas unit image battery developer b mi mi
seismic sensor stream temperature member voltage member orientation mrow mrow
model circuit co.sub value toner charge developing core mn msub
plurality signals dioxide flow transfer capacity electrode means msub math
set power carbon pressure belt state supply physical math s
system output solution measurement forming current container absorption g.sub mfrac
information frequency absorbent relay portion soc developer-carrying processing mstyle coating
user module absorption sensor roller charging insulating s g.sub.n c
method device acid rate fixing power carrier processed sub-goal e
based current liquid medium photosensitive pack configured liquid mspace substrate
# see how similar topics are in terms of loadings
loadings_all <- cbind(loadings, loadings_km, loadings_nmf)
loading_cors <- lma_simets(t(loadings_all), metric = "pearson")
topic_names <- rownames(loading_cors)
scores_all <- dtm %*% loadings_all

display_matrix <- function(l, threshold = .1) {
  v <- as.data.frame(as.matrix(l))
  v[abs(v) < threshold] <- 0
  kable(v, digits = 3, format.args = list(zero.print = "."))
}

display_matrix(loading_cors[grep("km", topic_names), grep("taffy", topic_names)])
taffy_1 taffy_2 taffy_3 taffy_4 taffy_5 taffy_6 taffy_7 taffy_8 taffy_9 taffy_10
km_1 . . . . . . . . . .
km_2 . . . . . . . . . .
km_3 . . 0.256 . . . . . . 0.171
km_4 . . . . . . 0.255 . . .
km_5 . . . . . 0.789 . . . .
km_6 . 0.668 -0.162 . . . . -0.18 0.677 -0.261
km_7 . . . . . . . . . .
km_8 0.157 . -0.203 0.700 -0.123 . -0.141 . . .
km_9 . . 0.313 -0.156 0.247 . 0.140 . -0.131 0.160
km_10 0.655 . . . . . . . . .
display_matrix(loading_cors[grep("nmf", topic_names), grep("taffy", topic_names)])
taffy_1 taffy_2 taffy_3 taffy_4 taffy_5 taffy_6 taffy_7 taffy_8 taffy_9 taffy_10
nmf_1 . -0.144 0.577 -0.189 . . 0.188 0.132 -0.197 0.291
nmf_2 . . . -0.151 0.658 . . . . .
nmf_3 0.135 . -0.201 0.698 -0.127 . -0.140 . . .
nmf_4 . 0.191 . . . . . -0.107 . .
nmf_5 . 0.702 -0.159 . . . . -0.155 0.596 -0.230
nmf_6 . . . . . 0.798 . . . .
nmf_7 . 0.214 -0.105 . . . . . 0.510 -0.149
nmf_8 0.644 . . . . . . . . .
nmf_9 . . . . . . . . . .
nmf_10 . . . . . . . . . .
display_matrix(loading_cors[grep("nmf", topic_names), grep("km", topic_names)])
km_1 km_2 km_3 km_4 km_5 km_6 km_7 km_8 km_9 km_10
nmf_1 0.218 . 0.54 0.267 0.129 . . . 0.740 .
nmf_2 . . . . 0.164 . 0.115 . 0.631 .
nmf_3 . . . . . . . 0.968 0.110 .
nmf_4 . . . . 0.140 0.283 0.380 0.136 0.428 .
nmf_5 . . . . . 0.884 . . 0.126 .
nmf_6 . . . . 0.976 . . . 0.199 .
nmf_7 . . . . . 0.625 . . . .
nmf_8 . . . . . . . . 0.104 0.997
nmf_9 . 0.960 . . . . 0.259 . . .
nmf_10 . 0.965 . . . . 0.273 . . .

Now we can look at trends in these topics over time within art unit:

applications$year <- as.numeric(applications$datePublYear)
trends <- vapply(gaus, function(au) {
  su <- artunits == au
  lma_simets(t(scores_all[artunits == au, ]), applications[su, "year"], metric = "pearson")
}, numeric(ncol(scores_all)))
display_matrix(t(trends), .1)
taffy_1 taffy_2 taffy_3 taffy_4 taffy_5 taffy_6 taffy_7 taffy_8 taffy_9 taffy_10 km_1 km_2 km_3 km_4 km_5 km_6 km_7 km_8 km_9 km_10 nmf_1 nmf_2 nmf_3 nmf_4 nmf_5 nmf_6 nmf_7 nmf_8 nmf_9 nmf_10
2865 . . . -0.135 0.161 0.120 . . -0.153 . . . 0.107 0.105 0.146 . . . 0.184 0.113 0.113 0.175 . . . 0.135 . 0.111 . .
2857 . . 0.111 . . . . 0.149 -0.102 . . . 0.182 . . . . . . . 0.144 -0.105 . . . . . . . .
3623 0.112 -0.265 . . . -0.109 -0.154 0.159 -0.217 0.113 . 0.106 0.126 . . . 0.107 0.194 0.150 0.116 0.106 0.133 0.164 . . . . 0.116 0.102 .
2852 . -0.507 0.475 . 0.279 0.290 0.277 0.420 -0.425 0.494 0.164 0.136 0.256 0.232 0.252 -0.479 . . 0.342 . 0.454 0.243 . -0.167 -0.492 0.229 -0.248 . 0.108 .
1734 0.116 . . . . -0.103 . . . . . . . . . . . . 0.112 0.119 . 0.106 . . . . . 0.119 . .
2863 . -0.287 . -0.150 . 0.109 . 0.213 -0.207 0.271 0.143 -0.363 -0.264 -0.120 . -0.289 -0.236 -0.270 -0.124 -0.124 . . -0.239 -0.271 -0.254 . -0.250 -0.113 -0.336 -0.267
2864 . -0.154 0.221 . . . . 0.199 -0.196 . . . 0.246 0.103 . . . . 0.158 . 0.249 . . . . . . . . .
focal_unit <- names(which.max(colMeans(abs(trends))))
focal_topics <- names(which(apply(abs(trends), 1, max) > .4))
get_top_terms(loadings_all[, focal_topics])
taffy_2 taffy_3 taffy_8 taffy_9 taffy_10 km_6 nmf_1 nmf_5
image data methods member method member data image
forming collected systems developer-carrying determining image seismic member
toner acquisition disclosed insulates includes developer model toner
latent associated described bearing identifying toner plurality transfer
forms structured load-switching developer using developing set belt
carrier interpret pc accommodate comprises portion system forming
electrostatic collector utilize overlapped drilling-performance unit information portion
recording collection answers image steps forming user roller
transfer battery-powered include developing calculating transfer method fixing
member gas-volume problems photosensitive obtaining belt based photosensitive
splot(
  scores_all[, focal_topics] ~ applications$year,
  su = artunits == focal_unit, mv.scale = TRUE, sud = FALSE, myl = c(-2, 9),
  title = paste("Topic Over Time in Art Unit", focal_unit),
  lines = "spline", laby = "Topic Use (z-scored)", labx = FALSE
)