Expand from a Seed Set • uspto

Built with R 4.2.1

This example looks at a set of patent applications based on an initial seed set.

Data Collection

Setup

Before starting, we’ll need to load the package, and point to a directory where we’d like things saved:

# install if needed: remotes::install_guthub("uva-bi-sdad/uspto")
library(uspto)
outDir <- "eda/seed_expand/"

Initial Search

We’ll start by identifying a small set of applications with a meteorology Cooperative Patent Classification (CPC) class, and that mention “global warming” or “climate change” in their abstracts:

seed_set <- uspto_search(
  'g01w.cpcl. AND ("global warming".ab. "climate change".ab.)',
  "US-PGPUB",
  outFile = paste0(outDir, "seed_set.csv")
)

This gives us 9 applications from between 2012-05-17 and 2022-06-09.

Examination Expansions

For some context around the examination of these applications, we’ll want to collect all of the other applications examined by the examiners of the initial set. Examiners are not included in applications, so we’ll need to identify them by the office actions associated with the applications.

Seed Office Actions

First, we need to know which examiners examined the seed set of applications:

office_actions <- download_office_actions(
  paste0("patentApplicationNumber:(", paste(
    sub("/", "", seed_set$applicationNumber, fixed = TRUE),
    collapse = " "
  ), ")"),
  paste0(outDir, "seed_set_office_actions.json.xz"),
  verbose = FALSE
)

From these 14 office actions, we can get a set of examiners:

examiners_ids <- unique(vapply(office_actions, function(oa) trimws(oa$examinerEmployeeNumber[[1]]), ""))

And now we have 6 examiners who examined at least one application in our seed set.

Examiner-Based Expansion

Next, we want to collect all of the other applications our identified examiners examined. Again, since examiners are not directly associated with applications, we’ll have to first collect all of their office actions:

all_office_actions <- download_office_actions(
  paste0("examinerEmployeeNumber:(", paste(examiners_ids, collapse = " "), ")"),
  paste0(outDir, "all_office_actions.json.xz"),
  verbose = FALSE
)
all_applications <- unique(vapply(all_office_actions, function(oa) oa$patentApplicationNumber[[1]], ""))

This gives us 3201 total applications.

Collect Application Data

There are at least three sources of information that might be of interest when considering the examination of a patent application: One we have in office actions, another similar to those are the full prosecution histories, and finally, the text of the applications themselves.

Prosecution Histories

Prosecution histories track when happens to an application between the submitter and the patent office. These are recorded in the Patent Examination Data System (PEDS):

library(jsonlite)
examinations_file <- paste0(outDir, "examination_records.json.xz")
if (file.exists(examinations_file)) {
  examination_records <- read_json(examinations_file)
} else {
  # have to break up the calls based on some limit on query length theoretically
  n <- length(all_applications)
  filters <- lapply(
    split(sub("/", "", all_applications, fixed = TRUE), sort(rep_len(seq_len(ceiling(n / 750)), n))),
    function(set) paste0("applId:(", paste(set, collapse = " "), ")")
  )
  examination_records <- unlist(lapply(filters, function(f) {
    download_peds(filters = list(f), verbose = FALSE)
  }), FALSE, FALSE)
  con <- xzfile(examinations_file)
  write_json(examination_records, con, auto_unbox = TRUE)
  close(con)
}

Application Text

Finally, we can collect the actual content of each application, based on the document IDs included in the examination records:

document_ids <- unique(unlist(lapply(examination_records, function(r) {
  r$patentCaseMetadata$patentPublicationIdentification$publicationNumber
})))
oriDir <- "eda/original/"
if (!dir.exists(oriDir)) oriDir <- "../eda/original/"
applications <- uspto_download(document_ids, outDir = paste0(oriDir, "applications"))

Analysis

We can start with a simple look at the text to get a feel for what sets our seed set apart:

# identify the applications that were part of the initial seed set
applications$seed <- applications$applicationNumber %in% seed_set$applicationNumber

# count up terms used within abstracts
library(lingmatch)
dtm <- lma_dtm(applications$abstractHtml, "function", dc.min = 2)

# identify the terms that most identify the seed set
term_seed_sim <- lma_simets(t(dtm), applications$seed, metric = "cosine")
term_seed_sim <- sort(term_seed_sim, TRUE)[1:45]

library(splot)
splot(
  term_seed_sim ~ names(term_seed_sim),
  type = "bar", sort = FALSE, title = FALSE, labx = FALSE, laby = "Similarity with Seed"
)

Examination Timing

Of more interest might be how long aspects of examination takes, and what features of the application might be associated with that time.

We can extract timing from the examination records:

examination_timing <- do.call(rbind, lapply(examination_records, function(r) {
  times <- extract_event_timing(r$prosecutionHistoryDataBag$prosecutionHistoryData)
  data.frame(
    Classification_Time = if (!is.null(times[[1]])) times[[1]]$days[[1]] else NA,
    Assignment_Time = if (!is.null(times[[2]])) times[[2]]$days[[1]] else NA,
    First_Action_Time = if (!is.null(times[[3]])) times[[3]]$days[[1]] else NA,
    Average_Action_Time = if (!is.null(times[[3]])) mean(times[[3]]$days) else NA,
    Number_of_Actions = if (!is.null(times[[3]])) nrow(times[[3]]) else 0,
    row.names = paste0(
      substring(r$patentCaseMetadata$applicationNumberText$value, 1, 2),
      "/",
      substring(r$patentCaseMetadata$applicationNumberText$value, 3)
    )
  )
}))
examination_timing <- examination_timing[applications$applicationNumber, ]

# look at only applications with extracted times
has_timings <- which(!is.na(examination_timing$First_Action_Time))
examination_timing <- examination_timing[has_timings, ]

Then look at the relationship between timings and actions:

cors <- lma_simets(t(examination_timing), metric = "pearson")
colnames(cors) <- seq_len(ncol(cors))
rownames(cors) <- paste0("(", colnames(cors), ") ", rownames(cors))
round(cors, 3)
#> 5 x 5 sparse Matrix of class "dtCMatrix" (unitriangular)
#>                             1      2     3     4 5
#> (1) Classification_Time I      .     .     .     .
#> (2) Assignment_Time     0.187  I     .     .     .
#> (3) First_Action_Time   0.151  0.161 I     .     .
#> (4) Average_Action_Time 0.138  0.156 0.855 I     .
#> (5) Number_of_Actions   0.141 -0.001 0.156 0.044 I

Or associate timing and number of actions with terms used in the application’s abstract:

# align applications with valid timings
term_time_sim <- lma_simets(t(dtm[has_timings, ]), t(examination_timing), metric = "pearson")

for (aspect in colnames(term_time_sim)) {
  terms <- term_time_sim[, aspect]
  terms <- sort(terms, TRUE)[1:45]
  splot(
    terms ~ names(terms),
    title = paste0("Terms Most Associated With ", gsub("_", " ", aspect, fixed = TRUE)),
    type = "bar", sort = FALSE, laby = "Pearson's r", labx = FALSE
  )
}

Art Unit Topics

We can also get a feel for what seems to characterize Group Art Units by extracting topics from the applications they review.

We can start by getting art unit information for each application:

artunits <- unlist(lapply(examination_records, function(r) {
  res <- r$patentCaseMetadata$groupArtUnitNumber$value
  names(res) <- sub("(\\d{2})", "\\1/", r$patentCaseMetadata$applicationNumberText$value, perl = TRUE)
  res
}))[applications$applicationNumber]

# number of applications assigned to each art unit
table(artunits)
#> artunits
#> 1734 2852 2857 2863 2864 2865 3623 
#>  533  808  500   24  844  274  133

Then, we could get the same sort of associated set of terms, as with the seed set:

gaus <- unique(artunits)
term_artunit <- lma_simets(
  t(dtm), vapply(artunits, "==", logical(length(gaus)), gaus),
  metric = "pearson"
)
colnames(term_artunit) <- gaus

# look at top 10 terms for each
get_top_terms <- function(loadings, n = 10) {
  kable(as.data.frame(vapply(colnames(loadings), function(col) {
    rownames(loadings)[order(loadings[, col], decreasing = TRUE)[seq_len(n)]]
  }, character(n))))
}
get_top_terms(term_artunit)

2865	2857	3623	2852	1734	2863	2864
line	system	business	image	gas	sun	data
corners	malfunction	user	member	stream	tentative	motion
reconstructs	transformed	users	forming	catalyst	tolerances	values
reserves	set	sales	toner	dioxide	closure	sensors
protrusions	reference	actions	developing	carbon	golden	based
imparted	runtime	customer	developer	containing	radiations	associated
constrained	boosted	service	direction	oxidation	sub-assembly	determining
leakage	build-up	task	formed	hydrogen	aims	signals
differential	centrifugal	score	portion	exhaust	varies	signal
prompting	incipient	scheduling	apparatus	removal	uncorrected	monitoring


# we could use this to roughly predict art unit in-sample:
predicted_art_unit <- gaus[max.col(dtm %*% term_artunit)]
round(table(artunits, predicted_art_unit) / as.numeric(table(artunits)), 4)
#>         predicted_art_unit
#> artunits   1734   2852   2857   2863   2864   2865   3623
#>     1734 0.9962 0.0019 0.0000 0.0000 0.0000 0.0000 0.0019
#>     2852 0.0062 0.6349 0.0594 0.0000 0.2438 0.0223 0.0334
#>     2857 0.0060 0.0320 0.5680 0.0060 0.3540 0.0140 0.0200
#>     2863 0.0000 0.0417 0.0417 0.7917 0.1250 0.0000 0.0000
#>     2864 0.0213 0.0284 0.0249 0.0000 0.8993 0.0095 0.0166
#>     2865 0.0073 0.0255 0.0584 0.0000 0.3212 0.5657 0.0219
#>     3623 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000

We could also extract general topics from this set of applications, then see how much they are represented in each art unit’s assigned applications:

# weigh terms
wdtm <- lma_weight(dtm, "tf-idf", FALSE)

# get 10 simple topics
## remotes::install_github("miserman/lusilab")
library(lusilab)
loadings <- taffyInf(wdtm, 10)
colnames(loadings) <- paste0("taffy_", seq_len(ncol(loadings)))
scores <- dtm %*% loadings
get_top_terms(loadings)

taffy_1	taffy_2	taffy_3	taffy_4	taffy_5	taffy_6	taffy_7	taffy_8	taffy_9	taffy_10
b	image	data	gas	signal	battery	information	methods	member	method
slidably	forming	collected	stream	frequency	pack	exercise	systems	developer-carrying	determining
announcement	toner	acquisition	streams	output	charge	mental	disclosed	insulates	includes
sequential	latent	associated	exhaust	synchronous	capacity	wrist	described	bearing	identifying
k	forms	structured	acid	delays	voltage	similarity	load-switching	developer	using
withdrawn	carrier	interpret	flue	esr	blocks	advice	pc	accommodate	comprises
orientation	electrostatic	collector	lean	digital	lifespan	credit	utilize	overlapped	drilling-performance
counter-current	recording	collection	p.sub.i	rails	discharging	acquires	answers	image	steps
s-selective	transfer	battery-powered	p.sub.o	unmodulated	estimating	schedule	include	developing	calculating
physical	member	gas-volume	permeate	modulated	state	action	problems	photosensitive	obtaining

# compare with kmeans
topics_km <- kmeans(wdtm, 10)
loadings_km <- t(topics_km$centers)
dimnames(loadings_km) <- list(colnames(dtm), paste0("km_", seq_len(ncol(loadings))))
scores_km <- dtm %*% loadings_km
get_top_terms(loadings_km)

km_1	km_2	km_3	km_4	km_5	km_6	km_7	km_8	km_9	km_10
brand	mi	seismic	defect	battery	member	torque	gas	data	b
merchant	mrow	data	exercise	charge	image	steering	stream	signal	orientation
vibration	msub	wavefields	sales	voltage	developer	renewable	dioxide	sensor	absorption
version	math	receiver	insolation	capacity	toner	mi	carbon	system	gas
personality	mn	wavefield	lead	state	developing	unit	co.sub	device	liquid
code	mstyle	subsurface	r.sub	current	portion	relay	catalyst	plurality	means
positioning	mfrac	synthetic	failure-type	soc	unit	vortex	solution	value	co.sub
detailed	g.sub	volume	information	energy	forming	software	fuel	power	core
features	mspace	domain	model	estimating	transfer	location-unknown	exhaust	unit	chamber
options	g.sub.n	model	types	secondary	belt	longitude	acid	method	exhaust

# compare with NMF
library(RcppML)
topics_nmf <- nmf(wdtm, 10, verbose = FALSE)
loadings_nmf <- t(topics_nmf$h)
dimnames(loadings_nmf) <- list(colnames(dtm), paste0("nmf_", seq_len(ncol(loadings_nmf))))
colnames(topics_nmf$w) <- colnames(loadings_nmf)
get_top_terms(loadings_nmf)

nmf_1	nmf_2	nmf_3	nmf_4	nmf_5	nmf_6	nmf_7	nmf_8	nmf_9	nmf_10
data	signal	gas	unit	image	battery	developer	b	mi	mi
seismic	sensor	stream	temperature	member	voltage	member	orientation	mrow	mrow
model	circuit	co.sub	value	toner	charge	developing	core	mn	msub
plurality	signals	dioxide	flow	transfer	capacity	electrode	means	msub	math
set	power	carbon	pressure	belt	state	supply	physical	math	s
system	output	solution	measurement	forming	current	container	absorption	g.sub	mfrac
information	frequency	absorbent	relay	portion	soc	developer-carrying	processing	mstyle	coating
user	module	absorption	sensor	roller	charging	insulating	s	g.sub.n	c
method	device	acid	rate	fixing	power	carrier	processed	sub-goal	e
based	current	liquid	medium	photosensitive	pack	configured	liquid	mspace	substrate

# see how similar topics are in terms of loadings
loadings_all <- cbind(loadings, loadings_km, loadings_nmf)
loading_cors <- lma_simets(t(loadings_all), metric = "pearson")
topic_names <- rownames(loading_cors)
scores_all <- dtm %*% loadings_all

display_matrix <- function(l, threshold = .1) {
  v <- as.data.frame(as.matrix(l))
  v[abs(v) < threshold] <- 0
  kable(v, digits = 3, format.args = list(zero.print = "."))
}

display_matrix(loading_cors[grep("km", topic_names), grep("taffy", topic_names)])

	taffy_1	taffy_2	taffy_3	taffy_4	taffy_5	taffy_6	taffy_7	taffy_8	taffy_9	taffy_10
km_1	.	.	.	.	.	.	.	.	.	.
km_2	.	.	.	.	.	.	.	.	.	.
km_3	.	.	0.256	.	.	.	.	.	.	0.171
km_4	.	.	.	.	.	.	0.255	.	.	.
km_5	.	.	.	.	.	0.789	.	.	.	.
km_6	.	0.668	-0.162	.	.	.	.	-0.18	0.677	-0.261
km_7	.	.	.	.	.	.	.	.	.	.
km_8	0.157	.	-0.203	0.700	-0.123	.	-0.141	.	.	.
km_9	.	.	0.313	-0.156	0.247	.	0.140	.	-0.131	0.160
km_10	0.655	.	.	.	.	.	.	.	.	.

display_matrix(loading_cors[grep("nmf", topic_names), grep("taffy", topic_names)])

	taffy_1	taffy_2	taffy_3	taffy_4	taffy_5	taffy_6	taffy_7	taffy_8	taffy_9	taffy_10
nmf_1	.	-0.144	0.577	-0.189	.	.	0.188	0.132	-0.197	0.291
nmf_2	.	.	.	-0.151	0.658	.	.	.	.	.
nmf_3	0.135	.	-0.201	0.698	-0.127	.	-0.140	.	.	.
nmf_4	.	0.191	.	.	.	.	.	-0.107	.	.
nmf_5	.	0.702	-0.159	.	.	.	.	-0.155	0.596	-0.230
nmf_6	.	.	.	.	.	0.798	.	.	.	.
nmf_7	.	0.214	-0.105	.	.	.	.	.	0.510	-0.149
nmf_8	0.644	.	.	.	.	.	.	.	.	.
nmf_9	.	.	.	.	.	.	.	.	.	.
nmf_10	.	.	.	.	.	.	.	.	.	.

display_matrix(loading_cors[grep("nmf", topic_names), grep("km", topic_names)])

	km_1	km_2	km_3	km_4	km_5	km_6	km_7	km_8	km_9	km_10
nmf_1	0.218	.	0.54	0.267	0.129	.	.	.	0.740	.
nmf_2	.	.	.	.	0.164	.	0.115	.	0.631	.
nmf_3	.	.	.	.	.	.	.	0.968	0.110	.
nmf_4	.	.	.	.	0.140	0.283	0.380	0.136	0.428	.
nmf_5	.	.	.	.	.	0.884	.	.	0.126	.
nmf_6	.	.	.	.	0.976	.	.	.	0.199	.
nmf_7	.	.	.	.	.	0.625	.	.	.	.
nmf_8	.	.	.	.	.	.	.	.	0.104	0.997
nmf_9	.	0.960	.	.	.	.	0.259	.	.	.
nmf_10	.	0.965	.	.	.	.	0.273	.	.	.

Now we can look at trends in these topics over time within art unit:

applications$year <- as.numeric(applications$datePublYear)
trends <- vapply(gaus, function(au) {
  su <- artunits == au
  lma_simets(t(scores_all[artunits == au, ]), applications[su, "year"], metric = "pearson")
}, numeric(ncol(scores_all)))
display_matrix(t(trends), .1)

	taffy_1	taffy_2	taffy_3	taffy_4	taffy_5	taffy_6	taffy_7	taffy_8	taffy_9	taffy_10	km_1	km_2	km_3	km_4	km_5	km_6	km_7	km_8	km_9	km_10	nmf_1	nmf_2	nmf_3	nmf_4	nmf_5	nmf_6	nmf_7	nmf_8	nmf_9	nmf_10
2865	.	.	.	-0.135	0.161	0.120	.	.	-0.153	.	.	.	0.107	0.105	0.146	.	.	.	0.184	0.113	0.113	0.175	.	.	.	0.135	.	0.111	.	.
2857	.	.	0.111	.	.	.	.	0.149	-0.102	.	.	.	0.182	.	.	.	.	.	.	.	0.144	-0.105	.	.	.	.	.	.	.	.
3623	0.112	-0.265	.	.	.	-0.109	-0.154	0.159	-0.217	0.113	.	0.106	0.126	.	.	.	0.107	0.194	0.150	0.116	0.106	0.133	0.164	.	.	.	.	0.116	0.102	.
2852	.	-0.507	0.475	.	0.279	0.290	0.277	0.420	-0.425	0.494	0.164	0.136	0.256	0.232	0.252	-0.479	.	.	0.342	.	0.454	0.243	.	-0.167	-0.492	0.229	-0.248	.	0.108	.
1734	0.116	.	.	.	.	-0.103	.	.	.	.	.	.	.	.	.	.	.	.	0.112	0.119	.	0.106	.	.	.	.	.	0.119	.	.
2863	.	-0.287	.	-0.150	.	0.109	.	0.213	-0.207	0.271	0.143	-0.363	-0.264	-0.120	.	-0.289	-0.236	-0.270	-0.124	-0.124	.	.	-0.239	-0.271	-0.254	.	-0.250	-0.113	-0.336	-0.267
2864	.	-0.154	0.221	.	.	.	.	0.199	-0.196	.	.	.	0.246	0.103	.	.	.	.	0.158	.	0.249	.	.	.	.	.	.	.	.	.

focal_unit <- names(which.max(colMeans(abs(trends))))
focal_topics <- names(which(apply(abs(trends), 1, max) > .4))
get_top_terms(loadings_all[, focal_topics])

taffy_2	taffy_3	taffy_8	taffy_9	taffy_10	km_6	nmf_1	nmf_5
image	data	methods	member	method	member	data	image
forming	collected	systems	developer-carrying	determining	image	seismic	member
toner	acquisition	disclosed	insulates	includes	developer	model	toner
latent	associated	described	bearing	identifying	toner	plurality	transfer
forms	structured	load-switching	developer	using	developing	set	belt
carrier	interpret	pc	accommodate	comprises	portion	system	forming
electrostatic	collector	utilize	overlapped	drilling-performance	unit	information	portion
recording	collection	answers	image	steps	forming	user	roller
transfer	battery-powered	include	developing	calculating	transfer	method	fixing
member	gas-volume	problems	photosensitive	obtaining	belt	based	photosensitive

splot(
  scores_all[, focal_topics] ~ applications$year,
  su = artunits == focal_unit, mv.scale = TRUE, sud = FALSE, myl = c(-2, 9),
  title = paste("Topic Over Time in Art Unit", focal_unit),
  lines = "spline", laby = "Topic Use (z-scored)", labx = FALSE
)