example-seed_expand.Rmd
Built with R 4.2.1
This example looks at a set of patent applications based on an initial seed set.
Before starting, we’ll need to load the package, and point to a directory where we’d like things saved:
# install if needed: remotes::install_guthub("uva-bi-sdad/uspto")
library(uspto)
outDir <- "eda/seed_expand/"
We’ll start by identifying a small set of applications with a meteorology Cooperative Patent Classification (CPC) class, and that mention “global warming” or “climate change” in their abstracts:
seed_set <- uspto_search(
'g01w.cpcl. AND ("global warming".ab. "climate change".ab.)',
"US-PGPUB",
outFile = paste0(outDir, "seed_set.csv")
)
This gives us 9 applications from between 2012-05-17 and 2022-06-09.
For some context around the examination of these applications, we’ll want to collect all of the other applications examined by the examiners of the initial set. Examiners are not included in applications, so we’ll need to identify them by the office actions associated with the applications.
First, we need to know which examiners examined the seed set of applications:
office_actions <- download_office_actions(
paste0("patentApplicationNumber:(", paste(
sub("/", "", seed_set$applicationNumber, fixed = TRUE),
collapse = " "
), ")"),
paste0(outDir, "seed_set_office_actions.json.xz"),
verbose = FALSE
)
From these 14 office actions, we can get a set of examiners:
examiners_ids <- unique(vapply(office_actions, function(oa) trimws(oa$examinerEmployeeNumber[[1]]), ""))
And now we have 6 examiners who examined at least one application in our seed set.
Next, we want to collect all of the other applications our identified examiners examined. Again, since examiners are not directly associated with applications, we’ll have to first collect all of their office actions:
all_office_actions <- download_office_actions(
paste0("examinerEmployeeNumber:(", paste(examiners_ids, collapse = " "), ")"),
paste0(outDir, "all_office_actions.json.xz"),
verbose = FALSE
)
all_applications <- unique(vapply(all_office_actions, function(oa) oa$patentApplicationNumber[[1]], ""))
This gives us 3201 total applications.
There are at least three sources of information that might be of interest when considering the examination of a patent application: One we have in office actions, another similar to those are the full prosecution histories, and finally, the text of the applications themselves.
Prosecution histories track when happens to an application between the submitter and the patent office. These are recorded in the Patent Examination Data System (PEDS):
library(jsonlite)
examinations_file <- paste0(outDir, "examination_records.json.xz")
if (file.exists(examinations_file)) {
examination_records <- read_json(examinations_file)
} else {
# have to break up the calls based on some limit on query length theoretically
n <- length(all_applications)
filters <- lapply(
split(sub("/", "", all_applications, fixed = TRUE), sort(rep_len(seq_len(ceiling(n / 750)), n))),
function(set) paste0("applId:(", paste(set, collapse = " "), ")")
)
examination_records <- unlist(lapply(filters, function(f) {
download_peds(filters = list(f), verbose = FALSE)
}), FALSE, FALSE)
con <- xzfile(examinations_file)
write_json(examination_records, con, auto_unbox = TRUE)
close(con)
}
Finally, we can collect the actual content of each application, based on the document IDs included in the examination records:
document_ids <- unique(unlist(lapply(examination_records, function(r) {
r$patentCaseMetadata$patentPublicationIdentification$publicationNumber
})))
oriDir <- "eda/original/"
if (!dir.exists(oriDir)) oriDir <- "../eda/original/"
applications <- uspto_download(document_ids, outDir = paste0(oriDir, "applications"))
We can start with a simple look at the text to get a feel for what sets our seed set apart:
# identify the applications that were part of the initial seed set
applications$seed <- applications$applicationNumber %in% seed_set$applicationNumber
# count up terms used within abstracts
library(lingmatch)
dtm <- lma_dtm(applications$abstractHtml, "function", dc.min = 2)
# identify the terms that most identify the seed set
term_seed_sim <- lma_simets(t(dtm), applications$seed, metric = "cosine")
term_seed_sim <- sort(term_seed_sim, TRUE)[1:45]
library(splot)
splot(
term_seed_sim ~ names(term_seed_sim),
type = "bar", sort = FALSE, title = FALSE, labx = FALSE, laby = "Similarity with Seed"
)
Of more interest might be how long aspects of examination takes, and what features of the application might be associated with that time.
We can extract timing from the examination records:
examination_timing <- do.call(rbind, lapply(examination_records, function(r) {
times <- extract_event_timing(r$prosecutionHistoryDataBag$prosecutionHistoryData)
data.frame(
Classification_Time = if (!is.null(times[[1]])) times[[1]]$days[[1]] else NA,
Assignment_Time = if (!is.null(times[[2]])) times[[2]]$days[[1]] else NA,
First_Action_Time = if (!is.null(times[[3]])) times[[3]]$days[[1]] else NA,
Average_Action_Time = if (!is.null(times[[3]])) mean(times[[3]]$days) else NA,
Number_of_Actions = if (!is.null(times[[3]])) nrow(times[[3]]) else 0,
row.names = paste0(
substring(r$patentCaseMetadata$applicationNumberText$value, 1, 2),
"/",
substring(r$patentCaseMetadata$applicationNumberText$value, 3)
)
)
}))
examination_timing <- examination_timing[applications$applicationNumber, ]
# look at only applications with extracted times
has_timings <- which(!is.na(examination_timing$First_Action_Time))
examination_timing <- examination_timing[has_timings, ]
Then look at the relationship between timings and actions:
cors <- lma_simets(t(examination_timing), metric = "pearson")
colnames(cors) <- seq_len(ncol(cors))
rownames(cors) <- paste0("(", colnames(cors), ") ", rownames(cors))
round(cors, 3)
#> 5 x 5 sparse Matrix of class "dtCMatrix" (unitriangular)
#> 1 2 3 4 5
#> (1) Classification_Time I . . . .
#> (2) Assignment_Time 0.187 I . . .
#> (3) First_Action_Time 0.151 0.161 I . .
#> (4) Average_Action_Time 0.138 0.156 0.855 I .
#> (5) Number_of_Actions 0.141 -0.001 0.156 0.044 I
Or associate timing and number of actions with terms used in the application’s abstract:
# align applications with valid timings
term_time_sim <- lma_simets(t(dtm[has_timings, ]), t(examination_timing), metric = "pearson")
for (aspect in colnames(term_time_sim)) {
terms <- term_time_sim[, aspect]
terms <- sort(terms, TRUE)[1:45]
splot(
terms ~ names(terms),
title = paste0("Terms Most Associated With ", gsub("_", " ", aspect, fixed = TRUE)),
type = "bar", sort = FALSE, laby = "Pearson's r", labx = FALSE
)
}
We can also get a feel for what seems to characterize Group Art Units by extracting topics from the applications they review.
We can start by getting art unit information for each application:
artunits <- unlist(lapply(examination_records, function(r) {
res <- r$patentCaseMetadata$groupArtUnitNumber$value
names(res) <- sub("(\\d{2})", "\\1/", r$patentCaseMetadata$applicationNumberText$value, perl = TRUE)
res
}))[applications$applicationNumber]
# number of applications assigned to each art unit
table(artunits)
#> artunits
#> 1734 2852 2857 2863 2864 2865 3623
#> 533 808 500 24 844 274 133
Then, we could get the same sort of associated set of terms, as with the seed set:
gaus <- unique(artunits)
term_artunit <- lma_simets(
t(dtm), vapply(artunits, "==", logical(length(gaus)), gaus),
metric = "pearson"
)
colnames(term_artunit) <- gaus
# look at top 10 terms for each
get_top_terms <- function(loadings, n = 10) {
kable(as.data.frame(vapply(colnames(loadings), function(col) {
rownames(loadings)[order(loadings[, col], decreasing = TRUE)[seq_len(n)]]
}, character(n))))
}
get_top_terms(term_artunit)
2865 | 2857 | 3623 | 2852 | 1734 | 2863 | 2864 |
---|---|---|---|---|---|---|
line | system | business | image | gas | sun | data |
corners | malfunction | user | member | stream | tentative | motion |
reconstructs | transformed | users | forming | catalyst | tolerances | values |
reserves | set | sales | toner | dioxide | closure | sensors |
protrusions | reference | actions | developing | carbon | golden | based |
imparted | runtime | customer | developer | containing | radiations | associated |
constrained | boosted | service | direction | oxidation | sub-assembly | determining |
leakage | build-up | task | formed | hydrogen | aims | signals |
differential | centrifugal | score | portion | exhaust | varies | signal |
prompting | incipient | scheduling | apparatus | removal | uncorrected | monitoring |
# we could use this to roughly predict art unit in-sample:
predicted_art_unit <- gaus[max.col(dtm %*% term_artunit)]
round(table(artunits, predicted_art_unit) / as.numeric(table(artunits)), 4)
#> predicted_art_unit
#> artunits 1734 2852 2857 2863 2864 2865 3623
#> 1734 0.9962 0.0019 0.0000 0.0000 0.0000 0.0000 0.0019
#> 2852 0.0062 0.6349 0.0594 0.0000 0.2438 0.0223 0.0334
#> 2857 0.0060 0.0320 0.5680 0.0060 0.3540 0.0140 0.0200
#> 2863 0.0000 0.0417 0.0417 0.7917 0.1250 0.0000 0.0000
#> 2864 0.0213 0.0284 0.0249 0.0000 0.8993 0.0095 0.0166
#> 2865 0.0073 0.0255 0.0584 0.0000 0.3212 0.5657 0.0219
#> 3623 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000
We could also extract general topics from this set of applications, then see how much they are represented in each art unit’s assigned applications:
# weigh terms
wdtm <- lma_weight(dtm, "tf-idf", FALSE)
# get 10 simple topics
## remotes::install_github("miserman/lusilab")
library(lusilab)
loadings <- taffyInf(wdtm, 10)
colnames(loadings) <- paste0("taffy_", seq_len(ncol(loadings)))
scores <- dtm %*% loadings
get_top_terms(loadings)
taffy_1 | taffy_2 | taffy_3 | taffy_4 | taffy_5 | taffy_6 | taffy_7 | taffy_8 | taffy_9 | taffy_10 |
---|---|---|---|---|---|---|---|---|---|
b | image | data | gas | signal | battery | information | methods | member | method |
slidably | forming | collected | stream | frequency | pack | exercise | systems | developer-carrying | determining |
announcement | toner | acquisition | streams | output | charge | mental | disclosed | insulates | includes |
sequential | latent | associated | exhaust | synchronous | capacity | wrist | described | bearing | identifying |
k | forms | structured | acid | delays | voltage | similarity | load-switching | developer | using |
withdrawn | carrier | interpret | flue | esr | blocks | advice | pc | accommodate | comprises |
orientation | electrostatic | collector | lean | digital | lifespan | credit | utilize | overlapped | drilling-performance |
counter-current | recording | collection | p.sub.i | rails | discharging | acquires | answers | image | steps |
s-selective | transfer | battery-powered | p.sub.o | unmodulated | estimating | schedule | include | developing | calculating |
physical | member | gas-volume | permeate | modulated | state | action | problems | photosensitive | obtaining |
# compare with kmeans
topics_km <- kmeans(wdtm, 10)
loadings_km <- t(topics_km$centers)
dimnames(loadings_km) <- list(colnames(dtm), paste0("km_", seq_len(ncol(loadings))))
scores_km <- dtm %*% loadings_km
get_top_terms(loadings_km)
km_1 | km_2 | km_3 | km_4 | km_5 | km_6 | km_7 | km_8 | km_9 | km_10 |
---|---|---|---|---|---|---|---|---|---|
brand | mi | seismic | defect | battery | member | torque | gas | data | b |
merchant | mrow | data | exercise | charge | image | steering | stream | signal | orientation |
vibration | msub | wavefields | sales | voltage | developer | renewable | dioxide | sensor | absorption |
version | math | receiver | insolation | capacity | toner | mi | carbon | system | gas |
personality | mn | wavefield | lead | state | developing | unit | co.sub | device | liquid |
code | mstyle | subsurface | r.sub | current | portion | relay | catalyst | plurality | means |
positioning | mfrac | synthetic | failure-type | soc | unit | vortex | solution | value | co.sub |
detailed | g.sub | volume | information | energy | forming | software | fuel | power | core |
features | mspace | domain | model | estimating | transfer | location-unknown | exhaust | unit | chamber |
options | g.sub.n | model | types | secondary | belt | longitude | acid | method | exhaust |
# compare with NMF
library(RcppML)
topics_nmf <- nmf(wdtm, 10, verbose = FALSE)
loadings_nmf <- t(topics_nmf$h)
dimnames(loadings_nmf) <- list(colnames(dtm), paste0("nmf_", seq_len(ncol(loadings_nmf))))
colnames(topics_nmf$w) <- colnames(loadings_nmf)
get_top_terms(loadings_nmf)
nmf_1 | nmf_2 | nmf_3 | nmf_4 | nmf_5 | nmf_6 | nmf_7 | nmf_8 | nmf_9 | nmf_10 |
---|---|---|---|---|---|---|---|---|---|
data | signal | gas | unit | image | battery | developer | b | mi | mi |
seismic | sensor | stream | temperature | member | voltage | member | orientation | mrow | mrow |
model | circuit | co.sub | value | toner | charge | developing | core | mn | msub |
plurality | signals | dioxide | flow | transfer | capacity | electrode | means | msub | math |
set | power | carbon | pressure | belt | state | supply | physical | math | s |
system | output | solution | measurement | forming | current | container | absorption | g.sub | mfrac |
information | frequency | absorbent | relay | portion | soc | developer-carrying | processing | mstyle | coating |
user | module | absorption | sensor | roller | charging | insulating | s | g.sub.n | c |
method | device | acid | rate | fixing | power | carrier | processed | sub-goal | e |
based | current | liquid | medium | photosensitive | pack | configured | liquid | mspace | substrate |
# see how similar topics are in terms of loadings
loadings_all <- cbind(loadings, loadings_km, loadings_nmf)
loading_cors <- lma_simets(t(loadings_all), metric = "pearson")
topic_names <- rownames(loading_cors)
scores_all <- dtm %*% loadings_all
display_matrix <- function(l, threshold = .1) {
v <- as.data.frame(as.matrix(l))
v[abs(v) < threshold] <- 0
kable(v, digits = 3, format.args = list(zero.print = "."))
}
display_matrix(loading_cors[grep("km", topic_names), grep("taffy", topic_names)])
taffy_1 | taffy_2 | taffy_3 | taffy_4 | taffy_5 | taffy_6 | taffy_7 | taffy_8 | taffy_9 | taffy_10 | |
---|---|---|---|---|---|---|---|---|---|---|
km_1 | . | . | . | . | . | . | . | . | . | . |
km_2 | . | . | . | . | . | . | . | . | . | . |
km_3 | . | . | 0.256 | . | . | . | . | . | . | 0.171 |
km_4 | . | . | . | . | . | . | 0.255 | . | . | . |
km_5 | . | . | . | . | . | 0.789 | . | . | . | . |
km_6 | . | 0.668 | -0.162 | . | . | . | . | -0.18 | 0.677 | -0.261 |
km_7 | . | . | . | . | . | . | . | . | . | . |
km_8 | 0.157 | . | -0.203 | 0.700 | -0.123 | . | -0.141 | . | . | . |
km_9 | . | . | 0.313 | -0.156 | 0.247 | . | 0.140 | . | -0.131 | 0.160 |
km_10 | 0.655 | . | . | . | . | . | . | . | . | . |
taffy_1 | taffy_2 | taffy_3 | taffy_4 | taffy_5 | taffy_6 | taffy_7 | taffy_8 | taffy_9 | taffy_10 | |
---|---|---|---|---|---|---|---|---|---|---|
nmf_1 | . | -0.144 | 0.577 | -0.189 | . | . | 0.188 | 0.132 | -0.197 | 0.291 |
nmf_2 | . | . | . | -0.151 | 0.658 | . | . | . | . | . |
nmf_3 | 0.135 | . | -0.201 | 0.698 | -0.127 | . | -0.140 | . | . | . |
nmf_4 | . | 0.191 | . | . | . | . | . | -0.107 | . | . |
nmf_5 | . | 0.702 | -0.159 | . | . | . | . | -0.155 | 0.596 | -0.230 |
nmf_6 | . | . | . | . | . | 0.798 | . | . | . | . |
nmf_7 | . | 0.214 | -0.105 | . | . | . | . | . | 0.510 | -0.149 |
nmf_8 | 0.644 | . | . | . | . | . | . | . | . | . |
nmf_9 | . | . | . | . | . | . | . | . | . | . |
nmf_10 | . | . | . | . | . | . | . | . | . | . |
km_1 | km_2 | km_3 | km_4 | km_5 | km_6 | km_7 | km_8 | km_9 | km_10 | |
---|---|---|---|---|---|---|---|---|---|---|
nmf_1 | 0.218 | . | 0.54 | 0.267 | 0.129 | . | . | . | 0.740 | . |
nmf_2 | . | . | . | . | 0.164 | . | 0.115 | . | 0.631 | . |
nmf_3 | . | . | . | . | . | . | . | 0.968 | 0.110 | . |
nmf_4 | . | . | . | . | 0.140 | 0.283 | 0.380 | 0.136 | 0.428 | . |
nmf_5 | . | . | . | . | . | 0.884 | . | . | 0.126 | . |
nmf_6 | . | . | . | . | 0.976 | . | . | . | 0.199 | . |
nmf_7 | . | . | . | . | . | 0.625 | . | . | . | . |
nmf_8 | . | . | . | . | . | . | . | . | 0.104 | 0.997 |
nmf_9 | . | 0.960 | . | . | . | . | 0.259 | . | . | . |
nmf_10 | . | 0.965 | . | . | . | . | 0.273 | . | . | . |
Now we can look at trends in these topics over time within art unit:
applications$year <- as.numeric(applications$datePublYear)
trends <- vapply(gaus, function(au) {
su <- artunits == au
lma_simets(t(scores_all[artunits == au, ]), applications[su, "year"], metric = "pearson")
}, numeric(ncol(scores_all)))
display_matrix(t(trends), .1)
taffy_1 | taffy_2 | taffy_3 | taffy_4 | taffy_5 | taffy_6 | taffy_7 | taffy_8 | taffy_9 | taffy_10 | km_1 | km_2 | km_3 | km_4 | km_5 | km_6 | km_7 | km_8 | km_9 | km_10 | nmf_1 | nmf_2 | nmf_3 | nmf_4 | nmf_5 | nmf_6 | nmf_7 | nmf_8 | nmf_9 | nmf_10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2865 | . | . | . | -0.135 | 0.161 | 0.120 | . | . | -0.153 | . | . | . | 0.107 | 0.105 | 0.146 | . | . | . | 0.184 | 0.113 | 0.113 | 0.175 | . | . | . | 0.135 | . | 0.111 | . | . |
2857 | . | . | 0.111 | . | . | . | . | 0.149 | -0.102 | . | . | . | 0.182 | . | . | . | . | . | . | . | 0.144 | -0.105 | . | . | . | . | . | . | . | . |
3623 | 0.112 | -0.265 | . | . | . | -0.109 | -0.154 | 0.159 | -0.217 | 0.113 | . | 0.106 | 0.126 | . | . | . | 0.107 | 0.194 | 0.150 | 0.116 | 0.106 | 0.133 | 0.164 | . | . | . | . | 0.116 | 0.102 | . |
2852 | . | -0.507 | 0.475 | . | 0.279 | 0.290 | 0.277 | 0.420 | -0.425 | 0.494 | 0.164 | 0.136 | 0.256 | 0.232 | 0.252 | -0.479 | . | . | 0.342 | . | 0.454 | 0.243 | . | -0.167 | -0.492 | 0.229 | -0.248 | . | 0.108 | . |
1734 | 0.116 | . | . | . | . | -0.103 | . | . | . | . | . | . | . | . | . | . | . | . | 0.112 | 0.119 | . | 0.106 | . | . | . | . | . | 0.119 | . | . |
2863 | . | -0.287 | . | -0.150 | . | 0.109 | . | 0.213 | -0.207 | 0.271 | 0.143 | -0.363 | -0.264 | -0.120 | . | -0.289 | -0.236 | -0.270 | -0.124 | -0.124 | . | . | -0.239 | -0.271 | -0.254 | . | -0.250 | -0.113 | -0.336 | -0.267 |
2864 | . | -0.154 | 0.221 | . | . | . | . | 0.199 | -0.196 | . | . | . | 0.246 | 0.103 | . | . | . | . | 0.158 | . | 0.249 | . | . | . | . | . | . | . | . | . |
focal_unit <- names(which.max(colMeans(abs(trends))))
focal_topics <- names(which(apply(abs(trends), 1, max) > .4))
get_top_terms(loadings_all[, focal_topics])
taffy_2 | taffy_3 | taffy_8 | taffy_9 | taffy_10 | km_6 | nmf_1 | nmf_5 |
---|---|---|---|---|---|---|---|
image | data | methods | member | method | member | data | image |
forming | collected | systems | developer-carrying | determining | image | seismic | member |
toner | acquisition | disclosed | insulates | includes | developer | model | toner |
latent | associated | described | bearing | identifying | toner | plurality | transfer |
forms | structured | load-switching | developer | using | developing | set | belt |
carrier | interpret | pc | accommodate | comprises | portion | system | forming |
electrostatic | collector | utilize | overlapped | drilling-performance | unit | information | portion |
recording | collection | answers | image | steps | forming | user | roller |
transfer | battery-powered | include | developing | calculating | transfer | method | fixing |
member | gas-volume | problems | photosensitive | obtaining | belt | based | photosensitive |
splot(
scores_all[, focal_topics] ~ applications$year,
su = artunits == focal_unit, mv.scale = TRUE, sud = FALSE, myl = c(-2, 9),
title = paste("Topic Over Time in Art Unit", focal_unit),
lines = "spline", laby = "Topic Use (z-scored)", labx = FALSE
)