example-inventor_comparisons.Rmd
Built with R 4.2.1
This example explores the inventor-focused search set created by example_inventor_sexing.R.
The files in produces are in the eda/searches directory:
outDir <- "eda/searches/"
First, we can load in the original search results:
search_pharmaceutical <- read.csv(xzfile(paste0(outDir, "pharmaceutical.csv.xz")))
search_mechanical <- read.csv(xzfile(paste0(outDir, "mechanical.csv.xz")))
The pharmaceutical search returned 94,998 results, and the mechanical search returned 89,487 results.
The files are split into separate search sets, which we’ll combine:
inventors <- rbind(
# based on the search pharmaceutical.ab.
read.csv(xzfile(paste0(outDir, "inventors_pharmaceutical.csv.xz"))),
# based on the search F01?.cpcl.
read.csv(xzfile(paste0(outDir, "inventors_mechanical.csv.xz")))
)
# ensure our working set is from the original searches
inventors <- inventors[inventors$applicationNumber %in% c(
search_mechanical$applicationNumber, search_pharmaceutical$applicationNumber
), ]
This set includes information from 181,694 patent applications, with 245,824 inventors (going by unique full name anyway). Of those applications, 93,234 are from the pharmaceutical search, and 88,460 are from mechanical set.
This set of applications is meant to represent technology fields that differ in their proportion of female inventors, where mechanical (such as machine tools and mechanical elements) has a particularly low proportion, and pharmaceutical (along with biotechnology and areas of chemistry) has a higher proportion (Intellectual Property Office, 2019; Miguelez et al., 2019). The field-specific subsets are only defined by their simple search terms, however, so we’ll want to see how representative of those fields these applications might actually be.
One way to get a feel for how representative our sets are might be to look at the distribution of classifications within each:
# make a subset that contains only one line per applications
applications <- inventors[!duplicated(inventors$guid), ]
# look at the overall UPC class of each application between sets
overall_classes <- substring(applications$classification, 4, 6)
classes <- table(overall_classes, applications$search_set)
class_highlights <- rbind(
# most differing classes
classes[order(classes[, 2] - classes[, 1])[c(1:10, (1:10) + nrow(classes) - 10)], ],
# most overlapping classes
classes[order(abs(classes[, 2] - classes[, 1]) - rowSums(classes))[1:10], ]
)
class_highlights <- data.frame(
class = rownames(class_highlights),
criteria = rep(c("differing", "overlapping"), c(20, 10)),
class_highlights
)
## add descriptions
library(uspto)
class_info <- get_class_info(class_highlights$class, paste0(dirname(outDir), "/original/class_info"))
class_highlights$description <- vapply(class_info, "[[", "", "description")
kable(
class_highlights,
row.names = FALSE,
col.names = c("Class", "Criteria", "Mechanical", "Pharmaceutical", "Description")
)
Class | Criteria | Mechanical | Pharmaceutical | Description |
---|---|---|---|---|
060 | differing | 20227 | 0 | POWER PLANTS |
123 | differing | 14319 | 1 | INTERNAL-COMBUSTION ENGINES |
415 | differing | 11596 | 1 | ROTARY KINETIC FLUID MOTORS OR PUMPS |
416 | differing | 6355 | 0 | FLUID REACTION SURFACES (I.E., IMPELLERS) |
701 | differing | 2029 | 1 | DATA PROCESSING: VEHICLES, NAVIGATION, AND RELATIVE LOCATION |
029 | differing | 1969 | 9 | METAL WORKING |
418 | differing | 1737 | 0 | ROTARY EXPANSIBLE CHAMBER DEVICES |
422 | differing | 1690 | 142 | CHEMICAL APPARATUS AND PROCESS DISINFECTING, DEODORIZING, PRESERVING, OR STERILIZING |
073 | differing | 1556 | 81 | MEASURING AND TESTING |
428 | differing | 1629 | 236 | STOCK MATERIAL OR MISCELLANEOUS ARTICLES |
604 | differing | 7 | 652 | SURGERY |
540 | differing | 0 | 700 | ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES |
536 | differing | 0 | 941 | ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES |
530 | differing | 2 | 1680 | CHEMISTRY: NATURAL RESINS OR DERIVATIVES; PEPTIDES OR PROTEINS; LIGNINS OR REACTION PRODUCTS THEREOF |
548 | differing | 0 | 2344 | ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES |
546 | differing | 0 | 2812 | ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES |
544 | differing | 2 | 2905 | ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES |
435 | differing | 22 | 3483 | CHEMISTRY: MOLECULAR BIOLOGY AND MICROBIOLOGY |
424 | differing | 6 | 24820 | DRUG, BIO-AFFECTING AND BODY TREATING COMPOSITIONS |
514 | differing | 17 | 45051 | DRUG, BIO-AFFECTING AND BODY TREATING COMPOSITIONS |
428 | overlapping | 1629 | 236 | STOCK MATERIAL OR MISCELLANEOUS ARTICLES |
700 | overlapping | 340 | 170 | DATA PROCESSING: GENERIC CONTROL SYSTEMS OR SPECIFIC APPLICATIONS |
702 | overlapping | 424 | 167 | DATA PROCESSING: MEASURING, CALIBRATING, OR TESTING |
422 | overlapping | 1690 | 142 | CHEMICAL APPARATUS AND PROCESS DISINFECTING, DEODORIZING, PRESERVING, OR STERILIZING |
210 | overlapping | 484 | 116 | LIQUID PURIFICATION OR SEPARATION |
436 | overlapping | 97 | 172 | CHEMISTRY: ANALYTICAL AND IMMUNOLOGICAL TESTING |
427 | overlapping | 678 | 91 | COATING PROCESSES |
073 | overlapping | 1556 | 81 | MEASURING AND TESTING |
264 | overlapping | 285 | 79 | PLASTIC AND NONMETALLIC ARTICLE SHAPING OR TREATING: PROCESSES |
340 | overlapping | 118 | 74 | COMMUNICATIONS: ELECTRICAL |
Some other application-level features might also affect the comparisons we might want to make between sets, so we can see how balanced they are incidentally:
applications$n_inventors <- tapply(inventors$guid, inventors$guid, length)[applications$guid]
summaries <- vapply(split(applications, applications$search_set), function(d) {
c(
"Proportion US" = mean(d$inventorCountry == "US"),
"Proportion California" = mean(d[d$inventorCountry == "US", "inventorState"] == "CA"),
"Mean Year" = mean(as.numeric(substring(d$date, 1, 4))),
"Proportion Utility" = mean(d$category == "Utility"),
"Proportion Small Business" = (1 - mean(d$business == "UNDISCOUNTED")),
"Proportion Inventor First" = mean(d$first_inventor == "true"),
"Mean Inventors Per Team" = mean(d$n_inventors),
"Mean Time To classification" = mean(d$time_initial_classificaiton, na.rm = TRUE),
"Mean Time To First Action" = mean(d$time_first_action, na.rm = TRUE),
"Mean Examination Rounds" = mean(d[d$examination_rounds != 0, "examination_rounds"]),
"Proportion Accepted" = mean(d$any_accepts),
"Proportion Patented" = mean(d$status == "Patented Case", na.rm = TRUE)
)
}, numeric(12))
summaries <- cbind(summaries, summaries[, 1] - summaries[, 2])
kable(summaries, digits = 3, col.names = c("Mechanical", "Pharmaceutical", "Mechanical - Pharmaceutical"))
Mechanical | Pharmaceutical | Mechanical - Pharmaceutical | |
---|---|---|---|
Proportion US | 0.434 | 0.483 | -0.049 |
Proportion California | 0.066 | 0.261 | -0.195 |
Mean Year | 2013.115 | 2012.321 | 0.794 |
Proportion Utility | 1.000 | 1.000 | 0.000 |
Proportion Small Business | 0.111 | 0.363 | -0.251 |
Proportion Inventor First | 0.428 | 0.283 | 0.145 |
Mean Inventors Per Team | 2.710 | 4.130 | -1.420 |
Mean Time To classification | 40.703 | 19.406 | 21.297 |
Mean Time To First Action | 44.724 | 48.523 | -3.799 |
Mean Examination Rounds | 1.772 | 2.195 | -0.423 |
Proportion Accepted | 0.598 | 0.495 | 0.102 |
Proportion Patented | 0.581 | 0.370 | 0.211 |
The only inventor information included in USPTO data is name, country, states (in some countries), and city. We are particularly interested in looking at differences in the sex-distribution of inventors between technology areas, so we used that inventor information to assign sex. For some comparison, we used 3 basic methods with several different source:
sex_in_country_wgnd
and prop_fem_wgnd
(with some additional sources; Raffo, 2021)
prob_fem_skydeck
(Rao, 2020)
prob_fem_usssa
(Social Security Administration,
2021)
sex_usuk
(Giordano et al.,
2021)
prob_fem_ssa
, prob_fem_ipums
, and
prob_fem_napp
(Mullen, 2021)
prob_fem_fb
(from leaked account
details; Remy,
2021)
prob_fb_scraped
(scraped from
profiles; Tang et al.,
2011)
prob_fem_search
(guess_sex.R)First, we can just look at the correlation between sources:
prob_cols <- grep("prob_fem", colnames(inventors), fixed = TRUE, value = TRUE)
cors <- cor(inventors[, prob_cols], use = "pairwise.complete.obs")
rownames(cors) <- paste0("(", seq_along(prob_cols), ") ", prob_cols)
colnames(cors) <- paste0("(", seq_along(prob_cols), ")")
kable(cors, digits = 3)
(1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | |
---|---|---|---|---|---|---|---|---|---|
(1) prob_fem_wgnd | 1.000 | 0.899 | 0.809 | 0.858 | 0.843 | 0.831 | 0.700 | 0.630 | 0.391 |
(2) prob_fem_fb | 0.899 | 1.000 | 0.829 | 0.856 | 0.845 | 0.834 | 0.730 | 0.667 | 0.404 |
(3) prob_fem_fb_scraped | 0.809 | 0.829 | 1.000 | 0.878 | 0.884 | 0.887 | 0.749 | 0.695 | 0.381 |
(4) prob_fem_skydeck | 0.858 | 0.856 | 0.878 | 1.000 | 0.976 | 0.960 | 0.783 | 0.705 | 0.388 |
(5) prob_fem_usssa | 0.843 | 0.845 | 0.884 | 0.976 | 1.000 | 0.982 | 0.794 | 0.717 | 0.389 |
(6) prob_fem_ssa | 0.831 | 0.834 | 0.887 | 0.960 | 0.982 | 1.000 | 0.790 | 0.726 | 0.391 |
(7) prob_fem_ipums | 0.700 | 0.730 | 0.749 | 0.783 | 0.794 | 0.790 | 1.000 | 0.784 | 0.360 |
(8) prob_fem_napp | 0.630 | 0.667 | 0.695 | 0.705 | 0.717 | 0.726 | 0.784 | 1.000 | 0.332 |
(9) prob_fem_search | 0.391 | 0.404 | 0.381 | 0.388 | 0.389 | 0.391 | 0.360 | 0.332 | 1.000 |
The World Gender Name Dictionary has the broadest coverage and widest range of source, so we could also treat that as the best guess, and look at accuracy of the others:
# convert probabilities to predictions
sex_predictions <- inventors[, prob_cols]
sex_predictions[is.na(sex_predictions)] <- .5
sex_predictions[sex_predictions == .5] <- "U"
sex_predictions[inventors[, prob_cols] > .5] <- "F"
sex_predictions[inventors[, prob_cols] < .5] <- "M"
inventors[, sub("prob_fem", "sex", prob_cols, fixed = TRUE)] <- sex_predictions
sex_cols <- grep("sex_", colnames(inventors), fixed = TRUE, value = TRUE)
# then get accuracy to WGND in country
prediction_summaries <- data.frame(
Accuracy = colMeans(vapply(
inventors[, sex_cols], "==", logical(nrow(inventors)), inventors$sex_in_country_wgnd
)),
Percent_Determinate = colMeans(vapply(inventors[, sex_cols], "!=", logical(nrow(inventors)), "U"))
)
kable(prediction_summaries[order(-prediction_summaries$Accuracy), ], digits = 3)
Accuracy | Percent_Determinate | |
---|---|---|
sex_in_country_wgnd | 1.000 | 0.829 |
sex_wgnd | 0.912 | 0.908 |
sex_fb | 0.871 | 0.915 |
sex_usuk | 0.864 | 0.765 |
sex_skydeck | 0.860 | 0.804 |
sex_usssa | 0.846 | 0.781 |
sex_ssa | 0.835 | 0.766 |
sex_fb_scraped | 0.819 | 0.751 |
sex_ipums | 0.788 | 0.736 |
sex_napp | 0.695 | 0.579 |
sex_search | 0.276 | 0.163 |
With the potential limitations of our sample and inventor sexing methods in mind, we can look at differences between sex-based inventor groups:
inventors$n_inventors <- tapply(inventors$guid, inventors$guid, length)[inventors$guid]
unique_inventors <- inventors[!duplicated(paste(
inventors$firstName, inventors$lastName, inventors$inventorCountry
)), ]
summaries_sex <- vapply(split(unique_inventors, unique_inventors$sex_in_country_wgnd), function(d) {
c(
"Pharmaceutical Set" = mean(d$search_set == "pharmaceutical"),
"Proportion US" = mean(d$inventorCountry == "US", na.rm = TRUE),
"Proportion California" = mean(d[
!is.na(d$inventorCountry) & d$inventorCountry == "US", "inventorState"
] == "CA"),
"Mean Year" = mean(as.numeric(substring(d$date, 1, 4))),
"Proportion Utility" = mean(d$category == "Utility"),
"Proportion Small Business" = (1 - mean(d$business == "UNDISCOUNTED")),
"Proportion Inventor First" = mean(d$first_inventor == "true"),
"Mean Inventors Per Team" = mean(d$n_inventors),
"Mean Time To classification" = mean(d$time_initial_classificaiton, na.rm = TRUE),
"Mean Time To First Action" = mean(d$time_first_action, na.rm = TRUE),
"Mean Examination Rounds" = mean(d[d$examination_rounds != 0, "examination_rounds"]),
"Proportion Accepted" = mean(d$any_accepts),
"Proportion Patented" = mean(d$status == "Patented Case", na.rm = TRUE)
)
}, numeric(13))
kable(data.frame(
Female = as.numeric(summaries_sex[, 1, drop = FALSE]),
Male = summaries_sex[, 2],
"Female - Male" = summaries_sex[, 1] - summaries_sex[, 2],
Unknown = summaries_sex[, 3],
check.names = FALSE
), digits = 3)
Female | Male | Female - Male | Unknown | |
---|---|---|---|---|
Pharmaceutical Set | 0.849 | 0.510 | 0.339 | 0.728 |
Proportion US | 0.358 | 0.379 | -0.021 | 0.241 |
Proportion California | 0.219 | 0.154 | 0.065 | 0.216 |
Mean Year | 2012.358 | 2011.798 | 0.561 | 2013.694 |
Proportion Utility | 1.000 | 1.000 | 0.000 | 1.000 |
Proportion Small Business | 0.298 | 0.211 | 0.087 | 0.398 |
Proportion Inventor First | 0.348 | 0.336 | 0.012 | 0.449 |
Mean Inventors Per Team | 5.714 | 4.649 | 1.065 | 5.798 |
Mean Time To classification | 22.947 | 30.293 | -7.346 | 26.233 |
Mean Time To First Action | 49.346 | 47.750 | 1.596 | 46.309 |
Mean Examination Rounds | 2.267 | 2.088 | 0.179 | 2.075 |
Proportion Accepted | 0.521 | 0.564 | -0.043 | 0.531 |
Proportion Patented | 0.399 | 0.463 | -0.064 | 0.462 |