Built with R 4.2.1


This example explores the inventor-focused search set created by example_inventor_sexing.R.

The files in produces are in the eda/searches directory:

outDir <- "eda/searches/"

Load Data

First, we can load in the original search results:

search_pharmaceutical <- read.csv(xzfile(paste0(outDir, "pharmaceutical.csv.xz")))
search_mechanical <- read.csv(xzfile(paste0(outDir, "mechanical.csv.xz")))

The pharmaceutical search returned 94,998 results, and the mechanical search returned 89,487 results.

The files are split into separate search sets, which we’ll combine:

inventors <- rbind(
  # based on the search pharmaceutical.ab.
  read.csv(xzfile(paste0(outDir, "inventors_pharmaceutical.csv.xz"))),
  # based on the search F01?.cpcl.
  read.csv(xzfile(paste0(outDir, "inventors_mechanical.csv.xz")))
)

# ensure our working set is from the original searches
inventors <- inventors[inventors$applicationNumber %in% c(
  search_mechanical$applicationNumber, search_pharmaceutical$applicationNumber
), ]

This set includes information from 181,694 patent applications, with 245,824 inventors (going by unique full name anyway). Of those applications, 93,234 are from the pharmaceutical search, and 88,460 are from mechanical set.

Set Assessment

This set of applications is meant to represent technology fields that differ in their proportion of female inventors, where mechanical (such as machine tools and mechanical elements) has a particularly low proportion, and pharmaceutical (along with biotechnology and areas of chemistry) has a higher proportion (Intellectual Property Office, 2019; Miguelez et al., 2019). The field-specific subsets are only defined by their simple search terms, however, so we’ll want to see how representative of those fields these applications might actually be.

One way to get a feel for how representative our sets are might be to look at the distribution of classifications within each:

# make a subset that contains only one line per applications
applications <- inventors[!duplicated(inventors$guid), ]

# look at the overall UPC class of each application between sets
overall_classes <- substring(applications$classification, 4, 6)
classes <- table(overall_classes, applications$search_set)

class_highlights <- rbind(
  # most differing classes
  classes[order(classes[, 2] - classes[, 1])[c(1:10, (1:10) + nrow(classes) - 10)], ],
  # most overlapping classes
  classes[order(abs(classes[, 2] - classes[, 1]) - rowSums(classes))[1:10], ]
)
class_highlights <- data.frame(
  class = rownames(class_highlights),
  criteria = rep(c("differing", "overlapping"), c(20, 10)),
  class_highlights
)

## add descriptions
library(uspto)
class_info <- get_class_info(class_highlights$class, paste0(dirname(outDir), "/original/class_info"))
class_highlights$description <- vapply(class_info, "[[", "", "description")
kable(
  class_highlights,
  row.names = FALSE,
  col.names = c("Class", "Criteria", "Mechanical", "Pharmaceutical", "Description")
)
Class Criteria Mechanical Pharmaceutical Description
060 differing 20227 0 POWER PLANTS
123 differing 14319 1 INTERNAL-COMBUSTION ENGINES
415 differing 11596 1 ROTARY KINETIC FLUID MOTORS OR PUMPS
416 differing 6355 0 FLUID REACTION SURFACES (I.E., IMPELLERS)
701 differing 2029 1 DATA PROCESSING: VEHICLES, NAVIGATION, AND RELATIVE LOCATION
029 differing 1969 9 METAL WORKING
418 differing 1737 0 ROTARY EXPANSIBLE CHAMBER DEVICES
422 differing 1690 142 CHEMICAL APPARATUS AND PROCESS DISINFECTING, DEODORIZING, PRESERVING, OR STERILIZING
073 differing 1556 81 MEASURING AND TESTING
428 differing 1629 236 STOCK MATERIAL OR MISCELLANEOUS ARTICLES
604 differing 7 652 SURGERY
540 differing 0 700 ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
536 differing 0 941 ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
530 differing 2 1680 CHEMISTRY: NATURAL RESINS OR DERIVATIVES; PEPTIDES OR PROTEINS; LIGNINS OR REACTION PRODUCTS THEREOF
548 differing 0 2344 ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
546 differing 0 2812 ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
544 differing 2 2905 ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
435 differing 22 3483 CHEMISTRY: MOLECULAR BIOLOGY AND MICROBIOLOGY
424 differing 6 24820 DRUG, BIO-AFFECTING AND BODY TREATING COMPOSITIONS
514 differing 17 45051 DRUG, BIO-AFFECTING AND BODY TREATING COMPOSITIONS
428 overlapping 1629 236 STOCK MATERIAL OR MISCELLANEOUS ARTICLES
700 overlapping 340 170 DATA PROCESSING: GENERIC CONTROL SYSTEMS OR SPECIFIC APPLICATIONS
702 overlapping 424 167 DATA PROCESSING: MEASURING, CALIBRATING, OR TESTING
422 overlapping 1690 142 CHEMICAL APPARATUS AND PROCESS DISINFECTING, DEODORIZING, PRESERVING, OR STERILIZING
210 overlapping 484 116 LIQUID PURIFICATION OR SEPARATION
436 overlapping 97 172 CHEMISTRY: ANALYTICAL AND IMMUNOLOGICAL TESTING
427 overlapping 678 91 COATING PROCESSES
073 overlapping 1556 81 MEASURING AND TESTING
264 overlapping 285 79 PLASTIC AND NONMETALLIC ARTICLE SHAPING OR TREATING: PROCESSES
340 overlapping 118 74 COMMUNICATIONS: ELECTRICAL

Some other application-level features might also affect the comparisons we might want to make between sets, so we can see how balanced they are incidentally:

applications$n_inventors <- tapply(inventors$guid, inventors$guid, length)[applications$guid]
summaries <- vapply(split(applications, applications$search_set), function(d) {
  c(
    "Proportion US" = mean(d$inventorCountry == "US"),
    "Proportion California" = mean(d[d$inventorCountry == "US", "inventorState"] == "CA"),
    "Mean Year" = mean(as.numeric(substring(d$date, 1, 4))),
    "Proportion Utility" = mean(d$category == "Utility"),
    "Proportion Small Business" = (1 - mean(d$business == "UNDISCOUNTED")),
    "Proportion Inventor First" = mean(d$first_inventor == "true"),
    "Mean Inventors Per Team" = mean(d$n_inventors),
    "Mean Time To classification" = mean(d$time_initial_classificaiton, na.rm = TRUE),
    "Mean Time To First Action" = mean(d$time_first_action, na.rm = TRUE),
    "Mean Examination Rounds" = mean(d[d$examination_rounds != 0, "examination_rounds"]),
    "Proportion Accepted" = mean(d$any_accepts),
    "Proportion Patented" = mean(d$status == "Patented Case", na.rm = TRUE)
  )
}, numeric(12))
summaries <- cbind(summaries, summaries[, 1] - summaries[, 2])
kable(summaries, digits = 3, col.names = c("Mechanical", "Pharmaceutical", "Mechanical - Pharmaceutical"))
Mechanical Pharmaceutical Mechanical - Pharmaceutical
Proportion US 0.434 0.483 -0.049
Proportion California 0.066 0.261 -0.195
Mean Year 2013.115 2012.321 0.794
Proportion Utility 1.000 1.000 0.000
Proportion Small Business 0.111 0.363 -0.251
Proportion Inventor First 0.428 0.283 0.145
Mean Inventors Per Team 2.710 4.130 -1.420
Mean Time To classification 40.703 19.406 21.297
Mean Time To First Action 44.724 48.523 -3.799
Mean Examination Rounds 1.772 2.195 -0.423
Proportion Accepted 0.598 0.495 0.102
Proportion Patented 0.581 0.370 0.211

Inventor Sexing

The only inventor information included in USPTO data is name, country, states (in some countries), and city. We are particularly interested in looking at differences in the sex-distribution of inventors between technology areas, so we used that inventor information to assign sex. For some comparison, we used 3 basic methods with several different source:

First, we can just look at the correlation between sources:

prob_cols <- grep("prob_fem", colnames(inventors), fixed = TRUE, value = TRUE)
cors <- cor(inventors[, prob_cols], use = "pairwise.complete.obs")
rownames(cors) <- paste0("(", seq_along(prob_cols), ") ", prob_cols)
colnames(cors) <- paste0("(", seq_along(prob_cols), ")")
kable(cors, digits = 3)
(1) (2) (3) (4) (5) (6) (7) (8) (9)
(1) prob_fem_wgnd 1.000 0.899 0.809 0.858 0.843 0.831 0.700 0.630 0.391
(2) prob_fem_fb 0.899 1.000 0.829 0.856 0.845 0.834 0.730 0.667 0.404
(3) prob_fem_fb_scraped 0.809 0.829 1.000 0.878 0.884 0.887 0.749 0.695 0.381
(4) prob_fem_skydeck 0.858 0.856 0.878 1.000 0.976 0.960 0.783 0.705 0.388
(5) prob_fem_usssa 0.843 0.845 0.884 0.976 1.000 0.982 0.794 0.717 0.389
(6) prob_fem_ssa 0.831 0.834 0.887 0.960 0.982 1.000 0.790 0.726 0.391
(7) prob_fem_ipums 0.700 0.730 0.749 0.783 0.794 0.790 1.000 0.784 0.360
(8) prob_fem_napp 0.630 0.667 0.695 0.705 0.717 0.726 0.784 1.000 0.332
(9) prob_fem_search 0.391 0.404 0.381 0.388 0.389 0.391 0.360 0.332 1.000

The World Gender Name Dictionary has the broadest coverage and widest range of source, so we could also treat that as the best guess, and look at accuracy of the others:

# convert probabilities to predictions
sex_predictions <- inventors[, prob_cols]
sex_predictions[is.na(sex_predictions)] <- .5
sex_predictions[sex_predictions == .5] <- "U"
sex_predictions[inventors[, prob_cols] > .5] <- "F"
sex_predictions[inventors[, prob_cols] < .5] <- "M"
inventors[, sub("prob_fem", "sex", prob_cols, fixed = TRUE)] <- sex_predictions
sex_cols <- grep("sex_", colnames(inventors), fixed = TRUE, value = TRUE)

# then get accuracy to WGND in country
prediction_summaries <- data.frame(
  Accuracy = colMeans(vapply(
    inventors[, sex_cols], "==", logical(nrow(inventors)), inventors$sex_in_country_wgnd
  )),
  Percent_Determinate = colMeans(vapply(inventors[, sex_cols], "!=", logical(nrow(inventors)), "U"))
)
kable(prediction_summaries[order(-prediction_summaries$Accuracy), ], digits = 3)
Accuracy Percent_Determinate
sex_in_country_wgnd 1.000 0.829
sex_wgnd 0.912 0.908
sex_fb 0.871 0.915
sex_usuk 0.864 0.765
sex_skydeck 0.860 0.804
sex_usssa 0.846 0.781
sex_ssa 0.835 0.766
sex_fb_scraped 0.819 0.751
sex_ipums 0.788 0.736
sex_napp 0.695 0.579
sex_search 0.276 0.163

Sex Differences

With the potential limitations of our sample and inventor sexing methods in mind, we can look at differences between sex-based inventor groups:

inventors$n_inventors <- tapply(inventors$guid, inventors$guid, length)[inventors$guid]
unique_inventors <- inventors[!duplicated(paste(
  inventors$firstName, inventors$lastName, inventors$inventorCountry
)), ]
summaries_sex <- vapply(split(unique_inventors, unique_inventors$sex_in_country_wgnd), function(d) {
  c(
    "Pharmaceutical Set" = mean(d$search_set == "pharmaceutical"),
    "Proportion US" = mean(d$inventorCountry == "US", na.rm = TRUE),
    "Proportion California" = mean(d[
      !is.na(d$inventorCountry) & d$inventorCountry == "US", "inventorState"
    ] == "CA"),
    "Mean Year" = mean(as.numeric(substring(d$date, 1, 4))),
    "Proportion Utility" = mean(d$category == "Utility"),
    "Proportion Small Business" = (1 - mean(d$business == "UNDISCOUNTED")),
    "Proportion Inventor First" = mean(d$first_inventor == "true"),
    "Mean Inventors Per Team" = mean(d$n_inventors),
    "Mean Time To classification" = mean(d$time_initial_classificaiton, na.rm = TRUE),
    "Mean Time To First Action" = mean(d$time_first_action, na.rm = TRUE),
    "Mean Examination Rounds" = mean(d[d$examination_rounds != 0, "examination_rounds"]),
    "Proportion Accepted" = mean(d$any_accepts),
    "Proportion Patented" = mean(d$status == "Patented Case", na.rm = TRUE)
  )
}, numeric(13))
kable(data.frame(
  Female = as.numeric(summaries_sex[, 1, drop = FALSE]),
  Male = summaries_sex[, 2],
  "Female - Male" = summaries_sex[, 1] - summaries_sex[, 2],
  Unknown = summaries_sex[, 3],
  check.names = FALSE
), digits = 3)
Female Male Female - Male Unknown
Pharmaceutical Set 0.849 0.510 0.339 0.728
Proportion US 0.358 0.379 -0.021 0.241
Proportion California 0.219 0.154 0.065 0.216
Mean Year 2012.358 2011.798 0.561 2013.694
Proportion Utility 1.000 1.000 0.000 1.000
Proportion Small Business 0.298 0.211 0.087 0.398
Proportion Inventor First 0.348 0.336 0.012 0.449
Mean Inventors Per Team 5.714 4.649 1.065 5.798
Mean Time To classification 22.947 30.293 -7.346 26.233
Mean Time To First Action 49.346 47.750 1.596 46.309
Mean Examination Rounds 2.267 2.088 0.179 2.075
Proportion Accepted 0.521 0.564 -0.043 0.531
Proportion Patented 0.399 0.463 -0.064 0.462

References

Giordano, R., Day, A., & Boyle, J. (2021). GenderInfer: This is a collection of functions to analyse gender differences. https://CRAN.R-project.org/package=GenderInfer
Intellectual Property Office. (2019). Gender profiles in worldwide patenting: An analysis of female inventorship. Intellectual Property Office. https://www.gov.uk/government/publications/gender-profiles-in-worldwide-patenting-an-analysis-of-female-inventorship-2019-edition
Miguelez, E., Toole, A., Myers, A., Breschi, S., Ferruci, E., Lissoni, F., Sterzi, V., Tarasconi, G., et al. (2019). Progress and potential: A profile of women inventors on US patents. U.S. Patent and Trademark Office, Office of the Chief Economist. https://www.uspto.gov/sites/default/files/documents/Progress-and-Potential-2019.pdf
Mullen, L. (2021). Gender: Predict gender from names using historical data. https://github.com/lmullen/gender
Raffo, J. (2021). World gender name dictionary (WGND 2.0) (DRAFT VERSION) [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/MSEGSJ
Rao, A. (2020). Gender by name data set [Data set]. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Gender+by+Name
Remy, P. (2021). Name dataset [Data set]. GitHub. https://github.com/philipperemy/name-dataset
Social Security Administration. (2021). Baby names [Data set]. U.S. Social Security Administration. https://www.ssa.gov/OACT/babynames/limits.html
Tang, C., Ross, K., Saxena, N., & Chen, R. (2011). What’s in a name: A study of names, gender inference, and gender behavior in facebook. International Conference on Database Systems for Advanced Applications, 344–356. https://nsaxena.engr.tamu.edu/wp-content/uploads/sites/238/2019/12/trsc-snsmw11.pdf