Compare Inventors Between Search Sets • uspto

Built with R 4.2.1

This example explores the inventor-focused search set created by example_inventor_sexing.R.

The files in produces are in the eda/searches directory:

outDir <- "eda/searches/"

Load Data

First, we can load in the original search results:

search_pharmaceutical <- read.csv(xzfile(paste0(outDir, "pharmaceutical.csv.xz")))
search_mechanical <- read.csv(xzfile(paste0(outDir, "mechanical.csv.xz")))

The pharmaceutical search returned 94,998 results, and the mechanical search returned 89,487 results.

The files are split into separate search sets, which we’ll combine:

inventors <- rbind(
  # based on the search pharmaceutical.ab.
  read.csv(xzfile(paste0(outDir, "inventors_pharmaceutical.csv.xz"))),
  # based on the search F01?.cpcl.
  read.csv(xzfile(paste0(outDir, "inventors_mechanical.csv.xz")))
)

# ensure our working set is from the original searches
inventors <- inventors[inventors$applicationNumber %in% c(
  search_mechanical$applicationNumber, search_pharmaceutical$applicationNumber
), ]

This set includes information from 181,694 patent applications, with 245,824 inventors (going by unique full name anyway). Of those applications, 93,234 are from the pharmaceutical search, and 88,460 are from mechanical set.

Set Assessment

This set of applications is meant to represent technology fields that differ in their proportion of female inventors, where mechanical (such as machine tools and mechanical elements) has a particularly low proportion, and pharmaceutical (along with biotechnology and areas of chemistry) has a higher proportion (Intellectual Property Office, 2019; Miguelez et al., 2019). The field-specific subsets are only defined by their simple search terms, however, so we’ll want to see how representative of those fields these applications might actually be.

One way to get a feel for how representative our sets are might be to look at the distribution of classifications within each:

# make a subset that contains only one line per applications
applications <- inventors[!duplicated(inventors$guid), ]

# look at the overall UPC class of each application between sets
overall_classes <- substring(applications$classification, 4, 6)
classes <- table(overall_classes, applications$search_set)

class_highlights <- rbind(
  # most differing classes
  classes[order(classes[, 2] - classes[, 1])[c(1:10, (1:10) + nrow(classes) - 10)], ],
  # most overlapping classes
  classes[order(abs(classes[, 2] - classes[, 1]) - rowSums(classes))[1:10], ]
)
class_highlights <- data.frame(
  class = rownames(class_highlights),
  criteria = rep(c("differing", "overlapping"), c(20, 10)),
  class_highlights
)

## add descriptions
library(uspto)
class_info <- get_class_info(class_highlights$class, paste0(dirname(outDir), "/original/class_info"))
class_highlights$description <- vapply(class_info, "[[", "", "description")
kable(
  class_highlights,
  row.names = FALSE,
  col.names = c("Class", "Criteria", "Mechanical", "Pharmaceutical", "Description")
)

Class	Criteria	Mechanical	Pharmaceutical	Description
060	differing	20227	0	POWER PLANTS
123	differing	14319	1	INTERNAL-COMBUSTION ENGINES
415	differing	11596	1	ROTARY KINETIC FLUID MOTORS OR PUMPS
416	differing	6355	0	FLUID REACTION SURFACES (I.E., IMPELLERS)
701	differing	2029	1	DATA PROCESSING: VEHICLES, NAVIGATION, AND RELATIVE LOCATION
029	differing	1969	9	METAL WORKING
418	differing	1737	0	ROTARY EXPANSIBLE CHAMBER DEVICES
422	differing	1690	142	CHEMICAL APPARATUS AND PROCESS DISINFECTING, DEODORIZING, PRESERVING, OR STERILIZING
073	differing	1556	81	MEASURING AND TESTING
428	differing	1629	236	STOCK MATERIAL OR MISCELLANEOUS ARTICLES
604	differing	7	652	SURGERY
540	differing	0	700	ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
536	differing	0	941	ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
530	differing	2	1680	CHEMISTRY: NATURAL RESINS OR DERIVATIVES; PEPTIDES OR PROTEINS; LIGNINS OR REACTION PRODUCTS THEREOF
548	differing	0	2344	ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
546	differing	0	2812	ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
544	differing	2	2905	ORGANIC COMPOUNDS – PART OF THE CLASS 532-570 SERIES
435	differing	22	3483	CHEMISTRY: MOLECULAR BIOLOGY AND MICROBIOLOGY
424	differing	6	24820	DRUG, BIO-AFFECTING AND BODY TREATING COMPOSITIONS
514	differing	17	45051	DRUG, BIO-AFFECTING AND BODY TREATING COMPOSITIONS
428	overlapping	1629	236	STOCK MATERIAL OR MISCELLANEOUS ARTICLES
700	overlapping	340	170	DATA PROCESSING: GENERIC CONTROL SYSTEMS OR SPECIFIC APPLICATIONS
702	overlapping	424	167	DATA PROCESSING: MEASURING, CALIBRATING, OR TESTING
422	overlapping	1690	142	CHEMICAL APPARATUS AND PROCESS DISINFECTING, DEODORIZING, PRESERVING, OR STERILIZING
210	overlapping	484	116	LIQUID PURIFICATION OR SEPARATION
436	overlapping	97	172	CHEMISTRY: ANALYTICAL AND IMMUNOLOGICAL TESTING
427	overlapping	678	91	COATING PROCESSES
073	overlapping	1556	81	MEASURING AND TESTING
264	overlapping	285	79	PLASTIC AND NONMETALLIC ARTICLE SHAPING OR TREATING: PROCESSES
340	overlapping	118	74	COMMUNICATIONS: ELECTRICAL

Some other application-level features might also affect the comparisons we might want to make between sets, so we can see how balanced they are incidentally:

applications$n_inventors <- tapply(inventors$guid, inventors$guid, length)[applications$guid]
summaries <- vapply(split(applications, applications$search_set), function(d) {
  c(
    "Proportion US" = mean(d$inventorCountry == "US"),
    "Proportion California" = mean(d[d$inventorCountry == "US", "inventorState"] == "CA"),
    "Mean Year" = mean(as.numeric(substring(d$date, 1, 4))),
    "Proportion Utility" = mean(d$category == "Utility"),
    "Proportion Small Business" = (1 - mean(d$business == "UNDISCOUNTED")),
    "Proportion Inventor First" = mean(d$first_inventor == "true"),
    "Mean Inventors Per Team" = mean(d$n_inventors),
    "Mean Time To classification" = mean(d$time_initial_classificaiton, na.rm = TRUE),
    "Mean Time To First Action" = mean(d$time_first_action, na.rm = TRUE),
    "Mean Examination Rounds" = mean(d[d$examination_rounds != 0, "examination_rounds"]),
    "Proportion Accepted" = mean(d$any_accepts),
    "Proportion Patented" = mean(d$status == "Patented Case", na.rm = TRUE)
  )
}, numeric(12))
summaries <- cbind(summaries, summaries[, 1] - summaries[, 2])
kable(summaries, digits = 3, col.names = c("Mechanical", "Pharmaceutical", "Mechanical - Pharmaceutical"))

	Mechanical	Pharmaceutical	Mechanical - Pharmaceutical
Proportion US	0.434	0.483	-0.049
Proportion California	0.066	0.261	-0.195
Mean Year	2013.115	2012.321	0.794
Proportion Utility	1.000	1.000	0.000
Proportion Small Business	0.111	0.363	-0.251
Proportion Inventor First	0.428	0.283	0.145
Mean Inventors Per Team	2.710	4.130	-1.420
Mean Time To classification	40.703	19.406	21.297
Mean Time To First Action	44.724	48.523	-3.799
Mean Examination Rounds	1.772	2.195	-0.423
Proportion Accepted	0.598	0.495	0.102
Proportion Patented	0.581	0.370	0.211

Inventor Sexing

The only inventor information included in USPTO data is name, country, states (in some countries), and city. We are particularly interested in looking at differences in the sex-distribution of inventors between technology areas, so we used that inventor information to assign sex. For some comparison, we used 3 basic methods with several different source:

Historical or public name-sex datasets:
- sex_in_country_wgnd and prop_fem_wgnd (with some additional sources; Raffo, 2021)
- prob_fem_skydeck (Rao, 2020)
- prob_fem_usssa (Social Security Administration, 2021)
- sex_usuk (Giordano et al., 2021)
- prob_fem_ssa, prob_fem_ipums, and prob_fem_napp (Mullen, 2021)
Social media profiles:
- prob_fem_fb (from leaked account details; Remy, 2021)
- prob_fb_scraped (scraped from profiles; Tang et al., 2011)
Search cues:
- prob_fem_search (guess_sex.R)

First, we can just look at the correlation between sources:

prob_cols <- grep("prob_fem", colnames(inventors), fixed = TRUE, value = TRUE)
cors <- cor(inventors[, prob_cols], use = "pairwise.complete.obs")
rownames(cors) <- paste0("(", seq_along(prob_cols), ") ", prob_cols)
colnames(cors) <- paste0("(", seq_along(prob_cols), ")")
kable(cors, digits = 3)

	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)
(1) prob_fem_wgnd	1.000	0.899	0.809	0.858	0.843	0.831	0.700	0.630	0.391
(2) prob_fem_fb	0.899	1.000	0.829	0.856	0.845	0.834	0.730	0.667	0.404
(3) prob_fem_fb_scraped	0.809	0.829	1.000	0.878	0.884	0.887	0.749	0.695	0.381
(4) prob_fem_skydeck	0.858	0.856	0.878	1.000	0.976	0.960	0.783	0.705	0.388
(5) prob_fem_usssa	0.843	0.845	0.884	0.976	1.000	0.982	0.794	0.717	0.389
(6) prob_fem_ssa	0.831	0.834	0.887	0.960	0.982	1.000	0.790	0.726	0.391
(7) prob_fem_ipums	0.700	0.730	0.749	0.783	0.794	0.790	1.000	0.784	0.360
(8) prob_fem_napp	0.630	0.667	0.695	0.705	0.717	0.726	0.784	1.000	0.332
(9) prob_fem_search	0.391	0.404	0.381	0.388	0.389	0.391	0.360	0.332	1.000

The World Gender Name Dictionary has the broadest coverage and widest range of source, so we could also treat that as the best guess, and look at accuracy of the others:

# convert probabilities to predictions
sex_predictions <- inventors[, prob_cols]
sex_predictions[is.na(sex_predictions)] <- .5
sex_predictions[sex_predictions == .5] <- "U"
sex_predictions[inventors[, prob_cols] > .5] <- "F"
sex_predictions[inventors[, prob_cols] < .5] <- "M"
inventors[, sub("prob_fem", "sex", prob_cols, fixed = TRUE)] <- sex_predictions
sex_cols <- grep("sex_", colnames(inventors), fixed = TRUE, value = TRUE)

# then get accuracy to WGND in country
prediction_summaries <- data.frame(
  Accuracy = colMeans(vapply(
    inventors[, sex_cols], "==", logical(nrow(inventors)), inventors$sex_in_country_wgnd
  )),
  Percent_Determinate = colMeans(vapply(inventors[, sex_cols], "!=", logical(nrow(inventors)), "U"))
)
kable(prediction_summaries[order(-prediction_summaries$Accuracy), ], digits = 3)

	Accuracy	Percent_Determinate
sex_in_country_wgnd	1.000	0.829
sex_wgnd	0.912	0.908
sex_fb	0.871	0.915
sex_usuk	0.864	0.765
sex_skydeck	0.860	0.804
sex_usssa	0.846	0.781
sex_ssa	0.835	0.766
sex_fb_scraped	0.819	0.751
sex_ipums	0.788	0.736
sex_napp	0.695	0.579
sex_search	0.276	0.163

Sex Differences

With the potential limitations of our sample and inventor sexing methods in mind, we can look at differences between sex-based inventor groups:

inventors$n_inventors <- tapply(inventors$guid, inventors$guid, length)[inventors$guid]
unique_inventors <- inventors[!duplicated(paste(
  inventors$firstName, inventors$lastName, inventors$inventorCountry
)), ]
summaries_sex <- vapply(split(unique_inventors, unique_inventors$sex_in_country_wgnd), function(d) {
  c(
    "Pharmaceutical Set" = mean(d$search_set == "pharmaceutical"),
    "Proportion US" = mean(d$inventorCountry == "US", na.rm = TRUE),
    "Proportion California" = mean(d[
      !is.na(d$inventorCountry) & d$inventorCountry == "US", "inventorState"
    ] == "CA"),
    "Mean Year" = mean(as.numeric(substring(d$date, 1, 4))),
    "Proportion Utility" = mean(d$category == "Utility"),
    "Proportion Small Business" = (1 - mean(d$business == "UNDISCOUNTED")),
    "Proportion Inventor First" = mean(d$first_inventor == "true"),
    "Mean Inventors Per Team" = mean(d$n_inventors),
    "Mean Time To classification" = mean(d$time_initial_classificaiton, na.rm = TRUE),
    "Mean Time To First Action" = mean(d$time_first_action, na.rm = TRUE),
    "Mean Examination Rounds" = mean(d[d$examination_rounds != 0, "examination_rounds"]),
    "Proportion Accepted" = mean(d$any_accepts),
    "Proportion Patented" = mean(d$status == "Patented Case", na.rm = TRUE)
  )
}, numeric(13))
kable(data.frame(
  Female = as.numeric(summaries_sex[, 1, drop = FALSE]),
  Male = summaries_sex[, 2],
  "Female - Male" = summaries_sex[, 1] - summaries_sex[, 2],
  Unknown = summaries_sex[, 3],
  check.names = FALSE
), digits = 3)

	Female	Male	Female - Male	Unknown
Pharmaceutical Set	0.849	0.510	0.339	0.728
Proportion US	0.358	0.379	-0.021	0.241
Proportion California	0.219	0.154	0.065	0.216
Mean Year	2012.358	2011.798	0.561	2013.694
Proportion Utility	1.000	1.000	0.000	1.000
Proportion Small Business	0.298	0.211	0.087	0.398
Proportion Inventor First	0.348	0.336	0.012	0.449
Mean Inventors Per Team	5.714	4.649	1.065	5.798
Mean Time To classification	22.947	30.293	-7.346	26.233
Mean Time To First Action	49.346	47.750	1.596	46.309
Mean Examination Rounds	2.267	2.088	0.179	2.075
Proportion Accepted	0.521	0.564	-0.043	0.531
Proportion Patented	0.399	0.463	-0.064	0.462

References

Giordano, R., Day, A., & Boyle, J. (2021). GenderInfer: This is a collection of functions to analyse gender differences. https://CRAN.R-project.org/package=GenderInfer

Intellectual Property Office. (2019). Gender profiles in worldwide patenting: An analysis of female inventorship. Intellectual Property Office. https://www.gov.uk/government/publications/gender-profiles-in-worldwide-patenting-an-analysis-of-female-inventorship-2019-edition

Miguelez, E., Toole, A., Myers, A., Breschi, S., Ferruci, E., Lissoni, F., Sterzi, V., Tarasconi, G., et al. (2019). Progress and potential: A profile of women inventors on US patents. U.S. Patent and Trademark Office, Office of the Chief Economist. https://www.uspto.gov/sites/default/files/documents/Progress-and-Potential-2019.pdf

Mullen, L. (2021). Gender: Predict gender from names using historical data. https://github.com/lmullen/gender

Raffo, J. (2021). World gender name dictionary (WGND 2.0) (DRAFT VERSION) [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/MSEGSJ

Rao, A. (2020). Gender by name data set [Data set]. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Gender+by+Name

Remy, P. (2021). Name dataset [Data set]. GitHub. https://github.com/philipperemy/name-dataset

Social Security Administration. (2021). Baby names [Data set]. U.S. Social Security Administration. https://www.ssa.gov/OACT/babynames/limits.html

Tang, C., Ross, K., Saxena, N., & Chen, R. (2011). What’s in a name: A study of names, gender inference, and gender behavior in facebook. International Conference on Database Systems for Advanced Applications, 344–356. https://nsaxena.engr.tamu.edu/wp-content/uploads/sites/238/2019/12/trsc-snsmw11.pdf