This script scrapes the data on PATENT COUNTS BY ORIGIN AND TYPE available on the website of the U.S. PATENT AND TRADEMARK OFFICE as a html table and organizes it in Pandas DataFrame

Author: Carlos Góes

www.carlosgoes.com

In [8]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

def uspto(url):
    # Request the URL
    r  = requests.get(url)

    # Parse it through BeautifulSoup        
    soup = BeautifulSoup(r.content, "html.parser")

    # Find the tables in the website         
    table=soup.find('table', class_="table")
    
    # Loops through all the rows
    for row in table.find_all('tr'):
        
        # Identify headers, store them in a list
            # build DataFrame as wide as the number
            # of headers
        headers = row.find_all('th')
        if len(headers) > 0:
            labels = []
            for header in headers:
                labels.append(header.get_text())
            new_table = pd.DataFrame(
                    columns=range(0,len(labels)),
                                 index = [0])

        # If row are not headers, proceed with the following
        else:
            
            # Identify columns
            columns = row.find_all('td')
            
            # Skip row if columns are merged
            if len(columns) != len(labels):
                continue
            
            # Loop through columns, store data in a list,
                # transform it into a Pandas series,
                # and append it to the Pandas DataFrame
            else:
                list = []
                for column in columns:
                    max = len(column.get_text()) - 1
                    list.append(column.get_text()[:max])
                list = pd.Series(list)
                new_table = new_table.append(list, ignore_index=True)
       
    new_table.columns = labels
    new_table = new_table.dropna()
    
    return new_table

Retrieves data from 2013-2015

In [9]:
years = ['13','14','15']

for year in years:    
    total = uspto("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_" + year + ".htm")
    total['year'] = year

    if year == years[0]:
        complete = total.copy()
        
    else:
        complete = complete.append(total, ignore_index=True)   

Let's see the data

In [10]:
complete
Out[10]:
Code Design Plant Reissue State, Territory, or Country Statutory Invention Registration (SIR) Total Total (less SIRs) US or Foreign Utility year
0 AL 69 0 0 ALABAMA 1 NaN 572 US 503 13
1 AK 8 0 0 ALASKA 1 NaN 54 US 46 13
2 AZ 196 2 4 ARIZONA 0 NaN 2427 US 2225 13
3 AR 69 1 0 ARKANSAS 0 NaN 229 US 159 13
4 CA 2705 133 108 CALIFORNIA 0 NaN 39139 US 36193 13
5 CO 376 0 7 COLORADO 0 NaN 3176 US 2793 13
6 CT 202 1 4 CONNECTICUT 1 NaN 2375 US 2168 13
7 DE 20 0 3 DELAWARE 1 NaN 476 US 453 13
8 FL 731 54 23 FLORIDA 0 NaN 4761 US 3953 13
9 GA 334 21 1 GEORGIA 0 NaN 2861 US 2505 13
10 HI 19 1 1 HAWAII 0 NaN 146 US 125 13
11 ID 30 0 7 IDAHO 0 NaN 1029 US 992 13
12 IL 679 16 18 ILLINOIS 0 NaN 5357 US 4644 13
13 IN 244 0 2 INDIANA 0 NaN 2167 US 1921 13
14 IA 61 0 2 IOWA 0 NaN 996 US 933 13
15 KS 70 0 0 KANSAS 0 NaN 1088 US 1018 13
16 KY 69 1 2 KENTUCKY 0 NaN 625 US 553 13
17 LA 40 3 1 LOUISIANA 0 NaN 439 US 395 13
18 ME 9 0 0 MAINE 0 NaN 243 US 234 13
19 MD 134 1 6 MARYLAND 0 NaN 1889 US 1748 13
20 MA 336 2 18 MASSACHUSETTS 2 NaN 6765 US 6409 13
21 MI 487 27 7 MICHIGAN 0 NaN 5640 US 5119 13
22 MN 481 4 12 MINNESOTA 0 NaN 4789 US 4292 13
23 MS 15 2 0 MISSISSIPPI 0 NaN 178 US 161 13
24 MO 112 0 5 MISSOURI 0 NaN 1301 US 1184 13
25 MT 9 0 0 MONTANA 0 NaN 127 US 118 13
26 NE 34 0 0 NEBRASKA 0 NaN 341 US 307 13
27 NV 157 0 6 NEVADA 0 NaN 1003 US 840 13
28 NH 70 0 3 NEW HAMPSHIRE 0 NaN 895 US 822 13
29 NJ 521 5 14 NEW JERSEY 0 NaN 5257 US 4717 13
... ... ... ... ... ... ... ... ... ... ... ...
508 RUX 41 1 1 RUSSIAN FEDERATION NaN 483 NaN FOREIGN 440 15
509 BLX 1 0 0 SAINT BARTHELEMY NaN 1 NaN FOREIGN 0 15
510 SAX 4 0 0 SAUDI ARABIA NaN 364 NaN FOREIGN 360 15
511 RSX 0 0 0 SERBIA NaN 6 NaN FOREIGN 6 15
512 SCX 2 0 0 SEYCHELLES NaN 3 NaN FOREIGN 1 15
513 SGX 82 0 0 SINGAPORE NaN 1048 NaN FOREIGN 966 15
514 SKX 2 0 0 SLOVAKIA NaN 28 NaN FOREIGN 26 15
515 SIX 4 0 0 SLOVENIA NaN 42 NaN FOREIGN 38 15
516 ZAX 29 4 0 SOUTH AFRICA NaN 199 NaN FOREIGN 166 15
517 ESX 88 5 1 SPAIN NaN 912 NaN FOREIGN 818 15
518 LKX 3 0 0 SRI LANKA NaN 7 NaN FOREIGN 4 15
519 SEX 225 0 4 SWEDEN NaN 2862 NaN FOREIGN 2633 15
520 CHX 283 1 4 SWITZERLAND NaN 2841 NaN FOREIGN 2553 15
521 SYX 0 0 0 SYRIA NaN 2 NaN FOREIGN 2 15
522 TWX 877 1 7 TAIWAN NaN 12575 NaN FOREIGN 11690 15
523 THX 20 7 0 THAILAND NaN 116 NaN FOREIGN 89 15
524 TTX 1 0 0 TRINIDAD AND TOBAGO NaN 7 NaN FOREIGN 6 15
525 TNX 0 0 0 TUNISIA NaN 6 NaN FOREIGN 6 15
526 TRX 18 0 2 TURKEY NaN 136 NaN FOREIGN 116 15
527 TMX 0 0 0 TURKMENISTAN NaN 1 NaN FOREIGN 1 15
528 UAX 2 0 1 UKRAINE NaN 64 NaN FOREIGN 61 15
529 AEX 11 0 0 UNITED ARAB EMIRATES NaN 56 NaN FOREIGN 45 15
530 GBX 702 32 16 UNITED KINGDOM NaN 7167 NaN FOREIGN 6417 15
531 UYX 0 0 0 URUGUAY NaN 5 NaN FOREIGN 5 15
532 UZX 0 0 0 UZBEKISTAN NaN 1 NaN FOREIGN 1 15
533 VEX 8 0 0 VENEZUELA NaN 30 NaN FOREIGN 22 15
534 VNX 0 0 0 VIET NAM NaN 5 NaN FOREIGN 5 15
535 PSX 0 0 0 WEST BANK/GAZA NaN 1 NaN FOREIGN 1 15
536 11651 674 234 -- Subtotal -- NaN 169997 NaN FOREIGN 157438 15
537 ALL 25986 1074 512 TOTAL- All Regions, U.S. and Foreign NaN 325979 NaN ALL 298407 15

538 rows × 11 columns

In [11]:
foreign = complete[complete['US or Foreign'] == 'FOREIGN'][complete['Code'] != '']

foreign
C:\Users\Carlos\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':
Out[11]:
Code Design Plant Reissue State, Territory, or Country Statutory Invention Registration (SIR) Total Total (less SIRs) US or Foreign Utility year
56 ALX 0 0 0 ALBANIA 0 NaN 1 FOREIGN 1 13
57 AGX 0 0 0 ANTIGUA AND BARBUDA 0 NaN 1 FOREIGN 1 13
58 ARX 5 0 0 ARGENTINA 0 NaN 80 FOREIGN 75 13
59 AMX 1 0 0 ARMENIA 0 NaN 4 FOREIGN 3 13
60 AWX 0 0 0 ARUBA 0 NaN 2 FOREIGN 2 13
61 AUX 269 17 7 AUSTRALIA 0 NaN 1924 FOREIGN 1631 13
62 ATX 117 0 10 AUSTRIA 0 NaN 1135 FOREIGN 1008 13
63 AZX 0 0 0 AZERBAIJAN 0 NaN 1 FOREIGN 1 13
64 BSX 1 0 0 THE BAHAMAS 0 NaN 6 FOREIGN 5 13
65 BHX 0 0 0 BAHRAIN 0 NaN 2 FOREIGN 2 13
66 BDX 0 0 0 BANGLADESH 0 NaN 3 FOREIGN 3 13
67 BYX 3 0 0 BELARUS 0 NaN 9 FOREIGN 6 13
68 BEX 71 13 2 BELGIUM 0 NaN 1148 FOREIGN 1062 13
69 BMX 0 0 0 BERMUDA 0 NaN 3 FOREIGN 3 13
70 BOX 0 0 0 BOLIVIA 0 NaN 2 FOREIGN 2 13
71 BRX 32 0 0 BRAZIL 0 NaN 286 FOREIGN 254 13
72 VGX 2 0 0 BRITISH VIRGIN ISLANDS 0 NaN 2 FOREIGN 0 13
73 BGX 1 0 0 BULGARIA 0 NaN 30 FOREIGN 29 13
74 CMX 0 0 0 CAMEROON 0 NaN 4 FOREIGN 4 13
75 CAX 701 4 20 CANADA 0 NaN 7272 FOREIGN 6547 13
76 KYX 3 0 0 CAYMAN ISLANDS 0 NaN 15 FOREIGN 12 13
77 CLX 1 2 0 CHILE 0 NaN 57 FOREIGN 54 13
78 HKX 216 0 0 CHINA, HONG KONG S.A.R. 0 NaN 756 FOREIGN 540 13
79 MOX 1 0 0 CHINA, MACAU S.A.R. 0 NaN 11 FOREIGN 10 13
80 CNX 661 1 7 CHINA, PEOPLE'S REPUBLIC OF 0 NaN 6597 FOREIGN 5928 13
81 COX 5 0 0 COLOMBIA 0 NaN 21 FOREIGN 16 13
82 CRX 0 0 0 COSTA RICA 0 NaN 9 FOREIGN 9 13
83 HRX 1 0 0 CROATIA 0 NaN 18 FOREIGN 17 13
84 CUX 0 0 0 CUBA 0 NaN 13 FOREIGN 13 13
85 CYX 0 0 0 CYPRUS 0 NaN 8 FOREIGN 8 13
... ... ... ... ... ... ... ... ... ... ... ...
506 QAX 1 0 1 QATAR NaN 8 NaN FOREIGN 6 15
507 ROX 3 0 0 ROMANIA NaN 74 NaN FOREIGN 71 15
508 RUX 41 1 1 RUSSIAN FEDERATION NaN 483 NaN FOREIGN 440 15
509 BLX 1 0 0 SAINT BARTHELEMY NaN 1 NaN FOREIGN 0 15
510 SAX 4 0 0 SAUDI ARABIA NaN 364 NaN FOREIGN 360 15
511 RSX 0 0 0 SERBIA NaN 6 NaN FOREIGN 6 15
512 SCX 2 0 0 SEYCHELLES NaN 3 NaN FOREIGN 1 15
513 SGX 82 0 0 SINGAPORE NaN 1048 NaN FOREIGN 966 15
514 SKX 2 0 0 SLOVAKIA NaN 28 NaN FOREIGN 26 15
515 SIX 4 0 0 SLOVENIA NaN 42 NaN FOREIGN 38 15
516 ZAX 29 4 0 SOUTH AFRICA NaN 199 NaN FOREIGN 166 15
517 ESX 88 5 1 SPAIN NaN 912 NaN FOREIGN 818 15
518 LKX 3 0 0 SRI LANKA NaN 7 NaN FOREIGN 4 15
519 SEX 225 0 4 SWEDEN NaN 2862 NaN FOREIGN 2633 15
520 CHX 283 1 4 SWITZERLAND NaN 2841 NaN FOREIGN 2553 15
521 SYX 0 0 0 SYRIA NaN 2 NaN FOREIGN 2 15
522 TWX 877 1 7 TAIWAN NaN 12575 NaN FOREIGN 11690 15
523 THX 20 7 0 THAILAND NaN 116 NaN FOREIGN 89 15
524 TTX 1 0 0 TRINIDAD AND TOBAGO NaN 7 NaN FOREIGN 6 15
525 TNX 0 0 0 TUNISIA NaN 6 NaN FOREIGN 6 15
526 TRX 18 0 2 TURKEY NaN 136 NaN FOREIGN 116 15
527 TMX 0 0 0 TURKMENISTAN NaN 1 NaN FOREIGN 1 15
528 UAX 2 0 1 UKRAINE NaN 64 NaN FOREIGN 61 15
529 AEX 11 0 0 UNITED ARAB EMIRATES NaN 56 NaN FOREIGN 45 15
530 GBX 702 32 16 UNITED KINGDOM NaN 7167 NaN FOREIGN 6417 15
531 UYX 0 0 0 URUGUAY NaN 5 NaN FOREIGN 5 15
532 UZX 0 0 0 UZBEKISTAN NaN 1 NaN FOREIGN 1 15
533 VEX 8 0 0 VENEZUELA NaN 30 NaN FOREIGN 22 15
534 VNX 0 0 0 VIET NAM NaN 5 NaN FOREIGN 5 15
535 PSX 0 0 0 WEST BANK/GAZA NaN 1 NaN FOREIGN 1 15

365 rows × 11 columns