Sad part of story:
best current commercial tools better than open source tools for many cases
import pandas as pd
https://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide
https://thomaslevine.com/!/parsing-pdfs/
list=pd.read_csv("pdf_examples/tabula-AFD-130118-015.csv")
list.head()
MAJCOM, FOA, Etc | Organizational Level | Finding Type | Quantity | Item(s) discovered | Location | |
---|---|---|---|---|---|---|
0 | ACC | Staff | Unprofessional | 1 | photo | Workplace Common Area |
1 | ACC | Staff | Unprofessional | 1 | newspaper with unprofessional cover | Workplace Common Area |
2 | ACC | Squadron | Unprofessional | 1 | magazine | Workplace Common Area |
3 | ACC | Squadron | Inappropriate/Offensive | 1 | Bumper sticker | Car |
4 | ACC | Squadron | Unprofessional | 6 | signs with unproffesional language | Workplace Common Area |
list.groupby(by="Location").count()
MAJCOM, FOA, Etc | Organizational Level | Finding Type | Quantity | Item(s) discovered | |
---|---|---|---|---|---|
Location | |||||
1inX1in deck of cards with \rnude drawing; 2inX2.5in post \rcard drawing depicting front of \rairplane with female drawing | 1 | 1 | 1 | 1 | 1 |
A-10 Ladder Doors | 1 | 1 | 1 | 1 | 1 |
Acft Dock desk | 1 | 1 | 1 | 1 | 1 |
Air Terminal Operations Bldg | 1 | 1 | 1 | 1 | 1 |
Aircraft | 2 | 2 | 2 | 2 | 2 |
Aircraft Parts Store | 1 | 1 | 1 | 1 | 1 |
Airfield Server/network drive | 1 | 1 | 1 | 1 | 1 |
Airman & Family Readiness \rCenter | 1 | 1 | 1 | 1 | 1 |
Airmen's common work area | 1 | 1 | 1 | 1 | 1 |
Although \rhumorous/functional…could be \rperceived as offensive | 3 | 3 | 3 | 3 | 3 |
Ammo Facility | 1 | 1 | 1 | 1 | 1 |
Anti-religious sentiment does \rnot promote a proper work \renvironment | 1 | 1 | 1 | 1 | 1 |
Auditorium | 2 | 2 | 2 | 2 | 2 |
Auto Hobby Shop | 2 | 2 | 2 | 2 | 2 |
Avionics Programs Office | 2 | 2 | 2 | 2 | 2 |
Avionics section | 2 | 2 | 2 | 2 | 2 |
Back of office door | 1 | 1 | 1 | 1 | 1 |
Bar | 1 | 1 | 1 | 1 | 1 |
Bar (class gift) | 1 | 1 | 1 | 1 | 1 |
Bar (visiting unit gift) | 1 | 1 | 1 | 1 | 1 |
Base Common Area | 1 | 1 | 1 | 1 | 1 |
Base Library | 1 | 1 | 1 | 1 | 1 |
Base operations men’s room | 1 | 1 | 1 | 1 | 1 |
Bathroom | 8 | 8 | 8 | 8 | 8 |
Bathroom (M) | 1 | 1 | 1 | 1 | 1 |
Bathroom (W) | 1 | 1 | 1 | 1 | 1 |
Bathroom Stall | 1 | 1 | 1 | 1 | 1 |
Bathroom stalls | 6 | 6 | 6 | 6 | 6 |
Bathroom wall | 1 | 1 | 1 | 1 | 1 |
Bias against mentally \rhandicapped people is not \rappropriate in the work place | 1 | 1 | 1 | 1 | 1 |
... | ... | ... | ... | ... | ... |
computer | 4 | 4 | 4 | 4 | 4 |
computer files | 32 | 32 | 32 | 32 | 32 |
computer room | 1 | 1 | 1 | 1 | 1 |
detrimental to good order and \rdiscipline | 2 | 2 | 2 | 2 | 2 |
explicit material | 6 | 6 | 6 | 6 | 6 |
explicit material/mild nudity | 1 | 1 | 1 | 1 | 1 |
explicit material/sexuality mild \rnudity | 1 | 1 | 1 | 1 | 1 |
explicit/mild nudity | 1 | 1 | 1 | 1 | 1 |
explicit/violent/vulgar material | 1 | 1 | 1 | 1 | 1 |
foyer | 1 | 1 | 1 | 1 | 1 |
inside the drawer of a common \ruse desk | 1 | 1 | 1 | 1 | 1 |
latrine | 2 | 2 | 2 | 2 | 2 |
member’s office | 2 | 2 | 2 | 2 | 2 |
nudity/inappropriate subject \rmatter | 1 | 1 | 1 | 1 | 1 |
office cubicles | 1 | 1 | 1 | 1 | 1 |
on latrine board | 1 | 1 | 1 | 1 | 1 |
potential for inappropriate \rcontent | 1 | 1 | 1 | 1 | 1 |
server | 5 | 5 | 5 | 5 | 5 |
sexually explicit item in \rcommon area | 1 | 1 | 1 | 1 | 1 |
sexually explicit/offensive | 1 | 1 | 1 | 1 | 1 |
sexually explicit/profane item in \rcommon area | 2 | 2 | 2 | 2 | 2 |
share drive | 8 | 8 | 8 | 8 | 8 |
shared drive | 77 | 77 | 77 | 77 | 77 |
shared drive/history micro film r | 2 | 2 | 2 | 2 | 2 |
shelf | 3 | 3 | 3 | 3 | 3 |
storage closet | 1 | 1 | 1 | 1 | 1 |
unprofessional comments | 2 | 2 | 2 | 2 | 2 |
vulgar | 1 | 1 | 1 | 1 | 1 |
vulgar or offensive language | 4 | 4 | 4 | 4 | 4 |
workspace | 1 | 1 | 1 | 1 | 1 |
577 rows × 5 columns
Hello *REGEX* my old friend,
I've come to talk with you once again
pdftotext
¶apt-get install poppler-utils
http://www.foolabs.com/xpdf/home.html
implementations vary a lot. Better on Linux than on Mac.
cd pdf_examples/
/home/mljones/repositories/courses/databases-2015/pdf_examples
#does basic conversio
!pdftotext p5.pdf
!cat p5.txt
Keynote Talk The Mathematics of Causal Inference Judea Pearl Computer Science Department University of California Los Angeles Los Angeles, CA 90024, USA judea@cs.ucla.edu Abstract I will review concepts, principles, and mathematical tools that were found useful in applications involving causal and counterfactual relationships. This semantical framework, enriched with a few ideas from logic and graph theory, gives rise to a complete, coherent, and friendly calculus of causation that unifies the graphical and counterfactual approaches to causation and resolves many long-standing problems in several of the sciences. These include questions of causal effect estimation, policy analysis, and the integration of data from diverse studies. Of special interest to KDD researchers would be the following topics: 1. The Mediation Formula, and what it tells us about direct and indirect effects. 2. What mathematics can tell us about “external validity” or “generalizing from experiments” 3. What can graph theory tell us about recovering from sample-selection bias. Categories and Subject Descriptors: G.m [Mathematics of Computing]: Miscellaneous General Terms: Theory Bio Judea Pearl is a professor of computer science and statistics at the University of California, Los Angeles. He is a graduate of the Technion, Israel, and has joined the faculty of UCLA in 1970, where he currently directs the Cognitive Systems Laboratory and conducts research in artificial intelligence, causal inference and philosophy of science. He has authored three books: Heuristics (1984), Probabilistic Reasoning (1988), and Causality (2000;2009). A member of the National Academy of Engineering, and a Founding Fellow the American Association for Artificial Intelligence (AAAI), Judea Pearl is the recipient of the 2008 Benjamin Franklin Medal for Computer and Cognitive Science and this year’s David Rumelhart Prize from the Cognitive Science Society. Copyright is held by the author/owner(s). KDD’11, August 21–24, 2011, San Diego, California, USA. ACM 978-1-4503-0813-7/11/08. 5
!pdftotext -layout p5.pdf
!cat p5.txt
Keynote Talk The Mathematics of Causal Inference Judea Pearl Computer Science Department University of California Los Angeles Los Angeles, CA 90024, USA judea@cs.ucla.edu Abstract I will review concepts, principles, and mathematical tools that were found useful in applications involving causal and counterfactual relationships. This semantical framework, enriched with a few ideas from logic and graph theory, gives rise to a complete, coherent, and friendly calculus of causation that unifies the graphical and counterfactual approaches to causation and resolves many long-standing problems in several of the sciences. These include questions of causal effect estimation, policy analysis, and the integration of data from diverse studies. Of special interest to KDD researchers would be the following topics: 1. The Mediation Formula, and what it tells us about direct and indirect effects. 2. What mathematics can tell us about “external validity” or “generalizing from experiments” 3. What can graph theory tell us about recovering from sample-selection bias. Categories and Subject Descriptors: G.m [Mathematics of Computing]: Miscellaneous General Terms: Theory Bio Judea Pearl is a professor of computer science and statistics at the University of California, Los Angeles. He is a graduate of the Technion, Israel, and has joined the faculty of UCLA in 1970, where he currently directs the Cognitive Systems Laboratory and conducts research in artificial intelligence, causal inference and philosophy of science. He has authored three books: Heuristics (1984), Probabilistic Reasoning (1988), and Causality (2000;2009). A member of the National Academy of Engineering, and a Founding Fellow the American Association for Artificial Intelligence (AAAI), Judea Pearl is the recipient of the 2008 Benjamin Franklin Medal for Computer and Cognitive Science and this year’s David Rumelhart Prize from the Cognitive Science Society. Copyright is held by the author/owner(s). KDD’11, August 21–24, 2011, San Diego, California, USA. ACM 978-1-4503-0813-7/11/08. 5
Let's check out a yucky scanned then OCR'd table from our good friends at DARPA. (It doesn't work on Tabula, alas!)
!pdftotext 12-F-1039_1999-DARPA-Funding-List.pdf
!head 12-F-1039_1999-DARPA-Funding-List.txt
A 1 FY 2 1420 1999 1421 1422 1423 1424 1425 1426
-layout
OR -fixed
(and a number say 2 or 10)¶!pdftotext -layout 12-F-1039_1999-DARPA-Funding-List.pdf
!head 12-F-1039_1999-DARPA-Funding-List.txt
A B c D E F G 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE AMOUNT 2 1420 1999 MDA97292J 1029 GR20 CNRI INFORMATION MANAGEMENT 12/10/1998 $687,000.00 1421 MDA97292J1 029 GR22 CNRI COMMUNICATOR 4/22/1999 $400,000.00 1422 MDA97292J1 029 GR22 CNRI WEBINABOX 4122/1999 $360,000.00 1423 MDA97292J1 029 P00025 CNRI WEBINABOX 8/24/1999 $0.00 1424 MDA972931 0030 P00009 GEORGIATEC HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 $1 ,210,694.00 1425 MDA9729320014 P00017 USDISPLAYC FLAT PANEL DISPLAYS 8116/1999 $5,794,000.00 1426 MDA97293C0016 P00043 SYSPLANCOR CHPS: Combat Hybrid Power Systems 1nt1999 $79,441.00
That looks like something we might be able to struggle with!
Let's try it!
Lots of ways of tackling it but the easiest is probably pandas
' read_table
function.
#first just make sure in control of encoding
!pdftotext -layout -enc "UTF-8" 12-F-1039_1999-DARPA-Funding-List.pdf
darpa1999=pd.read_table("12-F-1039_1999-DARPA-Funding-List.txt", sep="\t", encoding="UTF-8", header=1)
darpa1999
1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE AMOUNT | |
---|---|
0 | 2 |
1 | 1420 1999 MDA97292J 1029 GR20 CNRI ... |
2 | 1421 MDA97292J1 029 GR22 CNRI ... |
3 | 1422 MDA97292J1 029 GR22 CNRI ... |
4 | 1423 MDA97292J1 029 P00025 CNRI ... |
5 | 1424 MDA972931 0030 P00009 GEORGI... |
6 | 1425 MDA9729320014 P00017 USDISP... |
7 | 1426 MDA97293C0016 P00043 SYSPLA... |
8 | 1427 MDA97294C0003 A00003 BELLAT... |
9 | 1428 MDA97294C0003 P00026 BELLAT... |
10 | 1429 MDA97294C0003 P00027 BELLAT... |
11 | 1430 MDA97294C0003 P00028 BELLAT... |
12 | 1431 MDA97294C0003 P00029 BELLAT... |
13 | 1432 MDA97294C0003 P00030 BELLAT... |
14 | 1433 MDA97294C0003 P00031 BELLAT... |
15 | 1434 MDA97294C0003 P00032 BELLAT... |
16 | 1435 MDA97294C0016 P00026 BDMFED... |
17 | 1436 MDA97294C0016 P00027 BDMFED... |
18 | 1437 MDA97294C0016 P00028 BDMFED... |
19 | 1438 MDA97294C0016 P00029 BDMFED... |
20 | 1439 MDA97294C0016 P00030 BDMFED... |
21 | 1440 MDA97294D0001 D003/P16 VRT ... |
22 | 1441 MDA97294D0001 0032/3 VRT ... |
23 | 1442 MDA97294D0001 003202 VALLEY... |
24 | 1443 MDA972951 0016 GR03 ARIZON... |
25 | 1444 MDA9729530027 P00014 BELLCO... |
26 | 1445 MDA9729530029 A00009 PLANAR... |
27 | 1446 MDA9729530029 GR0008 PLANAR... |
28 | 1447 MDA9729530036 GR06 ITNENE... |
29 | 1448 MDA9729530042 GR011 CRAYRE... |
... | ... |
488 | 1880 MDA97299F0028 D001 DIGITSY... |
489 | 1881 MDA97299F0029 DO DTAI ... |
490 | 1882 MDA97299F0030 BASIC BOOZALL... |
491 | 1883 MDA97299F0031 BASIC SCHAFER... |
492 | 1884 MDA97299F0032 DO BRADSON... |
493 | 1885 MDA97299F0033 DO SYSPLAN... |
494 | 1886 MDA97299F0033 P00001 SYSPLAN... |
495 | 1887 MDA97299F0034 BASIC DIGITSY... |
496 | 1888 MDA97299M0002 DO INFOSYS... |
497 | 1889 MDA97299M0003 DO SRC ... |
498 | A B c D ... |
499 | 1 FY CONTRACT NUMBER CONTRACT MOD PERFORME... |
500 | 2 |
501 | 1890 MDA97299M0004 DO ARDAK ... |
502 | 1891 MDA97299M0004 P00001 ARDAK ... |
503 | 1892 MDA97299M0004 P00002 ARDAK ... |
504 | 1893 MDA97299M0005 DO SHA ... |
505 | 1894 MDA97299M0005 P00001 SHA ... |
506 | 1895 MDA97299M0006 DO VISTARE... |
507 | 1896 MDA97299M0007 DO VISUALE... |
508 | 1897 MDA97299M0008 BASIC BLUE RI... |
509 | 1898 MDA97299M0009 DO QRI ... |
510 | 1899 MDA97299M001 0 DO PRAJAIN... |
511 | 1900 MDA97299M0011 BASIC lVI ... |
512 | 1901 MDA97299M0012 BASIC JERRYCO... |
513 | 1902 MDA97299M0013 DO DIAMOND... |
514 | 1903 MDA9769630014 P00007 SDLINC ... |
515 | 1904 ... |
516 | 1905 |
517 |
518 rows × 1 columns
You'll recall the sep="\t"
or sep="|"
to tell pd.read_csv
to look for tabs.
The trick here is to look for spaces
. Fortunately, we don't have to convert spaces to tabs. We just tell it that a number of spaces are the delimited using standard regex: \s+
!
darpa1999=pd.read_table("12-F-1039_1999-DARPA-Funding-List.txt", sep="\s\s+", encoding="UTF-8", header=0)
/home/mljones/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:648: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'. ParserWarning)
darpa1999
A | B | c | D | E | F | G | |
---|---|---|---|---|---|---|---|
0 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
1 | 2 | None | None | None | None | None | None |
2 | 1420 1999 | MDA97292J 1029 | GR20 | CNRI | INFORMATION MANAGEMENT | 12/10/1998 | $687,000.00 |
3 | 1421 | MDA97292J1 029 | GR22 | CNRI | COMMUNICATOR | 4/22/1999 | $400,000.00 |
4 | 1422 | MDA97292J1 029 | GR22 | CNRI | WEBINABOX | 4122/1999 | $360,000.00 |
5 | 1423 | MDA97292J1 029 | P00025 | CNRI | WEBINABOX | 8/24/1999 | $0.00 |
6 | 1424 | MDA972931 0030 | P00009 | GEORGIATEC | HIGH DEFINITION SYSTEMS (HDS) | 1/29/1999 | $1 ,210,694.00 |
7 | 1425 | MDA9729320014 | P00017 | USDISPLAYC | FLAT PANEL DISPLAYS | 8116/1999 | $5,794,000.00 |
8 | 1426 | MDA97293C0016 | P00043 | SYSPLANCOR | CHPS: Combat Hybrid Power Systems | 1nt1999 | $79,441.00 |
9 | 1427 | MDA97294C0003 | A00003 | BELLATLANT | NEXT GENERATION INTERNET | 8/28/1998 | $0.00 |
10 | 1428 | MDA97294C0003 | P00026 | BELLATLANT | NEXT GENERATION INTERNET | 1/2011999 | $332,197.00 |
11 | 1429 | MDA97294C0003 | P00027 | BELLATLANT | NEXT GENERATION INTERNET | 2/4/1999 | $94,750.00 |
12 | 1430 | MDA97294C0003 | P00028 | BELLATLANT | NEXT GENERATION INTERNET | 2/22/1999 | $450,000.00 |
13 | 1431 | MDA97294C0003 | P00029 | BELLATLANT | NEXT GENERATION INTERNET | 3/1/1999 | $254,750.00 |
14 | 1432 | MDA97294C0003 | P00030 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 2/1999 | $0.00 |
15 | 1433 | MDA97294C0003 | P00031 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 3/1999 | $254,750.00 |
16 | 1434 | MDA97294C0003 | P00032 | BELLATLANT | NEXT GENERATION INTERNET | 9/8/1999 | $254,750.00 |
17 | 1435 | MDA97294C0016 | P00026 | BDMFEDERAL | STOWACTD | 2/1 2/1999 | $117,000.00 |
18 | 1436 | MDA97294C0016 | P00027 | BDMFEDERAL | STOWACTD | 3/1/1999 | $273,000.00 |
19 | 1437 | MDA97294C0016 | P00028 | BDMFEDERAL | IMAGE UNDERSTANDING | 3/2911999 | $150,166.00 |
20 | 1438 | MDA97294C0016 | P00029 | BDMFEDERAL | STOWACTD | 5/27/1999 | $40,000.00 |
21 | 1439 | MDA97294C0016 | P00030 | BDMFEDERAL | STOWACTD | 911 /1999 | $55,930.00 |
22 | 1440 | MDA97294D0001 | D003/P16 | VRT | BADD | 12/9/1998 | $73,374.00 |
23 | 1441 | MDA97294D0001 | 0032/3 | VRT | AGILE INFO CONTROL ENVIRONMENT | 2/12/1999 | $100,095.00 |
24 | 1442 | MDA97294D0001 | 003202 | VALLEYELEC | AGILE INFO CONTROL ENVIRONMENT | 12/22/1998 | $100,095.00 |
25 | 1443 | MDA972951 0016 | GR03 | ARIZONASTA | VLSI PHOTONICS | 3/1 5/1999 | $149,984.00 |
26 | 1444 | MDA9729530027 | P00014 | BELLCORE | BROADBAND INFORMATION TECHNOLOGY | 1/4/1999 | $4,547,200.00 |
27 | 1445 | MDA9729530029 | A00009 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 5/4/1999 | $0.00 |
28 | 1446 | MDA9729530029 | GR0008 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 11/10/1998 | $7,570,137.00 |
29 | 1447 | MDA9729530036 | GR06 | ITNENERGYS | PHOTOVOLTAICS (VP) | 11/1 8/1998 | $558,900.00 |
... | ... | ... | ... | ... | ... | ... | ... |
488 | 1879 | M DA97299F0028 | DO | DIGITSYSIN | CONTRACT ADMINISTRATION | 7/14/1999 | $90,000.00 |
489 | 1880 | MDA97299F0028 | D001 | DIGITSYSIN | CONTRACTS MANAGEMENT | 6/30/1999 | $4,422.00 |
490 | 1881 | MDA97299F0029 | DO | DTAI | TECH INTEGRATION CENTER/TECH DEV CENTER | 8/4/1999 | $100,000.00 |
491 | 1882 | MDA97299F0030 | BASIC | BOOZALLEN | POLYMER MATERIALS (CONG ADD) | 5/15/1999 | $423,916.45 |
492 | 1883 | MDA97299F0031 | BASIC | SCHAFER | CEROS (FENCED) | 8/2/1999 | $59,972.00 |
493 | 1884 | MDA97299F0032 | DO | BRADSONCOR | ADVANCED SHIP/SENSOR SYSTEMS MRN-02 | 8/9/1999 | $43,425.18 |
494 | 1885 | MDA97299F0033 | DO | SYSPLANCOR | CONTRACTS MANAGEMENT | 8/30/1999 | $37,075.00 |
495 | 1886 | MDA97299F0033 | P00001 | SYSPLANCOR | CONTRACTS MANAGEMENT | 9/13/1999 | $0.00 |
496 | 1887 | MDA97299F0034 | BASIC | DIGITSYSIN | CONTRACTS MANAGEMENT | 8/31/1999 | $64,755.00 |
497 | 1888 | MDA97299M0002 | DO | INFOSYSLAB | ADVANCED GROUND SURVELLIANCE | 3/1211999 | $99,729.00 |
498 | 1889 | MDA97299M0003 | DO | SRC | ADVANCED MICROELECTRONICS | 4/14/1999 | $10,000.00 |
499 | A | B | c | D | E | F | G |
500 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
501 | 2 | None | None | None | None | None | None |
502 | 1890 | MDA97299M0004 | DO | ARDAK | BW MEDICAL DIAGNOSTICS | 3/30/1999 | $99,970.00 |
503 | 1891 | MDA97299M0004 | P00001 | ARDAK | BW MEDICAL DIAGNOSTICS | 5/26/1999 | $0.00 |
504 | 1892 | MDA97299M0004 | P00002 | ARDAK | BW MEDICAL DIAGNOSTICS | 8/4/1999 | $0.00 |
505 | 1893 | MDA97299M0005 | DO | SHA | SENSOR EMULATION | 5/4/1999 | $100,000.00 |
506 | 1894 | MDA97299M0005 | P00001 | SHA | SENSOR EMULATION | 5/12/1999 | $0.00 |
507 | 1895 | MDA97299M0006 | DO | VISTARESEA | UNDERSEA LITTORAL WARFARE | 4/12/1999 | $74,827.00 |
508 | 1896 | MDA97299M0007 | DO | VISUALEYES | COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND | 5/3/1999 | $59,500.00 |
509 | 1897 | MDA97299M0008 | BASIC | BLUE RIDGE | OFFICE/PROGRAM SUPPORT (related to VTAX4) | 5/11/1999 | $48,566.00 |
510 | 1898 | MDA97299M0009 | DO | QRI | ADVANCED SIMULATION TECH | 6/29/1999 | $99,494.00 |
511 | 1899 | MDA97299M001 0 | DO | PRAJAINC | COUNTER MEASURES | 6/14/1999 | $80,460.00 |
512 | 1900 | MDA97299M0011 | BASIC | lVI | COUNTER MEASURES | 7/16/1999 | $90,000.00 |
513 | 1901 | MDA97299M0012 | BASIC | JERRYCOOKE | CONTRACT ADMINISTRATION | 5/3/1999 | $100,000.00 |
514 | 1902 | MDA97299M0013 | DO | DIAMONDBAC | TECH INTEGRATION CENTER/TECH DEV CENTER | 9/8/1999 | $50,000.00 |
515 | 1903 | MDA9769630014 | P00007 | SDLINC | SOLAR BLIND DETECTORS | 7/9/1999 | $0.00 |
516 | 1904 | FY SUBTOTAL: $340,495,021.94 | None | None | None | None | None |
517 | 1905 | None | None | None | None | None | None |
518 rows × 7 columns
darpa1999.columns=["Number", "CONTRACT_NUMBER", "CONTRACT_MOD", "PERFORMER","PROGRAM_TITLE","AWARD_DATE","AMOUNT"]
darpa1999
Number | CONTRACT_NUMBER | CONTRACT_MOD | PERFORMER | PROGRAM_TITLE | AWARD_DATE | AMOUNT | |
---|---|---|---|---|---|---|---|
0 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
1 | 2 | None | None | None | None | None | None |
2 | 1420 1999 | MDA97292J 1029 | GR20 | CNRI | INFORMATION MANAGEMENT | 12/10/1998 | $687,000.00 |
3 | 1421 | MDA97292J1 029 | GR22 | CNRI | COMMUNICATOR | 4/22/1999 | $400,000.00 |
4 | 1422 | MDA97292J1 029 | GR22 | CNRI | WEBINABOX | 4122/1999 | $360,000.00 |
5 | 1423 | MDA97292J1 029 | P00025 | CNRI | WEBINABOX | 8/24/1999 | $0.00 |
6 | 1424 | MDA972931 0030 | P00009 | GEORGIATEC | HIGH DEFINITION SYSTEMS (HDS) | 1/29/1999 | $1 ,210,694.00 |
7 | 1425 | MDA9729320014 | P00017 | USDISPLAYC | FLAT PANEL DISPLAYS | 8116/1999 | $5,794,000.00 |
8 | 1426 | MDA97293C0016 | P00043 | SYSPLANCOR | CHPS: Combat Hybrid Power Systems | 1nt1999 | $79,441.00 |
9 | 1427 | MDA97294C0003 | A00003 | BELLATLANT | NEXT GENERATION INTERNET | 8/28/1998 | $0.00 |
10 | 1428 | MDA97294C0003 | P00026 | BELLATLANT | NEXT GENERATION INTERNET | 1/2011999 | $332,197.00 |
11 | 1429 | MDA97294C0003 | P00027 | BELLATLANT | NEXT GENERATION INTERNET | 2/4/1999 | $94,750.00 |
12 | 1430 | MDA97294C0003 | P00028 | BELLATLANT | NEXT GENERATION INTERNET | 2/22/1999 | $450,000.00 |
13 | 1431 | MDA97294C0003 | P00029 | BELLATLANT | NEXT GENERATION INTERNET | 3/1/1999 | $254,750.00 |
14 | 1432 | MDA97294C0003 | P00030 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 2/1999 | $0.00 |
15 | 1433 | MDA97294C0003 | P00031 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 3/1999 | $254,750.00 |
16 | 1434 | MDA97294C0003 | P00032 | BELLATLANT | NEXT GENERATION INTERNET | 9/8/1999 | $254,750.00 |
17 | 1435 | MDA97294C0016 | P00026 | BDMFEDERAL | STOWACTD | 2/1 2/1999 | $117,000.00 |
18 | 1436 | MDA97294C0016 | P00027 | BDMFEDERAL | STOWACTD | 3/1/1999 | $273,000.00 |
19 | 1437 | MDA97294C0016 | P00028 | BDMFEDERAL | IMAGE UNDERSTANDING | 3/2911999 | $150,166.00 |
20 | 1438 | MDA97294C0016 | P00029 | BDMFEDERAL | STOWACTD | 5/27/1999 | $40,000.00 |
21 | 1439 | MDA97294C0016 | P00030 | BDMFEDERAL | STOWACTD | 911 /1999 | $55,930.00 |
22 | 1440 | MDA97294D0001 | D003/P16 | VRT | BADD | 12/9/1998 | $73,374.00 |
23 | 1441 | MDA97294D0001 | 0032/3 | VRT | AGILE INFO CONTROL ENVIRONMENT | 2/12/1999 | $100,095.00 |
24 | 1442 | MDA97294D0001 | 003202 | VALLEYELEC | AGILE INFO CONTROL ENVIRONMENT | 12/22/1998 | $100,095.00 |
25 | 1443 | MDA972951 0016 | GR03 | ARIZONASTA | VLSI PHOTONICS | 3/1 5/1999 | $149,984.00 |
26 | 1444 | MDA9729530027 | P00014 | BELLCORE | BROADBAND INFORMATION TECHNOLOGY | 1/4/1999 | $4,547,200.00 |
27 | 1445 | MDA9729530029 | A00009 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 5/4/1999 | $0.00 |
28 | 1446 | MDA9729530029 | GR0008 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 11/10/1998 | $7,570,137.00 |
29 | 1447 | MDA9729530036 | GR06 | ITNENERGYS | PHOTOVOLTAICS (VP) | 11/1 8/1998 | $558,900.00 |
... | ... | ... | ... | ... | ... | ... | ... |
488 | 1879 | M DA97299F0028 | DO | DIGITSYSIN | CONTRACT ADMINISTRATION | 7/14/1999 | $90,000.00 |
489 | 1880 | MDA97299F0028 | D001 | DIGITSYSIN | CONTRACTS MANAGEMENT | 6/30/1999 | $4,422.00 |
490 | 1881 | MDA97299F0029 | DO | DTAI | TECH INTEGRATION CENTER/TECH DEV CENTER | 8/4/1999 | $100,000.00 |
491 | 1882 | MDA97299F0030 | BASIC | BOOZALLEN | POLYMER MATERIALS (CONG ADD) | 5/15/1999 | $423,916.45 |
492 | 1883 | MDA97299F0031 | BASIC | SCHAFER | CEROS (FENCED) | 8/2/1999 | $59,972.00 |
493 | 1884 | MDA97299F0032 | DO | BRADSONCOR | ADVANCED SHIP/SENSOR SYSTEMS MRN-02 | 8/9/1999 | $43,425.18 |
494 | 1885 | MDA97299F0033 | DO | SYSPLANCOR | CONTRACTS MANAGEMENT | 8/30/1999 | $37,075.00 |
495 | 1886 | MDA97299F0033 | P00001 | SYSPLANCOR | CONTRACTS MANAGEMENT | 9/13/1999 | $0.00 |
496 | 1887 | MDA97299F0034 | BASIC | DIGITSYSIN | CONTRACTS MANAGEMENT | 8/31/1999 | $64,755.00 |
497 | 1888 | MDA97299M0002 | DO | INFOSYSLAB | ADVANCED GROUND SURVELLIANCE | 3/1211999 | $99,729.00 |
498 | 1889 | MDA97299M0003 | DO | SRC | ADVANCED MICROELECTRONICS | 4/14/1999 | $10,000.00 |
499 | A | B | c | D | E | F | G |
500 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
501 | 2 | None | None | None | None | None | None |
502 | 1890 | MDA97299M0004 | DO | ARDAK | BW MEDICAL DIAGNOSTICS | 3/30/1999 | $99,970.00 |
503 | 1891 | MDA97299M0004 | P00001 | ARDAK | BW MEDICAL DIAGNOSTICS | 5/26/1999 | $0.00 |
504 | 1892 | MDA97299M0004 | P00002 | ARDAK | BW MEDICAL DIAGNOSTICS | 8/4/1999 | $0.00 |
505 | 1893 | MDA97299M0005 | DO | SHA | SENSOR EMULATION | 5/4/1999 | $100,000.00 |
506 | 1894 | MDA97299M0005 | P00001 | SHA | SENSOR EMULATION | 5/12/1999 | $0.00 |
507 | 1895 | MDA97299M0006 | DO | VISTARESEA | UNDERSEA LITTORAL WARFARE | 4/12/1999 | $74,827.00 |
508 | 1896 | MDA97299M0007 | DO | VISUALEYES | COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND | 5/3/1999 | $59,500.00 |
509 | 1897 | MDA97299M0008 | BASIC | BLUE RIDGE | OFFICE/PROGRAM SUPPORT (related to VTAX4) | 5/11/1999 | $48,566.00 |
510 | 1898 | MDA97299M0009 | DO | QRI | ADVANCED SIMULATION TECH | 6/29/1999 | $99,494.00 |
511 | 1899 | MDA97299M001 0 | DO | PRAJAINC | COUNTER MEASURES | 6/14/1999 | $80,460.00 |
512 | 1900 | MDA97299M0011 | BASIC | lVI | COUNTER MEASURES | 7/16/1999 | $90,000.00 |
513 | 1901 | MDA97299M0012 | BASIC | JERRYCOOKE | CONTRACT ADMINISTRATION | 5/3/1999 | $100,000.00 |
514 | 1902 | MDA97299M0013 | DO | DIAMONDBAC | TECH INTEGRATION CENTER/TECH DEV CENTER | 9/8/1999 | $50,000.00 |
515 | 1903 | MDA9769630014 | P00007 | SDLINC | SOLAR BLIND DETECTORS | 7/9/1999 | $0.00 |
516 | 1904 | FY SUBTOTAL: $340,495,021.94 | None | None | None | None | None |
517 | 1905 | None | None | None | None | None | None |
518 rows × 7 columns
darpa1999=darpa1999[2:]
A different problem! The columns titles are repeated at top of each sheet!
Lots of ways to resolve and eliminate the unnecessary rows.
In many cases, means that you'll have column names as values. Pick out just those ones and clean your data.
darpa1999[darpa1999["Number"]==("A")]
Number | CONTRACT_NUMBER | CONTRACT_MOD | PERFORMER | PROGRAM_TITLE | AWARD_DATE | AMOUNT | |
---|---|---|---|---|---|---|---|
49 | A | B | c | D | E | F | G |
99 | A | B | c | D | E | F | G |
149 | A | B | c | D | E | F | G |
199 | A | B | c | D | E | F | G |
249 | A | B | c | D | E | F | G |
299 | A | B | c | D | E | F | G |
349 | A | B | c | D | E | F | G |
399 | A | B | c | D | E | F | G |
449 | A | B | c | D | E | F | G |
499 | A | B | c | D | E | F | G |
darpa1999[darpa1999["Number"]==("1 FY")]
Number | CONTRACT_NUMBER | CONTRACT_MOD | PERFORMER | PROGRAM_TITLE | AWARD_DATE | AMOUNT | |
---|---|---|---|---|---|---|---|
50 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
100 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
150 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
200 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
250 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
300 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
350 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
400 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
500 | 1 FY | CONTRACT NUMBER CONTRACT MOD PERFORMER | PROGRAM TITLE | AWARD DATE | AMOUNT | None | None |
rows_to_include=(darpa1999["Number"]!="A") & (darpa1999["Number"]!="1 FY")
darpa1999=darpa1999[rows_to_include]
darpa1999
Number | CONTRACT_NUMBER | CONTRACT_MOD | PERFORMER | PROGRAM_TITLE | AWARD_DATE | AMOUNT | |
---|---|---|---|---|---|---|---|
2 | 1420 1999 | MDA97292J 1029 | GR20 | CNRI | INFORMATION MANAGEMENT | 12/10/1998 | $687,000.00 |
3 | 1421 | MDA97292J1 029 | GR22 | CNRI | COMMUNICATOR | 4/22/1999 | $400,000.00 |
4 | 1422 | MDA97292J1 029 | GR22 | CNRI | WEBINABOX | 4122/1999 | $360,000.00 |
5 | 1423 | MDA97292J1 029 | P00025 | CNRI | WEBINABOX | 8/24/1999 | $0.00 |
6 | 1424 | MDA972931 0030 | P00009 | GEORGIATEC | HIGH DEFINITION SYSTEMS (HDS) | 1/29/1999 | $1 ,210,694.00 |
7 | 1425 | MDA9729320014 | P00017 | USDISPLAYC | FLAT PANEL DISPLAYS | 8116/1999 | $5,794,000.00 |
8 | 1426 | MDA97293C0016 | P00043 | SYSPLANCOR | CHPS: Combat Hybrid Power Systems | 1nt1999 | $79,441.00 |
9 | 1427 | MDA97294C0003 | A00003 | BELLATLANT | NEXT GENERATION INTERNET | 8/28/1998 | $0.00 |
10 | 1428 | MDA97294C0003 | P00026 | BELLATLANT | NEXT GENERATION INTERNET | 1/2011999 | $332,197.00 |
11 | 1429 | MDA97294C0003 | P00027 | BELLATLANT | NEXT GENERATION INTERNET | 2/4/1999 | $94,750.00 |
12 | 1430 | MDA97294C0003 | P00028 | BELLATLANT | NEXT GENERATION INTERNET | 2/22/1999 | $450,000.00 |
13 | 1431 | MDA97294C0003 | P00029 | BELLATLANT | NEXT GENERATION INTERNET | 3/1/1999 | $254,750.00 |
14 | 1432 | MDA97294C0003 | P00030 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 2/1999 | $0.00 |
15 | 1433 | MDA97294C0003 | P00031 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 3/1999 | $254,750.00 |
16 | 1434 | MDA97294C0003 | P00032 | BELLATLANT | NEXT GENERATION INTERNET | 9/8/1999 | $254,750.00 |
17 | 1435 | MDA97294C0016 | P00026 | BDMFEDERAL | STOWACTD | 2/1 2/1999 | $117,000.00 |
18 | 1436 | MDA97294C0016 | P00027 | BDMFEDERAL | STOWACTD | 3/1/1999 | $273,000.00 |
19 | 1437 | MDA97294C0016 | P00028 | BDMFEDERAL | IMAGE UNDERSTANDING | 3/2911999 | $150,166.00 |
20 | 1438 | MDA97294C0016 | P00029 | BDMFEDERAL | STOWACTD | 5/27/1999 | $40,000.00 |
21 | 1439 | MDA97294C0016 | P00030 | BDMFEDERAL | STOWACTD | 911 /1999 | $55,930.00 |
22 | 1440 | MDA97294D0001 | D003/P16 | VRT | BADD | 12/9/1998 | $73,374.00 |
23 | 1441 | MDA97294D0001 | 0032/3 | VRT | AGILE INFO CONTROL ENVIRONMENT | 2/12/1999 | $100,095.00 |
24 | 1442 | MDA97294D0001 | 003202 | VALLEYELEC | AGILE INFO CONTROL ENVIRONMENT | 12/22/1998 | $100,095.00 |
25 | 1443 | MDA972951 0016 | GR03 | ARIZONASTA | VLSI PHOTONICS | 3/1 5/1999 | $149,984.00 |
26 | 1444 | MDA9729530027 | P00014 | BELLCORE | BROADBAND INFORMATION TECHNOLOGY | 1/4/1999 | $4,547,200.00 |
27 | 1445 | MDA9729530029 | A00009 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 5/4/1999 | $0.00 |
28 | 1446 | MDA9729530029 | GR0008 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 11/10/1998 | $7,570,137.00 |
29 | 1447 | MDA9729530036 | GR06 | ITNENERGYS | PHOTOVOLTAICS (VP) | 11/1 8/1998 | $558,900.00 |
30 | 1448 | MDA9729530042 | GR011 | CRAYRESEAR | SHOCC | 6nt1999 | $1 ,289,562.00 |
31 | 1449 | MDA97295C0004 | P00008 | UMASS | LARGE MILLIMETER TELESCOPE | 8/30/1999 | $1 ,151 ,500.00 |
... | ... | ... | ... | ... | ... | ... | ... |
486 | 1877 | MDA97299F0025 | BASIC | SYSPLANCOR | COUNTER UNDERGROUND FACILITIES | 6/25/1999 | $251 ,924.00 |
487 | 1878 | MDA97299F0027 | DO | ORIONSCSYS | COUNTER MEASURES | 6/11/1999 | $199,991 .00 |
488 | 1879 | M DA97299F0028 | DO | DIGITSYSIN | CONTRACT ADMINISTRATION | 7/14/1999 | $90,000.00 |
489 | 1880 | MDA97299F0028 | D001 | DIGITSYSIN | CONTRACTS MANAGEMENT | 6/30/1999 | $4,422.00 |
490 | 1881 | MDA97299F0029 | DO | DTAI | TECH INTEGRATION CENTER/TECH DEV CENTER | 8/4/1999 | $100,000.00 |
491 | 1882 | MDA97299F0030 | BASIC | BOOZALLEN | POLYMER MATERIALS (CONG ADD) | 5/15/1999 | $423,916.45 |
492 | 1883 | MDA97299F0031 | BASIC | SCHAFER | CEROS (FENCED) | 8/2/1999 | $59,972.00 |
493 | 1884 | MDA97299F0032 | DO | BRADSONCOR | ADVANCED SHIP/SENSOR SYSTEMS MRN-02 | 8/9/1999 | $43,425.18 |
494 | 1885 | MDA97299F0033 | DO | SYSPLANCOR | CONTRACTS MANAGEMENT | 8/30/1999 | $37,075.00 |
495 | 1886 | MDA97299F0033 | P00001 | SYSPLANCOR | CONTRACTS MANAGEMENT | 9/13/1999 | $0.00 |
496 | 1887 | MDA97299F0034 | BASIC | DIGITSYSIN | CONTRACTS MANAGEMENT | 8/31/1999 | $64,755.00 |
497 | 1888 | MDA97299M0002 | DO | INFOSYSLAB | ADVANCED GROUND SURVELLIANCE | 3/1211999 | $99,729.00 |
498 | 1889 | MDA97299M0003 | DO | SRC | ADVANCED MICROELECTRONICS | 4/14/1999 | $10,000.00 |
501 | 2 | None | None | None | None | None | None |
502 | 1890 | MDA97299M0004 | DO | ARDAK | BW MEDICAL DIAGNOSTICS | 3/30/1999 | $99,970.00 |
503 | 1891 | MDA97299M0004 | P00001 | ARDAK | BW MEDICAL DIAGNOSTICS | 5/26/1999 | $0.00 |
504 | 1892 | MDA97299M0004 | P00002 | ARDAK | BW MEDICAL DIAGNOSTICS | 8/4/1999 | $0.00 |
505 | 1893 | MDA97299M0005 | DO | SHA | SENSOR EMULATION | 5/4/1999 | $100,000.00 |
506 | 1894 | MDA97299M0005 | P00001 | SHA | SENSOR EMULATION | 5/12/1999 | $0.00 |
507 | 1895 | MDA97299M0006 | DO | VISTARESEA | UNDERSEA LITTORAL WARFARE | 4/12/1999 | $74,827.00 |
508 | 1896 | MDA97299M0007 | DO | VISUALEYES | COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND | 5/3/1999 | $59,500.00 |
509 | 1897 | MDA97299M0008 | BASIC | BLUE RIDGE | OFFICE/PROGRAM SUPPORT (related to VTAX4) | 5/11/1999 | $48,566.00 |
510 | 1898 | MDA97299M0009 | DO | QRI | ADVANCED SIMULATION TECH | 6/29/1999 | $99,494.00 |
511 | 1899 | MDA97299M001 0 | DO | PRAJAINC | COUNTER MEASURES | 6/14/1999 | $80,460.00 |
512 | 1900 | MDA97299M0011 | BASIC | lVI | COUNTER MEASURES | 7/16/1999 | $90,000.00 |
513 | 1901 | MDA97299M0012 | BASIC | JERRYCOOKE | CONTRACT ADMINISTRATION | 5/3/1999 | $100,000.00 |
514 | 1902 | MDA97299M0013 | DO | DIAMONDBAC | TECH INTEGRATION CENTER/TECH DEV CENTER | 9/8/1999 | $50,000.00 |
515 | 1903 | MDA9769630014 | P00007 | SDLINC | SOLAR BLIND DETECTORS | 7/9/1999 | $0.00 |
516 | 1904 | FY SUBTOTAL: $340,495,021.94 | None | None | None | None | None |
517 | 1905 | None | None | None | None | None | None |
497 rows × 7 columns
darpa1999.loc[2]["Number"]="1420"
darpa1999
Number | CONTRACT_NUMBER | CONTRACT_MOD | PERFORMER | PROGRAM_TITLE | AWARD_DATE | AMOUNT | |
---|---|---|---|---|---|---|---|
2 | 1420 | MDA97292J 1029 | GR20 | CNRI | INFORMATION MANAGEMENT | 12/10/1998 | $687,000.00 |
3 | 1421 | MDA97292J1 029 | GR22 | CNRI | COMMUNICATOR | 4/22/1999 | $400,000.00 |
4 | 1422 | MDA97292J1 029 | GR22 | CNRI | WEBINABOX | 4122/1999 | $360,000.00 |
5 | 1423 | MDA97292J1 029 | P00025 | CNRI | WEBINABOX | 8/24/1999 | $0.00 |
6 | 1424 | MDA972931 0030 | P00009 | GEORGIATEC | HIGH DEFINITION SYSTEMS (HDS) | 1/29/1999 | $1 ,210,694.00 |
7 | 1425 | MDA9729320014 | P00017 | USDISPLAYC | FLAT PANEL DISPLAYS | 8116/1999 | $5,794,000.00 |
8 | 1426 | MDA97293C0016 | P00043 | SYSPLANCOR | CHPS: Combat Hybrid Power Systems | 1nt1999 | $79,441.00 |
9 | 1427 | MDA97294C0003 | A00003 | BELLATLANT | NEXT GENERATION INTERNET | 8/28/1998 | $0.00 |
10 | 1428 | MDA97294C0003 | P00026 | BELLATLANT | NEXT GENERATION INTERNET | 1/2011999 | $332,197.00 |
11 | 1429 | MDA97294C0003 | P00027 | BELLATLANT | NEXT GENERATION INTERNET | 2/4/1999 | $94,750.00 |
12 | 1430 | MDA97294C0003 | P00028 | BELLATLANT | NEXT GENERATION INTERNET | 2/22/1999 | $450,000.00 |
13 | 1431 | MDA97294C0003 | P00029 | BELLATLANT | NEXT GENERATION INTERNET | 3/1/1999 | $254,750.00 |
14 | 1432 | MDA97294C0003 | P00030 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 2/1999 | $0.00 |
15 | 1433 | MDA97294C0003 | P00031 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 3/1999 | $254,750.00 |
16 | 1434 | MDA97294C0003 | P00032 | BELLATLANT | NEXT GENERATION INTERNET | 9/8/1999 | $254,750.00 |
17 | 1435 | MDA97294C0016 | P00026 | BDMFEDERAL | STOWACTD | 2/1 2/1999 | $117,000.00 |
18 | 1436 | MDA97294C0016 | P00027 | BDMFEDERAL | STOWACTD | 3/1/1999 | $273,000.00 |
19 | 1437 | MDA97294C0016 | P00028 | BDMFEDERAL | IMAGE UNDERSTANDING | 3/2911999 | $150,166.00 |
20 | 1438 | MDA97294C0016 | P00029 | BDMFEDERAL | STOWACTD | 5/27/1999 | $40,000.00 |
21 | 1439 | MDA97294C0016 | P00030 | BDMFEDERAL | STOWACTD | 911 /1999 | $55,930.00 |
22 | 1440 | MDA97294D0001 | D003/P16 | VRT | BADD | 12/9/1998 | $73,374.00 |
23 | 1441 | MDA97294D0001 | 0032/3 | VRT | AGILE INFO CONTROL ENVIRONMENT | 2/12/1999 | $100,095.00 |
24 | 1442 | MDA97294D0001 | 003202 | VALLEYELEC | AGILE INFO CONTROL ENVIRONMENT | 12/22/1998 | $100,095.00 |
25 | 1443 | MDA972951 0016 | GR03 | ARIZONASTA | VLSI PHOTONICS | 3/1 5/1999 | $149,984.00 |
26 | 1444 | MDA9729530027 | P00014 | BELLCORE | BROADBAND INFORMATION TECHNOLOGY | 1/4/1999 | $4,547,200.00 |
27 | 1445 | MDA9729530029 | A00009 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 5/4/1999 | $0.00 |
28 | 1446 | MDA9729530029 | GR0008 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 11/10/1998 | $7,570,137.00 |
29 | 1447 | MDA9729530036 | GR06 | ITNENERGYS | PHOTOVOLTAICS (VP) | 11/1 8/1998 | $558,900.00 |
30 | 1448 | MDA9729530042 | GR011 | CRAYRESEAR | SHOCC | 6nt1999 | $1 ,289,562.00 |
31 | 1449 | MDA97295C0004 | P00008 | UMASS | LARGE MILLIMETER TELESCOPE | 8/30/1999 | $1 ,151 ,500.00 |
... | ... | ... | ... | ... | ... | ... | ... |
486 | 1877 | MDA97299F0025 | BASIC | SYSPLANCOR | COUNTER UNDERGROUND FACILITIES | 6/25/1999 | $251 ,924.00 |
487 | 1878 | MDA97299F0027 | DO | ORIONSCSYS | COUNTER MEASURES | 6/11/1999 | $199,991 .00 |
488 | 1879 | M DA97299F0028 | DO | DIGITSYSIN | CONTRACT ADMINISTRATION | 7/14/1999 | $90,000.00 |
489 | 1880 | MDA97299F0028 | D001 | DIGITSYSIN | CONTRACTS MANAGEMENT | 6/30/1999 | $4,422.00 |
490 | 1881 | MDA97299F0029 | DO | DTAI | TECH INTEGRATION CENTER/TECH DEV CENTER | 8/4/1999 | $100,000.00 |
491 | 1882 | MDA97299F0030 | BASIC | BOOZALLEN | POLYMER MATERIALS (CONG ADD) | 5/15/1999 | $423,916.45 |
492 | 1883 | MDA97299F0031 | BASIC | SCHAFER | CEROS (FENCED) | 8/2/1999 | $59,972.00 |
493 | 1884 | MDA97299F0032 | DO | BRADSONCOR | ADVANCED SHIP/SENSOR SYSTEMS MRN-02 | 8/9/1999 | $43,425.18 |
494 | 1885 | MDA97299F0033 | DO | SYSPLANCOR | CONTRACTS MANAGEMENT | 8/30/1999 | $37,075.00 |
495 | 1886 | MDA97299F0033 | P00001 | SYSPLANCOR | CONTRACTS MANAGEMENT | 9/13/1999 | $0.00 |
496 | 1887 | MDA97299F0034 | BASIC | DIGITSYSIN | CONTRACTS MANAGEMENT | 8/31/1999 | $64,755.00 |
497 | 1888 | MDA97299M0002 | DO | INFOSYSLAB | ADVANCED GROUND SURVELLIANCE | 3/1211999 | $99,729.00 |
498 | 1889 | MDA97299M0003 | DO | SRC | ADVANCED MICROELECTRONICS | 4/14/1999 | $10,000.00 |
501 | 2 | None | None | None | None | None | None |
502 | 1890 | MDA97299M0004 | DO | ARDAK | BW MEDICAL DIAGNOSTICS | 3/30/1999 | $99,970.00 |
503 | 1891 | MDA97299M0004 | P00001 | ARDAK | BW MEDICAL DIAGNOSTICS | 5/26/1999 | $0.00 |
504 | 1892 | MDA97299M0004 | P00002 | ARDAK | BW MEDICAL DIAGNOSTICS | 8/4/1999 | $0.00 |
505 | 1893 | MDA97299M0005 | DO | SHA | SENSOR EMULATION | 5/4/1999 | $100,000.00 |
506 | 1894 | MDA97299M0005 | P00001 | SHA | SENSOR EMULATION | 5/12/1999 | $0.00 |
507 | 1895 | MDA97299M0006 | DO | VISTARESEA | UNDERSEA LITTORAL WARFARE | 4/12/1999 | $74,827.00 |
508 | 1896 | MDA97299M0007 | DO | VISUALEYES | COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND | 5/3/1999 | $59,500.00 |
509 | 1897 | MDA97299M0008 | BASIC | BLUE RIDGE | OFFICE/PROGRAM SUPPORT (related to VTAX4) | 5/11/1999 | $48,566.00 |
510 | 1898 | MDA97299M0009 | DO | QRI | ADVANCED SIMULATION TECH | 6/29/1999 | $99,494.00 |
511 | 1899 | MDA97299M001 0 | DO | PRAJAINC | COUNTER MEASURES | 6/14/1999 | $80,460.00 |
512 | 1900 | MDA97299M0011 | BASIC | lVI | COUNTER MEASURES | 7/16/1999 | $90,000.00 |
513 | 1901 | MDA97299M0012 | BASIC | JERRYCOOKE | CONTRACT ADMINISTRATION | 5/3/1999 | $100,000.00 |
514 | 1902 | MDA97299M0013 | DO | DIAMONDBAC | TECH INTEGRATION CENTER/TECH DEV CENTER | 9/8/1999 | $50,000.00 |
515 | 1903 | MDA9769630014 | P00007 | SDLINC | SOLAR BLIND DETECTORS | 7/9/1999 | $0.00 |
516 | 1904 | FY SUBTOTAL: $340,495,021.94 | None | None | None | None | None |
517 | 1905 | None | None | None | None | None | None |
497 rows × 7 columns
Let's get rid of those yucky last two rows'
darpa1999=darpa1999[:-2]
import re
subtitute
re.sub("REGULAR EXPRESSION", <REPLACEMENT, "" if nothing>, <STRING_TO_BE_OPERATED_UPON>)
## WRITE UP MAP EXPLANATION
darpa1999["AMOUNT"].astype("str").map(lambda x: re.sub("[^\d\.\(\)]", "", x))
2 687000.00 3 400000.00 4 360000.00 5 0.00 6 1210694.00 7 5794000.00 8 79441.00 9 0.00 10 332197.00 11 94750.00 12 450000.00 13 254750.00 14 0.00 15 254750.00 16 254750.00 17 117000.00 18 273000.00 19 150166.00 20 40000.00 21 55930.00 22 73374.00 23 100095.00 24 100095.00 25 149984.00 26 4547200.00 27 0.00 28 7570137.00 29 558900.00 30 1289562.00 31 1151500.00 ... 484 40000.00 485 117000.00 486 251924.00 487 199991.00 488 90000.00 489 4422.00 490 100000.00 491 423916.45 492 59972.00 493 43425.18 494 37075.00 495 0.00 496 64755.00 497 99729.00 498 10000.00 501 502 99970.00 503 0.00 504 0.00 505 100000.00 506 0.00 507 74827.00 508 59500.00 509 48566.00 510 99494.00 511 80460.00 512 90000.00 513 100000.00 514 50000.00 515 0.00 Name: AMOUNT, dtype: object
## First get rid of everything not a numeral, a period, or a `(`.
## Note for Non Anglo-American sources, use you'll need to get rid of periods not commas.
darpa1999["AMOUNT"]=darpa1999["AMOUNT"].astype("str").map(lambda x: re.sub("[^\d\.\(]", "", x))
#[^\d\.\(] means everything but single digits or "." or "("
## for European style, you'd use `re.sub("[^\d\,\(]", "", x)`
/home/mljones/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
darpa1999["AMOUNT"].values
array(['687000.00', '400000.00', '360000.00', '0.00', '1210694.00', '5794000.00', '79441.00', '0.00', '332197.00', '94750.00', '450000.00', '254750.00', '0.00', '254750.00', '254750.00', '117000.00', '273000.00', '150166.00', '40000.00', '55930.00', '73374.00', '100095.00', '100095.00', '149984.00', '4547200.00', '0.00', '7570137.00', '558900.00', '1289562.00', '1151500.00', '498383.00', '2000000.00', '290000.00', '1516319.00', '210500.00', '210500.00', '0.00', '490071.00', '48000.00', '1321000.00', '423212.00', '3417.00', '5286.00', '0.00', '(109494.00', '65572.00', '(10000.00', '', '60000.00', '566045.00', '0.00', '1926000.00', '808440.00', '244629.00', '0.00', '250000.00', '0.00', '810734.00', '1000000.00', '2790100.00', '3497712.00', '300000.00', '0.00', '1606470.00', '874.00', '0.00', '0.00', '450000.00', '97000.00', '0.00', '50000.00', '333929.00', '97000.00', '0.00', '100000.00', '0.00', '500000.00', '0.00', '599878.00', '980000.00', '70000.00', '30000.00', '0.00', '2100000.00', '0.00', '0.00', '0.00', '921400.00', '100000.00', '1790000.00', '267500.00', '0.00', '91071.00', '40056.00', '0.00', '', '741000.00', '300000.00', '30000.00', '900000.00', '1950000.00', '499052.00', '10523.00', '35528.00', '49905.00', '52158.00', '51861.00', '800000.00', '0.00', '0.00', '365396.00', '500000.00', '400000.00', '300000.00', '0.00', '3474136.00', '3400000.00', '1380000.00', '162618.00', '0.00', '0.00', '150000.00', '15000.00', '45000.00', '0.00', '205000.00', '2897000.00', '503630.00', '0.00', '395320.00', '0.00', '6510000.00', '4600545.00', '(402700.00', '2700000.00', '898000.00', '16000.00', '650000.00', '0.00', '0.00', '9901798.00', '0.00', '20000.00', '', '55000.00', '750000.00', '0.00', '345372.00', '0.00', '209431.00', '500000.00', '290569.00', '1188000.00', '200000.00', '400795.00', '176482.00', '350000.00', '117420.00', '430533.00', '912000.00', '500000.00', '480975.00', '0.00', '105408.00', '386493.00', '107142.00', '392858.00', '199309.00', '2610132.00', '150000.00', '103950.00', '119950.00', '200000.00', '0.00', '2130000.00', '1000000.00', '9250000.00', '30950.00', '174798.00', '0.00', '0.00', '199899.00', '550000.00', '350000.00', '0.00', '200000.00', '149999.00', '100000.00', '258264.00', '327206.00', '93780.00', '', '102706.00', '0.00', '0.00', '2135000.00', '3750000.00', '100000.00', '0.00', '2200000.00', '144000.00', '4384000.00', '0.00', '1785000.00', '1700000.00', '0.00', '0.00', '56000.00', '750000.00', '1853441.00', '2036559.00', '750000.00', '20526.00', '184428.00', '256638.00', '321768.00', '204435.00', '43650.00', '350000.00', '735760.00', '1819008.00', '1500000.00', '0.00', '0.00', '7925688.00', '0.00', '800000.00', '650000.00', '186083.00', '190842.00', '(500000.00', '6565000.00', '0.00', '874010.00', '0.00', '518453.00', '0.00', '847010.00', '27000.00', '', '0.00', '3660000.00', '5600000.00', '5600000.00', '5600000.00', '5000000.00', '0.00', '5000000.00', '1427526.00', '243774.00', '0.00', '0.00', '1064928.00', '556226.00', '867000.00', '(800000.00', '200000.00', '769226.00', '1650000.00', '1915045.00', '1642515.00', '800000.00', '370416.00', '900000.00', '100000.00', '', '', '', '3333000.00', '', '4082000.00', '1515000.00', '687927.00', '116667.00', '583333.00', '143433.00', '349909.00', '2053443.00', '108700.00', '300000.00', '391300.00', '0.00', '6119332.00', '0.00', '1490998.00', '26658.00', '114281.00', '', '39415.00', '700000.00', '1293841.00', '0.00', '302233.00', '0.00', '199789.00', '247900.00', '0.00', '1200000.00', '124983.00', '0.00', '0.00', '294996.00', '244295.00', '0.00', '76655.00', '0.00', '0.00', '413864.00', '324929.00', '79167.00', '194148.00', '375000.00', '2000000.00', '(120000.00', '', '2000000.00', '0.00', '2586066.00', '100000.00', '100000.00', '50000.00', '100000.00', '156000.00', '0.00', '50000.00', '0.00', '100000.00', '450000.00', '50000.00', '250000.00', '850000.00', '200000.00', '380492.00', '155992.00', '130000.00', '', '210000.00', '355947.00', '80000.00', '32839.00', '79846.00', '97966.00', '392856.00', '70778.00', '24864.00', '395859.00', '124999.00', '1800000.00', '', '5883520.00', '1169500.00', '', '698000.00', '3075000.00', '3075000.00', '2311497.00', '500000.00', '200000.00', '8826140.00', '0.00', '0.00', '95524.00', '25000.00', '400000.00', '150000.00', '100000.00', '65000.00', '68000.00', '771805.00', '', '222649.00', '0.00', '242880.00', '50000.00', '3282970.00', '0.00', '365731.00', '(365731.00', '200000.00', '195867.00', '417065.00', '666262.00', '493980.00', '', '374990.00', '833281.00', '357869.00', '95943.00', '0.00', '100000.00', '299997.00', '500000.00', '0.00', '400000.00', '114000.00', '0.00', '1000000.00', '499927.00', '497.00', '218000.00', '176491.00', '0.00', '0.00', '273501.00', '0.00', '100000.00', '0.00', '594401.00', '0.00', '0.00', '0.00', '0.00', '0.00', '200000.00', '200000.00', '250000.00', '534702.00', '', '31569.00', '124928.00', '0.00', '168421.00', '80000.00', '60669.00', '140000.00', '40000.00', '20000.00', '82109.00', '60000.00', '104539.00', '70000.00', '', '', '59000.00', '312360.00', '290580.00', '332000.00', '320000.00', '78000.00', '285000.00', '0.00', '31697.37', '114950.00', '0.00', '393486.00', '210262.00', '300000.00', '329987.00', '500386.00', '210262.00', '129850.00', '119732.00', '119732.00', '0.00', '81772.38', '73096.56', '789358.00', '49959.00', '100000.00', '306000.00', '306000.00', '600000.00', '75000.00', '183000.00', '179993.00', '40000.00', '117000.00', '251924.00', '199991.00', '90000.00', '4422.00', '100000.00', '423916.45', '59972.00', '43425.18', '37075.00', '0.00', '64755.00', '99729.00', '10000.00', '', '99970.00', '0.00', '0.00', '100000.00', '0.00', '74827.00', '59500.00', '48566.00', '99494.00', '80460.00', '90000.00', '100000.00', '50000.00', '0.00'], dtype=object)
#make all the ( into negatives
darpa1999["AMOUNT"]=darpa1999["AMOUNT"].astype("str").map(lambda x: re.sub("[\(]", "-", x))
/home/mljones/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance()
#finally convert into a numerical object.
# pandas convert_objects will do the trick!
darpa1999["AMOUNT"].convert_objects(convert_numeric=True)
2 687000.00 3 400000.00 4 360000.00 5 0.00 6 1210694.00 7 5794000.00 8 79441.00 9 0.00 10 332197.00 11 94750.00 12 450000.00 13 254750.00 14 0.00 15 254750.00 16 254750.00 17 117000.00 18 273000.00 19 150166.00 20 40000.00 21 55930.00 22 73374.00 23 100095.00 24 100095.00 25 149984.00 26 4547200.00 27 0.00 28 7570137.00 29 558900.00 30 1289562.00 31 1151500.00 ... 484 40000.00 485 117000.00 486 251924.00 487 199991.00 488 90000.00 489 4422.00 490 100000.00 491 423916.45 492 59972.00 493 43425.18 494 37075.00 495 0.00 496 64755.00 497 99729.00 498 10000.00 501 NaN 502 99970.00 503 0.00 504 0.00 505 100000.00 506 0.00 507 74827.00 508 59500.00 509 48566.00 510 99494.00 511 80460.00 512 90000.00 513 100000.00 514 50000.00 515 0.00 Name: AMOUNT, dtype: float64
darpa1999["AMOUNT"]=darpa1999["AMOUNT"].convert_objects(convert_numeric=True)
/home/mljones/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':
darpa1999
Number | CONTRACT_NUMBER | CONTRACT_MOD | PERFORMER | PROGRAM_TITLE | AWARD_DATE | AMOUNT | |
---|---|---|---|---|---|---|---|
2 | 1420 | MDA97292J 1029 | GR20 | CNRI | INFORMATION MANAGEMENT | 12/10/1998 | 687000.00 |
3 | 1421 | MDA97292J1 029 | GR22 | CNRI | COMMUNICATOR | 4/22/1999 | 400000.00 |
4 | 1422 | MDA97292J1 029 | GR22 | CNRI | WEBINABOX | 4122/1999 | 360000.00 |
5 | 1423 | MDA97292J1 029 | P00025 | CNRI | WEBINABOX | 8/24/1999 | 0.00 |
6 | 1424 | MDA972931 0030 | P00009 | GEORGIATEC | HIGH DEFINITION SYSTEMS (HDS) | 1/29/1999 | 1210694.00 |
7 | 1425 | MDA9729320014 | P00017 | USDISPLAYC | FLAT PANEL DISPLAYS | 8116/1999 | 5794000.00 |
8 | 1426 | MDA97293C0016 | P00043 | SYSPLANCOR | CHPS: Combat Hybrid Power Systems | 1nt1999 | 79441.00 |
9 | 1427 | MDA97294C0003 | A00003 | BELLATLANT | NEXT GENERATION INTERNET | 8/28/1998 | 0.00 |
10 | 1428 | MDA97294C0003 | P00026 | BELLATLANT | NEXT GENERATION INTERNET | 1/2011999 | 332197.00 |
11 | 1429 | MDA97294C0003 | P00027 | BELLATLANT | NEXT GENERATION INTERNET | 2/4/1999 | 94750.00 |
12 | 1430 | MDA97294C0003 | P00028 | BELLATLANT | NEXT GENERATION INTERNET | 2/22/1999 | 450000.00 |
13 | 1431 | MDA97294C0003 | P00029 | BELLATLANT | NEXT GENERATION INTERNET | 3/1/1999 | 254750.00 |
14 | 1432 | MDA97294C0003 | P00030 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 2/1999 | 0.00 |
15 | 1433 | MDA97294C0003 | P00031 | BELLATLANT | NEXT GENERATION INTERNET | 4/1 3/1999 | 254750.00 |
16 | 1434 | MDA97294C0003 | P00032 | BELLATLANT | NEXT GENERATION INTERNET | 9/8/1999 | 254750.00 |
17 | 1435 | MDA97294C0016 | P00026 | BDMFEDERAL | STOWACTD | 2/1 2/1999 | 117000.00 |
18 | 1436 | MDA97294C0016 | P00027 | BDMFEDERAL | STOWACTD | 3/1/1999 | 273000.00 |
19 | 1437 | MDA97294C0016 | P00028 | BDMFEDERAL | IMAGE UNDERSTANDING | 3/2911999 | 150166.00 |
20 | 1438 | MDA97294C0016 | P00029 | BDMFEDERAL | STOWACTD | 5/27/1999 | 40000.00 |
21 | 1439 | MDA97294C0016 | P00030 | BDMFEDERAL | STOWACTD | 911 /1999 | 55930.00 |
22 | 1440 | MDA97294D0001 | D003/P16 | VRT | BADD | 12/9/1998 | 73374.00 |
23 | 1441 | MDA97294D0001 | 0032/3 | VRT | AGILE INFO CONTROL ENVIRONMENT | 2/12/1999 | 100095.00 |
24 | 1442 | MDA97294D0001 | 003202 | VALLEYELEC | AGILE INFO CONTROL ENVIRONMENT | 12/22/1998 | 100095.00 |
25 | 1443 | MDA972951 0016 | GR03 | ARIZONASTA | VLSI PHOTONICS | 3/1 5/1999 | 149984.00 |
26 | 1444 | MDA9729530027 | P00014 | BELLCORE | BROADBAND INFORMATION TECHNOLOGY | 1/4/1999 | 4547200.00 |
27 | 1445 | MDA9729530029 | A00009 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 5/4/1999 | 0.00 |
28 | 1446 | MDA9729530029 | GR0008 | PLANARAMER | HIGH DEFINITION SYSTEMS (HDS) | 11/10/1998 | 7570137.00 |
29 | 1447 | MDA9729530036 | GR06 | ITNENERGYS | PHOTOVOLTAICS (VP) | 11/1 8/1998 | 558900.00 |
30 | 1448 | MDA9729530042 | GR011 | CRAYRESEAR | SHOCC | 6nt1999 | 1289562.00 |
31 | 1449 | MDA97295C0004 | P00008 | UMASS | LARGE MILLIMETER TELESCOPE | 8/30/1999 | 1151500.00 |
... | ... | ... | ... | ... | ... | ... | ... |
484 | 1875 | MDA97299F0024 | BASIC | GRCI | OFFICE/PROGRAM SUPPORT (RELATED TO VSEE8) | 3/1711999 | 40000.00 |
485 | 1876 | MDA97299F0024 | BASIC | GRCI | WARFIGHTERS INTERNET | 3/17/1999 | 117000.00 |
486 | 1877 | MDA97299F0025 | BASIC | SYSPLANCOR | COUNTER UNDERGROUND FACILITIES | 6/25/1999 | 251924.00 |
487 | 1878 | MDA97299F0027 | DO | ORIONSCSYS | COUNTER MEASURES | 6/11/1999 | 199991.00 |
488 | 1879 | M DA97299F0028 | DO | DIGITSYSIN | CONTRACT ADMINISTRATION | 7/14/1999 | 90000.00 |
489 | 1880 | MDA97299F0028 | D001 | DIGITSYSIN | CONTRACTS MANAGEMENT | 6/30/1999 | 4422.00 |
490 | 1881 | MDA97299F0029 | DO | DTAI | TECH INTEGRATION CENTER/TECH DEV CENTER | 8/4/1999 | 100000.00 |
491 | 1882 | MDA97299F0030 | BASIC | BOOZALLEN | POLYMER MATERIALS (CONG ADD) | 5/15/1999 | 423916.45 |
492 | 1883 | MDA97299F0031 | BASIC | SCHAFER | CEROS (FENCED) | 8/2/1999 | 59972.00 |
493 | 1884 | MDA97299F0032 | DO | BRADSONCOR | ADVANCED SHIP/SENSOR SYSTEMS MRN-02 | 8/9/1999 | 43425.18 |
494 | 1885 | MDA97299F0033 | DO | SYSPLANCOR | CONTRACTS MANAGEMENT | 8/30/1999 | 37075.00 |
495 | 1886 | MDA97299F0033 | P00001 | SYSPLANCOR | CONTRACTS MANAGEMENT | 9/13/1999 | 0.00 |
496 | 1887 | MDA97299F0034 | BASIC | DIGITSYSIN | CONTRACTS MANAGEMENT | 8/31/1999 | 64755.00 |
497 | 1888 | MDA97299M0002 | DO | INFOSYSLAB | ADVANCED GROUND SURVELLIANCE | 3/1211999 | 99729.00 |
498 | 1889 | MDA97299M0003 | DO | SRC | ADVANCED MICROELECTRONICS | 4/14/1999 | 10000.00 |
501 | 2 | None | None | None | None | None | NaN |
502 | 1890 | MDA97299M0004 | DO | ARDAK | BW MEDICAL DIAGNOSTICS | 3/30/1999 | 99970.00 |
503 | 1891 | MDA97299M0004 | P00001 | ARDAK | BW MEDICAL DIAGNOSTICS | 5/26/1999 | 0.00 |
504 | 1892 | MDA97299M0004 | P00002 | ARDAK | BW MEDICAL DIAGNOSTICS | 8/4/1999 | 0.00 |
505 | 1893 | MDA97299M0005 | DO | SHA | SENSOR EMULATION | 5/4/1999 | 100000.00 |
506 | 1894 | MDA97299M0005 | P00001 | SHA | SENSOR EMULATION | 5/12/1999 | 0.00 |
507 | 1895 | MDA97299M0006 | DO | VISTARESEA | UNDERSEA LITTORAL WARFARE | 4/12/1999 | 74827.00 |
508 | 1896 | MDA97299M0007 | DO | VISUALEYES | COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND | 5/3/1999 | 59500.00 |
509 | 1897 | MDA97299M0008 | BASIC | BLUE RIDGE | OFFICE/PROGRAM SUPPORT (related to VTAX4) | 5/11/1999 | 48566.00 |
510 | 1898 | MDA97299M0009 | DO | QRI | ADVANCED SIMULATION TECH | 6/29/1999 | 99494.00 |
511 | 1899 | MDA97299M001 0 | DO | PRAJAINC | COUNTER MEASURES | 6/14/1999 | 80460.00 |
512 | 1900 | MDA97299M0011 | BASIC | lVI | COUNTER MEASURES | 7/16/1999 | 90000.00 |
513 | 1901 | MDA97299M0012 | BASIC | JERRYCOOKE | CONTRACT ADMINISTRATION | 5/3/1999 | 100000.00 |
514 | 1902 | MDA97299M0013 | DO | DIAMONDBAC | TECH INTEGRATION CENTER/TECH DEV CENTER | 9/8/1999 | 50000.00 |
515 | 1903 | MDA9769630014 | P00007 | SDLINC | SOLAR BLIND DETECTORS | 7/9/1999 | 0.00 |
495 rows × 7 columns
#finally can do some operations
darpa1999.groupby(by="PROGRAM_TITLE").sum()
AMOUNT | |
---|---|
PROGRAM_TITLE | |
3-D MICRO ELECTRONICS | NaN |
6/1711999 | NaN |
AA V: Advanced Air Vehicle | -500000.00 |
AAV: Advanced Air Vehicle | 10225000.00 |
ACMPNIP | 302233.00 |
ACTIVE NETWORKS | 50000.00 |
ACTIVE TEMPLATES | 329987.00 |
ADAPTIVE COMPUTING SYSTEMS | 850000.00 |
ADMINISTRATIVE SUPPORT | 290000.00 |
ADVANCED FLEXIBLE MANUFACTURING | 3400000.00 |
ADVANCED GROUND SURVELLIANCE | 124593.00 |
ADVANCED LITHOGRAPHY | 3030000.00 |
ADVANCED LOGISTICS TECHNOLOGY | 16518949.00 |
ADVANCED MICROELECTRONICS | 10000.00 |
ADVANCED NETWORKING TECHNOLOGY | 49959.00 |
ADVANCED SHIP/SENSOR SYSTEMS MRN-02 | 328425.18 |
ADVANCED SIMULATION TECH | 2947837.00 |
AG ILE INFO CONTROL ENVIRONMENT | 199899.00 |
AGENT ADMIN SUPPORT | 769226.00 |
AGILE INFO CONTROL ENVIRONMENT | 3872700.00 |
AIM : Advanced ISR Man'!9ement | 600000.00 |
AIRBORNE COMMS NODE | 16800000.00 |
AIRBORNE VIDEO SURVEILLANCE | 468486.00 |
AM3: Affordable Multi-Missile Manufacturing | 18795907.00 |
AMOUNT | NaN |
ANTS SEEDLINGS | 50000.00 |
APLA: SELF HEALING/TAGS/MGM | 20526.00 |
ARRMD: Affordable Rapid Response Missile Demonstrator | 500000.00 |
ART: Advanced Rotorcraft Technology | 1500000.00 |
BADD | 642161.00 |
... | ... |
SECURITY SUPPORT | 1321000.00 |
SECURITY SUPPORT . | 423212.00 |
SENSOR EMULATION | 332215.00 |
SHOCC | 1289562.00 |
SKYLINK | 100000.00 |
SLID: Small Low-Cost Interceptor Device | 1859000.00 |
SMART MATERIALS/ACTUATORS | 1633760.00 |
SMART MATERIALS/DEMOS | 5280951.00 |
SOLAR BLIND DETECTORS | 250000.00 |
STARLIGHT SUPPORT COSTS-ITO | 210500.00 |
STOWACTD | 831302.00 |
SUB STUDY (SUBMARINE PAYLOADS AND SENSORS | 6180000.00 |
SUO: SITUATION AWARENESS SYS (SAS) | 19138500.00 |
SURVIVABLTY LARGE SCALE INFO SYS | 119732.00 |
Seedlings SGT-02 | 242880.00 |
Seedlings TT-06 | 900000.00 |
Seedlings TT-07 | 200000.00 |
TACTICAL SENSORS | 290580.00 |
TECH INTEGRATION CENTER/TECH DEV CENTER | 150000.00 |
TMR: URBAN ROBOTICS | 332000.00 |
TRVS | 500000.00 |
UCAV: Unmanned Combat Air Vehicle | 10017493.00 |
UNDERSEA LITTORAL WARFARE | 1324827.00 |
VIRTUAL ELECTROMAGNETIC TEST RANGE | 498383.00 |
VLSI PHOTONICS | 149984.00 |
WARFIGHTERS INTERNET | 327000.00 |
WATER HAMMER | 2887441.00 |
WEBINABOX | 360000.00 |
lA INTEGRATED TESTBED (INFORMATION ASSURANCE) | 909090.00 |
lA INTEGRATED TESTBEDJINFORMATION ASSURANCE) | 79167.00 |
164 rows × 1 columns
darpa1999.groupby(by="PERFORMER").sum()
AMOUNT | |
---|---|
PERFORMER | |
ALPHATECH | 885851.00 |
ALPINECONS | 16110000.00 |
APT I | 2809441.00 |
APTI | 124999.00 |
ARDAK | 99970.00 |
ARIZONASTA | 345851.00 |
ART I | 2290000.00 |
ART! | 210500.00 |
ARTI | 4009102.00 |
ATTTECH | 200000.00 |
AUBURNU | 395320.00 |
AWARD DATE | NaN |
BBN | 144000.00 |
BDMFEDERAL | 751046.00 |
BELLATLANT | 1641197.00 |
BELLCORE | 4547200.00 |
BLUE RIDGE | 48566.00 |
BOEING | 8082268.00 |
BOEINGDEFS | 4238010.00 |
BOEINGDESP | 5883520.00 |
BOEINGNAIN | 76655.00 |
BOOZALLEN | 3301594.45 |
BRADSONCOR | 885747.74 |
CALTECH | 256638.00 |
CENTRA | 1810672.00 |
CERIDIAN | 3797712.00 |
CFDRESCORP | 699919.00 |
CNRI | 5936135.00 |
COLOSTU | 100000.00 |
CRAYRESEAR | 1289562.00 |
... | ... |
TRACORAERO | 1447900.00 |
TRITECHINC | 222649.00 |
TRW | 5600000.00 |
UALABAMA | 900000.00 |
UARIZONA | 2790100.00 |
UCBERKELEY | 3260000.00 |
UCIRVINE | 50000.00 |
UCLA | 350000.00 |
UCLON | 156000.00 |
UCSANTABAR | 100000.00 |
UFLA | 321768.00 |
UILLURBCHA | 100000.00 |
UMASS | 1151500.00 |
UMINN | 1915045.00 |
UNIVNEWORL | 3474136.00 |
USCISI | 299997.00 |
USDISPLAYC | 5794000.00 |
UTAHSTU | 327618.00 |
UTEXAS | 350000.00 |
UVA | 365396.00 |
UWISCONSIN | 43650.00 |
VALLEYELEC | 100095.00 |
VANDERBILT | 204435.00 |
VEDAINC | 666262.00 |
VISTARESEA | 74827.00 |
VISUALEYES | 59500.00 |
VRT | 173469.00 |
WALCOFF | 20526.00 |
XEROXPARC | 1642515.00 |
lVI | 90000.00 |
152 rows × 1 columns
pdftotext
not use wildcards.
To run on all files in a directory within the unix bash shell (Mac OS X, most linux):
for file in *.pdf; do pdftotext "$file" "$file.txt"; done
RUN in shell not in python
Google drive for file < 2m or 10 pages.
Google probably has the best ocr out there but no way to access at scale.
if you've found a pdf online, can always consult Google's OCR of it via Google cache:
take yer url and prefix it with:
https://webcache.googleusercontent.com/search?q=cache:{{your URL}}
Doesn't always work and result is challenging html that reproduces the position of text
Adobe Acrobat Pro
Abbyy FineReader
Locally, can be used on library machines.
Old form of Google tech: tesseract
Futzy: requires pdfs to be divided into individual pages, then rendered as tiff.
Very linux-y world of multiple dependencies, weird incompatibilites
See https://apple.stackexchange.com/questions/128384/ocr-on-pdfs-in-os-x-with-free-open-source-tools
All major utilities honor the pdf encryption schemes.
For ebooks you "own" (i.e. have a license), such as Kindle books, use the Calibre application and the de-DRM add ons to extract your licensed text as a more open format.