cnpj_cpf
is the column identifying the company or individual who received the payment made by the congressperson. Having this value empty should mean that it's an expense made outside Brazil, with a company (or person) without a Brazilian ID.
import numpy as np
import pandas as pd
dataset = pd.read_csv('../data/2016-11-19-reimbursements.xz',
dtype={'applicant_id': np.str,
'cnpj_cpf': np.str,
'congressperson_id': np.str,
'subquota_number': np.str},
low_memory=False)
dataset.shape
(1532491, 31)
from pycpfcnpj import cpfcnpj
def validate_cnpj_cpf(cnpj_or_cpf):
return (cnpj_or_cpf == None) | cpfcnpj.validate(cnpj_or_cpf)
cnpj_cpf_list = dataset['cnpj_cpf'].astype(np.str).replace('nan', None)
dataset['valid_cnpj_cpf'] = np.vectorize(validate_cnpj_cpf)(cnpj_cpf_list)
document_type
2 means expenses made abroad.
keys = ['year',
'applicant_id',
'document_id',
'total_net_value',
'cnpj_cpf',
'supplier',
'document_type']
dataset.query('document_type != 2').loc[~dataset['valid_cnpj_cpf'], keys]
year | applicant_id | document_id | total_net_value | cnpj_cpf | supplier | document_type | |
---|---|---|---|---|---|---|---|
53466 | 2009 | 1607 | 1748889 | 123.57 | 11111111111 | CAP HORN | 0 |
53467 | 2009 | 1607 | 1748896 | 100.25 | 11111111111 | CAP HORN | 0 |
53468 | 2009 | 1607 | 1748909 | 229.25 | 11111111111 | DENSKALDEDEKOK RESTAURANT | 0 |
53469 | 2009 | 1607 | 1748911 | 18.89 | 11111111111 | BELLA CENTER | 0 |
53470 | 2009 | 1607 | 1748915 | 581.85 | 11111111111 | FIRST HOTEL SKT. PETRI | 0 |
284494 | 2010 | 184 | 1987827 | 2974.63 | 11111111111 | AKA CENTRAL PARK - NEW YORK | 0 |
284495 | 2010 | 184 | 1987829 | 2974.63 | 11111111111 | AKA CENTRAL PARK - NEW YORK | 0 |
527753 | 2011 | 2329 | 2085477 | 190.74 | 00000000000000 | PREFEITURA MUNICIPAL DE FORTALEZA | 0 |
552301 | 2011 | 2387 | 2055025 | 372.72 | 00000000000 | TAM LINHAS AREAS S/A | 0 |
552775 | 2011 | 2387 | 2209688 | 290.91 | 00000000000 | CONDOMINIO CENTRO EMPRESARIAL IGUATEMI | 1 |
With 1,532,491 records in the dataset and just 10 with invalid CNPJ/CPF, we can probably assume that the Chamber of Deputies has a validation in the tool where the congressperson requests for reimbursements. These represent a mistake in the implemented algorithm.