This notebook assumes that you have a jupyter notebook with CopyCat installed. To start a jupyter notebook with CopyCat installed and your local directory mounted, run:
docker run --rm -ti -v ${PWD}:/home/jovyan -p 8888:8888 webis/chatnoir-copycat:1.0-jupyter
Then, we use the CopyCat cli to deduplicate run files submitted to Terabyte 2006.
You need:
PATH_TO_GOV2
should point to the GOV2 collection. I.e., !du -h -s $PATH_TO_GOV2
should show 81G and !ls $PATH_TO_GOV2
should list files GX000
to GX272
.PATH_TO_RUNS
should contain all run files submitted to Terabyte 2006. I.e., !du -h -s $PATH_TO_RUNS
should show 1.1G and !ls $PATH_TO_RUNS
should list files input.AMRIMtp20006.gz
to input.zetamerg.gz
.We use Anserini to index the GOV2 collection to deduplicate the run files (to speed up the index creation, we index only documents that are in the run files).
# variables as discussed above
PATH_TO_GOV2='/mnt/ceph/storage/corpora/corpora-thirdparty/corpora-trec/corpus-trec-web/DOTGOV2/gov2-corpus'
PATH_TO_RUNS='terabyte-data'
# create list of ids to deduplicate
from trectools import TrecRun
RUNS = !ls $PATH_TO_RUNS
for run in RUNS:
print('Process: ' + run)
r = TrecRun(PATH_TO_RUNS + '/' + run)
r.run_data = r.run_data[r.run_data['rank'] <= 50]
if len(r.run_data) < 8000:
r.run_data.to_csv('terabyte-runs-top50/' + r.get_runid(), sep='\t', index=False, header=False)
PATH_TO_RUNS = 'terabyte-runs-top50/'
RUNS = !ls $PATH_TO_RUNS
!cat terabyte-runs-top50/*|awk '{print $4}'|sort -u > terabyte-runs-top50/allow-list
Process: input.AMRIMtp20006.gz Process: input.AMRIMtp5006.gz Process: input.AMRIMtpm5006.gz Process: input.arscDomAlog.gz Process: input.arscDomAsrt.gz Process: input.arscDomManL.gz Process: input.arscDomManS.gz Process: input.CoveoRun1.gz Process: input.CWI06DISK1ah.gz Process: input.CWI06DIST8ah.gz Process: input.DCU05BASE.gz Process: input.hedge0.gz Process: input.hedge10.gz Process: input.hedge30.gz Process: input.hedge50.gz Process: input.hedge5.gz Process: input.humT06l.gz Process: input.humT06xlc.gz Process: input.humT06xle.gz Process: input.humT06xl.gz Process: input.humT06xlz.gz Process: input.indri06AdmD.gz Process: input.indri06AlceB.gz Process: input.indri06AlceD.gz Process: input.indri06Aql.gz Process: input.indri06AtdnD.gz Process: input.JuruMan.gz Process: input.JuruTD.gz Process: input.JuruT.gz Process: input.JuruTWE.gz Process: input.mg4jAdhocBBV.gz Process: input.mg4jAdhocBV.gz Process: input.mg4jAdhocBVV.gz Process: input.mg4jAdhocV.gz Process: input.mg4jAutoBBV.gz Process: input.mg4jAutoBV.gz Process: input.mg4jAutoBVV.gz Process: input.mg4jAutoV.gz Process: input.mpiircomb.gz Process: input.mpiirdesc.gz Process: input.mpiirmanual.gz Process: input.mpiirtitle.gz Process: input.MU06TBa1.gz Process: input.MU06TBa2.gz Process: input.MU06TBa5.gz Process: input.MU06TBa6.gz Process: input.p6tbadt.gz Process: input.p6tbaxl.gz Process: input.sabtb06aa1.gz Process: input.sabtb06at1.gz Process: input.sabtb06man1.gz Process: input.THUADALL.gz Process: input.THUADAO.gz Process: input.THUADLMAO.gz Process: input.THUADLMO.gz Process: input.THUADOR.gz Process: input.TWTB06AD01.gz Process: input.TWTB06AD02.gz Process: input.TWTB06AD03.gz Process: input.TWTB06AD04.gz Process: input.TWTB06AD05.gz Process: input.UAmsT06a3SUM.gz Process: input.UAmsT06aAnLM.gz Process: input.UAmsT06aTDN.gz Process: input.UAmsT06aTeLM.gz Process: input.UAmsT06aTTDN.gz Process: input.uogTB06QET1.gz Process: input.uogTB06QET2.gz Process: input.uogTB06S50L.gz Process: input.uogTB06SS10L.gz Process: input.uogTB06SSQL.gz Process: input.uwmtFadDS.gz Process: input.uwmtFadTPFB.gz Process: input.uwmtFadTPRR.gz Process: input.uwmtFmanual.gz Process: input.zetabm.gz Process: input.zetadir.gz Process: input.zetaman.gz Process: input.zetamerg2.gz Process: input.zetamerg.gz cat: terabyte-runs-top50/deduplication: Is a directory
# navigate to your Anserini installation to index the GOV2 corpus with the following command
GOV2_INDEX='lucene-index.gov2.pos+docvectors+raw'
!target/appassembler/bin/IndexCollection -collection TrecwebCollection \
-input $PATH_TO_GOV2 \
-index lucene-index.gov2.pos+docvectors+raw \
-whitelist allow-list\
-generator JsoupGenerator \
-threads 5 -storePositions -storeDocvectors -storeRawDocs
We have double-checked the preprocessing for the ClueWebs and CommonCrawls by many unit and integration tests. Since the GOV2 dataset is not included in these tests, we have to double-check the preprocessing using some examples. (Please see here for an overview of preprocessing options.)
To verify the preprocessing, we check a few documents manually.
# Use CopyCat to check the preprocessing for a single document
!copy-cat \
--retrieveDocId GX272-84-11390548 \
--documents AnseriniIndex \
--anseriniIndex $GOV2_INDEX \
--keepStopwords False \
--output a --input a
januari 1 2003 honor governor georg h ryan state hou room 207 springfield illinoi 62706 honor governor ryan i am plea submit report behalf illinoi deaf hard hear commiss activ calendar year 2002 deaf hard hear commiss execut agenc state dedic advoc public polici regul program design improv qualiti coordin exist servic individu hear loss well promot new servic whenev necessari respon assess need deaf hard hear commiss serv conduit inform deaf hard hear commun gener public legisl govern agenc servic provid organ privat entiti deaf hard hear commiss empow deaf hard hear individu affirm indisput right equal respect independ self suffici ncy access societi commiss structur membership eleven initi member deaf hard hear commiss were appoint novemb 1997 stagger four year term time write four member term have expir two member have resign result six vacanc requir reappoint replac staf commiss staf eight posit seven which current fill vacanc exist which ha en fill due administr order number 1 see attach chart graphic repres commiss organiz structur offic oper we continu meet need public provid follow servic advocaci public polici advocaci self advocaci individu case advocaci case manag iep advocaci public awar exhibit workshop train inform seri brochur honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 2 10 public servic announc inform referr consult technic assist resourc resourc directori lend librari site inform interpret directori commun quarterli newslett listserv websit commiss retreat commiss member staff held dai retreat march 15 2002 review progress toward object our five year strateg plan which implement two year ago outstand issu concern were identifi priorit order meet need public whom we serv most effici expedi mean possibl five year strateg plan revi five year strateg plan june 6 2002 ha been place sinc june 2000 includ intermedi rang long rang goal object goal object have been incorpor annual manag plan submit strateg manag unit director staff charg implement execut plan ensur maximum deliveri servic our constitu fund commiss began util line item budget fy01 when 685,000.00 appropri agenc oper previou commiss oper under lumpsum appropri our fund headcount increa eight 8 posit commiss creat posit system administr posit respon manag agenc local area network maintain our websit commiss expend approxim eighti four percent 84 fy01 appropri commiss step up effort increa public awar advoc improv public polici maxim deliveri servic deaf hard hear popul state illinoi commiss receiv appropri 726,600.00 fy02 commiss util d 624,000.00 86 appropri return 102,000 state treasuri current year appropri 688,400.00 repr slight decrea over prior year fund effort support governor effort amelior state budgetari crisi commiss identifi cost could trim function could perform more cost effect manner effort includ u seri brochur promot our program servic wide segment public util billboard adverti develop public servic announc honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 3 10 air televi radio near futur take advantag mass mail mass media commiss increa public awar our program servic effort improv qualiti life illinoi citizen who experi hear loss commiss expect expend approxim 90 current year appropri effect governor execut order number 1 2001 commiss fulli support action taken offset unforeseen declin revenu compli ani request measur assist mitig current budgetari situat end commiss direct reserv 16,900.00 from our fy02 appropri effectu alloc unu balanc equip line 11,200.00 5,700.00 from person servic line amount did result ani signif chang agenc oper more seriou consequ commiss hire freez impo order commiss ha had vacant posit entireti fy02 fy03 state term percentag vacanc repr 12.5 commiss fund headcount posit respon fulfil sever commiss mandat critic long rang mission our hope commiss can fill posit fy04 when econom situat allow audit commiss receiv biannual complianc audit conduct offic auditor gener two year period end june 30 2001 commiss receiv d two materi find base technic error first find result from inaccur incomplet report fix asset own commiss find doe indic ani asset unaccount rather commiss fail properli document asset central inventori system relat report file offic comptrol plan correct action ha now been implement wherebi all asset properli document specif personnel have been assign respon ongo asset track report function second find requir under commiss ha public result ed from commiss failur adopt formal agenc rule illinoi administr procedur act 5 ilc 100 f il requir rule secretari state index divi administr code perform rel legisl mandat 20 ilc 3932 develop program inform person who deaf hard hear public state local servic avail deaf hard hear make avail other inform valu famili profess citizen work involv person who deaf hard hear increa public awar avail commiss servic director public inform coordin program coordin conduct substanti number public appear present calendar year 200 2 staff made person contact approxim 5,000 individu through varieti speak engag presentatio n over 60 event due increa public awar honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 4 10 avail commiss program servic number incom telephon call fax mail correspond e mail inquiri person contact top 12,000 calendar year 2002 commiss now ha site world wide web which had approxim 30,000 visit 2002 internet ha proven excel wai deaf hard hear individu access inform take part mainstream cultur furthermor internet benefici non deaf individu learn about deaf relat issu unbia neutral environ keep our dedic improv accessibilit y advanc technolog commiss maintain comprehen dynam complet unrestrict websit all commiss program servic public notic avail site addit exten list link other site further inform interest parti mai visit www.idhhc.state.il.u find latest new relat hear loss commiss maintain e mail subscrib servic idhhc link which allow u commun inform about our offic activ those who subscrib list current we have over 100 subscrib we publish quarterli newslett idhhc insid detail commiss activ newslett develop distribut after each quarterli meet commiss avail print el ectron format newslett allow u inform public about commiss upcom meet activ event legisl updat new servic program commiss distribut nearli 4,000 newslett our subscrib 20 02 commiss develop seri brochur educ public about hear loss inform deaf hard hear citizen right avail servic titl seri follow about commiss new idhhc booklet assisit listen devic assisit technolog commun access real time translat cart estim popul individu hear loss state illinoi hear aid hear loss illinoi relai protect your ear tty commiss particip 67 event includ confer celebr fair awar dai throughout state where we util displai exhibit honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 5 10 2002 commiss host tent state fair where we made conta ct thousand fair visitor who had interest our program we hand out balloon children promot item our logo contact inform also we hand out thousand brochur other literatur from collabor agenc organ effort promot our respect program from event we rai tremend amount awar our agenc servic addit we made contact individu from around state regard issu concern home area feedback invalu commiss focu our attent statewid basi commiss distribut approxim 90 000 piec literatur relat deaf includ limit fact she et inform seri brochur interpret registri brochur strateg plan commiss minut agenda schedul effort magnet approxim commiss increa public awar commiss now ha promot pen kei chain bear comm ission logo contact inform 42,000 item were distribut throughout state increa visibl access our servic current under develop four 4 comprehen project slate distribut 2003 statewid directori deaf hard hear revamp interpret registri law enforc train manual empow manual deaf hard hear statewid directori comprehen directori list all state servic provid deaf hard hear individu directori list resourc regard statut state feder govern advocaci educ inform referr organ public religion other servic revamp interpret registri intend assist public easili locat interpret counti hour avail area experti law enforc train manual collabor result commiss longstand relationship illinoi state polic whom commiss provid deaf relat train cadet state polic academi new manual address issu face both deaf commun law enforc can u variou law enforc entiti refresh cour offic empow manual manual includ relev feder state disabl statut address access commun discrimin issu inform how protect individu right commiss ha compil comprehen lend librari over 1,000 deafnessrel video multi media book public librari consid state repositori inform avail wide rang subject matter regard hear loss ani interest individu organ mai check out materi free charg up thirti dai nearli 600 item were lent dure calendar year 2002 cooper public priva te agenc local state feder govern coordin program person who deaf hard hear commiss work tandem sister state agenc streamlin deliveri servic provid illinoi deaf hard hear citizen commiss fast becom recogn state clearingh issu relat hear loss although commiss establish deliv direct servic subject area honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 6 10 experti provid commiss creat synergi among all agenc servic provid relat hear loss commiss provid statewid train seminar larg number state agenc springfield mai 30 2001 present includ histori commiss public awar commun outreach avail deaf hard hear program were thirti two 32 state agenc repr workshop p lan underwai anoth session 2003 commiss began discuss depart hear aid under health insur plan offer develop depart insur present avail under most plan result individu hear loss commiss provid hear aid coverag similar vision insur plan insur pursu coverag illinoi inform packag public hear aid benefit larg out pocket expen hope rectifi situat care benefit provid under mani commiss receiv number formal complaint deal wide rang issu matter emploi discrimin access interpret issu other violat state feder law regularli brought commiss calendar year 2002 commiss successfulli resolv 36 complaint bring satisfactori closur matter commiss repr follow committ board central network deaf servic team offic mental health deaf social servic provid central illinoi hard hear late deafen issu committ highlight educ resourc advisori council illinoi allianc deaf hard hear illinoi interag agreement servic person who deaf blind illinoi school deaf advisori board john logan colleg interpret prepar program advisori board macmurrai colleg interpret train program newborn hear screen advisori committ offic mental health statewid deaf nd hard hear subcommitt provid technic assist train support start enhanc exist program servic person who deaf hard hear commiss provid inform present about program hard hear late deafen issu committ hear vision connect illinoi allianc deaf hard hear illinoi school deaf rehabilit counselor deaf center independ live statewid deaf servic coordin scott commun colleg interpret train program honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 7 10 commiss provid cross cultur train deaf awar train depart children famili servic depart emploi secur depart correct illinoi school deaf depart human servic illinoi state polic academi illinoi state librari depart central manag servic commiss provid deaf empow present center independ live statewid deaf servic coordin illinoi school deaf recommend legisl chang governor gener assembl follow evalu law affect person who deaf hard hear commiss plan reintroduc amend our enact statut deaf hard hear commiss act 20 ilc 3932 order clarifi legisl mandat elimin duplic expand duti assur servic provid more signif impact live deaf hard hear citizen upcom month commiss submit interpret certif licensur act propo illinoi gener assembl establish polici relat evalu certif licensur tra standard sign languag interpret monitor court u interpret provid from approv list serv resourc provid list qualifi interpret upon request legisl bodi public privat agenc person who deaf hard hear number issu relat interpret have been problemat make qualiti quantiti interpret paramount issu deaf hard hear communit y commiss conven task forc direct develop plan address minimum standard train evalu certif licensur interpret object increa pool interpret interpret task forc met approxim everi six week 18 month submit propo march 8 2001 commiss interpret task forc propo 37 recommend commiss commiss review discuss each recommend made interpret task forc commiss adopt 27 recommend modifi six 6 recommend reject four 4 recommend commiss staff instruct develop draft incorpor recommend from interpret task forc format wit h languag u interpret profess other state customari languag u other profess state illinoi june 7 2001 meet commiss review made chang approv draft public distr ibut also author town hall meet collect public comment regard draft honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 8 10 from august 21 2001 septemb 28 2001 commiss staff host 17 town hall meet around state were total 707 peopl attend 285 deaf 31 hard hear 391 hear total 206 provid comment 161 verbal 41 written 4 videotap special meet held novemb 9 2001 review all public comment review each section draft commiss met decemb 6 2001 final interpret certif licensur draft meet determin due event occur recent month would high prioriti gener assembl address licen need interpret interpret certif licensur draft submit offic attornei gener depart profess regul licensur review befor submit legislatur pro gram goal object 2003 evalu state program deliv servic deaf hard hear person determin effect make recommend public offici about futur financ support continu exist program nd establish new program monitor state fund program deliv servic person who deaf hard hear determin extent promi mandat servic deliv commiss ha formal evalu program servic under mandat due current staff shortag howev we have identifi defici base inform garner from town hall meet particip variou committ our collect experi commiss intend propo resolut address follow defici request addit staff fund state budget allow servic gap identifi servic specif direct hard hear late deafen individu who do u tradit servic program design deaf hard hear individu requir servic program promot util residu hear assist listen devic oral cu interpret other special accommod includ train inform commun strategi commun workshop educ about assist devic other avail servic resourc mani case hard hear individu need access inform ada ptation hear loss cope hear loss servic specif direct parent deaf hard hear children includ train deaf commun option educ option assist devic parent right avail servic resourc also includ establish support group parent outreach resourc center parent servic specif direct individu who deaf blind current comprehen statewid support servic program exist popul servic current avail through offic rehabilit servic chicago lighthou helen keller nation center limit fragment deaf blind individu need train how live independ how u avail technolog design honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 9 10 them educ regard servic resourc avail them advoc person assist servic specif direct individu hear loss need continu care nur home retir home includ establish access program awar commun issu encourag facil have more than deaf hard hear individu home train nd profess develop should provid inpati staff technic assist need make facil more access mental health servic deaf hard hear outsid chicago area includ inpati facil outpati servic contractu mental health servic need increa number mental health profess who profici sign languag have understand deaf relat mental health issu well need increa u avail technolog video telepsychiatri servic provid more access mental health servic throughout state lack deaf educ standard address legisl mandat p romot cooper among state local agenc provid educ program deaf hard hear individu assist technolog demonstr center mobil demonstr unit allow deaf hard hear individu experi variou technolog avail would remov commun barrier allow them live live more independ program allow interest individu borrow devic trial period provid train how u avail technolog branch region offic can provid servic area cannot effect serv central offic offic tailor provid local advocaci public awar program workshop train amelior problem face deaf hard hear peopl effect carri out strateg plan area honor governor georg h ryan illinoi deaf hard hear commiss annual report januari 1 2003 page 10 10 conclu commiss continu promot program servic enhanc qualiti life illinoi citizen who experi hear lo ss commiss ha increa our visibl public awar evidenc ever increa number contact we receiv commiss dedic promot equal respect independ access all individu hear loss end we endeavor provid effect effici imparti leadership educ advocaci servic elimin barrier deaf hard hear individu while our duti continu expand commi sion ha met need our constitu dimin ish resourc fund headcount time commiss intend further increa our program streamlin deliveri all state servic deaf hard hear popul through effort those hear loss enjoi more effect effici servic substanti cost save state illinoi respectfulli submit gerald l covel director cc honor jame pate philip honor emil jone senat minor leader jim harri secretari senat honor michael j madigan speaker hou honor tom cross hou republican leader anthoni rossi clerk hou repr jean wilkin director illinoi state librari patrick o'gradi director legisl research unit
The preprocessed GOV2 document GX272-84-11390548 looks good (stemming as expected, no problems with encoding, etc.).
Now that we have double-checked that the document preprocessing works as expected for Terabyte 2006, we can deduplicate the run files and inspect the results.
# this helper function executes copycat on the passed run file with the double-checked document preprocessing
def deduplicate_run_file(run_name, ranks):
input_file = PATH_TO_RUNS + '/' + run_name
output_file = PATH_TO_RUNS + 'deduplication/' + run_name + 'deduplication-top' + str(ranks) + '.jsonl'
!copy-cat \
--output $output_file \
--input $input_file \
--similarities "s3" \
--s3Threshold 0.8 \
--threads 10 \
--ranks $ranks \
--documents AnseriniIndex \
--keepStopwords True \
--anseriniIndex $GOV2_INDEX
# Preprocess all runs
RUNS = !ls $PATH_TO_RUNS|grep -v allow
for run in RUNS:
print('Process: ' + run)
for depth in [10, 50]:
deduplicate_run_file(run, depth)
Process: AMRIMtp20006 The specified output 'terabyte-runs-top50/deduplication/AMRIMtp20006deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/AMRIMtp20006deduplication-top50.jsonl' exists. Skip... Process: AMRIMtp5006 The specified output 'terabyte-runs-top50/deduplication/AMRIMtp5006deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/AMRIMtp5006deduplication-top50.jsonl' exists. Skip... Process: AMRIMtpm5006 The specified output 'terabyte-runs-top50/deduplication/AMRIMtpm5006deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/AMRIMtpm5006deduplication-top50.jsonl' exists. Skip... Process: arscDomAlog The specified output 'terabyte-runs-top50/deduplication/arscDomAlogdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/arscDomAlogdeduplication-top50.jsonl' exists. Skip... Process: arscDomAsrt The specified output 'terabyte-runs-top50/deduplication/arscDomAsrtdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/arscDomAsrtdeduplication-top50.jsonl' exists. Skip... Process: arscDomManL The specified output 'terabyte-runs-top50/deduplication/arscDomManLdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/arscDomManLdeduplication-top50.jsonl' exists. Skip... Process: arscDomManS The specified output 'terabyte-runs-top50/deduplication/arscDomManSdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/arscDomManSdeduplication-top50.jsonl' exists. Skip... Process: CoveoRun1 The specified output 'terabyte-runs-top50/deduplication/CoveoRun1deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/CoveoRun1deduplication-top50.jsonl' exists. Skip... Process: CWI06DISK1ah The specified output 'terabyte-runs-top50/deduplication/CWI06DISK1ahdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/CWI06DISK1ahdeduplication-top50.jsonl' exists. Skip... Process: CWI06DIST8ah The specified output 'terabyte-runs-top50/deduplication/CWI06DIST8ahdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/CWI06DIST8ahdeduplication-top50.jsonl' exists. Skip... Process: DCU05BASE The specified output 'terabyte-runs-top50/deduplication/DCU05BASEdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/DCU05BASEdeduplication-top50.jsonl' exists. Skip... Process: deduplication Exception in thread "main" java.io.FileNotFoundException: terabyte-runs-top50/deduplication (Is a directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at de.webis.trec_ndd.spark.RunLine.openRunFile(RunLine.java:112) at de.webis.copycat_cli.App.main(App.java:54) Exception in thread "main" java.io.FileNotFoundException: terabyte-runs-top50/deduplication (Is a directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at de.webis.trec_ndd.spark.RunLine.openRunFile(RunLine.java:112) at de.webis.copycat_cli.App.main(App.java:54) Process: hedge0 The specified output 'terabyte-runs-top50/deduplication/hedge0deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/hedge0deduplication-top50.jsonl' exists. Skip... Process: hedge10 The specified output 'terabyte-runs-top50/deduplication/hedge10deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/hedge10deduplication-top50.jsonl' exists. Skip... Process: hedge30 The specified output 'terabyte-runs-top50/deduplication/hedge30deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/hedge30deduplication-top50.jsonl' exists. Skip... Process: hedge5 The specified output 'terabyte-runs-top50/deduplication/hedge5deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/hedge5deduplication-top50.jsonl' exists. Skip... Process: hedge50 The specified output 'terabyte-runs-top50/deduplication/hedge50deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/hedge50deduplication-top50.jsonl' exists. Skip... Process: humT06l The specified output 'terabyte-runs-top50/deduplication/humT06ldeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/humT06ldeduplication-top50.jsonl' exists. Skip... Process: humT06xl The specified output 'terabyte-runs-top50/deduplication/humT06xldeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/humT06xldeduplication-top50.jsonl' exists. Skip... Process: humT06xlc The specified output 'terabyte-runs-top50/deduplication/humT06xlcdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/humT06xlcdeduplication-top50.jsonl' exists. Skip... Process: humT06xle The specified output 'terabyte-runs-top50/deduplication/humT06xlededuplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/humT06xlededuplication-top50.jsonl' exists. Skip... Process: humT06xlz The specified output 'terabyte-runs-top50/deduplication/humT06xlzdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/humT06xlzdeduplication-top50.jsonl' exists. Skip... Process: indri06AdmD The specified output 'terabyte-runs-top50/deduplication/indri06AdmDdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/indri06AdmDdeduplication-top50.jsonl' exists. Skip... Process: indri06AlceB The specified output 'terabyte-runs-top50/deduplication/indri06AlceBdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/indri06AlceBdeduplication-top50.jsonl' exists. Skip... Process: indri06AlceD The specified output 'terabyte-runs-top50/deduplication/indri06AlceDdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/indri06AlceDdeduplication-top50.jsonl' exists. Skip... Process: indri06Aql The specified output 'terabyte-runs-top50/deduplication/indri06Aqldeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/indri06Aqldeduplication-top50.jsonl' exists. Skip... Process: indri06AtdnD The specified output 'terabyte-runs-top50/deduplication/indri06AtdnDdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/indri06AtdnDdeduplication-top50.jsonl' exists. Skip... Process: JuruMan The specified output 'terabyte-runs-top50/deduplication/JuruMandeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/JuruMandeduplication-top50.jsonl' exists. Skip... Process: JuruT The specified output 'terabyte-runs-top50/deduplication/JuruTdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/JuruTdeduplication-top50.jsonl' exists. Skip... Process: JuruTD The specified output 'terabyte-runs-top50/deduplication/JuruTDdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/JuruTDdeduplication-top50.jsonl' exists. Skip... Process: JuruTWE The specified output 'terabyte-runs-top50/deduplication/JuruTWEdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/JuruTWEdeduplication-top50.jsonl' exists. Skip... Process: mg4jAdhocBBV The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocBBVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocBBVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAdhocBV The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocBVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocBVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAdhocBVV The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocBVVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocBVVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAdhocV The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAdhocVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAutoBBV The specified output 'terabyte-runs-top50/deduplication/mg4jAutoBBVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAutoBBVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAutoBV The specified output 'terabyte-runs-top50/deduplication/mg4jAutoBVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAutoBVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAutoBVV The specified output 'terabyte-runs-top50/deduplication/mg4jAutoBVVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAutoBVVdeduplication-top50.jsonl' exists. Skip... Process: mg4jAutoV The specified output 'terabyte-runs-top50/deduplication/mg4jAutoVdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mg4jAutoVdeduplication-top50.jsonl' exists. Skip... Process: mpiircomb The specified output 'terabyte-runs-top50/deduplication/mpiircombdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mpiircombdeduplication-top50.jsonl' exists. Skip... Process: mpiirdesc The specified output 'terabyte-runs-top50/deduplication/mpiirdescdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mpiirdescdeduplication-top50.jsonl' exists. Skip... Process: mpiirmanual The specified output 'terabyte-runs-top50/deduplication/mpiirmanualdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mpiirmanualdeduplication-top50.jsonl' exists. Skip... Process: mpiirtitle The specified output 'terabyte-runs-top50/deduplication/mpiirtitlededuplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/mpiirtitlededuplication-top50.jsonl' exists. Skip... Process: MU06TBa1 The specified output 'terabyte-runs-top50/deduplication/MU06TBa1deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/MU06TBa1deduplication-top50.jsonl' exists. Skip... Process: MU06TBa2 The specified output 'terabyte-runs-top50/deduplication/MU06TBa2deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/MU06TBa2deduplication-top50.jsonl' exists. Skip... Process: MU06TBa5 The specified output 'terabyte-runs-top50/deduplication/MU06TBa5deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/MU06TBa5deduplication-top50.jsonl' exists. Skip... Process: MU06TBa6 The specified output 'terabyte-runs-top50/deduplication/MU06TBa6deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/MU06TBa6deduplication-top50.jsonl' exists. Skip... Process: p6tbadt The specified output 'terabyte-runs-top50/deduplication/p6tbadtdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/p6tbadtdeduplication-top50.jsonl' exists. Skip... Process: p6tbaxl The specified output 'terabyte-runs-top50/deduplication/p6tbaxldeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/p6tbaxldeduplication-top50.jsonl' exists. Skip... Process: sabtb06aa1 The specified output 'terabyte-runs-top50/deduplication/sabtb06aa1deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/sabtb06aa1deduplication-top50.jsonl' exists. Skip... Process: sabtb06at1 The specified output 'terabyte-runs-top50/deduplication/sabtb06at1deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/sabtb06at1deduplication-top50.jsonl' exists. Skip... Process: sabtb06man1 The specified output 'terabyte-runs-top50/deduplication/sabtb06man1deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/sabtb06man1deduplication-top50.jsonl' exists. Skip... Process: THUADALL The specified output 'terabyte-runs-top50/deduplication/THUADALLdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/THUADALLdeduplication-top50.jsonl' exists. Skip... Process: THUADAO The specified output 'terabyte-runs-top50/deduplication/THUADAOdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/THUADAOdeduplication-top50.jsonl' exists. Skip... Process: THUADLMAO The specified output 'terabyte-runs-top50/deduplication/THUADLMAOdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/THUADLMAOdeduplication-top50.jsonl' exists. Skip... Process: THUADLMO The specified output 'terabyte-runs-top50/deduplication/THUADLMOdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/THUADLMOdeduplication-top50.jsonl' exists. Skip... Process: THUADOR The specified output 'terabyte-runs-top50/deduplication/THUADORdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/THUADORdeduplication-top50.jsonl' exists. Skip... Process: TWTB06AD01 The specified output 'terabyte-runs-top50/deduplication/TWTB06AD01deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/TWTB06AD01deduplication-top50.jsonl' exists. Skip... Process: TWTB06AD02 The specified output 'terabyte-runs-top50/deduplication/TWTB06AD02deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/TWTB06AD02deduplication-top50.jsonl' exists. Skip... Process: TWTB06AD03 The specified output 'terabyte-runs-top50/deduplication/TWTB06AD03deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/TWTB06AD03deduplication-top50.jsonl' exists. Skip... Process: TWTB06AD04 The specified output 'terabyte-runs-top50/deduplication/TWTB06AD04deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/TWTB06AD04deduplication-top50.jsonl' exists. Skip... Process: TWTB06AD05 The specified output 'terabyte-runs-top50/deduplication/TWTB06AD05deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/TWTB06AD05deduplication-top50.jsonl' exists. Skip... Process: UAmsT06a3SUM The specified output 'terabyte-runs-top50/deduplication/UAmsT06a3SUMdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/UAmsT06a3SUMdeduplication-top50.jsonl' exists. Skip... Process: UAmsT06aAnLM The specified output 'terabyte-runs-top50/deduplication/UAmsT06aAnLMdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/UAmsT06aAnLMdeduplication-top50.jsonl' exists. Skip... Process: UAmsT06aTDN The specified output 'terabyte-runs-top50/deduplication/UAmsT06aTDNdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/UAmsT06aTDNdeduplication-top50.jsonl' exists. Skip... Process: UAmsT06aTeLM The specified output 'terabyte-runs-top50/deduplication/UAmsT06aTeLMdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/UAmsT06aTeLMdeduplication-top50.jsonl' exists. Skip... Process: UAmsT06aTTDN The specified output 'terabyte-runs-top50/deduplication/UAmsT06aTTDNdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/UAmsT06aTTDNdeduplication-top50.jsonl' exists. Skip... Process: uogTB06QET1 The specified output 'terabyte-runs-top50/deduplication/uogTB06QET1deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uogTB06QET1deduplication-top50.jsonl' exists. Skip... Process: uogTB06QET2 The specified output 'terabyte-runs-top50/deduplication/uogTB06QET2deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uogTB06QET2deduplication-top50.jsonl' exists. Skip... Process: uogTB06S50L The specified output 'terabyte-runs-top50/deduplication/uogTB06S50Ldeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uogTB06S50Ldeduplication-top50.jsonl' exists. Skip... Process: uogTB06SS10L The specified output 'terabyte-runs-top50/deduplication/uogTB06SS10Ldeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uogTB06SS10Ldeduplication-top50.jsonl' exists. Skip... Process: uogTB06SSQL The specified output 'terabyte-runs-top50/deduplication/uogTB06SSQLdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uogTB06SSQLdeduplication-top50.jsonl' exists. Skip... Process: uwmtFadDS The specified output 'terabyte-runs-top50/deduplication/uwmtFadDSdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uwmtFadDSdeduplication-top50.jsonl' exists. Skip... Process: uwmtFadTPFB The specified output 'terabyte-runs-top50/deduplication/uwmtFadTPFBdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uwmtFadTPFBdeduplication-top50.jsonl' exists. Skip... Process: uwmtFadTPRR The specified output 'terabyte-runs-top50/deduplication/uwmtFadTPRRdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uwmtFadTPRRdeduplication-top50.jsonl' exists. Skip... Process: uwmtFmanual The specified output 'terabyte-runs-top50/deduplication/uwmtFmanualdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/uwmtFmanualdeduplication-top50.jsonl' exists. Skip... Process: zetabm The specified output 'terabyte-runs-top50/deduplication/zetabmdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/zetabmdeduplication-top50.jsonl' exists. Skip... Process: zetadir The specified output 'terabyte-runs-top50/deduplication/zetadirdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/zetadirdeduplication-top50.jsonl' exists. Skip... Process: zetaman The specified output 'terabyte-runs-top50/deduplication/zetamandeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/zetamandeduplication-top50.jsonl' exists. Skip... Process: zetamerg The specified output 'terabyte-runs-top50/deduplication/zetamergdeduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/zetamergdeduplication-top50.jsonl' exists. Skip... Process: zetamerg2 The specified output 'terabyte-runs-top50/deduplication/zetamerg2deduplication-top10.jsonl' exists. Skip... The specified output 'terabyte-runs-top50/deduplication/zetamerg2deduplication-top50.jsonl' exists. Skip...
import json
import pandas as pd
def eval_with_threshold(threshold, run_file_name):
rows = []
with open(run_file_name) as jsonl_file:
for jsonl in jsonl_file:
dedup_data = json.loads(jsonl)
docs_to_remove = []
for sim in dedup_data['similarities']:
if sim['similarities']['s3'] >= threshold:
docs_to_remove += [sim['secondId']]
rows += [{
'topic': dedup_data['topic'],
'duplicates': len(set(docs_to_remove)),
'docs': dedup_data['docs'],
}]
return rows
def eval_runs_with_threshold(threshold, run_files):
rows = []
for r in run_files:
rows += eval_with_threshold(threshold, r)
return pd.DataFrame(rows)
DEDUP_TARGET_DIR=PATH_TO_RUNS + 'deduplication/'
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS = [DEDUP_TARGET_DIR + i for i in ALL_DIRS if '.jsonl' in i]
df = eval_runs_with_threshold(0.82, ALL_DIRS)
df = df[df['docs'] > 9]
df['redundancy'] = df['duplicates']/df['docs']
df['docs'] = 10
df[['docs', 'redundancy']].groupby('docs').mean()
redundancy | |
---|---|
docs | |
10 | 0.126488 |
import seaborn as sns
sns.catplot(data=df, x='docs', y='redundancy', kind='violin', hue='docs')
<seaborn.axisgrid.FacetGrid at 0x7fbbacf275e0>
DEDUP_TARGET_DIR=PATH_TO_RUNS + 'deduplication/'
ALL_DIRS=!ls $DEDUP_TARGET_DIR
ALL_DIRS = [DEDUP_TARGET_DIR + i for i in ALL_DIRS if '.jsonl' in i]
df = eval_runs_with_threshold(0.82, ALL_DIRS)
df = df[df['docs'] > 45]
df['redundancy'] = df['duplicates']/df['docs']
df['docs'] = 50
df[['docs', 'redundancy']].groupby('docs').mean()
redundancy | |
---|---|
docs | |
50 | 0.157529 |
import seaborn as sns
sns.catplot(data=df, x='docs', y='redundancy', kind='violin', hue='docs')
<seaborn.axisgrid.FacetGrid at 0x7f77c4fcafa0>