Vuestro portátil: ~3.6 GHz
dask es una biblioteca de computación paralela orientada a la analítica. Está formada por dos componentes:
Es un proyecto joven pero tiene determinadas propiedades que lo hacen muy interesante, entre ellas:
La versión más reciente es la 2.9.0 (2019-12-06, ¡hace unos días!) y se puede instalar con pip:
$ pip install dask[complete]
o con conda:
$ conda install dask
Vamos a hacer un ejemplo trivial con dask.array
para comprobar cómo funciona la computación en dask.
import numpy as np
import dask.array as da
x = np.arange(1000)
y = da.from_array(x, chunks=100)
y
|
Si intentamos efectuar cualquier operación sobre estos arrays, no se ejecuta inmediatamente:
op = y.mean()
op
|
Dask en su lugar construye un grafo con todas las operaciones necesarias y sus dependencias para que podamos visualizarlo y razonar sobre él. Este grafo está almacenado en estructuras de datos corrientes de Python como diccionarios, listas y tuplas:
y.dask.dicts
{'array-e30dcabe0f2b3c7236d769ba2cbdb28b': {('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 0): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(0, 100, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 1): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(100, 200, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 2): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(200, 300, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 3): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(300, 400, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 4): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(400, 500, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 5): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(500, 600, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 6): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(600, 700, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 7): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(700, 800, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 8): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(800, 900, None),)), ('array-e30dcabe0f2b3c7236d769ba2cbdb28b', 9): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(900, 1000, None),)), 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999])}}
Y podemos visualizarlo si tenemos instalada la biblioteca graphviz:
op.visualize()
Si queremos efectuar la operación, tendremos que llamar al método .compute()
.
op.compute()
499.5
Si queremos convertir nuestro array original a array de NumPy, también se hace llamando a compute()
:
y.compute()
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999])
Otra de las estructuras de datos que provee pandas son los DataFrames, que se comportan de la misma manera que los DataFrames de pandas.
Para estudiar cómo funciona, vamos a descargar datos de trayectos de los taxis de New York:
!du data/yellow*.csv -h -s
656M data/yellow_tripdata_2019-01.csv 620M data/yellow_tripdata_2019-02.csv 693M data/yellow_tripdata_2019-03.csv 658M data/yellow_tripdata_2019-04.csv
!du data/ -h -s
2,6G data/
Tanto dask.dataframe
como dask.array
usan un scheduler por defecto basado en hilos. En su lugar, vamos a utilizar una clase Client
, la que emplearíamos si estuviéramos en un cluster.
import dask.dataframe as dd
Esta clase Client
, cuando se utiliza en local, lanza un scheduler que minimiza el uso de memoria y aprovecha todos los núcleos de la CPU.
"The dask single-machine schedulers have logic to execute the graph in a way that minimizes memory footprint." http://dask.pydata.org/en/latest/custom-graphs.html?highlight=minimizes%20memory#related-projects
El servidor de diagnóstico está disponible en http://127.0.0.1:8787/.
from dask.distributed import Client
client = Client()
client
Client
|
Cluster
|
Y ahora leemos los .csv
con un filtro todos a la vez en el mismo DataFrame de Dask:
df = dd.read_csv("data/yellow*.csv",
parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])
Que mimetiza la API de pandas:
df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 1 | 1.5 | 1 | N | 151 | 239 | 1 | 7.0 | 0.5 | 0.5 | 1.65 | 0.0 | 0.3 | 9.95 | NaN |
1 | 1 | 2019-01-01 00:59:47 | 2019-01-01 01:18:59 | 1 | 2.6 | 1 | N | 239 | 246 | 1 | 14.0 | 0.5 | 0.5 | 1.00 | 0.0 | 0.3 | 16.30 | NaN |
2 | 2 | 2018-12-21 13:48:30 | 2018-12-21 13:52:40 | 3 | 0.0 | 1 | N | 236 | 236 | 1 | 4.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 5.80 | NaN |
3 | 2 | 2018-11-28 15:52:25 | 2018-11-28 15:55:45 | 5 | 0.0 | 1 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 7.55 | NaN |
4 | 2 | 2018-11-28 15:56:57 | 2018-11-28 15:58:33 | 5 | 0.0 | 2 | N | 193 | 193 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 55.55 | NaN |
df.dtypes
VendorID int64 tpep_pickup_datetime datetime64[ns] tpep_dropoff_datetime datetime64[ns] passenger_count int64 trip_distance float64 RatecodeID int64 store_and_fwd_flag object PULocationID int64 DOLocationID int64 payment_type int64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount float64 congestion_surcharge float64 dtype: object
Vamos a calcular la longitud del DataFrame:
# Esta operación bloquea el intérprete durante unos minutos
len(df)
29952851
Como se puede observar, el uso de memoria está contenido y todas las CPUs están trabajando.
También lo podemos hacer de manera asíncrona:
futures = client.submit(len, df)
futures
from distributed import progress
Vamos ahora a calcular la distancia media recorrida en función del número de ocupantes. Igual que cuando usábamos dask.array
, la operación no se efectúa automáticamente.
op = df.groupby(df.passenger_count).trip_distance.mean()
op
Dask Series Structure: npartitions=1 float64 ... Name: trip_distance, dtype: float64 Dask Name: truediv, 240 tasks
f2 = client.compute(op)
f2
client.compute
almacena el resultado en un solo nodo, y por tanto debe usarse con cuidado. Para objetos grandes, es mejor usar client.persist
.progress(f2)
VBox()
f2.result()
passenger_count 0 2.788861 1 2.894872 2 3.009092 3 2.970344 4 3.015722 5 2.959904 6 2.934012 7 2.186186 8 5.187980 9 3.607727 Name: trip_distance, dtype: float64
En este caso la visualización de la operación ya tiene una magnitud considerable:
op.visualize()
Más operaciones que se pueden hacer:
df2 = df[(df.tip_amount > 0) & (df.fare_amount > 0)] # filter out bad rows
df2['tip_fraction'] = df2.tip_amount / df2.fare_amount # make new column
hour = df2.groupby(df2.tpep_pickup_datetime.dt.hour).tip_fraction.mean()
hour
Dask Series Structure: npartitions=1 float64 ... Name: tip_fraction, dtype: float64 Dask Name: truediv, 780 tasks
f_hour = client.compute(hour)
f_hour.result().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fc0102d6850>
payments = pd.DataFrame(
{
1: 'Credit Card',
2: 'Cash',
3: 'No Charge',
4: 'Dispute',
5: 'Unknown',
6: 'Voided trip'
}, index=["payment_name"]
).T
df2 = df.merge(payments, left_on='payment_type', right_index=True)
client.compute(df2.groupby(df2.payment_name).tip_amount.mean()).result()
payment_name Cash 0.000358 Credit Card 2.896196 Dispute -0.006623 No Charge 0.968446 Unknown 0.000000 Name: tip_amount, dtype: float64
No todas las operaciones pueden ejecutarse leyendo parcialmente los datos en memoria.
zero_tip = df2.tip_amount == 0
cash = df2.payment_name == 'Cash'
client.compute(dd.concat([zero_tip, cash], axis=1).corr()).result()
/home/juanlu/.pyenv/versions/3.8.0/envs/dask38/lib/python3.8/site-packages/dask/dataframe/multi.py:1054: UserWarning: Concatenating dataframes with unknown divisions. We're assuming that the indexes of each dataframes are aligned. This assumption is not generally safe. warnings.warn(
tip_amount | payment_name | |
---|---|---|
tip_amount | 1.000000 | 0.911077 |
payment_name | 0.911077 | 1.000000 |
¿Qué significa este warning? En dask, algunas operaciones son sensibles al particionado http://dask.pydata.org/en/latest/dataframe-design.html#partitions así que tendremos que reindexar el DataFrame para que se alineen:
df2.passenger_count.resample('1d').compute()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-37-bab0b1ae8068> in <module> ----> 1 df2.passenger_count.resample('1d').compute() ~/.pyenv/versions/3.8.0/envs/dask38/lib/python3.8/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label) 2315 from .tseries.resample import Resampler 2316 -> 2317 return Resampler(self, rule, closed=closed, label=label) 2318 2319 @derived_from(pd.DataFrame) ~/.pyenv/versions/3.8.0/envs/dask38/lib/python3.8/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs) 106 "for more information." 107 ) --> 108 raise ValueError(msg) 109 self.obj = obj 110 rule = pd.tseries.frequencies.to_offset(rule) ValueError: Can only resample dataframes with known divisions See https://docs.dask.org/en/latest/dataframe-design.html#partitions for more information.
df2.npartitions
45
df2.divisions
(None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
df3 = df2.set_index('tpep_pickup_datetime')
df3.npartitions
45
df3.divisions
(Timestamp('2001-02-02 14:55:07'), Timestamp('2019-01-01 16:48:28'), Timestamp('2019-01-04 20:22:47'), Timestamp('2019-01-07 23:11:40'), Timestamp('2019-01-10 19:26:47'), Timestamp('2019-01-13 01:03:37'), Timestamp('2019-01-15 18:40:15'), Timestamp('2019-01-18 11:35:22.885714432'), Timestamp('2019-01-21 14:58:21'), Timestamp('2019-01-24 08:53:32'), Timestamp('2019-01-26 17:59:52'), Timestamp('2019-01-29 12:03:09'), Timestamp('2019-01-31 22:15:22'), Timestamp('2019-02-03 12:23:44'), Timestamp('2019-02-06 08:29:23'), Timestamp('2019-02-08 19:17:04'), Timestamp('2019-02-11 12:23:35.800000'), Timestamp('2019-02-14 07:51:50'), Timestamp('2019-02-16 21:46:10'), Timestamp('2019-02-19 20:59:17'), Timestamp('2019-02-22 15:46:26'), Timestamp('2019-02-25 17:31:12'), Timestamp('2019-02-28 07:21:55.571428608'), Timestamp('2019-03-02 17:11:36'), Timestamp('2019-03-05 19:08:00.632938240'), Timestamp('2019-03-07 23:52:22.154189824'), Timestamp('2019-03-10 16:13:53.773808128'), Timestamp('2019-03-13 10:19:22.353587456'), Timestamp('2019-03-15 20:55:15'), Timestamp('2019-03-18 18:19:07.453472512'), Timestamp('2019-03-21 13:54:42'), Timestamp('2019-03-23 19:23:36'), Timestamp('2019-03-27 00:44:50'), Timestamp('2019-03-29 16:16:01'), Timestamp('2019-04-01 12:08:20.085714176'), Timestamp('2019-04-04 10:46:31'), Timestamp('2019-04-06 15:39:35'), Timestamp('2019-04-09 14:49:18'), Timestamp('2019-04-12 00:48:54'), Timestamp('2019-04-14 16:56:45'), Timestamp('2019-04-17 12:39:21'), Timestamp('2019-04-20 02:12:12'), Timestamp('2019-04-23 12:57:48.657224704'), Timestamp('2019-04-26 05:38:58.437222144'), Timestamp('2019-04-28 21:20:59'), Timestamp('2088-01-24 00:25:39'))
daily_mean = df3.passenger_count.resample('1d').mean()
daily_mean
Dask Series Structure: npartitions=45 2001-02-02 int64 2019-01-01 ... ... 2019-04-28 ... 2088-01-24 ... Name: passenger_count, dtype: int64 Dask Name: resample, 1610 tasks
# No ejecutar en local!
# daily_mean.compute().plot()
El ecosistema alrededor de Dask está creciendo a gran velocidad.
>>> print(¡Muchas gracias!)
¶