Dask

Computación distribuida en Python à la Spark

Juan Luis Cano Rodríguez [email protected]

Mission Planning & Execution Engineer @ Satellogic

PyData Alicante 2019-12-16 @ CubeCut Software

¿Quién soy yo?

  • Ingeniero Aeronáutico y pythonista autodidacta
  • Mission Planning & Execution Engineer en Satellogic
  • Presidente de la Asociación Python España 🐍 🇪🇸
  • Colaborador en proyectos de Python Científico: NumPy, SciPy, conda, astropy, memory-profiler...
  • Profesor asociado en el Instituto Empresa de Python para Big Data
  • Amante de la pizza y el hard rock 🤘

Resumen

  1. Estado de la cuestión
  2. Dask
    1. Introducción
    2. Evaluación perezosa
    3. Grafos de operaciones
    4. Demo
    5. Limitaciones
  3. Proyectos relacionados
  4. Conclusiones y futuro

¿El principio del fin de PySpark?

1. Estado de la cuestión

Vuestro portátil: ~3.6 GHz

Clock speed

https://en.wikipedia.org/wiki/File:Clock_CPU_Scaling.jpg

Big Data

2. Dask

dask es una biblioteca de computación paralela orientada a la analítica. Está formada por dos componentes:

  1. Dynamic task scheduling optimizada para la computación.
  2. Colecciones "Big Data" como arrays, DataFrames y listas paralelas, que mimetizan la forma de trabajar con NumPy, pandas o iteradores de Python para objetos más grandes que la memoria disponible o en entornos distribuidos. Estas colecciones funcionan sobre los schedulers.

Es un proyecto joven pero tiene determinadas propiedades que lo hacen muy interesante, entre ellas:

  • Familiar: Dask replica la forma de trabajar con arrays de Numpy y DataFrames de pandas, así que la transición es mucho más sencilla que con otros sistemas.
  • Flexible: Se integra bien con otros proyectos y provee herramientas para paralelizar nuestras propias funciones.
  • Nativo: Es Python puro, no hay antipatrones ni comunicación con otros lenguajes.
  • Escalable: Dask funciona tanto en clusters de 1000 nodos como en portátiles normales, optimizando el uso de memoria.
  • Amistoso: Proporciona feedback inmediato y abundantes herramientas de diagnóstico.

dask

Instalación

La versión más reciente es la 2.9.0 (2019-12-06, ¡hace unos días!) y se puede instalar con pip:

$ pip install dask[complete]

o con conda:

$ conda install dask

Evaluación perezosa

Vamos a hacer un ejemplo trivial con dask.array para comprobar cómo funciona la computación en dask.

In [1]:
import numpy as np
import dask.array as da
In [2]:
x = np.arange(1000)
y = da.from_array(x, chunks=100)
In [3]:
y
Out[3]:
Array Chunk
Bytes 8.00 kB 800 B
Shape (1000,) (100,)
Count 11 Tasks 10 Chunks
Type int64 numpy.ndarray
1000 1

Si intentamos efectuar cualquier operación sobre estos arrays, no se ejecuta inmediatamente:

In [4]:
op = y.mean()
op
Out[4]:
Array Chunk
Bytes 8 B 8 B
Shape () ()
Count 25 Tasks 1 Chunks
Type float64 numpy.ndarray

Dask en su lugar construye un grafo con todas las operaciones necesarias y sus dependencias para que podamos visualizarlo y razonar sobre él. Este grafo está almacenado en estructuras de datos corrientes de Python como diccionarios, listas y tuplas:

In [5]:
y.dask.dicts
Out[5]:
{'array-e30dcabe0f2b3c7236d769ba2cbdb28b': {('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   0): (<function _operator.getitem(a, b, /)>,
   'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b',
   (slice(0, 100, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   1): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(100, 200, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   2): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(200, 300, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   3): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(300, 400, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   4): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(400, 500, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   5): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(500, 600, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   6): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(600, 700, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   7): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(700, 800, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   8): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(800, 900, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   9): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(900, 1000, None),)),
  'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b': array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
          13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
          26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
          39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
          52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
          65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
          78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
          91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
         104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
         117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
         130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
         143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
         156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
         169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
         182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
         195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
         208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
         221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,
         234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,
         247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,
         260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,
         273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,
         286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,
         299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,
         312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,
         325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,
         338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,
         351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,
         364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,
         377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
         390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,
         403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,
         416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,
         429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,
         442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,
         455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,
         468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,
         481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,
         494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,
         507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,
         520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,
         533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,
         546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,
         559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,
         572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,
         585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,
         598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,
         611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,
         624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,
         637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,
         650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,
         663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,
         676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,
         689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,
         702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,
         715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,
         728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,
         741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,
         754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,
         767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,
         780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,
         793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,
         806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,
         819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,
         832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,
         845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,
         858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,
         871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,
         884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896,
         897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909,
         910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922,
         923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935,
         936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948,
         949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961,
         962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974,
         975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987,
         988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999])}}
In [6]:
op.dask.dicts
Out[6]:
{'mean_agg-aggregate-f2f4df75bd5e24989d51d701807bd50b': {('mean_agg-aggregate-f2f4df75bd5e24989d51d701807bd50b',): (functools.partial(<function mean_agg at 0x7fc05ed82550>, dtype=dtype('float64'), axis=(0,), keepdims=False),
   [('mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8', 0),
    ('mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8', 1),
    ('mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8', 2)])},
 'mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8': {('mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8',
   0): (functools.partial(<function mean_combine at 0x7fc05ed824c0>, axis=(0,), keepdims=True),
   [('mean_chunk-34017827db9ad0848df29f76eadef2b2', 0),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 1),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 2),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 3)]),
  ('mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8',
   1): (functools.partial(<function mean_combine at 0x7fc05ed824c0>, axis=(0,), keepdims=True), [('mean_chunk-34017827db9ad0848df29f76eadef2b2',
     4),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 5),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 6),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 7)]),
  ('mean_combine-partial-966a5ccef2c1ae4a1763289b49bd8fe8',
   2): (functools.partial(<function mean_combine at 0x7fc05ed824c0>, axis=(0,), keepdims=True), [('mean_chunk-34017827db9ad0848df29f76eadef2b2',
     8),
    ('mean_chunk-34017827db9ad0848df29f76eadef2b2', 9)])},
 'mean_chunk-34017827db9ad0848df29f76eadef2b2': <dask.blockwise.Blockwise at 0x7fc05eccfa60>,
 'array-e30dcabe0f2b3c7236d769ba2cbdb28b': {('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   0): (<function _operator.getitem(a, b, /)>,
   'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b',
   (slice(0, 100, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   1): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(100, 200, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   2): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(200, 300, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   3): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(300, 400, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   4): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(400, 500, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   5): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(500, 600, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   6): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(600, 700, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   7): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(700, 800, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   8): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(800, 900, None),)),
  ('array-e30dcabe0f2b3c7236d769ba2cbdb28b',
   9): (<function _operator.getitem(a, b, /)>, 'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b', (slice(900, 1000, None),)),
  'array-original-e30dcabe0f2b3c7236d769ba2cbdb28b': array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
          13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
          26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
          39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
          52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
          65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
          78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
          91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
         104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
         117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
         130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
         143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
         156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
         169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
         182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
         195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
         208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
         221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,
         234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,
         247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,
         260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,
         273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,
         286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,
         299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,
         312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,
         325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,
         338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,
         351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,
         364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,
         377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
         390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,
         403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,
         416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,
         429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,
         442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,
         455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,
         468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,
         481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,
         494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,
         507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,
         520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,
         533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,
         546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,
         559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,
         572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,
         585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,
         598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,
         611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,
         624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,
         637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,
         650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,
         663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,
         676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,
         689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,
         702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,
         715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,
         728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,
         741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,
         754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,
         767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,
         780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,
         793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,
         806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,
         819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,
         832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,
         845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,
         858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,
         871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,
         884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896,
         897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909,
         910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922,
         923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935,
         936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948,
         949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961,
         962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974,
         975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987,
         988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999])}}

Y podemos visualizarlo si tenemos instalada la biblioteca graphviz:

In [7]:
op.visualize()
Out[7]:

Si queremos efectuar la operación, tendremos que llamar al método .compute().

In [8]:
op.compute()
Out[8]:
499.5

Si queremos convertir nuestro array original a array de NumPy, también se hace llamando a compute():

In [9]:
y.compute()
Out[9]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
       208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
       221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,
       234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,
       247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,
       260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,
       273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,
       286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,
       299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,
       312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,
       325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,
       338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,
       351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,
       364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,
       377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
       390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,
       403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,
       416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,
       429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,
       442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,
       455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,
       468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,
       481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,
       494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,
       507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,
       520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,
       533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,
       546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,
       559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,
       572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,
       585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,
       598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,
       611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,
       624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,
       637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,
       650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,
       663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,
       676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,
       689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,
       702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,
       715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,
       728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,
       741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,
       754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,
       767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,
       780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,
       793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,
       806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,
       819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,
       832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,
       845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,
       858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,
       871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,
       884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896,
       897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909,
       910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922,
       923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935,
       936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948,
       949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961,
       962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974,
       975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987,
       988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999])

Demo: Taxis de NYC

Otra de las estructuras de datos que provee pandas son los DataFrames, que se comportan de la misma manera que los DataFrames de pandas.

Para estudiar cómo funciona, vamos a descargar datos de trayectos de los taxis de New York:

In [10]:
!cat data/raw_data_urls.txt
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
In [11]:
!du data/yellow*.csv -h -s
656M	data/yellow_tripdata_2019-01.csv
620M	data/yellow_tripdata_2019-02.csv
693M	data/yellow_tripdata_2019-03.csv
658M	data/yellow_tripdata_2019-04.csv
In [12]:
!du data/ -h -s
2,6G	data/
In [13]:
!cat data/download_raw_data.sh
cat raw_data_urls.txt | xargs -n 1 -P 6 wget -c -P .

Tanto dask.dataframe como dask.array usan un scheduler por defecto basado en hilos. En su lugar, vamos a utilizar una clase Client, la que emplearíamos si estuviéramos en un cluster.

In [14]:
import dask.dataframe as dd

Esta clase Client, cuando se utiliza en local, lanza un scheduler que minimiza el uso de memoria y aprovecha todos los núcleos de la CPU.

"The dask single-machine schedulers have logic to execute the graph in a way that minimizes memory footprint." http://dask.pydata.org/en/latest/custom-graphs.html?highlight=minimizes%20memory#related-projects

El servidor de diagnóstico está disponible en http://127.0.0.1:8787/.

In [15]:
from dask.distributed import Client

client = Client()
client
Out[15]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 16.73 GB

Y ahora leemos los .csv con un filtro todos a la vez en el mismo DataFrame de Dask:

In [16]:
df = dd.read_csv("data/yellow*.csv",
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

Que mimetiza la API de pandas:

In [17]:
df.head()
Out[17]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 1 2019-01-01 00:46:40 2019-01-01 00:53:20 1 1.5 1 N 151 239 1 7.0 0.5 0.5 1.65 0.0 0.3 9.95 NaN
1 1 2019-01-01 00:59:47 2019-01-01 01:18:59 1 2.6 1 N 239 246 1 14.0 0.5 0.5 1.00 0.0 0.3 16.30 NaN
2 2 2018-12-21 13:48:30 2018-12-21 13:52:40 3 0.0 1 N 236 236 1 4.5 0.5 0.5 0.00 0.0 0.3 5.80 NaN
3 2 2018-11-28 15:52:25 2018-11-28 15:55:45 5 0.0 1 N 193 193 2 3.5 0.5 0.5 0.00 0.0 0.3 7.55 NaN
4 2 2018-11-28 15:56:57 2018-11-28 15:58:33 5 0.0 2 N 193 193 2 52.0 0.0 0.5 0.00 0.0 0.3 55.55 NaN
In [18]:
df.dtypes
Out[18]:
VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
RatecodeID                        int64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

Vamos a calcular la longitud del DataFrame:

In [19]:
# Esta operación bloquea el intérprete durante unos minutos
len(df)
Out[19]:
29952851

Como se puede observar, el uso de memoria está contenido y todas las CPUs están trabajando.

len

También lo podemos hacer de manera asíncrona:

In [20]:
futures = client.submit(len, df)
futures
Out[20]:
Future: len status: pending, key: len-ba53e73bd851b9acbce1db3147916a2f
In [21]:
from distributed import progress
In [22]:
progress(futures)

Vamos ahora a calcular la distancia media recorrida en función del número de ocupantes. Igual que cuando usábamos dask.array, la operación no se efectúa automáticamente.

In [23]:
op = df.groupby(df.passenger_count).trip_distance.mean()
op
Out[23]:
Dask Series Structure:
npartitions=1
    float64
        ...
Name: trip_distance, dtype: float64
Dask Name: truediv, 240 tasks
In [24]:
f2 = client.compute(op)
f2
Out[24]:
Future: finalize status: pending, key: finalize-f2fcbd32b7961de72012aa4fafe66321
El método client.compute almacena el resultado en un solo nodo, y por tanto debe usarse con cuidado. Para objetos grandes, es mejor usar client.persist.
In [25]:
progress(f2)
In [26]:
f2.result()
Out[26]:
passenger_count
0    2.788861
1    2.894872
2    3.009092
3    2.970344
4    3.015722
5    2.959904
6    2.934012
7    2.186186
8    5.187980
9    3.607727
Name: trip_distance, dtype: float64

En este caso la visualización de la operación ya tiene una magnitud considerable:

In [27]:
op.visualize()
Out[27]: