This notebook introduces the da.linalg.svd
algorithms for the Singular Value Decomposition
Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.
The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.
from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='2GB')
client
Client-8ad1cc86-0de1-11ed-a4f6-000d3a8f7959
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://10.1.1.64:8787/status |
9a88c089
Dashboard: http://10.1.1.64:8787/status | Workers: 1 |
Total threads: 4 | Total memory: 1.86 GiB |
Status: running | Using processes: False |
Scheduler-ad8714da-ab0d-4c50-a4cb-ec68028ac54d
Comm: inproc://10.1.1.64/9462/1 | Workers: 1 |
Dashboard: http://10.1.1.64:8787/status | Total threads: 4 |
Started: Just now | Total memory: 1.86 GiB |
Comm: inproc://10.1.1.64/9462/4 | Total threads: 4 |
Dashboard: http://10.1.1.64:39957/status | Memory: 1.86 GiB |
Nanny: None | |
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-we6_wppm |
For many applications the provided matrix has many more rows than columns. In this case a specialized algorithm can be used.
import dask.array as da
X = da.random.random((200000, 100), chunks=(10000, 100)).persist()
import dask
u, s, v = da.linalg.svd(X)
dask.visualize(u, s, v)
v.compute()
array([[ 0.09994831, 0.10007229, 0.09997617, ..., 0.09995264, 0.09995591, 0.09984163], [ 0.05585195, -0.06184545, -0.04747733, ..., -0.06275421, -0.2061527 , 0.18218227], [ 0.01277435, 0.00484692, -0.04387551, ..., 0.00444605, 0.12143905, -0.06438531], ..., [ 0.02933106, 0.00834248, 0.0103009 , ..., -0.06069817, 0.01291796, 0.12832988], [ 0.0901224 , -0.00492353, -0.00470015, ..., 0.14196305, -0.09734339, -0.05803211], [ 0.16619815, 0.14906927, -0.18081339, ..., -0.1346468 , 0.12524437, 0.01322112]])
When there are also many chunks in columns then we use an approximate randomized algorithm to collect only a few of the singular values and vectors.
import dask.array as da
X = da.random.random((10000, 10000), chunks=(2000, 2000)).persist()
import dask
u, s, v = da.linalg.svd_compressed(X, k=5)
dask.visualize(u, s, v)
v.compute()
array([[ 0.00997573, 0.01011564, 0.00997098, ..., 0.00998413, 0.00994515, 0.0099736 ], [-0.00048293, 0.00247921, -0.00527027, ..., -0.00606932, -0.01272348, 0.00818248], [ 0.00145378, -0.00278148, 0.01078658, ..., -0.00428512, -0.00213905, -0.00738961], [-0.00984979, -0.00230993, -0.00277437, ..., 0.0056367 , -0.00199535, -0.01409744], [-0.00210629, -0.00320545, -0.00190336, ..., 0.01871436, -0.01494592, -0.00274385]])