In [ ]:

%install-location $cwd/swift-install
%install '.package(path: "$cwd/FastaiNotebook_00_load_data")' FastaiNotebook_00_load_data

Installing packages:
	.package(path: "/home/jupyter/notebooks/swift/FastaiNotebook_00_load_data")
		FastaiNotebook_00_load_data
With SwiftPM flags: []
Working in: /tmp/tmp602xg52k/swift-install
[1/2] Merging module Path
[2/3] Merging module NotebookExport
[3/4] Merging module Just
[4/5] Merging module FastaiNotebook_00_load_data
[5/6] Compiling jupyterInstalledPackages jupyterInstalledPackages.swift
[6/7] Merging module jupyterInstalledPackages
Initializing Swift...
Installation complete!

In [ ]:

//export
import Path
import TensorFlow

In [ ]:

import FastaiNotebook_00_load_data

Get some Tensors to play with¶

We can initialize a tensor in lots of different ways because in swift, two functions with the same name can coexist as long as they don't have the same signatures. Different named arguments give different signatures, so all of those are different init functions of Tensor:

In [ ]:

let zeros = Tensor<Float>(zeros: [1,4,5])
let ones  = Tensor<Float>(ones: [12,4,5])
let twos  = Tensor<Float>(repeating: 2.0, shape: [2,3,4,5])
let range = Tensor<Int32>(rangeFrom: 0, to: 32, stride: 1)

Those are just some examples and there are many more! Here we grab random numbers

In [ ]:

let xTrain = Tensor<Float>(randomNormal: [5, 784])
var weights = Tensor<Float>(randomNormal: [784, 10]) / sqrt(784)
print(weights[0])

[  0.015628124,   0.026147088, -0.0083014425,    0.06369277,   -0.06701048,  -0.026953917,
   0.014180449,  0.0067227157,   0.005665138,   0.010936922]

Building Matmul¶

Ok, now that we know how floating point types and arrays work, we can finally build our own matmul from scratch, using a few loops. We will take the two input matrices as single dimensional arrays so we can show manual indexing into them, the hard way:

In [ ]:

// a and b are the flattened array elements, aDims/bDims are the #rows/columns of the arrays.
func swiftMatmul(a: [Float], b: [Float], aDims: (Int,Int), bDims: (Int,Int)) -> [Float] {
    assert(aDims.1 == bDims.0, "matmul shape mismatch")
    
    var res = Array(repeating: Float(0.0), count: aDims.0 * bDims.1)
    for i in 0 ..< aDims.0 {
        for j in 0 ..< bDims.1 {
            for k in 0 ..< aDims.1 {
                res[i*bDims.1+j] += a[i*aDims.1+k] * b[k*bDims.1+j]
            }
        }
    }
    return res
}

To try this out, we extract the scalars out of our MNIST data as an array.

In [ ]:

let flatA = xTrain[0..<5].scalars
let flatB = weights.scalars
let (aDims,bDims) = ((5, 784), (784, 10))

Now that we've got everything together, we can try it out!

In [ ]:

var resultArray = swiftMatmul(a: flatA, b: flatB, aDims: aDims, bDims: bDims)

In [ ]:

time(repeating: 100) {
    _ = swiftMatmul(a: flatA, b: flatB, aDims: aDims, bDims: bDims)
}

average: 0.12000220999999996 ms,   min: 0.110746 ms,   max: 0.149986 ms

Awesome, that is pretty fast - compare that to 835 ms with Python!

You might be wondering what that time(repeating:) builtin is. As you might guess, this is actually a Swift function - one that is using "trailing closure" syntax to specify the body of the timing block. Trailing closures are passed as arguments to the function, and in this case, the function was defined in our ✅00_load_data workbook. Let's take a look!

Getting the performance of C 💯¶

This performance is pretty great, but we can do better. Swift is a memory safe language (like Python), which means it has to do array bounds checks and some other stuff. Fortunately, Swift is a pragmatic language that allows you to drop through this to get peak performance - check out Jeremy's article High Performance Numeric Programming with Swift: Explorations and Reflections for a deep dive.

One thing you can do is use UnsafePointer (which is basically a raw C pointer) instead of using a bounds checked array. This isn't memory safe, but gives us about a 2x speedup in this case!

In [ ]:

// a and b are the flattened array elements, aDims/bDims are the #rows/columns of the arrays.
func swiftMatmulUnsafe(a: UnsafePointer<Float>, b: UnsafePointer<Float>, aDims: (Int,Int), bDims: (Int,Int)) -> [Float] {
    assert(aDims.1 == bDims.0, "matmul shape mismatch")
    
    var res = Array(repeating: Float(0.0), count: aDims.0 * bDims.1)
    res.withUnsafeMutableBufferPointer { res in 
        for i in 0 ..< aDims.0 {
            for j in 0 ..< bDims.1 {
                for k in 0 ..< aDims.1 {
                    res[i*bDims.1+j] += a[i*aDims.1+k] * b[k*bDims.1+j]
                }
            }
        }
    }
    return res
}

In [ ]:

time(repeating: 100) {
    _ = swiftMatmulUnsafe(a: flatA, b: flatB, aDims: aDims, bDims: bDims)
}

average: 0.059201270000000035 ms,   min: 0.052995 ms,   max: 0.084454 ms

One of the other cool things about this is that we can provide a nice idiomatic API to the caller of this, and keep all the unsafe shenanigans inside the implementation of this function.

If you really want to fall down the rabbit hole, you can look at the implementation of UnsafePointer, which is of written in Swift wrapping LLVM pointer operations. This means you can literally get the performance of C code directly in Swift, while providing easy to use high level APIs!

Swift 💖 C APIs too: you get the full utility of the C ecosystem¶

Swift even lets you transparently work with C APIs, just like it does with Python. This can be used for both good and evil. For example, here we directly call the malloc function, dereference the uninitialized pointer, and print it out:

In [ ]:

import Glibc

let ptr : UnsafeMutableRawPointer = malloc(42)

print("☠️☠️ Uninitialized garbage =", ptr.load(as: UInt8.self))

free(ptr)

☠️☠️ Uninitialized garbage = 160

An UnsafeMutableRawPointer (implementation) isn't something you should use lightly, but when you work with C APIs, you'll see various types like that in the function signatures.

Calling malloc and free directly aren't recommended in Swift, but is useful and important when you're working with C APIs that expect to get malloc'd memory, which comes up when you're written a safe Swift wrapper for some existing code.

Speaking of existing code, let's take a look at that Python interop we touched on before:

In [ ]:

import Python
let np = Python.import("numpy")
let pickle = Python.import("pickle")
let sys = Python.import("sys")

print("🐍list =    ", pickle.dumps([1, 2, 3]))
print("🐍ndarray = ", pickle.dumps(np.array([[1, 2], [3, 4]])))

Fatal error: 'try!' expression unexpectedly raised an error: Python exception: No module named 'numpy': file /swift-base/swift/stdlib/public/Python/Python.swift, line 683
Current stack trace:
0    libswiftCore.so                    0x00007ffb425a58b0 swift_reportError + 50
1    libswiftCore.so                    0x00007ffb42614aa0 _swift_stdlib_reportFatalErrorInFile + 115
2    libswiftCore.so                    0x00007ffb4253cace <unavailable> + 3738318
3    libswiftCore.so                    0x00007ffb4253cc47 <unavailable> + 3738695
4    libswiftCore.so                    0x00007ffb4230ac4d <unavailable> + 1436749
5    libswiftCore.so                    0x00007ffb42511a78 <unavailable> + 3562104
6    libswiftCore.so                    0x00007ffb42334795 <unavailable> + 1607573
7    libswiftPython.so                  0x00007ffb42f9862b <unavailable> + 67115

Current stack trace:
	frame #3: 0x00007ffae8008060 $__lldb_expr96`main at <Cell 13>:2:17

Of course this is all written in Swift as well. You can probably guess how this works now: PythonObject is a Swift struct that wraps a pointer to the Python interpreter's notion of a Python object.

@dynamicCallable
@dynamicMemberLookup
public struct PythonObject {
  var reference: PyReference
  ...
}

The @dynamicMemberLookup attribute allows it to dynamically handle all member lookups (like x.y) by calling into the PyObject_GetAttrString runtime call. Similarly, the @dynamicCallable attribute allows the type to intercept all calls to a PythonObject (like x()), which it implements using the PyObject_Call runtime call.

Because Swift has such simple and transparent access to C, it allows building very nice first-class Swift APIs that talk directly to the lower level implementation, and these implementations can have very little overhead.

Working with Tensor¶

Lets get back into matmul and explore more of the Tensor type as provided by the TensorFlow module. You can see all things Tensor can do in the official documentation.

Here are some highlights. We saw how you can get zeros or random data:

In [ ]:

var bias = Tensor<Float>(zeros: [10])

let m1 = Tensor<Float>(randomNormal: [5, 784])
let m2 = Tensor<Float>(randomNormal: [784, 10])

Tensors carry data and a shape.

In [ ]:

print("m1: ", m1.shape)
print("m2: ", m2.shape)

m1:  [5, 784]
m2:  [784, 10]

The Tensor type provides all the normal stuff you'd expect as methods. Including arithmetic, convolutions, etc and this includes full support for broadcasting:

In [ ]:

let small = Tensor<Float>([[1, 2],
                           [3, 4]])

print("🔢2x2:\n", small)

🔢2x2:
 [[1.0, 2.0],
 [3.0, 4.0]]

MatMul Operator: In addition to using the global matmul(a, b) function, you can also use the a • b operator to matmul together two things. This is just like the @ operator in Python. You can get it with the option-8 on Mac or compose-.-= elsewhere. Or if you prefer, just use the matmul() function we've seen already.

In [ ]:

print("⊞ matmul:\n",  matmul(small, small))
print("\n⊞ again:\n", small • small)

⊞ matmul:
 [[ 7.0, 10.0],
 [15.0, 22.0]]

⊞ again:
 [[ 7.0, 10.0],
 [15.0, 22.0]]

Reshaping works the way you'd expect:

In [ ]:

var m = Tensor([1.0, 2, 3, 4, 5, 6, 7, 8, 9]).reshaped(to: [3, 3])
print(m)

[[1.0, 2.0, 3.0],
 [4.0, 5.0, 6.0],
 [7.0, 8.0, 9.0]]

You have the basic mathematical functions:

In [ ]:

sqrt((m * m).sum())

Out[ ]:

16.881943016134134

Elementwise ops and comparisons¶

Standard math operators (+,-,*,/) are all element-wise, and there are a bunch of standard math functions like sqrt and pow. Here are some examples:

In [ ]:

var a = Tensor([10.0, 6, -4])
var b = Tensor([2.0, 8, 7])
(a,b)

Out[ ]:

▿ 2 elements
  - .0 : [10.0,  6.0, -4.0]
  - .1 : [2.0, 8.0, 7.0]

In [ ]:

print("add:  ", a + b)
print("mul:  ", a * b)
print("sqrt: ", sqrt(a))
print("pow:  ", pow(a, b))

add:   [12.0, 14.0,  3.0]
mul:   [ 20.0,  48.0, -28.0]
sqrt:  [3.1622776601683795,  2.449489742783178,               -nan]
pow:   [             100.0, 1679616.0000000002,           -16384.0]

Comparison operators (>,<,==,!=,...) in Swift are supposed to return a single Bool value, so they are true if all the elements of the tensors satisfy the comparison.

Elementwise versions have the . prefix, which is read as "pointwise": .>, .<, .==, etc. You can merge a tensor of bools into a single Bool with the any() and all() methods.

In [ ]:

a < b

Out[ ]:

false

In [ ]:

a .< b

Out[ ]:

[false,  true,  true]

In [ ]:

print((a .> 0).all())

false

In [ ]:

print((a .> 0).any())

true

Broadcasting¶

Broadcasting with a scalar works just like in Python:

In [ ]:

var a = Tensor([10.0, 6.0, -4.0])

In [ ]:

print(a+1)

[11.0,  7.0, -3.0]

In [ ]:

2 * m

Out[ ]:

[[ 2.0,  4.0,  6.0],
 [ 8.0, 10.0, 12.0],
 [14.0, 16.0, 18.0]]

Broadcasting a vector with a matrix¶

In [ ]:

let c = Tensor([10.0,20.0,30.0])

By default, broadcasting is done by adding 1 dimensions to the beginning until dimensions of both objects match.

In [ ]:

m + c

Out[ ]:

[[11.0, 22.0, 33.0],
 [14.0, 25.0, 36.0],
 [17.0, 28.0, 39.0]]

In [ ]:

c + m

Out[ ]:

[[11.0, 22.0, 33.0],
 [14.0, 25.0, 36.0],
 [17.0, 28.0, 39.0]]

To broadcast on the other dimensions, one has to use expandingShape to add the dimension.

In [ ]:

m + c.expandingShape(at: 1)

Out[ ]:

[[11.0, 12.0, 13.0],
 [24.0, 25.0, 26.0],
 [37.0, 38.0, 39.0]]

In [ ]:

c.expandingShape(at: 1)

Out[ ]:

[[10.0],
 [20.0],
 [30.0]]

Broadcasting rules¶

In [ ]:

print(c.expandingShape(at: 0).shape)

[1, 3]

In [ ]:

print(c.expandingShape(at: 1).shape)

[3, 1]

In [ ]:

c.expandingShape(at: 0) * c.expandingShape(at: 1)

Out[ ]:

[[100.0, 200.0, 300.0],
 [200.0, 400.0, 600.0],
 [300.0, 600.0, 900.0]]

In [ ]:

c.expandingShape(at: 0) .> c.expandingShape(at: 1)

Out[ ]:

[[false,  true,  true],
 [false, false,  true],
 [false, false, false]]

Matmul using `Tensor`¶

Coming back to our matmul algorithm, we can implement exactly what we had before by using subscripting into a tensor, instead of subscripting into an array. Let's see how that works:

In [ ]:

func tensorMatmul(_ a: Tensor<Float>, _ b: Tensor<Float>) -> Tensor<Float> {
    var res = Tensor<Float>(zeros: [a.shape[0], b.shape[1]])

    for i in 0 ..< a.shape[0] {
        for j in 0 ..< b.shape[1] {
            for k in 0 ..< a.shape[1] {
                res[i, j] += a[i, k] * b[k, j]
            }
        }
    }
    return res
}

_ = tensorMatmul(m1, m2)

In [ ]:

time { 
    let tmp = tensorMatmul(m1, m2)
    
    // Copy a scalar back to the host to force a GPU sync.
    _ = tmp[0, 0].scalar
}

average: 5100.505422 ms,   min: 5100.505422 ms,   max: 5100.505422 ms

What, what just happened?? We used to be less than a tenth of a millisecond, now we're taking multiple seconds. It turns out that Tensor's are very good at bulk data processing, but they are not good at doing one float at a time. Make sure to use the coarse-grained operations. We can make this faster by vectorizing each loop in turn.

Slides: Granularity of Tensor Operations.

Vectorize the inner loop into a multiply + sum¶

In [ ]:

func elementWiseMatmul(_ a:Tensor<Float>, _ b:Tensor<Float>) -> Tensor<Float>{
    let (ar, ac) = (a.shape[0], a.shape[1])
    let (br, bc) = (b.shape[0], b.shape[1])
    var res = Tensor<Float>(zeros: [ac, br])
    
    for i in 0 ..< ar {
        let row = a[i]
        for j in 0 ..< bc {
            res[i, j] = (row * b.slice(lowerBounds: [0,j], upperBounds: [ac,j+1]).squeezingShape(at: 1)).sum()
        }
    }
    return res
}

_ = elementWiseMatmul(m1, m2)

In [ ]:

time { 
    let tmp = elementWiseMatmul(m1, m2)

    // Copy a scalar back to the host to force a GPU sync.
    _ = tmp[0, 0].scalar
}

average: 361.954593 ms,   min: 361.954593 ms,   max: 361.954593 ms

Vectorize the inner two loops with broadcasting¶

In [ ]:

func broadcastMatmult(_ a:Tensor<Float>, _ b:Tensor<Float>) -> Tensor<Float>{
    var res = Tensor<Float>(zeros: [a.shape[0], b.shape[1]])
    for i in 0..<a.shape[0] {
        res[i] = (a[i].expandingShape(at: 1) * b).sum(squeezingAxes: 0)
    }
    return res
}

_ = broadcastMatmult(m1, m2)

In [ ]:

time(repeating: 100) {
    let tmp = broadcastMatmult(m1, m2)

    // Copy a scalar back to the host to force a GPU sync.
    _ = tmp[0, 0].scalar
}

average: 0.70941021 ms,   min: 0.61099 ms,   max: 1.239529 ms

Vectorize the whole thing with one Tensorflow op¶

In [ ]:

time(repeating: 100) { _ = m1 • m2 }

average: 0.0204796 ms,   min: 0.018915 ms,   max: 0.040889 ms

Ok, now that we have matmul, we can continue to build out our framework.

To complete today's lesson, let's jump way way up the stack to see Workbook 11.

Tensorflow vectorizes, parallelizes, and scales¶

The reason that TensorFlow works in practice is that it can scale way up to large matrices, for example, lets try some thing a bit larger:

In [ ]:

func timeMatmulTensor(size: Int) {
    var matrix = Tensor<Float>(randomNormal: [size, size])
    print("\n\(size)x\(size):\n  ⏰", terminator: "")
    time(repeating: 10) { 
        let matrix = matrix • matrix 
        _ = matrix[0, 0].scalar
    }
}

timeMatmulTensor(size: 1)     // Tiny
timeMatmulTensor(size: 10)    // Bigger
timeMatmulTensor(size: 100)   // Even Bigger
timeMatmulTensor(size: 1000)  // Biggerest
timeMatmulTensor(size: 5000)  // Even Biggerest

1x1:
  ⏰average: 0.18814999999999998 ms,   min: 0.143182 ms,   max: 0.31991 ms

10x10:
  ⏰average: 0.1964562 ms,   min: 0.168568 ms,   max: 0.306457 ms

100x100:
  ⏰average: 0.18293589999999998 ms,   min: 0.155465 ms,   max: 0.289976 ms

1000x1000:
  ⏰average: 0.36083699999999996 ms,   min: 0.336118 ms,   max: 0.413469 ms

5000x5000:
  ⏰average: 20.0013405 ms,   min: 19.084045 ms,   max: 20.446258 ms

In constrast, our simple CPU implementation takes a lot longer to do the same work. For example:

In [ ]:

func timeMatmulSwift(size: Int, repetitions: Int = 10) {
    var matrix = Tensor<Float>(randomNormal: [size, size])
    let matrixFlatArray = matrix.scalars

    print("\n\(size)x\(size):\n  ⏰", terminator: "")
    time(repeating: repetitions) { 
       _ = swiftMatmulUnsafe(a: matrixFlatArray, b: matrixFlatArray, aDims: (size,size), bDims: (size,size))
    }
}

timeMatmulSwift(size: 1)     // Tiny
timeMatmulSwift(size: 10)    // Bigger
timeMatmulSwift(size: 100)   // Even Bigger
timeMatmulSwift(size: 1000, repetitions: 1)  // Biggerest

print("\n5000x5000: skipped, it takes tooo long!")

1x1:
  ⏰average: 0.0002532 ms,   min: 0.00022 ms,   max: 0.000383 ms

10x10:
  ⏰average: 0.0032415 ms,   min: 0.003208 ms,   max: 0.003293 ms

100x100:
  ⏰average: 1.4835639 ms,   min: 1.38322 ms,   max: 1.565449 ms

1000x1000:
  ⏰average: 1698.665057 ms,   min: 1698.665057 ms,   max: 1698.665057 ms

5000x5000: skipped, it takes tooo long!

Why is TensorFlow so so so much faster than our CPU implementation? Well there are two reasons: the first of which is that it uses GPU hardware, which is much faster for math like this. That said, there are a ton of tricks (involving memory hierarchies, cache blocking, and other tricks) that make matrix multiplications go fast on CPUs and other hardware.

For example, try using TensorFlow on the CPU to do the same computation as above:

In [ ]:

withDevice(.cpu) {
    timeMatmulTensor(size: 1)     // Tiny
    timeMatmulTensor(size: 10)    // Bigger
    timeMatmulTensor(size: 100)   // Even Bigger
    timeMatmulTensor(size: 1000)  // Biggerest
    timeMatmulTensor(size: 5000)  // Even Biggerest
}

1x1:
  ⏰average: 0.030079900000000003 ms,   min: 0.02664 ms,   max: 0.040804 ms

10x10:
  ⏰average: 0.0340013 ms,   min: 0.031712 ms,   max: 0.044604 ms

100x100:
  ⏰average: 0.115587 ms,   min: 0.092782 ms,   max: 0.133166 ms

1000x1000:
  ⏰average: 6.0490596 ms,   min: 5.553612 ms,   max: 6.476949 ms

5000x5000:
  ⏰average: 852.6646403999999 ms,   min: 651.327758 ms,   max: 927.424344 ms

This is a pretty big difference. On my hardware, it takes 2287ms for Swift to do a 1000x1000 multiply on the CPU, it takes TensorFlow 6.7ms to do the same work on the CPU, and takes TensorFlow 0.49ms to do it on a GPU.

Hardware Accelerators vs Flexibility¶

One of the big challenges with machine learning frameworks today is that they provide a fixed set of "ops" that you can use with high performance. There is a lot of work underway to fix this. The XLA compiler in TensorFlow is an important piece of this, which allows more flexibility in the programming model while still providing high performance by using compilers to target the hardware accelerator. If you're interested in the details, there is a great video by the creator of Halide explaining why this is challenging.

TensorFlow internals are undergoing significant changes (slide) including the introduction of the XLA compiler, and the introduction of MLIR compiler technology.

Tensor internals and Raw TensorFlow operations¶

TensorFlow provides hundreds of different operators, and they sort of grew organically over time. This means that there are some deprecated operators, they aren't particularly consistent, and there are other oddities. As such, the Tensor type provides a curated set of these operators as methods.

Whereas Int and Float are syntactic sugar for LLVM, and PythonObject is syntactic sugar for the Python interpreter, Tensor ends up being syntactic sugar for the TensorFlow operator set. You can dive in and see its implementation in Swift in the S4TF TensorFlow module, e.g.:

public struct Tensor<Scalar : TensorFlowScalar> : TensorProtocol {
  /// The underlying `TensorHandle`.
  /// - Note: `handle` is public to allow user defined ops, but should not
  /// normally be used otherwise.
  public let handle: TensorHandle<Scalar>
  ... 
}

Here we see the internal implementation details of Tensor, which stores a TensorHandle - the internal implementation detail of the TensorFlow Eager runtime.

Methods are defined on Tensor just like you'd expect, here is the basic addition operator, defined over all numeric tensors (i.e., not tensors of Bool):

extension Tensor : AdditiveArithmetic where Scalar : Numeric {
  /// Adds two tensors and produces their sum.
  /// - Note: `+` supports broadcasting.
  public static func + (lhs: Tensor, rhs: Tensor) -> Tensor {
    return Raw.add(lhs, rhs)
  }
}

But wait, what is this Raw thing?

Raw TensorFlow ops¶

TensorFlow has a database of the operators it defines, which gets encoded into a protocol buffer. From this protobuf, all of the operators automatically get a Raw operator (implemented in terms of a lower level #tfop primitive).

In [ ]:

// Explore the contents of the Raw namespace by typing Raw.<tab>
print(Raw.zerosLike(c))

// Raw.

[0.0, 0.0, 0.0]

There is an entire tutorial on Raw operators on github/TensorFlow/swift. The key thing to know is that TensorFlow can do almost anything, so if there is no obvious method on Tensor to do what you need it is worth checking out the tutorial to see how to do this.

As one example, later parts of the tutorial need the ability to load files and decode JPEGs. Swift for TensorFlow doesn't have these as methods on StringTensor yet, but we can add them like this:

In [ ]:

//export
public extension StringTensor {
    // Read a file into a Tensor.
    init(readFile filename: String) {
        self.init(readFile: StringTensor(filename))
    }
    init(readFile filename: StringTensor) {
        self = Raw.readFile(filename: filename)
    }

    // Decode a StringTensor holding a JPEG file into a Tensor<UInt8>.
    func decodeJpeg(channels: Int = 0) -> Tensor<UInt8> {
        return Raw.decodeJpeg(contents: self, channels: Int64(channels), dctMethod: "") 
    }
}

Export¶

In [ ]:

import NotebookExport
let exporter = NotebookExport(Path.cwd/"01_matmul.ipynb")
print(exporter.export(usingPrefix: "FastaiNotebook_"))

success

In [ ]:

Get some Tensors to play with¶

Building Matmul¶

Getting the performance of C 💯¶

Swift 💖 C APIs too: you get the full utility of the C ecosystem¶

Working with Tensor¶

Elementwise ops and comparisons¶

Broadcasting¶

Broadcasting a vector with a matrix¶

Broadcasting rules¶

Matmul using Tensor¶

Vectorize the inner loop into a multiply + sum¶

Vectorize the inner two loops with broadcasting¶

Vectorize the whole thing with one Tensorflow op¶

Tensorflow vectorizes, parallelizes, and scales¶

Hardware Accelerators vs Flexibility¶

Tensor internals and Raw TensorFlow operations¶

Raw TensorFlow ops¶

Export¶

Matmul using `Tensor`¶