Note
Go to the end to download the full example code
Quick Start¶
This guide walks through the key functionality of Hidet for tensor computation.
Optimize PyTorch model with Hidet¶
Note
torch.compile(...)
requires PyTorch 2.0+.
The easiest way to use Hidet is to use the torch.compile()
function with hidet
as the backend, such as
model_opt = torch.compile(model, backend='hidet')
Next, we use resnet18 model as an example to show how to optimize a PyTorch model with Hidet.
Tip
Because tf32 is enabled by default for torch’s cudnn backend, the torch’s precision is slightly low. You could disable the tf32 (See also PyTorch TF32).
import hidet
import torch
# take resnet18 as an example
x = torch.randn(1, 3, 224, 224).cuda()
model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet18', pretrained=True, verbose=False)
model = model.cuda().eval()
# uncomment the following line to enable kernel tuning
# hidet.torch.dynamo_config.search_space(2)
# optimize the model with 'hidet' backend
model_opt = torch.compile(model, backend='hidet')
# run the optimized model
y1 = model_opt(x)
y2 = model(x)
# check the correctness
torch.testing.assert_close(actual=y1, expected=y2, rtol=1e-2, atol=1e-2)
# benchmark the performance
for name, model in [('eager', model), ('hidet', model_opt)]:
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
torch.cuda.synchronize()
start_event.record()
for _ in range(100):
y = model(x)
end_event.record()
torch.cuda.synchronize()
print('{:>10}: {:.3f} ms'.format(name, start_event.elapsed_time(end_event) / 100.0))
eager: 1.250 ms
hidet: 2.137 ms
One operator can have multiple equivalent implementations (i.e., kernel programs) with different performance. We usually need to try different implementations for each concrete input shape to find the best one for the specific input shape. This process is called kernel tuning. To enable kernel tuning, we can use the following config in hidet:
# 0 - no tuning, default kernel will be used
# 1 - tuning in a small search space
# 2 - tuning in a large search space, will take longer time and achieves better performance
hidet.torch.dynamo_config.search_space(2)
When kernel tuning is enabled, hidet can achieve the following performance on NVIDIA RTX 4090:
eager: 1.176 ms
hidet: 0.286 ms
Hidet provides some configurations to control the optimization of hidet backend. such as
Search Space: you can choose the search space of operator kernel tuning. A larger schedule space usually achieves the better performance, but takes longer time to optimize.
Correctness Checking: print the correctness checking report. You can know the numerical difference between the hidet generated operator and the original pytorch operator.
Other Configurations: you can also configure the other optimizations of hidet backend, such as using a lower precision of data type automatically (e.g., float16), or control the behavior of parallelization of the reduction dimension of the matrix multiplication and convolution operators.
See also
You can learn more about the configuration of hidet as a backend in torch dynamo in the tutorial Optimize PyTorch Model.
In the remaining parts, we will show you the key components of Hidet.
Define tensors¶
Tip
Besides randn()
, we can also use zeros()
, ones()
, full()
,
empty()
to create tensors with different initialized values. We can use from_torch()
to
convert a PyTorch tensor to Hidet tensor that shares the same memory. We can also use asarray()
to
convert python list or numpy ndarray to Hidet tensor.
A tensor is a n-dimension array. As other machine learning framework,
Hidet takes Tensor
as the core object to compute and manipulate.
The following code defines a tensor with randomly initialized tensor with hidet.randn()
.
a = hidet.randn([2, 3], device='cuda')
print(a)
Tensor(shape=(2, 3), dtype='float32', device='cuda:0')
[[-0.13 0.24 0.23]
[-1.3 0.61 -0.54]]
Each Tensor
has dtype
to define the type of each tensor element,
and device
to tell which device this tensor resides on, and
shape
to indicate the size of each dimension. The example defines a float32
tensor on
cuda
device with shape [2, 3]
.
Run operators¶
Hidet provides a bunch of operators
(e.g., matmul()
and
conv2d()
) to compute and manipulate tensors. We can do a matrix multiplication as follows:
b = hidet.randn([3, 2], device='cuda')
c = hidet.randn([2], device='cuda')
d = hidet.ops.matmul(a, b)
d = d + c # 'd + c' is equivalent to 'hidet.ops.add(d, c)'
print(d)
Tensor(shape=(2, 2), dtype='float32', device='cuda:0')
[[-0.59 -1.01]
[ 0.28 -0.77]]
In this example, the operator is executed on the device at the time we call it, thus it is in an imperative style of execution. Imperative execution is intuitive and easy to debug. But it prevents some graph-level optimization opportunities and suffers from higher kernel dispatch latency.
In the next section, we would introduce another way to execute operators.
Symbolic tensor and flow graph¶
In hidet, each tensor has an optional storage
attribute that represents a block of
memory that stores the contents of the tensor. If the storage attribute is None, the tensor is a symbolic tensor.
We could use hidet.symbol_like()
or hidet.symbol()
to create a symbolic tensor. Symbolic tensors are
returned if any input tensor of an operator is symbolic. We could know how the symbolic tensor is computed via the
trace
attribute. It is a tuple (op, idx)
where op
is the operator produces this
tensor and idx
is the index of this tensor in the operator’s outputs.
def linear_bias(x, b, c):
return hidet.ops.matmul(x, b) + c
x = hidet.symbol_like(a)
y = linear_bias(x, b, c)
assert x.trace is None
assert y.trace is not None
print('x:', x)
print('y:', y)
x: Tensor(shape=(2, 3), dtype='float32', device='cuda:0')
y: Tensor(shape=(2, 2), dtype='float32', device='cuda:0')
from (<hidet.graph.ops.arithmetic.AddOp object at 0x7f9ab84cb2b0>, 0)
We can use trace attribute to construct the computation graph, starting from the symbolic output tensor(s).
This is what function hidet.trace_from()
does. In hidet, we use hidet.graph.FlowGraph
to
represent the data flow graph (a.k.a, computation graph).
graph: hidet.FlowGraph = hidet.trace_from(y)
print(graph)
Graph(x: float32[2, 3][cuda]){
c = Constant(float32[3, 2][cuda])
c_1 = Constant(float32[2][cuda])
x_1: float32[2, 2][cuda] = Matmul(x, c, require_prologue=False)
x_2: float32[2, 2][cuda] = Add(x_1, c_1)
return x_2
}
Optimize flow graph¶
Tip
We may config optimizations with PassContext
.
Potential configs:
Whether to use tensor core.
Whether to use low-precision data type (e.g.,
float16
).
Flow graph is the basic unit of graph-level optimizations in hidet. We can optimize a flow graph with
hidet.graph.optimize()
. This function applies the predefined passes to optimize given flow graph.
In this example, we fused the matrix multiplication and element-wise addition into a single operator.
opt_graph: hidet.FlowGraph = hidet.graph.optimize(graph)
print(opt_graph)
Graph(x: float32[2, 3][cuda]){
c = Constant(float32[1, 3, 2][cuda])
c_1 = Constant(float32[2][cuda])
x_1: float32[2, 2][cuda] = FusedBatchMatmul(c, c_1, x, fused_graph=FlowGraph(Broadcast, BatchMatmul, Reshape, Add), anchor=1)
return x_1
}
Run flow graph¶
We can directly call the flow graph to run it:
y1 = opt_graph(a)
print(y1)
Tensor(shape=(2, 2), dtype='float32', device='cuda:0')
[[-0.59 -1.01]
[ 0.28 -0.77]]
For CUDA device, a more efficient way is to create a cuda graph to dispatch the kernels in a flow graph to the NVIDIA GPU.
cuda_graph = opt_graph.cuda_graph()
outputs = cuda_graph.run([a])
y2 = outputs[0]
print(y2)
Tensor(shape=(2, 2), dtype='float32', device='cuda:0')
[[-0.59 -1.01]
[ 0.28 -0.77]]
Summary¶
In this quick start guide, we walk through several important functionalities of hidet:
Define tensors.
Run operators imperatively.
Use symbolic tensor to create computation graph (e.g., flow graph).
Optimize and run flow graph.
Next Step¶
It is time to learn how to use hidet in your project. A good start is to Optimize PyTorch Model and Optimize ONNX Model with Hidet.
Total running time of the script: (0 minutes 58.327 seconds)