hidet.cuda

hidet.cuda¶

Contents¶

Device Management

`hidet.cuda.available`	Returns True if CUDA is available, False otherwise.
`hidet.cuda.device_count`	Get the number of available CUDA devices.
`hidet.cuda.current_device`	Get the current cuda device.
`hidet.cuda.set_device`	Set the current cuda device.
`hidet.cuda.properties`	Get the properties of a CUDA device.
`hidet.cuda.compute_capability`	Get the compute capability of a CUDA device.
`hidet.cuda.synchronize`	Synchronize the host thread with the device.
`hidet.cuda.profiler_start`	Mark the start of a profiling range.
`hidet.cuda.profiler_stop`	Mark the end of a profiling range.

Memory Management

`hidet.cuda.malloc`	Allocate memory on the current device.
`hidet.cuda.malloc_async`	Allocate memory on the current device asynchronously.
`hidet.cuda.malloc_host`	Allocate pinned host memory.
`hidet.cuda.free`	Free memory on the current cuda device.
`hidet.cuda.free_async`	Free memory on the current cuda device asynchronously.
`hidet.cuda.free_host`	Free pinned host memory.
`hidet.cuda.memset`	Set the gpu memory to a given value.
`hidet.cuda.memset_async`	Set the gpu memory to given value asynchronously.
`hidet.cuda.memcpy`	Copy gpu memory from one location to another.
`hidet.cuda.memcpy_async`	Copy gpu memory from one location to another asynchronously.
`hidet.cuda.memory_info`	Get the free and total memory on the current device in bytes.

Stream and Event

`hidet.cuda.Stream`	A CUDA stream.
`hidet.cuda.ExternalStream`	An external CUDA stream created from a handle.
`hidet.cuda.Event`	A CUDA event.
`hidet.cuda.current_stream`	Get the current stream.
`hidet.cuda.default_stream`	Get the default stream.
`hidet.cuda.stream`	Set the current stream.

CUDA Graph

hidet.cuda.graph.CudaGraph

Create a cuda graph to capture and replay the execution of a series of cuda kernels launched in a function.

Device Management¶

hidet.cuda.available()[source]¶

Returns True if CUDA is available, False otherwise.

Use ctypes to check if libcuda.so is available instead of calling cudart directly.

Returns:: ret – Whether CUDA is available.
Return type:: bool

hidet.cuda.device_count()[source]¶

Get the number of available CUDA devices.

Returns:: count – The number of available CUDA devices.
Return type:: int

hidet.cuda.current_device()[source]¶

Get the current cuda device.

Returns:: device_id – The ID of the cuda device.
Return type:: int

hidet.cuda.set_device(device_id)[source]¶

Set the current cuda device.

Parameters:: device_id (int) – The ID of the cuda device.

hidet.cuda.properties(device_id=0)[source]¶

Get the properties of a CUDA device.

Parameters:: device_id (int) – The ID of the device.
Returns:: prop – The properties of the device.
Return type:: cudaDeviceProp

hidet.cuda.compute_capability(device_id=0)[source]¶

Get the compute capability of a CUDA device.

Parameters:: device_id (int) – The ID of the device to query.
Returns:: (major, minor) – The compute capability of the device.
Return type:: Tuple[int, int]

hidet.cuda.synchronize()[source]¶

Synchronize the host thread with the device.

This function blocks until the device has completed all preceding requested tasks.

hidet.cuda.profiler_start()[source]¶: Mark the start of a profiling range.

hidet.cuda.profiler_stop()[source]¶: Mark the end of a profiling range.

Memory Allocation¶

hidet.cuda.malloc(num_bytes)[source]¶

Allocate memory on the current device.

Parameters:: num_bytes (int) – The number of bytes to allocate.
Returns:: addr – The address of the allocated memory.
Return type:: int

hidet.cuda.malloc_async(num_bytes, stream=None)[source]¶

Allocate memory on the current device asynchronously.

Parameters:

num_bytes (int) – The number of bytes to allocate.
stream (Optional[Union[Stream, cudaStream_t, int]]) – The stream to use for the allocation. If None, the current stream is used.

Returns:

addr – The address of the allocated memory. When the allocation failed due to insufficient memory, 0 is returned.

Return type:

int

hidet.cuda.malloc_host(num_bytes)[source]¶

Allocate pinned host memory.

Parameters:: num_bytes (int) – The number of bytes to allocate.
Returns:: addr – The address of the allocated memory.
Return type:: int

hidet.cuda.free(addr)[source]¶

Free memory on the current cuda device.

Parameters:: addr (int) – The address of the memory to free. This must be the address of memory allocated with malloc() or malloc_async().
Return type:: None

hidet.cuda.free_async(addr, stream=None)[source]¶

Free memory on the current cuda device asynchronously.

Parameters:

addr (int) – The address of the memory to free. This must be the address of memory allocated with malloc() or malloc_async().
stream (Union[Stream, cudaStream_t, int], optional) – The stream to use for the free. If None, the current stream is used.

Return type:

None

hidet.cuda.free_host(addr)[source]¶

Free pinned host memory.

Parameters:: addr (int) – The address of the memory to free. This must be the address of memory allocated with malloc_host().
Return type:: None

hidet.cuda.memset(addr, value, num_bytes)[source]¶

Set the gpu memory to a given value.

Parameters:

addr (int) – The start address of the memory region to set.
value (int) – The byte value to set the memory region to.
num_bytes (int) – The number of bytes to set.

Return type:

None

hidet.cuda.memset_async(addr, value, num_bytes, stream=None)[source]¶

Set the gpu memory to given value asynchronously.

Parameters:

addr (int) – The start address of the memory region to set.
value (int) – The byte value to set the memory region to.
num_bytes (int) – The number of bytes to set.
stream (Union[Stream, cudaStream_t, int], optional) – The stream to use for the memset. If None, the current stream is used.

Return type:

None

hidet.cuda.memcpy(dst, src, num_bytes)[source]¶

Copy gpu memory from one location to another.

Parameters:

dst (int) – The destination address.
src (int) – The source address.
num_bytes (int) – The number of bytes to copy.

Return type:

None

hidet.cuda.memcpy_async(dst, src, num_bytes, stream=None)[source]¶

Copy gpu memory from one location to another asynchronously.

Parameters:

dst (int) – The destination address.
src (int) – The source address.
num_bytes (int) – The number of bytes to copy.
stream (Union[Stream, cudaStream_t, int], optional) – The stream to use for the memcpy. If None, the current stream is used.

Return type:

None

hidet.cuda.memory_info()[source]¶

Get the free and total memory on the current device in bytes.

Returns:: (free, total) – The free and total memory on the current device in bytes.
Return type:: Tuple[int, int]

CUDA Stream and Event¶

class hidet.cuda.Stream(device=None, blocking=False, priority=0, **kwargs)[source]¶

A CUDA stream.

Parameters:

device (int or hidet.Device, optional) – The device on which to create the stream. If None, the current device will be used.
blocking (bool) – Whether to enable the implicit synchronization between this stream and the default stream. When enabled, any operation enqueued in the stream will wait for all previous operations in the default stream to complete before beginning execution.
priority (int) – The priority of the stream. The priority is a hint to the CUDA driver that it can use to reorder operations in the stream relative to other streams. The priority can be 0 (default priority) and -1 (high priority). By default, all streams are created with priority 0.

device_id()[source]¶

Get the device ID of the stream.

Returns:: device_id – The device ID of the stream.
Return type:: int

handle()[source]¶

Get the handle of the stream.

Returns:: handle – The handle of the stream.
Return type:: cudaStream_t

synchronize()[source]¶

Block the current host thread until the stream completes all operations.

Return type:: None

wait_event(event)[source]¶

Let the subsequent operations in the stream wait for the event to complete. The event might be recorded in another stream. The host thread will not be blocked.

Parameters:: event (Event) – The event to wait for.
Return type:: None

class hidet.cuda.ExternalStream(handle, device_id=None)[source]¶

An external CUDA stream created from a handle.

Parameters:

handle (int or cudaStream_t) – The handle of the stream.
device_id (int, optional) – The device ID of the stream. If None, the current device will be used.

class hidet.cuda.Event(enable_timing=False, blocking=False)[source]¶

A CUDA event.

Parameters:

enable_timing (bool) – When enabled, the event is able to record the time between itself and another event.
blocking (bool) – When enabled, we can use the synchronize() method to block the current host thread until the event completes.

handle()[source]¶

Get the handle of the event.

Returns:: handle – The handle of the event.
Return type:: cudaEvent_t

elapsed_time(start_event)[source]¶

Get the elapsed time between the start event and this event in milliseconds.

Parameters:: start_event (Event) – The start event.
Returns:: elapsed_time – The elapsed time in milliseconds.
Return type:: float

record(stream=None)[source]¶

Record the event in the given stream.

After the event is recorded:

We can synchronize the event to block the current host thread until all the tasks before the event are completed via Event.synchronize().
We can also get the elapsed time between the event and another event via Event.elapsed_time() (when enable_timing is True).
We can also let another stream to wait for the event via Stream.wait_event().

Parameters:: stream (Stream, optional) – The stream where the event is recorded.

synchronize()[source]¶: Block the current host thread until the tasks before the event are completed. This method is only available when blocking is True when creating the event, and the event is recorded in a stream.

hidet.cuda.current_stream(device=None)[source]¶

Get the current stream.

Parameters:: device (int or hidet.Device, optional) – The device on which to get the current stream. If None, the current device will be used.
Returns:: stream – The current stream.
Return type:: Stream

hidet.cuda.default_stream(device=None)[source]¶

Get the default stream.

Parameters:: device (int or hidet.Device, optional) – The device on which to get the default stream. If None, the current device will be used.
Returns:: stream – The default stream.
Return type:: Stream

hidet.cuda.stream(s)[source]¶

Set the current stream.

Parameters:: s (Stream) – The stream to set.
Return type:: StreamContext

Examples

>>> import hidet
>>> stream = hidet.cuda.Stream()
>>> with hidet.cuda.stream(stream):
>>>    ... # all hidet cuda kernels will be executed in the stream

CUDA Graph¶

class hidet.cuda.graph.CudaGraph(f_create_inputs, f_run, ref_objs)[source]¶

Create a cuda graph to capture and replay the execution of a series of cuda kernels launched in a function.

The graph is created by calling the constructor with the following arguments:

Parameters:

f_create_inputs (Callable[[], List[Tensor]]) – A function that creates the input tensors of the graph. This function is called before f_run.
f_run (Callable[[List[Tensor]], List[Tensor]]) – A function that runs the graph. Only the cuda kernels launched in this function will be captured. Rerunning this function must launch the same cuda kernels in the same order. The input tensors of this function will be the output tensors of the f_create_inputs function.
ref_objs (Any) – The objects that should keep alive during the lifetime of the cuda graph. It may contain the weight tensors that are used in the graph.

property inputs: List[Tensor]¶: The inputs of the cuda graph.

property outputs: List[Tensor]¶: The outputs of the cuda graph.

run(inputs=None)[source]¶

Run the cuda graph synchronously. If the inputs are provided, the inputs will be copied to the internal inputs of the cuda graph before running.

Parameters:: inputs (Optional[Sequence[Tensor]]) – The optional inputs to run the cuda graph.
Returns:: outputs – The outputs of the cuda graph.
Return type:: List[Tensor]

run_async(inputs=None, stream=None, output_to_torch_tensor=False)[source]¶

Run the cuda graph asynchronously. If the inputs are provided, the inputs will be copied to the internal inputs of the cuda graph before running.

Parameters:

inputs (Optional[Sequence[Tensor]]) – The optional inputs to run the cuda graph.
stream (Optional[Stream]) – The optional stream to run the cuda graph. If not provided, the current stream will be used.
output_to_torch_tensor (bool) – If True list of torch.Tensor will be returned, opposite list of hidet.Tensor.

Returns:

The outputs of the cuda graph.

Return type:

outputs