东方耀AI技术分享

 找回密码
 立即注册

QQ登录

只需一步,快速开始

搜索
热搜: 活动 交友 discuz
查看: 5066|回复: 1
打印 上一主题 下一主题

[课堂笔记] GPU加速之pyCuda快速上手demo

[复制链接]

1365

主题

1856

帖子

1万

积分

管理员

Rank: 10Rank: 10Rank: 10

积分
14439
QQ
跳转到指定楼层
楼主
发表于 2020-10-21 15:43:29 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式


GPU加速之pyCuda快速上手demo


官方文档:https://documen.tician.de/pycuda/


快速安装:
pip3 install pycuda


会编译 Building wheel for pycuda (setup.py)


Successfully built pycuda pytools
Installing collected packages: appdirs, dataclasses, pytools, MarkupSafe, mako, pycuda
Successfully installed MarkupSafe-1.1.1 appdirs-1.4.4 dataclasses-0.7 mako-1.1.3 pycuda-2020.1 pytools-2020.4.3

  1. # -*- coding: utf-8 -*-
  2. __author__ = u'东方老师 微信:dfy_88888'
  3. __date__ = '2020/10/21 下午2:42'
  4. __product__ = 'PyCharm'
  5. __filename__ = 'pycuda_demo01'

  6. import sys
  7. from time import time
  8. from functools import reduce

  9. import numpy as np
  10. import pandas as pd
  11. import matplotlib
  12. from matplotlib import pyplot as plt

  13. import pycuda
  14. import pycuda.autoinit
  15. import pycuda.driver as drv
  16. from pycuda import gpuarray
  17. from pycuda.elementwise import ElementwiseKernel
  18. from pycuda.scan import InclusiveScanKernel
  19. from pycuda.reduction import ReductionKernel


  20. # PyCUDA 可以通过 Python 访问 NVIDIA 的 CUDA 并行计算 API

  21. print(f'The version of PyCUDA: {pycuda.VERSION}')
  22. print(f'The version of Python: {sys.version}')


  23. def query_device():
  24.     drv.init()
  25.     print('CUDA device query (PyCUDA version) \n')
  26.     print(f'Detected {drv.Device.count()} CUDA Capable device(s) \n')
  27.     for i in range(drv.Device.count()):

  28.         gpu_device = drv.Device(i)
  29.         print(f'Device {i}: {gpu_device.name()}')
  30.         compute_capability = float('%d.%d' % gpu_device.compute_capability())
  31.         print(f'\t Compute Capability: {compute_capability}')
  32.         print(f'\t Total Memory: {gpu_device.total_memory() // (1024 ** 2)} megabytes')

  33.         # The following will give us all remaining device attributes as seen
  34.         # in the original deviceQuery.
  35.         # We set up a dictionary as such so that we can easily index
  36.         # the values using a string descriptor.

  37.         device_attributes_tuples = gpu_device.get_attributes().items()
  38.         device_attributes = {}

  39.         for k, v in device_attributes_tuples:
  40.             device_attributes[str(k)] = v

  41.         num_mp = device_attributes['MULTIPROCESSOR_COUNT']

  42.         # Cores per multiprocessor is not reported by the GPU!
  43.         # We must use a lookup table based on compute capability.
  44.         # See the following:
  45.         # http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

  46.         cuda_cores_per_mp = {5.0: 128, 6.0: 64, 6.1: 128, 6.2: 128, 7.5: 128}[compute_capability]

  47.         print(
  48.             f'\t ({num_mp}) Multiprocessors, ({cuda_cores_per_mp}) CUDA Cores / Multiprocessor: {num_mp * cuda_cores_per_mp} CUDA Cores')

  49.         # device_attributes.pop('MULTIPROCESSOR_COUNT')

  50.         for k in device_attributes.keys():
  51.             print(f'\t {k}: {device_attributes[k]}')


  52. # query_device()   # 查询设备

  53. # NumPy array 和 gpuarray 之间的相互转换
  54. # GPU 有自己的显存,这区别于主机上的内存,这又称为设备内存(device memory)。
  55. #
  56. # NumPy array 运行在 CPU 环境(主机端),而 gpuarray 运行在 GPU 环境(设备端),
  57. # 两者常常需要相互转换,即 CPU 数据和 GPU 数据之间的传输转换。

  58. host_data = np.array([1, 2, 3, 4, 5], dtype=np.float32)
  59. device_data = gpuarray.to_gpu(host_data)
  60. device_data_x2 = 2 * device_data
  61. print("设备端数据:", device_data_x2, type(device_data_x2))
  62. host_data_x2 = device_data_x2.get()
  63. print("主机端数据:", host_data_x2, type(host_data_x2))


  64. # 按元素运算是天生的可并行计算的操作类型,在进行这种运算时 gpuarray 会自动利用多核进行并行计算
  65. # 进行转换的时候应该尽可能通过 dtype 指定类型,以避免不必要的性能损失

  66. # 性能比较
  67. def simple_speed_test():
  68.     host_data = np.float32(np.random.random(50000000))
  69.     t1 = time()
  70.     host_data_2x = host_data * np.float32(2)
  71.     t2 = time()

  72.     print(f'total time to compute on CPU: {t2 - t1}')

  73.     device_data = gpuarray.to_gpu(host_data)

  74.     t1 = time()
  75.     device_data_2x = device_data * np.float32(2)
  76.     t2 = time()

  77.     from_device = device_data_2x.get()

  78.     print(f'total time to compute on GPU: {t2 - t1}')
  79.     print(f'Is the host computation the same as the GPU computation? : {np.allclose(from_device, host_data_2x)}')


  80. # simple_speed_test()

  81. # Python 的内置函数 map
  82. print(list(map(lambda x: x + 10, [1, 2, 3, 4, 5])))
  83. # ElementWiseKernel 非常类似于 map 函数。
  84. #
  85. # ElementwiseKernel 函数可以自定义按元素运算的内核。使用时需要嵌入 CUDA C 的代码。
  86. #
  87. # 内核(kernel)在这里可以简单理解为 CUDA 直接运行在 GPU 的函数

  88. gpu_2x_ker = ElementwiseKernel(
  89.         arguments="float *in, float *out",
  90.         operation="out[i] = 2 * in[i];",
  91.         name="gpu_2x_ker"
  92.     )

  93. def elementwise_kernel_example():
  94.     host_data = np.float32(np.random.random(50000000))
  95.     t1 = time()
  96.     host_data_2x = host_data * np.float32(2)
  97.     t2 = time()
  98.     print(f'total time to compute on CPU: {t2 - t1}')

  99.     device_data = gpuarray.to_gpu(host_data)
  100.     # allocate memory for output
  101.     device_data_2x = gpuarray.empty_like(device_data)

  102.     t1 = time()
  103.     gpu_2x_ker(device_data, device_data_2x)
  104.     t2 = time()
  105.     from_device = device_data_2x.get()
  106.     print(f'total time to compute on GPU: {t2 - t1}')
  107.     print(f'Is the host computation the same as the GPU computation? : {np.allclose(from_device, host_data_2x)}')

  108. # 因为在 PyCUDA 中,通常会在程序第一次运行过程中,nvcc 编译器会对 GPU 代码进行编译,
  109. # 然后由 PyCUDA 进行调用。这个编译时间就是额外的性能损耗
  110. # nvcc -V
  111. for _ in range(30):
  112.     elementwise_kernel_example()

  113. # class pycuda.elementwise.ElementwiseKernel(arguments, operation, name="kernel", keep=False, options=[], preamble="")
  114. # arguments:该内核定义的传参
  115. # operation:该内核定义的内嵌 CUDA C 代码
  116. # name:定义的内核名称

  117. # 运行输出之后PyCuda就会把所有清理和内存回收工作做好了,咱们的简介也就完毕了

复制代码






让天下人人学会人工智能!人工智能的前景一片大好!
回复

使用道具 举报

0

主题

100

帖子

236

积分

2W人工智能培训

Rank: 10Rank: 10Rank: 10

积分
236
沙发
发表于 2020-11-10 19:51:52 | 只看该作者
积分积分积分
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|Archiver|手机版|小黑屋|人工智能工程师的摇篮 ( 湘ICP备2020019608号-1 )

GMT+8, 2024-5-18 10:38 , Processed in 0.224087 second(s), 18 queries .

Powered by Discuz! X3.4

© 2001-2017 Comsenz Inc.

快速回复 返回顶部 返回列表