cudf，通过 GPU 加速数据科学和分析应用程序

cuDF 是一个由 NVIDIA 开发的 Python 库，它是 RAPIDS 数据科学框架的一部分。RAPIDS 旨在利用 NVIDIA 的 CUDA 技术，通过 GPU 加速数据科学和分析应用程序。

cuDF 提供了一个类似于 Pandas 的 DataFrame 接口，使得在 GPU 上进行数据处理和分析变得更加高效和快速。它与 Pandas API 非常匹配，但它并没有完全取代 Pandas。cuDF 和 Pandas 之间有一些相似之处和不同之处。cuDF 支持与 Pandas 类似的数据结构和操作，例如索引、过滤、连接、连接、groupby 等。

优点

cuDF 利用了 libcudf（一个超快的 C++/CUDA 数据帧库）和 Apache Arrow 列格式来提供 GPU 加速的 pandas API。

它具有如下优点。

**高性能计算：**cuDF 可以显著加快数据处理和分析任务，尤其是在数据清洗、转换和聚合等方面。
**与 Pandas 类似的 API：**cuDF 提供了与 Pandas 非常相似的 API，这降低了学习曲线。
**内存效率：**cuDF 通过其高效的内存管理在 GPU 上实现了更高的数据处理效率。
**易于集成：**cuDF 可以与 RAPIDS 生态系统中的其他工具（如 cuML、cuGraph）无缝集成，为复杂的数据科学和机器学习工作流程提供支持。
**支持多种文件格式：**cuDF 支持多种流行的数据格式，如 CSV、JSON、Parquet 等，方便与现有数据处理流程集成。
**可扩展性：**通过与 Dask 的集成，cuDF 支持分布式数据处理，可以处理超出单个 GPU 内存限制的大型数据集。

初体验

库的安装

CUDA/GPU 要求

CUDA 11.2+
NVIDIA 驱动程序 450.80.02+
Pascal 架构或更好（计算能力 >=6.0）

安装

首先，需要验证 NVIDIA GPU 是否运行正常。

!nvidia-smi

然后我们直接使用 pip 来安装

!pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com

执行如下命令，来查看安装是否正确。

import cudf # 这应该可以正常工作，没有任何错误
cudf.__version__

加载数据集

import pandas as pd

# read 5 columns data:
df = pd.read_parquet(
    "https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)

# view a random sample of 10 rows:
df.sample(10)

使用标准 Pandas 进行分析

让我们看看使用标准 Pandas 执行代码所花费的时间。

%%time
(df[["Registration State", "Violation Description"]]
 .value_counts()
 .groupby("Registration State")
 .head(1)
 .sort_index()
 .reset_index()
)

%%time

(df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)

可以看到使用 pandas 上的运行时间分别是 4.36 s 和 4.92 s。

使用 cudf 进行分析

%load_ext cudf.pandas

%%time
(df[["Registration State", "Violation Description"]]
 .value_counts()
 .groupby("Registration State")
 .head(1)
 .sort_index()
 .reset_index()
)

%%time

(df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)

**可以看到使用时间分别为 196ms 和 27.9 ms，**和使用 pandas 相比，性能提高了几百倍。

了解性能

cudf 提供分析实用程序，通过识别代码的哪些部分在 GPU 和 CPU 上运行来帮助我们更好地了解性能。

%%cudf.pandas.profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = 1

counts = small_df.groupby("a").b.count()

%%cudf.pandas.line_profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = 1

counts = small_df.groupby("a").b.count()

cuDF 库不支持的操作将自动回退到标准 Pandas（在 CPU 上）。

因此，由于数据在 GPU 和 CPU 之间复制，你可能会遇到性能低下的情况。例如，cuDF 目前不支持 count() 函数 axis 的参数。因此，此操作在 CPU 上执行，并且可能比前一个操作明显慢。

%%time
df.count()

%%time
df.count(axis=1)

来源—–程序员学长

发送评论编辑评论

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

功能

近期文章

近期评论

归档

优点

初体验

库的安装

CUDA/GPU 要求

安装

加载数据集

使用标准 Pandas 进行分析

使用 cudf 进行分析

了解性能

发送评论编辑评论

优点

初体验

库的安装

CUDA/GPU 要求

安装

加载数据集

使用标准 Pandas 进行分析

使用 cudf 进行分析

了解性能

发送评论 编辑评论

推荐文章

发送评论编辑评论