pandas-numpy压缩保存数据

numpy的保存

https://numpy.org/doc/stable/reference/generated/numpy.save.html

numpy保存数组比较简单。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. 保存单个数组
numpy.save(file, arr, allow_pickle=True)

# 2. 保存多个数组
numpy.savez(file, *args, allow_pickle=True, **kwds)

# 保存时如果不指定数组名字, 默认按照 arr_0, arr_1 编号
numpy.savez(file, a1, a2) # Arrays will be named: arr_0, arr_1 ...
# 保存时指定数组名字
numpy.savez(file, a1=a1, a2=a2) # Arrays will be named: a1, a2 ...

# 读取 npz
npzfile = np.load('file.npz')
print(npzfile.files)

a1, a2 = [npzfile[key] for key in npzfile.files]

# 3. 保存多个数组, 保存时进行数据压缩,用法与 numpy.savez() 一样
# 压缩率一般在 10 倍以上,相应牺牲的是读取、保存时间,时间消耗大概是同样的倍率。
numpy.savez_compressed(file, *args, allow_pickle=True, **kwds)

pandas的保存

https://pandas.pydata.org/docs/user_guide/io.html

pandas保存复杂得多,提供了非常多的格式,常见的:csv,xlsx,h5等。

Type Data Description Reader Writer
text CSV read_csv to_csv
text Fixed-Width Text File read_fwf NA
text JSON read_json to_json
text HTML read_html to_html
text LaTeX Styler.to_latex NA
text XML read_xml to_xml
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary OpenDocument read_excel NA
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary ORC Format read_orc to_orc
binary Stata read_stata to_stata
binary SAS read_sas NA
binary SPSS read_spss NA
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql

对比读写速度和文件体积。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# %%
import numpy as np
import pandas as pd
import hashlib

import os
from rich.pretty import pprint

np.random.seed(42)

#%%
size = 10000

df = pd.DataFrame()
df["A"] = np.random.rand(size)
df["B"] = np.random.randint(0, 10000, size)
df["sin"] = df["A"].apply(lambda x: np.sin(x))
df["datetime"] = pd.date_range("20240101", periods=size, freq="s")
df["array"] = df["B"].apply(lambda x: np.sin(x) * np.array([1, 2, 3]))
df["string"] = df["array"].apply(lambda x: hashlib.md5(str(x).encode("utf8")).hexdigest())

print(df)
print(df.dtypes)

# %%

import timeit

XLSX = "test.xlsx"
CSV = "test.csv"
PKL = "test.pkl"
PKLZ = "test.pklz"
HDF = "test.hdf"
HDFZ = "test.hdfz"
FTH = "test.feather"
PQT = "test.parquet"

funcs = [
# xls
lambda x: df.to_excel(XLSX),
lambda x: pd.read_excel(XLSX, index_col=0),
# csv
lambda x: df.to_csv(CSV, mode="w"),
lambda x: pd.read_csv(CSV, index_col=0),
# pkl
lambda x: df.to_pickle(PKL),
lambda x: pd.read_pickle(PKL),
lambda x: df.to_pickle(PKLZ, compression="xz"),
lambda x: pd.read_pickle(PKLZ, compression="xz"),
# hdf
lambda x: df.to_hdf(HDF, key="test", mode="w"),
lambda x: pd.read_hdf(HDF, "test"),
lambda x: df.to_hdf(HDFZ, key="test", mode="w", complib="blosc"),
lambda x: pd.read_hdf(HDFZ, "test"),
# feather
lambda x: df.to_feather(FTH),
lambda x: pd.read_feather(FTH),
# parquet
lambda x: df.to_parquet(PQT),
lambda x: pd.read_parquet(PQT)
]


# %%
# 定义测试参数
runs_number = 1
runs_result = []

for i, name in enumerate([XLSX, CSV, PKL, PKLZ, HDF, HDFZ, FTH, PQT]):
func = name.replace("test.", "")
w_speed = timeit.timeit(lambda: funcs[2 * i](df), number=runs_number)
r_speed = timeit.timeit(lambda: funcs[2 * i + 1](df), number=runs_number)
filesize = os.path.getsize(name) / 1024 / 1024
runs_result.append([func, w_speed, r_speed, filesize])

pprint(runs_result)

# %%
# 打印结果

df_rst = pd.DataFrame(runs_result, columns=["func", "write [s]", "read [s]", "filesize [M]"])
pprint(df_rst)

# %%

df_rst.plot(x='func', y=['write [s]', 'read [s]', 'filesize [M]'], kind='bar', title=f'datasize = {size}')

# %%

经过对比发现 feather 、 parquet 格式的压缩率、读写速度通常比其他格式更好一些。