Greenplum——大数据量写入和更新的性能优化之路

导读:本篇文章讲解 Greenplum——大数据量写入和更新的性能优化之路,希望对大家有帮助,欢迎收藏,转发!站点地址:www.bmabk.com

背景

Greenplum数据加载的三种方式:

  1. insert
  2. copyin
  3. 外部表 gpfdist

理论上来说,性能 gdfdist>copyin>insert(batch)>insert

1 安装所需工具

1.1 pgbench

yum安装pgbench

yum install -y https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7.9-x86_64/

yum install postgresql10
yum install postgresql10-server
yum install -y postgresql10-contrib

配置环境变量,默认安装是/usr/pgsql-10/bin,所以需要配置path 路径

vim /etc/profile

# 添加这行
export PATH=$PATH:/usr/pgsql-10/bin

source /etc/profile

在gp上面创建测试库

create database pgbenchdb;

执行测试指令

pgbench -i -s 5 pgbenchdb     --初始化,将在pgbench_accounts表中创建 500,000行。

执行后查询pgbench_accounts的条数,如果50w条说明插入成功

2 基准测试

2.1 测试准备

1、在pgbenchdb数据库下创建测试表tbl

CREATE TABLE public.tbl (
	id bigserial NOT NULL,
	crt_time timestamp NULL,
	sensorid int4 NULL,
	sensorloc point NULL,
	info text NULL
)
WITH (
	autovacuum_enabled=on,
	autovacuum_vacuum_threshold=1,
	autovacuum_vacuum_cost_delay=0
)
DISTRIBUTED BY (id);

2、在/home/gpadmin下面创建insert.sql文件,向表中插入一条随机数据

insert into public.tbl (crt_time,sensorid,info) values ( clock_timestamp(),trunc(random()*500000),substring(md5(random()::text),1,8) );

3、在/home/gpadmin下面创建read.sql文件, 从表中读取一条随机数据

select * from public.tbl where sensorid = trunc(random()*500000);

2.2 开始测试

2.2.1 写入测试

测试前

truncate table public.tbl

1、链接数测试,模拟224个客户端连接,8个线程,每个客户端8个事务

pgbench -f /home/gpadmin/insert.sql -c 224 -C -j 8 -t 8 pgbenchdb

执行结果

starting vacuum...end.
transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 224
number of threads: 8
number of transactions per client: 8
number of transactions actually processed: 1792/1792
latency average = 2308.419 ms
tps = 97.036123 (including connections establishing)
tps = 97.929852 (excluding connections establishing)

2、测试客户端64 10w写入

pgbench -f /home/gpadmin/insert.sql -c 64 -C -j 8 -t 1563 pgbenchdb

执行结果

transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 8
number of transactions per client: 1563
number of transactions actually processed: 98733/100032
latency average = 649.590 ms
tps = 98.523657 (including connections establishing)
tps = 100.359377 (excluding connections establishing)

3、测试客户端128 10w写入

pgbench -f /home/gpadmin/insert.sql -c 128 -C -j 8 -t 782 pgbenchdb

执行结果

[gpadmin@pnode3 ~]$ pgbench -f /home/gpadmin/insert.sql -c 128 -C -j 8 -t 782 pgbenchdb
starting vacuum...end.
transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 128
number of threads: 8
number of transactions per client: 782
number of transactions actually processed: 100096/100096
latency average = 1316.215 ms
tps = 97.248556 (including connections establishing)
tps = 98.091034 (excluding connections establishing)

4、测试客户端164 100w写入

pgbench -f /home/gpadmin/insert.sql -c 64 -C -j 8 -t 15625 pgbenchdb

执行结果

transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 8
number of transactions per client: 15625
number of transactions actually processed: 1000000/1000000
latency average = 769.468 ms
tps = 83.174383 (including connections establishing)
tps = 84.614156 (excluding connections establishing)

5、客户端128 100w写入

pgbench -f /home/gpadmin/insert.sql -c 128 -C -j 8 -t 7813 pgbenchdb

执行结果

transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 128
number of threads: 8
number of transactions per client: 7813
number of transactions actually processed: 1000064/1000064
latency average = 1389.130 ms
tps = 92.144034 (including connections establishing)
tps = 92.931354 (excluding connections establishing)

2.2.2 读取测试

1、客户端64  10w查询

pgbench -f /home/gpadmin/read.sql -c 64 -C -j 8 -t 1563 pgbenchdb

执行结果

transaction type: /home/gpadmin/read.sql
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 8
number of transactions per client: 1563
number of transactions actually processed: 100032/100032
latency average = 4188.179 ms
tps = 15.281105 (including connections establishing)
tps = 15.371656 (excluding connections establishing)

性能太差了 就不往下测试了

2.2.3 更新测试

3 优化

3.1 写入优化

1 copy in方式

采用CopyManager对Greenplum进行数据导出导入的Java Demo:Greenplum copy in的Java工具类_CarsonBigData的博客-CSDN博客

2 gpfdist方式

Greenplum——基于Greenplum-Spark Connector的Spark脚本开发及遇到的坑_CarsonBigData的博客-CSDN博客_greenplum spark connector

3.2 更新优化

方案描述:增量数据和目标表数据合并到临时表,然后覆盖目标表。

— 业务库insert和update的数据:tmp_incr_data

— 今天过来的最新数据:tmp_update_data

— 目标表里不需要更新的数据:tmp_not_update_data

— 目标表:dest_table,按天分区

— 主键字段:ID

1、取出当天未更新的数据,放到临时表里

CREATE TABLE tmp_not_update_data
SELECT *
FROM dest_table a
LEFT JOIN tmp_incr_data b
    ON a.id = b.id
WHERE b.id is null;

2、从目标表删除当天数据

TRUNCATE TABLE dest_table PARTITION(DATA_DATE = '2022-07-28');

3、把当天未更新的数据插回目标表

INSERT INTO dest_table
SELECT * FROM tmp_not_update_data;

4、把当天更新的数据查到目标表

INSERT INTO dest_table
SELECT * FROM tmp_update_data;

持续更新。。。

 参考博客:

Greenplum6 JDBC insert 性能百倍提升 *之踩坑之路_恋奴娇的博客-CSDN博客

HybridDB for PostgreSQL , Greenplum 写入性能优化实践_weixin_34391854的博客-CSDN博客
Greenplum6 JDBC insert性能媲美MySQL_Greenplum中文社区的博客-CSDN博客_greenplum写入性能

通过copyIn向PostgreSQL或Greenplum写入数据_陈序猿张的博客-CSDN博客

Greenplum基于pgbench的性能测试 – 腾讯云开发者社区-腾讯云

pgbench 使用介绍及示例 – 墨天轮

基于pgbench的GreenPlum压力测试_zyset的博客-CSDN博客_pgbench 官方压测结果

 pgbench10安装记录(yum)_mingjie73的博客-CSDN博客_pgbench 安装

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/71336.html

(0)
小半的头像小半

相关推荐

极客之音——专业性很强的中文编程技术网站,欢迎收藏到浏览器,订阅我们!