有没有想过在一台小笔记本上处理和分析上亿行的数据? 网上有很多关于OnTime数据的测试报告,今天让我们也尝试下DuckDB的,如果你想在Windows或者Linux下做测试,可以告诉我你的结果吗?
硬件情况
2014 年购入的 MacBook Pro,
-
4 核 2.8G i7
-
16G 内存
-
1T SSD
数据来源
美国从 1987 年至今持续更新的的民航数据[1], 2 亿条多些。
准备工作
-
安装好 Jupyter Lab -
安装必要的包
pip install duckdb duckdb-engine ipython-sql ipython-autotime
-
下载 ontime 数据
wget --no-check-certificate --c
https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip
为了简化下工作,这里我下载了一个合并好的包,
wget --no-check-certificate -c
https://repo.databend.rs/t_ontime/t_ontime.csv.zip
为啥要转成 Parquet?
想较多了解 Parquet 的可以看看我之前的一篇文章什么是 Parquet 文件格式以及为什么要使用它。
或者直接看个最简单的对比
-
使用 DuckDB 查询 csv.gz
select count(*) from read_csv_auto('sample.csv.gz',delim='t')
count_star()
0 2008064
time: 7.82 s (started: 2022-04-11 16:02:55 +08:00) -
查询 Parquet 格式
select count(*) from 'ontime.parquet'
count_star()
0 202687655
time: 513 ms (started: 2022-04-11 16:05:18 +08:00)
转 Parquet
zip 文件大概 7G, encoding 为 ascii, 因为含有拉丁字符,需要做进一步转换, 转为 gzip 格式,方便 DuckDB 转为 Parquet(snappy 压缩, 11G) 格式,
unzip -p t_ontime.csv.zip|iconv -f latin1 -t utf-8|gzip >t_ontime.csv.gz
本文的查询都仅仅为 SQL 语句,因此可以直接在 DuckDB 命令行下执行,也可以在 Jupyter Notebook 里执行, 我为了做笔记方便,都放到了 Notebook 中。
-
Notebook 导入必要的包和进行必要的设置,
import warnings; warnings.filterwarnings("ignore")
%load_ext autotime
%load_ext sql
%sql duckdb:///
%config SqlMagic.autopandas=True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
%sql pragma threads=4
#导入必要的包
import pandas as pd
import numpy as np
-
字段与类型定义
columns = {
'Year':'UInt16',
'Quarter':'UInt8',
'Month':'UInt8',
'DayofMonth':'UInt8',
'DayOfWeek':'UInt8',
'FlightDate':'Date',
'Reporting_Airline':'String',
'DOT_ID_Reporting_Airline':'Int32',
'IATA_CODE_Reporting_Airline':'String',
'Tail_Number':'String',
'Flight_Number_Reporting_Airline':'String',
'OriginAirportID':'Int32',
'OriginAirportSeqID':'Int32',
'OriginCityMarketID':'Int32',
'Origin':'String',
'OriginCityName':'String',
'OriginState':'String',
'OriginStateFips':'String',
'OriginStateName':'String',
'OriginWac':'Int32',
'DestAirportID':'Int32',
'DestAirportSeqID':'Int32',
'DestCityMarketID':'Int32',
'Dest':'String',
'DestCityName':'String',
'DestState':'String',
'DestStateFips':'String',
'DestStateName':'String',
'DestWac':'Int32',
'CRSDepTime':'Int32',
'DepTime':'Int32',
'DepDelay':'Int32',
'DepDelayMinutes':'Int32',
'DepDel15':'Int32',
'DepartureDelayGroups':'String',
'DepTimeBlk':'String',
'TaxiOut':'Int32',
'WheelsOff':'Int32',
'WheelsOn':'Int32',
'TaxiIn':'Int32',
'CRSArrTime':'Int32',
'ArrTime':'Int32',
'ArrDelay':'Int32',
'ArrDelayMinutes':'Int32',
'ArrDel15':'Int32',
'ArrivalDelayGroups':'Int32',
'ArrTimeBlk':'String',
'Cancelled':'UInt8',
'CancellationCode':'String',
'Diverted':'UInt8',
'CRSElapsedTime':'Int32',
'ActualElapsedTime':'Int32',
'AirTime':'Int32',
'Flights':'Int32',
'Distance':'Int32',
'DistanceGroup':'UInt8',
'CarrierDelay':'Int32',
'WeatherDelay':'Int32',
'NASDelay':'Int32',
'SecurityDelay':'Int32',
'LateAircraftDelay':'Int32',
'FirstDepTime':'String',
'TotalAddGTime':'String',
'LongestAddGTime':'String',
'DivAirportLandings':'String',
'DivReachedDest':'String',
'DivActualElapsedTime':'String',
'DivArrDelay':'String',
'DivDistance':'String',
'Div1Airport':'String',
'Div1AirportID':'Int32',
'Div1AirportSeqID':'Int32',
'Div1WheelsOn':'String',
'Div1TotalGTime':'String',
'Div1LongestGTime':'String',
'Div1WheelsOff':'String',
'Div1TailNum':'String',
'Div2Airport':'String',
'Div2AirportID':'Int32',
'Div2AirportSeqID':'Int32',
'Div2WheelsOn':'String',
'Div2TotalGTime':'String',
'Div2LongestGTime':'String',
'Div2WheelsOff':'String',
'Div2TailNum':'String',
'Div3Airport':'String',
'Div3AirportID':'Int32',
'Div3AirportSeqID':'Int32',
'Div3WheelsOn':'String',
'Div3TotalGTime':'String',
'Div3LongestGTime':'String',
'Div3WheelsOff':'String',
'Div3TailNum':'String',
'Div4Airport':'String',
'Div4AirportID':'Int32',
'Div4AirportSeqID':'Int32',
'Div4WheelsOn':'String',
'Div4TotalGTime':'String',
'Div4LongestGTime':'String',
'Div4WheelsOff':'String',
'Div4TailNum':'String',
'Div5Airport':'String',
'Div5AirportID':'Int32',
'Div5AirportSeqID':'Int32',
'Div5WheelsOn':'String',
'Div5TotalGTime':'String',
'Div5LongestGTime':'String',
'Div5WheelsOff':'String',
'Div5TailNum':'String',
}
-
csv.gz to Parquet,
copy
(select * from read_csv_auto('t_ontime.csv.gz',delim='t', columns={columns}))
to 'ontime.parquet'
-
创建视图(为了后面 SQL 更简洁)
create or replace view ontime as select * from 'ontime.parquet'
SQL benchmark
-
Q0. 1.39s
SELECT avg(c1)
FROM
(
SELECT Year, Month, count(*) AS c1
FROM ontime
GROUP BY Year, Month
); -
Q1. 查询从 2000 年到 2008 年每天的航班数,860ms
SELECT
DayOfWeek, count(*) AS c
FROM ontime
WHERE Year >= 2000
AND Year <= 2008
GROUP BY DayOfWeek
ORDER BY c DESC; -
Q2. 查询从 2000 年到 2008 年每周延误超过 10 分钟的航班数。1.26s
SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year >= 2000 AND Year <= 2008
GROUP BY DayOfWeek
ORDER BY c DESC; -
Q3.查询 2000 年到 2008 年每个机场延误超过 10 分钟以上的次数 1.45s
SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY Origin
ORDER BY c DESC
LIMIT 10; -
Q4. 查询 2007 年各航空公司延误超过 10 分钟以上的次数 572ms
SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*)
FROM ontime
WHERE DepDelay>10 AND Year=2007
GROUP BY Carrier
ORDER BY count(*) DESC; -
Q5. 查询 2007 年各航空公司延误超过 10 分钟以上的百分比 1.17s
SELECT qq.Carrier, c, c2, c*100/c2 as c3
FROM
(
SELECT
IATA_CODE_Reporting_Airline AS Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year=2007
GROUP BY Carrier
) q
JOIN
(
SELECT
IATA_CODE_Reporting_Airline AS Carrier,
count(*) AS c2
FROM ontime
WHERE Year=2007
GROUP BY Carrier
) qq on q.Carrier = qq.Carrier
ORDER BY c3 DESC; -
Q5 优化版本, 622ms
SELECT IATA_CODE_Reporting_Airline AS Carrier
, avg(cast(DepDelay>10 as Int8))*100 AS c3
FROM ontime
WHERE Year=2007
GROUP BY Carrier
ORDER BY c3 DESC; -
Q6. 同上一个查询一致,只是查询范围扩大到 2000 年到 2008 年 2.82s
SELECT q.Carrier, c, c2, c*100/c2 as c3
FROM
(
SELECT
IATA_CODE_Reporting_Airline AS Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year>=2000 AND Year<=2008
GROUP BY Carrier
) q
JOIN
(
SELECT
IATA_CODE_Reporting_Airline AS Carrier,
count(*) AS c2
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY Carrier
) qq on q.Carrier=qq.Carrier
ORDER BY c3 DESC; -
Q6 优化版本 1.57s
SELECT IATA_CODE_Reporting_Airline AS Carrier
, avg(cast(DepDelay>10 as Int8))*100 AS c3
FROM ontime
WHERE Year>=2000 AND Year <=2008
GROUP BY Carrier
ORDER BY c3 DESC; -
Q7. 每年航班延误超过 10 分钟的百分比 1.54s
SELECT Year, avg(cast(DepDelay>10 as Int8))*100
FROM ontime
GROUP BY Year
ORDER BY Year; -
Q8. 每年更受人们喜爱的目的地 9.92s
SELECT DestCityName, count(distinct OriginCityName) AS u
FROM ontime
WHERE Year >= 2000 and Year <= 2010
GROUP BY DestCityName
ORDER BY u DESC LIMIT 10; -
Q8 优化下, 4.17s
SELECT DestCityName,count(OriginCityName) as u from
(
SELECT DestCityName, OriginCityName
FROM ontime
WHERE Year >= 2000 and Year <= 2010
GROUP BY DestCityName, OriginCityName)
GROUP BY DestCityName
ORDER BY u DESC LIMIT 10; -
Q9. 910ms
SELECT Year, count(*) AS c1
FROM ontime
GROUP BY Year; -
Q10. 9.12s
SELECT
min(Year), max(Year), IATA_CODE_Reporting_Airline AS Carrier, count(*) AS cnt,
sum(cast(ArrDelayMinutes>30 as Int8)) AS flights_delayed,
round(sum(cast(ArrDelayMinutes>30 as Int8))/count(*),2) AS rate
FROM ontime
WHERE
DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
AND FlightDate < '2010-01-01'
GROUP by Carrier
HAVING count(*)>100000 and max(Year)>1990
ORDER by rate DESC
LIMIT 1000;
这个速度你还满意吧?
有兴趣的可以尝试使用 ClickHouse 导入该数据,并做个对比测试,反正我的电脑,经过一番折腾后,风扇狂转,内存溢出,然后我放弃了。
从上面的效果看,如果配合Superset, 是不是可以做到单机轻松分析上亿数据(并且跨平台支持)。
参考资料
民航数据: https://transtats.bts.gov/
原文始发于微信公众号(alitrack):在旧笔记本上分析2亿OnTime数据
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/62797.html