芝加哥出租车载客情况分析(附clickhouse搭建过程)

clickhouse 配置过程记录

参考资料

  1. 安装ClickHousx64-官方Doc
  2. deb包下载地址

ubuntu 18.04 ×64 clickhouse安装流程

  1. 检查当前CPU是否支持SSE 4.2

    • $ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
    • 检测后支持,继续下一步
  2. 执行以下命令开始安装

    • sudo apt-get install apt-transport-https ca-certificates dirmngr
    • sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4
    • echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list
    • sudo apt-get update
    • sudo apt-get install -y clickhouse-server clickhouse-client
  3. 由于安装时间超过3小时,暂停安装,改采用下载deb包安装。

  4. 下载deb安装包

    • 下载3个文件 clickhouse-client,clickhouse-server,clickhouse-common-static
  5. 执行安装命令

    • sudo dpkg -i clickhouse-common-static_20.6.3.28_amd64.deb
    • sudo dpkg -i clickhouse-client_20.6.3.28_all.deb
    • sudo dpkg -i clickhouse-server_20.6.3.28_all.deb
  6. 启动

    • sudo service clickhouse-server start
    • clickhouse-client
  7. 启动失败,错误信息如下

    Code: 516. DB::Exception: Received from localhost:9000. DB::Exception:default: Authentication failed: password is incorrect or there is no user with such name.

  8. 重新执行命令

    • clickhouse-client --password
  9. 启动成功,输出语句

    ClickHouse client version 20.6.3.28 (official build).
    Password for user (default):
    Connecting to localhost:9000 as user default.
    Connected to ClickHouse server version 20.6.3 revision 54436.

  10. 创建数据库 BITest

    • CREATE DATABASE IF NOT EXISTS BITEST
  11. 创建TAXI_TRIPS 表

    CREATE TABLE BITEST.TAXI_TRIPS
    (
    Trip ID String,
    Taxi ID String,
    Trip Start Timestamp DateTime,
    Trip End Timestamp Nullable(DateTime),
    Trip Seconds Nullable(UInt64),
    Trip Miles Nullable(Float64),
    Pickup Census Tract Nullable(UInt64),
    Dropoff Census Tract Nullable(UInt64),
    Pickup Community Area Nullable(UInt64),
    Dropoff Community Area Nullable(UInt64),
    Fare Nullable(String),
    Tips Nullable(String),
    Tolls Nullable(String),
    Extras Nullable(String),
    Trip Total Nullable(String),
    Payment Type Nullable(String),
    Company Nullable(String),
    Pickup Centroid Latitude Nullable(Float64),
    Pickup Centroid Longitude Nullable(Float64),

Pickup Centroid Location Nullable(String),
Dropoff Centroid Latitude Nullable(Float64),
Dropoff Centroid Longitude Nullable(Float64),
Dropoff Centroid Location Nullable(String),
Community Areas Nullable(UInt64)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(toDate(Trip Start Timestamp))
ORDER BY Trip ID
SETTINGS index_granularity = 8192;

11 退出client ,执行导入数据的命令行
- clickhouse-client --password --query='INSERT INTO BITEST.TAXI_TRIPS FORMAT CSVWithNames' < Taxi_Trips.csv --date_time_input_format='best_effort' --max_insert_block_size=1024

  1. 执行报错,显示数据导入值不对,报错位置在13572634行。打开文件后发现最后五行数据值不合理。删除最后五行。

    • err.png
    • errordata.png
  2. 删除表中上一次导入的数据,重新导入csv文件,然后检证数据条数。

    • [图片上传失败...(image-d482ca-1598519252266)]
  1. (在执行第三个SQL时发现问题)在把字段Trip Start Timestamp作为DateTime类型导入时,凌晨的“00:00:00~00:00:59"的数据全部读入错误,导致计算时一天只有1~23小时的时间段。
  • err_output3.png
  1. 重新建表,数据引擎选为Log,用string类型读入时间。

CREATE TABLE BITEST.TAXI_TRIPS
(
Trip ID String,
Taxi ID String,
Trip Start Timestamp String,
Trip End Timestamp Nullable(String),
Trip Seconds Nullable(UInt64),
Trip Miles Nullable(Float64),
Pickup Census Tract Nullable(UInt64),
Dropoff Census Tract Nullable(UInt64),
Pickup Community Area Nullable(UInt64),
Dropoff Community Area Nullable(UInt64),
Fare Nullable(String),
Tips Nullable(String),
Tolls Nullable(String),
Extras Nullable(String),
Trip Total Nullable(String),
Payment Type Nullable(String),
Company Nullable(String),
Pickup Centroid Latitude Nullable(Float64),
Pickup Centroid Longitude Nullable(Float64),
Pickup Centroid Location Nullable(String),
Dropoff Centroid Latitude Nullable(Float64),
Dropoff Centroid Longitude Nullable(Float64),
Dropoff Centroid Location Nullable(String),
Community Areas Nullable(UInt64)
)
ENGINE = Log()

16.再次执行导入数据命令行
- clickhouse-client --password --query='INSERT INTO BITest1.TAXI_TRIPS FORMAT CSVWithNames' < Taxi_Trips.csv --max_insert_block_size=1024
17.再次检证条数和Trip Start Timestamp`字段,读入成功。

报表分析

  1. 问题一: ⾏⻋⾥程最多的出租⻋Top 100

    SELECT sum(Trip Miles) AS Miles , Taxi ID AS ID
    FROM BITEST.TAXI_TRIPS
    GROUP BY Taxi ID
    order by Miles DESC
    limit 100

  • output1.png
  1. 问题二:搭载乘客最多的出租车Top 100

    • 提供的数据中并没乘客数量字段,改为计算最多的出租车

    SELECT count(1) as num , Taxi ID AS ID
    FROM BITEST.TAXI_TRIPS
    GROUP BY Taxi ID
    order by num DESC
    limit 100
    -

    output2.png

  2. 问题三:分析出租⻋每⽇⾼峰期和低峰期 (⼩时级别)

    SELECT
    AVG(t1.num) AS avg_num,
    t1.hour
    FROM
    (
    SELECT
    substring(Trip Start Timestamp, 1, 10) AS date,
    if(substring(Trip Start Timestamp, 21, 2) = 'AM', concat(substring(Trip Start Timestamp, 12, 2), 'AM'), concat(substring(Trip Start Timestamp, 12, 2), 'PM')) AS hour,
    count(Taxi ID) AS num
    FROM BITEST.TAXI_TRIPS
    GROUP BY
    date,
    hour
    ) AS t1
    GROUP BY t1.hour
    ORDER BY avg_num ASC
    -

    output3.png

  3. 问题四:分析出出租⻋平均搭载时⻓。

    SELECT
    Taxi ID AS ID,
    avg(Trip Seconds) AS avg_time
    FROM BITEST.TAXI_TRIPS
    GROUP BY Taxi ID
    ORDER BY avg_time DESC
    LIMIT 1000
    -

    output4.png

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容