logo

木头lbj‘s Blog

首页/关于

dbt 无法连接Spark

2022-05-27/# BigData

dbt 简介

dbt ,Data Build Tool是一个开源命令行工具,用于数据仓库中的数据转换,专注于ETL中的T。是现代大数据架构中非常热门的组件之一。

使用

安装

官方提供了4种安装方式,分别为

  • Homebrew(Mac用户)
  • pip
  • Docker image
  • 源码安装

一般使用pip安装较方便快捷,最佳实践。dbt为每一类的数据(仓)库提供不了不同的adapter,选择对应的进行安装, 会自动将依赖dbt-core引入。这里是想使用dbt连接Spark,通过thrift协议。

pip instal dbt-spark[PyHive]

安装成功后便可以使用dbt-cli命令行,下面是初始化一个dbt project的过程。

[root@mutoulbj spark_demo]# dbt init 08:08:09 Running with dbt=1.1.0 Enter a name for your project (letters, digits, underscore): spark_demo The profile spark_demo already exists in /root/.dbt/profiles.yml. Continue and overwrite it? [y/N]: Y Which database would you like to use? [1] spark (Don't see the one you want? https://docs.getdbt.com/docs/available-adapters) Enter a number: 1 host (yourorg.sparkhost.com): mcdp1 [1] odbc [2] http [3] thrift Desired authentication method option (enter a number): 3 port [443]: 10000 schema (default schema that dbt will build objects in): dbt_example threads (1 or more) [1]: 1 08:08:35 Profile spark_demo written to /root/.dbt/profiles.yml using target's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection. 08:08:35 Your new dbt project "spark_demo" was created! For more information on how to configure the profiles.yml file, please consult the dbt documentation here: https://docs.getdbt.com/docs/configure-your-profile One more thing: Need help? Don't hesitate to reach out to us via GitHub issues or on Slack: https://community.getdbt.com/ Happy modeling!

生成的project目录结构如下:

[root@mutoulbj spark_demo]# tree . ├── logs │ └── dbt.log └── spark_demo ├── analyses ├── dbt_project.yml ├── logs │ └── dbt.log ├── macros ├── models │ └── example │ ├── my_first_dbt_model.sql │ ├── my_second_dbt_model.sql │ └── schema.yml ├── README.md ├── seeds ├── snapshots └── tests

问题与排查解决

使用dbt debug 命令可以验证配置是否正确,数据(仓)库是否可连接。这时候, 问题来了, 一直报如下错误:

[root@mutoulbj spark_demo]# dbt debug 06:33:04 Running with dbt=1.1.0 dbt version: 1.1.0 python version: 3.9.10 python path: /usr/local/bin/python3.9 os info: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17 Using profiles.yml file at /root/.dbt/profiles.yml Using dbt_project.yml file at /root/spark_demo/dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: 192.168.51.194 port: 10000 cluster: None endpoint: None schema: dbt_example organization: 0 Connection test: [ERROR] 1 check failed: dbt was unable to connect to the specified database. The database returned the following error: >Runtime Error Database Error failed to connect Check your database credentials and try again. For more information, visit: https://docs.getdbt.com/docs/configure-your-profile

从错误信息来看,在连接Spark的时候出错了,但没有具体信息。只有进行排查。

  • 使用beeline进行连接验证Spark Thift服务是否正常 ✔︎
  • 换个环境,在本机(Mac)上验证是否OK ✔︎
  • 重装dbt再尝试 ✘

通过以上排查,依然没有头绪,看源码,在dbt-spark/dbt/adapters/spark/connections.py找到相关的代码。

elif creds.method == SparkConnectionMethod.THRIFT: cls.validate_creds(creds, ["host", "port", "user", "schema"]) if creds.use_ssl: transport = build_ssl_transport( host=creds.host, port=creds.port, username=creds.user, auth=creds.auth, kerberos_service_name=creds.kerberos_service_name, ) conn = hive.connect(thrift_transport=transport) else: conn = hive.connect( host=creds.host, port=creds.port, username=creds.user, auth=creds.auth, kerberos_service_name=creds.kerberos_service_name, ) # noqa handle = PyhiveConnectionWrapper(conn)

可见在调用hive.connect()方法时,没有捕获异常并透出。要想知道具体原因,可以在服务器中进行调试。

[root@mutoulbj ~]# python3 Python 3.9.10 (main, May 27 2022, 10:27:15) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from pyhive import hive >>> conn = hive.connect(host='192.168.51.194',port=10000) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/site-packages/pyhive/hive.py", line 104, in connect return Connection(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/pyhive/hive.py", line 243, in __init__ self._transport.open() File "/usr/local/lib/python3.9/site-packages/thrift_sasl/__init__.py", line 84, in open raise TTransportException(type=TTransportException.NOT_OPEN, thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'

Bingo,到这原因就清楚了,no mechanism available: No worthy mechs found,sasl相关的依赖包有缺失。

查看当前服务器已有的相关依赖:

[root@mutoulbj ~]# rpm -qa | grep cyrus cyrus-sasl-lib-2.1.26-24.el7_9.x86_64 cyrus-sasl-devel-2.1.26-24.el7_9.x86_64 cyrus-sasl-2.1.26-24.el7_9.x86_64

确实没有相关的加密算法库,安装并再次检查依赖:

[root@mutoulbj ~]# yum install cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5 cyrus-sasl-plain [root@mutoulbj ~]# rpm -qa | grep cyrus cyrus-sasl-gssapi-2.1.26-24.el7_9.x86_64 cyrus-sasl-lib-2.1.26-24.el7_9.x86_64 cyrus-sasl-plain-2.1.26-24.el7_9.x86_64 cyrus-sasl-devel-2.1.26-24.el7_9.x86_64 cyrus-sasl-2.1.26-24.el7_9.x86_64 cyrus-sasl-md5-2.1.26-24.el7_9.x86_64

再次运行dbt debug,成功。

[root@mutoulbj spark_demo]# dbt debug 08:08:51 Running with dbt=1.1.0 dbt version: 1.1.0 python version: 3.9.10 python path: /usr/local/bin/python3.9 os info: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17 Using profiles.yml file at /root/.dbt/profiles.yml Using dbt_project.yml file at /root/spark_demo/spark_demo/dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: mcdp1 port: 10000 cluster: None endpoint: None schema: dbt_example organization: 0 Connection test: [OK connection ok] All checks passed!

总结

不透出具体、友好的错误信息,排查会很困难。但是开源的最大好处是代码中的细节无处可藏,只要看代码,再加上一些调试总能找到具体问题所在。

dbt上手简单,配套生态已很完善,其他的待进一步测试。