airflow
Contents
hdfs cluster
tips:
if three nodes cannot communicate, you should check you .ssh/authorized_keys. make sure below files have right permissions. No write permission
-rw-r–r– 1 hadoop hadoop 1.5K May 2 21:16 authorized_keys
- -rw——- 1 hadoop hadoop 3.2K Apr 30 00:06 id_rsa
- -rw-r–r– 1 hadoop hadoop 766 Apr 30 00:06 id_rsa.pub
- -rw-r–r– 1 hadoop hadoop 1.4K Apr 30 06:03 known_hosts
- give more memory to below properties as possible. yarn.nodemanager.resource.memory-mb should a little smaller than your main memory size
|
|
- sync.sh
|
|
- point these value to directories which has large space.
|
|
hive
tips:
- you could just install mysql in the master node, but need to install hive in all nodes, so there is no hive dependency issues.
- remember to start hiveserver2: $HIVE_HOME/bin/hiveserver2. So you can access hive with HiveCLI (remember use metastore username & password)
airflow
tips:
- all dags files need to be in ~/airflow/dags/
- for the database backend,
explicit_defaults_for_timestamp=1 in your my.cnf under [mysqld]
, the my.cnf is your mysql server configuration file which locates in/usr/local/etc/my.cnf
- scale out with Celery (RabbitMQ). Need to install all related dependencies in all nodes.
- use tmux to start different sessions
- use pyarrow.hdfs instead of hdfsoperator