hdfs cluster

hdfs cluster configuration

tips:

  • if three nodes cannot communicate, you should check you .ssh/authorized_keys. make sure below files have right permissions. No write permission

  • -rw-r–r– 1 hadoop hadoop 1.5K May 2 21:16 authorized_keys

  • -rw——- 1 hadoop hadoop 3.2K Apr 30 00:06 id_rsa
  • -rw-r–r– 1 hadoop hadoop 766 Apr 30 00:06 id_rsa.pub
  • -rw-r–r– 1 hadoop hadoop 1.4K Apr 30 06:03 known_hosts
  • give more memory to below properties as possible. yarn.nodemanager.resource.memory-mb should a little smaller than your main memory size
1
2
3
4
5
6
7
8
9
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
  • sync.sh
1
2
3
for node in node1 node2; do
scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;
done
  • point these value to directories which has large space.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<property>
<name>dfs.namenode.name.dir</name>
<value>/vol/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/vol/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

hive

configuration

tips:

  1. you could just install mysql in the master node, but need to install hive in all nodes, so there is no hive dependency issues.
  2. remember to start hiveserver2: $HIVE_HOME/bin/hiveserver2. So you can access hive with HiveCLI (remember use metastore username & password)

airflow

configuration

tips:

  1. all dags files need to be in ~/airflow/dags/
  2. for the database backend, explicit_defaults_for_timestamp=1 in your my.cnf under [mysqld], the my.cnf is your mysql server configuration file which locates in /usr/local/etc/my.cnf
  3. scale out with Celery (RabbitMQ). Need to install all related dependencies in all nodes.
  4. use tmux to start different sessions
  5. use pyarrow.hdfs instead of hdfsoperator