[備忘録]misskeyのDBが吹っ飛んだ時に読むnote
前書き
このnoteは、上記noteに記したmisskey鯖のバックエンドとして動作しているDB(postgres)が吹っ飛んだときの復旧手順をまとめたものである。
早くも2度復旧作業をすることになったため、こうしてnoteにまとめておくことを思い立った。
復旧手順
VMの復旧
DBが稼働しているVMごとお亡くなりになった場合、DBが動作するVMをまず復旧しなければならない。
構築用のtfファイルおよびplaybookを用いてVMの再構築を行う。
ironawi@ironawi-ally:~$ cd terraform/kubernetes/
ironawi@ironawi-ally:~/terraform/kubernetes$ ls
kubernetes_node.tf modules output.tf sg-k8s-master.tf sg-k8s-worker.tf sg-misskey...tf terraform.tfstate terraform.tfstate.backup
ironawi@ironawi-ally:~/terraform/kubernetes$ terraform plan
...
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:
# module.worker_node2.aws_instance.main has been deleted
- resource "aws_instance" "main" {
id = "i-0891cba0a7c60c5fe"
- public_dns = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> null
tags = {
"Name" = "worker_node2"
}
# (33 unchanged attributes hidden)
# (9 unchanged blocks hidden)
}
Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may include actions to undo or respond to these changes.
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# module.worker_node2.aws_instance.main will be created
+ resource "aws_instance" "main" {
+ ami = "ami-0c1531991482a24e1"
+ arn = (known after apply)
+ associate_public_ip_address = (known after apply)
+ availability_zone = (known after apply)
+ cpu_core_count = (known after apply)
+ cpu_threads_per_core = (known after apply)
+ disable_api_stop = (known after apply)
+ disable_api_termination = (known after apply)
+ ebs_optimized = (known after apply)
+ get_password_data = false
+ host_id = (known after apply)
+ host_resource_group_arn = (known after apply)
+ iam_instance_profile = (known after apply)
+ id = (known after apply)
+ instance_initiated_shutdown_behavior = (known after apply)
+ instance_lifecycle = (known after apply)
+ instance_state = (known after apply)
+ instance_type = "t3.small"
+ ipv6_address_count = (known after apply)
+ ipv6_addresses = (known after apply)
+ key_name = "yaiwata-dev-northeast3"
+ monitoring = (known after apply)
+ outpost_arn = (known after apply)
+ password_data = (known after apply)
+ placement_group = (known after apply)
+ placement_partition_number = (known after apply)
+ primary_network_interface_id = (known after apply)
+ private_dns = (known after apply)
+ private_ip = (known after apply)
+ public_dns = (known after apply)
+ public_ip = (known after apply)
+ secondary_private_ips = (known after apply)
+ security_groups = (known after apply)
+ source_dest_check = true
+ spot_instance_request_id = (known after apply)
+ subnet_id = "subnet-0988129ff19aad0e4"
+ tags = {
+ "Name" = "worker_node2"
}
+ tags_all = {
+ "Name" = "worker_node2"
}
+ tenancy = (known after apply)
+ user_data = (known after apply)
+ user_data_base64 = (known after apply)
+ user_data_replace_on_change = false
+ vpc_security_group_ids = [
+ "sg-03bf3ff0d56c8f475",
+ "sg-0fc354d2e7824d1e6",
]
+ instance_market_options {
+ market_type = "spot"
+ spot_options {
+ instance_interruption_behavior = (known after apply)
+ max_price = "0.01"
+ spot_instance_type = (known after apply)
+ valid_until = (known after apply)
}
}
+ root_block_device {
+ delete_on_termination = true
+ device_name = (known after apply)
+ encrypted = (known after apply)
+ iops = (known after apply)
+ kms_key_id = (known after apply)
+ throughput = (known after apply)
+ volume_id = (known after apply)
+ volume_size = 30
+ volume_type = "gp3"
}
}
Plan: 1 to add, 0 to change, 0 to destroy.
Changes to Outputs:
~ worker_node2 = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> (known after apply)
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take exactly these actions if you run "terraform apply" now.
まずはterraform planで状態の確認を行う。出力結果からはworkerの1台が消滅しており、消滅したworkerが再作成されることが分かる。
状態を確認したので、terraform applyでVMの再作成を行う。
ironawi@ironawi-ally:~/terraform/kubernetes$ terraform apply
...
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:
# module.worker_node2.aws_instance.main has been deleted
- resource "aws_instance" "main" {
id = "i-0891cba0a7c60c5fe"
- public_dns = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> null
tags = {
"Name" = "worker_node2"
}
# (33 unchanged attributes hidden)
# (9 unchanged blocks hidden)
}
Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may include actions to undo or respond to these changes.
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# module.worker_node2.aws_instance.main will be created
+ resource "aws_instance" "main" {
+ ami = "ami-0c1531991482a24e1"
+ arn = (known after apply)
+ associate_public_ip_address = (known after apply)
+ availability_zone = (known after apply)
+ cpu_core_count = (known after apply)
+ cpu_threads_per_core = (known after apply)
+ disable_api_stop = (known after apply)
+ disable_api_termination = (known after apply)
+ ebs_optimized = (known after apply)
+ get_password_data = false
+ host_id = (known after apply)
+ host_resource_group_arn = (known after apply)
+ iam_instance_profile = (known after apply)
+ id = (known after apply)
+ instance_initiated_shutdown_behavior = (known after apply)
+ instance_lifecycle = (known after apply)
+ instance_state = (known after apply)
+ instance_type = "t3.small"
+ ipv6_address_count = (known after apply)
+ ipv6_addresses = (known after apply)
+ key_name = "yaiwata-dev-northeast3"
+ monitoring = (known after apply)
+ outpost_arn = (known after apply)
+ password_data = (known after apply)
+ placement_group = (known after apply)
+ placement_partition_number = (known after apply)
+ primary_network_interface_id = (known after apply)
+ private_dns = (known after apply)
+ private_ip = (known after apply)
+ public_dns = (known after apply)
+ public_ip = (known after apply)
+ secondary_private_ips = (known after apply)
+ security_groups = (known after apply)
+ source_dest_check = true
+ spot_instance_request_id = (known after apply)
+ subnet_id = "subnet-0988129ff19aad0e4"
+ tags = {
+ "Name" = "worker_node2"
}
+ tags_all = {
+ "Name" = "worker_node2"
}
+ tenancy = (known after apply)
+ user_data = (known after apply)
+ user_data_base64 = (known after apply)
+ user_data_replace_on_change = false
+ vpc_security_group_ids = [
+ "sg-03bf3ff0d56c8f475",
+ "sg-0fc354d2e7824d1e6",
]
+ instance_market_options {
+ market_type = "spot"
+ spot_options {
+ instance_interruption_behavior = (known after apply)
+ max_price = "0.01"
+ spot_instance_type = (known after apply)
+ valid_until = (known after apply)
}
}
+ root_block_device {
+ delete_on_termination = true
+ device_name = (known after apply)
+ encrypted = (known after apply)
+ iops = (known after apply)
+ kms_key_id = (known after apply)
+ throughput = (known after apply)
+ volume_id = (known after apply)
+ volume_size = 30
+ volume_type = "gp3"
}
}
Plan: 1 to add, 0 to change, 0 to destroy.
Changes to Outputs:
~ worker_node2 = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> (known after apply)
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
module.worker_node2.aws_instance.main: Creating...
module.worker_node2.aws_instance.main: Still creating... [10s elapsed]
module.worker_node2.aws_instance.main: Provisioning with 'local-exec'...
module.worker_node2.aws_instance.main (local-exec): Executing: ["/bin/sh" "-c" "modules/ec2/../scripts/check_ssh_connection.sh <host name>"]
module.worker_node2.aws_instance.main (local-exec): checking ssh connection...
...
module.worker_node2.aws_instance.main (local-exec): ssh connection established!
module.worker_node2.aws_instance.main: Provisioning with 'local-exec'...
module.worker_node2.aws_instance.main (local-exec): Executing: ["/bin/sh" "-c" "ansible-playbook -i <host name>, modules/ec2/../ansible/setup_k8s.yaml"]
module.worker_node2.aws_instance.main (local-exec): PLAY [all] *********************************************************************
...
module.worker_node2.aws_instance.main: Creation complete after 1m19s [id=<instance id>]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Outputs:
master_node = "<host name>"
worker_node1 = "<host name>"
worker_node2 = "<host name>"
worker_node3 = "<host name>"
問題なく構築完了すれば、上記のような出力でコマンドが終了する。
残ってしまったリソースの整理およびDB復旧のための事前準備
VMが突然お亡くなりになったので、VMに紐づいていたk8sリソースは残ったままになっている。
control plane nodeにsshログインし、リソースの整理を行う。
まず、dbにアクセスしているmisskeyのwebサーバを停止する。
deploymentをeditし、replicasの値を0にして保存する。
kubectl edit deploy -n misskey web-deployment
webサーバのPodがいなくなったことを確認し、突然お亡くなりになったVMに紐づいていたnodeのdrainを行い、nodeを削除する。
kubectl get po -n misskey
kubectl get node
kubectl drain --ignore-daemonsets --force <node名>
kubectl delete node <node名>
新規構築したk8s worker nodeのクラスタjoin
上記ページのワーカーノードの作成を参考に、worker nodeのkubeletの再起動までを実施。
上記ページのトークンを作成を参考に、control plane nodeでクラスタjoin用のトークンを再発行。
表示されたkubeadm joinコマンドをworker nodeで実行することで、k8sクラスタへworker nodeが追加される。
postgres起動およびバックアップからの復旧
control plane nodeに置いてあるDB用manifestをapplyし、DBを再起動。
kubectl apply -f db.yaml
kubectl get po -n misskey
db Podの起動を確認できたら、バックアップファイルをPod内へ送りこむ。
kubectl cp /k8s/misskey/backup/<バックアップファイル>.tar.gz misskey/db:/
kubectl exec でPod内に入り、送り込んだtar.gzを解凍してバックアップファイルを取り出してDBを復旧する。
kubectl exec -it -n misskey db -- /bin/bash
tar zxf <バックアップファイル>.tar.gz
psql -U misskey-user misskey < tmp/backup/dump.sql
redis起動およびバックアップからの復旧
redisが落ちた場合は、バックアップしておいたdump.rdbを /k8s/misskey/redis/ に置き直してredisを再起動すればOK。
misskey再起動
DBの復旧が完了したら、misskeyのwebサーバを再開する。
deploymentのreplicasを元に戻せばOK。
この記事が気に入ったらサポートをしてみませんか?