vRealize Log Insight - vRLI · April 2, 2021 0

vRealize Log Insight /storage/core partition is 100% on nodes causing cassandra DB failing to start

vRLI deletes old buckets when available space on the /storage/core partition is less than 3%. Deletion is done using a FIFO model. This partition should never reach 100% because vRLI manages that partition.
But in some cases the if the storage/core has already reached 100% will take the node offline. 

When you check the file system free space, you see content similar to:
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 16G 2.4G 13G 16% /
udev 7.9G 112K 7.9G 1% /dev
tmpfs 7.9G 648K 7.9G 1% /dev/shm
/dev/sda1 128M 38M 84M 31% /boot
/dev/mapper/data-var 20G 7.3G 12G 39% /storage/var
/dev/mapper/data-core 483G 483G 0 100% /storage/core

See if the nfs archive is configured properly and working fine with below pointers. 
1. NFS server reachable by Log Insight?
2. Is there sufficient space available?
3. Have permissions been properly configured to allow Log Insight to write and access the NFS server?
4. Is there sufficient end-to-end NFS throughput between the Log Insight appliance and the NFS server?

Note: Nfs is configured properly but that isn’t enough to say that we will not face to any problem with archiving because if somehow nfs will become unavailable(even more than 10 minutes) or the archive storage will become full the same issue will come on. In general, LI grinding to a halt is documented and expected behavior if NFS is not maintained.

Resolution
Important: Before starting with below steps, ensure to validate the stored buckets
Steps:
1. Run the below command on both affected node.
Note: This might take a while for the command to finish running. The results of the command will be saved in the /tmp/validate.txt file:
cd /usr/lib/loginsight/application/sbin ./validate-bucket –validate > /tmp/validate.txt
2. Once the validate is completed refer the /tmp/validate.txt and ensure no corrupted buckets. If any following the Phase 1 of action plan, if no corrupted buckets follow phase 2 of action plan.
Phase 1:
1. stop log insight
/etc/init.d/loginsight stop
2. list the buckets, Look for the guid of the old buckets by timestamp
/usr/lib/loginsight/application/sbin/bucket-index show
3. Run the following command to delete a bucket
/usr/lib/loginsight/application/sbin/bucket-index delete [BUCKET-ID]
4. start loginsight
/etc/init.d/loginsight startThe “./validate-bucket –validate” would reveal if there are any corrupted buckets remove those instead of looking for the oldest buckets.

Phase 2:
1. Deploy a new vRLI node following the steps outlined in below document.
https://docs.vmware.com/en/vRealize-Log-Insight/4.0/com.vmware.log-insight.administration.doc/GUID-F73595DC-1511-4A19-9AE4-02C8FEDF5CF5.html

2. Once you have the new vRLI node deployed, next is to import the buckets from old node into the new one without losing any data. Steps below.
1)ssh to Log Insight you want to import events from and go to “/storage/core/loginsight/cidata/store” – this is where the data buckets live.
2)run “service loginsight stop” to stop Log Insight. Ensure the service has stopped by running “service loginsight status”.
3)copy the buckets you want to import to target Log Insight – the destination directory must be the same; i.e. “/storage/core/loginsight/cidata/store”
4)ssh to Log Insight that will be importing the events and stop the service by running “service loginsight stop”. Ensure the service has stopped by running “service loginsight status”.
5)run “/usr/lib/loginsight/application/sbin/bucket-index add <bucket_id>”
6)Repeat the step above for all the copied buckets.

3. Ensure you complete the steps for both the current worker nodes to new nodes.
4. Once the above is completed. Join the 2 new nodes to existing Deployment following the below document
https://docs.vmware.com/en/vRealize-Log-Insight/4.0/com.vmware.log-insight.administration.doc/GUID-B793B5C7-C856-4324-8202-EBB35265BA7B.html

Clarification on the action plan:
The plan is not deleting the buckets from old nodes automatically. We are just copying(importing) that old buckets into the new nodes. After you can either reconnect the old worker nodes knowing that the data is saved in new nodes which will be connected instead of them or to delete some buckets on old worker nodes trying to get them back to online and import the missing deleted buckets which you had copied prior to deleting process.