Troubleshooting with Network Monitor II: Quick FixesNetwork Monitor II is a powerful tool for observing traffic, diagnosing issues, and maintaining the health of networks of all sizes. When things go wrong, you need fast, methodical steps to identify and resolve the root cause. This article provides a structured troubleshooting workflow, common symptoms and quick fixes, configuration checks, and preventive measures to keep Network Monitor II running smoothly.
Quick troubleshooting workflow
- Reproduce the issue reliably. Document steps to trigger the problem and gather timestamps.
- Check basic system health. Verify CPU, memory, disk usage, and network connectivity on the monitoring host.
- Confirm service status. Ensure Network Monitor II core services/processes are running. Restart services if necessary and watch logs.
- Collect logs and captures. Pull Network Monitor II logs, agent logs, and packet captures around the incident window.
- Isolate scope. Determine whether issue is local (single host/agent), segment-wide, or global.
- Apply targeted fixes. Use the symptom-specific quick fixes below.
- Validate and monitor. Confirm the fix resolved the issue and continue to monitor for recurrence.
- Document root cause and remediation. Record findings, the fix applied, and preventive steps.
Common symptoms and quick fixes
Symptom: Monitoring data stopped updating
- Check service processes. If core services stopped, restart them gracefully. On many systems:
sudo systemctl status netmon2 sudo systemctl restart netmon2
- Verify database connectivity. Ensure the monitoring backend (SQL/NoSQL) is reachable and not full. Clear old data or increase disk space if necessary.
- Inspect agent connectivity. Confirm agents report heartbeat. If not, verify agent configuration and network reachability (firewalls, routing).
- Look at retention settings. Aggressive retention/rollup jobs can temporarily pause updates—ensure scheduled maintenance jobs have completed.
Symptom: High CPU or memory usage on Monitor host
- Identify the culprit process. Use top/htop to locate the highest consumers.
- Adjust collection frequency. Reduce polling or capture rates temporarily.
- Tune buffer sizes. Lower memory usage by decreasing buffer and cache sizes in Network Monitor II configuration.
- Scale out. Add a secondary monitoring node or offload storage/processing to separate servers.
Symptom: Packet loss or incomplete captures
- Check NIC settings. Ensure network card supports promiscuous mode and offloads aren’t interfering (disable GRO/GSO/LRO if needed).
- Increase capture buffers. Expand pcap buffer sizes to avoid drops during bursts.
- Reduce capture filters. Narrow capture filters to only required protocols to lower volume.
- Use hardware timestamping. If timing accuracy is important, enable NIC hardware timestamping to reduce jitter.
Symptom: Alerts not firing or too many false alerts
- Verify alert rules. Ensure thresholds and conditions are correct and not inverted.
- Check notification channels. Test SMTP/webhook/Slack/Teams integrations and credentials.
- Rate-limit noisy sources. Apply suppression or aggregate rules for noisy devices.
- Use anomaly detection. Implement baseline-based alerts to reduce false positives from expected spikes.
Symptom: UI slow or unresponsive
- Inspect web service logs. Look for backend timeouts, database query slowdowns, or resource saturation.
- Enable caching. Activate UI-side caching for dashboards and templates.
- Paginate heavy views. Break large queries into paginated components to reduce load.
- Upgrade web server resources. Increase CPU, memory, or move to a dedicated UI node.
Configuration checks
- Confirm correct time synchronization (NTP/Chrony) across all nodes—misaligned clocks cause log and alert confusion.
- Ensure TLS certificates are valid for encrypted communications between agents, collectors, and UI.
- Validate access controls and API keys; expired or rotated keys commonly cause agent failures.
- Review firewall rules and network ACLs to confirm required ports (agent → collector, collector → DB, UI → collector) are open.
Log and capture analysis tips
- Use time-correlated logs: align agent, collector, and database logs by timestamp to track event flow.
- Search for common error strings (authentication failure, connection refused, disk full, OOM).
- For packet analysis, focus on packets immediately before and after the incident timestamp—look for retransmissions, RSTs, or ICMP unreachable messages.
- Use filters to isolate problematic protocols or IP ranges.
Preventive measures
- Implement regular health checks and synthetic transactions that simulate typical traffic.
- Set up capacity alerts for disk, CPU, memory, and database growth.
- Automate log rotation and archival to prevent storage exhaustion.
- Run periodic configuration audits and test disaster recovery procedures.
- Keep Network Monitor II and its dependencies patched to the latest stable releases.
When to escalate
- Repeated crashes or data corruption—open a support case with vendor and provide logs and packet captures.
- Possible security incidents (unauthorized access, unusual outbound traffic)—follow incident response and isolate affected nodes.
- Performance issues that persist after tuning—consider architecture review for horizontal scaling.
Example debugging checklist (quick copy)
- Reproduce issue and note timestamps
- Check service status and restart if down
- Verify disk space and DB connectivity
- Confirm agent heartbeats and network reachability
- Collect logs and pcap for incident window
- Apply targeted fix (restart, config tweak, buffer increase)
- Validate fix and monitor for recurrence
- Document root cause and remediation
Troubleshooting with Network Monitor II becomes faster with a disciplined approach: reproduce reliably, collect evidence, apply minimal targeted fixes, validate, and document. Over time, the preventive steps above will reduce incidents and mean quicker recoveries when problems do occur.
Leave a Reply