Skip to main content

Server Triage Checklist

First Commands

date -Is
hostnamectl
uptime
who

Capture time, host identity, load, and active users before changing anything.

Resource Snapshot

free -h
df -h
ps -eo pid,cmd,%cpu,%mem --sort=-%cpu | head

Service Snapshot

systemctl --failed
systemctl status nginx --no-pager
journalctl -p warning -n 100 --no-pager

Network Snapshot

ss -tulpn
ip addr
ip route

Safe Rule

Observe first, then change one thing at a time. Keep a terminal log when handling incidents.

What's Next

Server Environment Context

This lesson matters in server operations because Server Triage Checklist supports structured diagnosis, debug traces, and server triage without destructive guesses. On a workstation, a mistake may affect one project. On a server, the same mistake can interrupt users, hide evidence, weaken access control, or make recovery harder.

Use the commands in this lesson with three questions in mind:

  • What system state am I about to inspect or change?
  • What evidence should I capture before changing it?
  • How will I prove the server is healthier after the command runs?

Operational Runbook Pattern

Use this repeatable pattern when applying the lesson on a real host:

PhaseGoalBash Habit
IdentifyConfirm host, user, and scopehostname, id, pwd
InspectRead state before modifying itsystemctl status, ls -la, ss -tulpn
ChangeMake the smallest safe changeQuote paths and prefer explicit options
VerifyConfirm the intended resultCheck exit status, logs, and service health
RecordLeave a useful audit trailSave command output or ticket notes

Example session header:

printf 'time=%s host=%s user=%s cwd=%s
' "$(date -Is)" "$(hostname)" "$(id -un)" "$(pwd)"

Pre-Flight Checklist

Before running commands from this lesson on a production server, check:

  • You are connected to the intended host.
  • You know whether the command is read-only or state-changing.
  • You have a rollback or recovery path for state-changing work.
  • You understand whether sudo is required and why.
  • You have captured current service, disk, or network state if the work is risky.

Useful pre-flight commands:

hostnamectl 2>/dev/null || hostname
id
uptime
systemctl --failed 2>/dev/null || true

Production Safety Notes

RiskSafer Practice
Running on the wrong hostPrint hostname and environment name first
Accidentally expanding pathsQuote variables: "$path"
Losing evidenceCopy logs or capture journalctl output before cleanup
Silent failureUse set -euo pipefail in scripts and check exit codes interactively
Over-broad sudo usageRun the smallest command possible with elevated permissions

When a command can delete, overwrite, restart, reload, or reconfigure something, do a dry run or read-only inspection first.

Validation Commands

After applying the technique from this lesson, validate with commands appropriate to the changed area:

printf 'exit_status=%s
' "$?"
systemctl --failed 2>/dev/null || true
journalctl -p warning -n 50 --no-pager 2>/dev/null || true
df -h
ss -tulpn 2>/dev/null || true

For application-facing changes, add an endpoint or process check:

curl -fsS http://127.0.0.1:8080/health >/dev/null || true
ps -eo pid,cmd,%cpu,%mem --sort=-%cpu | head

Automation Example

The following template shows how to turn this lesson into a repeatable server check. Adapt names and commands before using it.

#!/usr/bin/env bash
set -euo pipefail

log() {
printf '%s INFO %s
' "$(date -Is)" "$*" >&2
}

die() {
printf '%s ERROR %s
' "$(date -Is)" "$*" >&2
exit 1
}

run_03_server_triage_checklist_check() {
log 'running Server Triage Checklist validation'
hostname >/dev/null
uptime >/dev/null
}

run_03_server_triage_checklist_check "$@"

Troubleshooting Flow

If the expected result does not appear, diagnose in this order:

  1. Confirm the command ran on the correct host and shell.
  2. Check whether the command failed with a non-zero exit status.
  3. Re-run the read-only inspection command with more explicit paths or options.
  4. Check recent logs for permission, path, DNS, disk, or service errors.
  5. Undo only the specific change you made, not unrelated user or system changes.

Useful debug commands:

set -x
# repeat the smallest failing command here
set +x
printf 'PATH=%s
' "$PATH"
type command 2>/dev/null || true

Practice Lab

Use a non-production VM, container, or temporary directory for practice:

  1. Capture a baseline using date -Is, hostname, uptime, and df -h.
  2. Apply the main command pattern from Server Triage Checklist to a safe test target.
  3. Intentionally trigger one harmless failure, such as a missing file or inactive service.
  4. Capture the error message and explain what Bash exit status it produced.
  5. Convert the manual check into a small script with logging and validation.

Review Questions

  • Which commands in Server Triage Checklist are read-only, and which can change server state?
  • What is the safest way to test the command before using it on production data?
  • What log, service, or health check proves the operation succeeded?
  • What rollback step would you use if the result is wrong?
  • Which parts of the process should be automated, and which should remain manual?

Field Notes

Server work rewards boring, explicit commands. Prefer commands that can be pasted into a runbook, reviewed by another operator, and repeated during an incident without relying on memory.

Keep lesson examples as starting points, not blind copy-paste snippets. Adjust paths, service names, package names, ports, and users to match the actual server environment.