All the rules of generally fixing things apply, but there are technology-specific considerations to also consider.
Before you ever panic or give up, make a simple web search for what you want, since others have probably found the answer.
- To be more thorough, use search handles to narrow what you’re looking for.
- Look for similar model numbers or alternate-language implementations of the same thing.
- What works for others may not work for you, but how you can use what worked is more important than what has been done.
Most of the industry is either “break-fix” (paid per fix) or “managed services” (paid to keep everything running).
Networked
Typically, it helps to triple-check that every networked component is offline, and preferably unplugged.
- A powered-off device can still register as connected (especially network switches).
Network Chains
As counter-intuitive as it sounds, move things around frequently to see if anything changes.
- Change out port locations, plug things into various locations, swap out hardware.
- Often, the programmer will make code that worked for the situation (e.g., check ports 1 and 2 on a two-port switch), and it wasn’t updated for a later hardware release (e.g., check all available ports).
- Sometimes, the code may need something to change to escape an endlessly looping subroutine.
Your most difficult challenge will first be in making the problem reproducible, then in localizing it.
- If the issue keeps cropping up in the same area, split that area in half as many times as possible.
- Brainstorm different perspectives on how it could arise or how you could fix it (e.g., mind maps).
Perception
The ability to know exactly which details are significant can only come from experience.
- A network technician with 2 years of hands-on experience with that particular software or hardware is worth one with 6 or 10 years on anything else.
Repetition
One of the fortunate aspects about most computer troubleshooting (with the important exception of anything involving AI) is that the system is highly fine-tuned, meaning that it’s not likely that more than one thing broke at once.
Complications
Since computers are inherently complicated, do not do anything to make things more complicated. This is not easy for the types of people who use computers.
First, before anything else, ask what the simplest likely thing would be that could have failed. It’s usually not very interesting and easy to overlook.
- Check that everything has been rebooted.
- Log out and login again with everything.
- Check the version numbers to be sure everything is up-to-date.
- Make sure the configurations are set correctly.
- Check the network to be sure it’s working correctly.
To avoid reference issues, don’t let a CPU run updates or install anything while it’s multitasking something else:
- The code on the computer is instructed to write information to Point A.
- While it’s been designated, but before it was written, a tech-savvy user made Point A become Point B because they were trying to be efficient with something else.
- Computer writes to Point A.
- Computer later glitches out because everything that was relative to Point A is now only accurate relative to Point B.
- Worst-case scenario is that the tech-savvy user must do something far more dramatic, like reinstall the OS or extract data from a hard drive.
If you must roll back updates, turn off the auto-update features first, and make sure to roll back all the connected dependencies. Rolling back is like heart surgery, so only do it if you have no choice.
Try to avoid updating multiple systems at once, and reboot each time if possible.
Repairing
The best way to repair depends heavily on the domain.
It’s always important to have done some preventative work before you needed to repair it:
- Have the same or similar extra hardware available for replacement.
- Keep offline media of the current software versions available, or at least have another means to connect to the source of that software (e.g., mobile hotspot cellphone subscription).
- Keep ready access to the precise technical documentation that indicates how to reset or reinstall something.
The easiest preventative measure is to always keep multiple backups.
- If you’re pressed for memory space, space out the backup cycle as you go farther back (e.g., keep a copy for each week for the past month, a copy from every month for the past year, etc.).
- If you must manually run the backup, you should be spending more time-saving backups than loading them.
For the most part, software fixes simply require having the software pre-downloaded for quick transfer, but it’s worth keeping some hardware available, just in case:
- USB drive loaded with a plethora of diagnostic apps, preferably two of them (one for Windows, and the other with a lightweight Linux distro on the drive)
- A wired USB keyboard, which is less trouble to set up
- A Bluetooth keyboard, for mobile devices
- Non-magnetized screwdriver set with Torx, flat, and cross-point drivers
- Antistatic mat or antistatic wrist strap
- Head-mounted magnifier or magnifying glass
- POST card for boot issues
- Loopback plug for network diagnosis
- Multimeter for testing circuitry
- Power supply tester
- Soldering iron with solder wire
Any laziness whatsoever in fixing bugs will magnify the problem, no matter how minor:
- If it re-emerges in 3 months, you’ll likely have forgotten about it in the interim.
- When someone else encounters it, they’ll go through the same journey you did.
- That bug may be unimportant right now, but will become much worse if the system ever scales.
At the same time, enough false-positives will make the system insensitive to actual issues, so the logs and staff should aim for only communicating legitimate issues.
Hardware
Find a sufficient replacement that does the job.
- It can be an upgrade if the situation permits (e.g., keyboard, mouse), but make sure it’s compatible before getting it.
- Don’t worry too much about overkill (e.g., a newer model with more features) or reliability, since you can replace it again when it’s not urgent.
Software
Try to reinstall or reload the code.
- If you have access to the code, you may be able to change a reference, but don’t try rebuilding the code until after it’s back online and no longer urgent.
If anything depends on it, don’t upgrade it.
- Unless a dependency elsewhere had upgraded and deprecated support for the current version, try to reinstall what existed.
- Updates are generally not good to roll out, but software updates will frequently overwrite hundreds, maybe thousands of references, and you may need to debug new updates and features on top of addressing your current problem.
Data
Unfortunately, recovering data can be tremendously difficult.
- Look for proprietary software to recover the data, which may require decrypting the device’s memory.
- If the data is particularly secure or proprietary, you may need another piece of hardware that’s precisely the same type (e.g., a specific brand of disk drive).
- Sometimes, you’ll simply have to hack the solution by ripping out the data yourself, then find the protocol that translates the raw data into a usable format.
Distributed Systems
In the chain of various systems, all you need is one failure to make the entire thing fall apart:
- If there are any integers, avoid floats. In the case of currency, that’s the only way for money to not randomly disappear or appear.
- Make sure the UUIDs are all interpreted as case-insensitive or with a shared convention.
- Watch for time-based discrepancies, such as asynchronous data flow or updating information that retroactively makes other information obsolete. This can be particularly nasty if a time server (e.g., NTP) shifts a few seconds or an hour for some reason. The solution is to have a centralized time, but also distinguish which system time you’re also using.
Pay attention to the slowest part of the system. Any new subsystem that resolves slowdown will likely become the new “slowest part of the system”.
AI
If a training model has been poisoned, you have several options:
- Start all over and retrain. This is technically the most obvious, but also the most time-consuming and potentially the most expensive.
- Train the entire model on fixed, predictable, safe data, which dilutes the poison. It’s not foolproof, but it’s technically the lowest-effort, and further exposure to good data will make the model fix itself over time.
- Delete and retrain the specific faulty nodes. If you can pull it off, this is ideal.
Postmortem
After fixing anything, always document the new rules and what happened. Otherwise, you’ve made life worse for someone else (including Future You).
The industry tends to not respect the people who repair the computers, so keep an eye on who you help, how much, and how much they pay you for the service relative to your investment.
Case Studies
Weird failures:
- Print this file, your printer will jam
- We can’t send email more than 500 miles
- Car allergic to vanilla ice cream
Impressive fixes: