Step 1: List what you're actually running
Every assessment starts with an honest inventory, and honest is the operative word. Catalog each AI system and use case: what it does, which model it uses, who owns it, what data it touches, where it sits in a real business process. Include the unofficial stuff, the staff quietly calling hosted models, the script someone wired to an API last spring. That is where the risk an org chart misses tends to live. If you cannot list a system, you cannot assess it, so treat discovery as the actual first step rather than a box to tick.
Step 2: Sort systems by how much they can hurt you
Rank each use case by the harm it could do and the exposure it creates. Look at the sensitivity of the data, whether the output affects a person directly, the regulatory regime it falls under, and how autonomously the thing acts. The EU AI Act risk tiers and the NIST AI Risk Management Framework give you ready-made levels to anchor on, so you do not have to invent a scale. What you want out of this step is a clean split: the systems that need heavy controls on one side, the ones that need a light touch on the other. Effort should land where the risk is, not spread evenly.
Step 3: Write down how each one breaks
For every higher-risk system, spell out the failure modes. Sensitive data leaving to an external model. Prompt injection steering an agent off its task. A model producing unsafe or biased output. Usage drifting past the purpose it was approved for. Be specific to the system in front of you. A generic checklist applied to everything is how real exposure gets missed, because the threat that matters for an agent with tool access is not the threat that matters for a summarizer.
Step 4: Test the controls, don't assume them
Against each threat, ask what currently stops it, and be ruthless. Is sensitive data actually handled before it reaches a model, or merely discouraged in a policy nobody reads? Is access checked, or assumed? Is there a usable trail of what the AI did, or just application logs that happen to exist? This is the uncomfortable step, and it is the one that earns the assessment. A rule nobody applies in practice is a rule that fails the first time it is tested, and most teams find their real exposure sitting right in that gap between written policy and what the systems actually do.
Step 5: Score what's left and decide
For each threat, combine its likelihood and impact with the strength of the control that addresses it, and you have a residual risk. Then make a call: accept it, add a control, or stop the use until it is safe. Push the high-impact threats with weak controls to the top, the ones most likely to surface in production or in front of an auditor. Record the decision and the owner for each one. An assessment that ends in a list of worries helps nobody. An assessment that ends in accountable actions does.
Step 6: Evidence it, then keep it alive
Capture the whole thing as a record an auditor can walk through: systems, classification, threats, controls, residual risk, decisions. Then do not let it rot. AI systems change constantly as models get swapped, prompts evolve, and new use cases appear, so a one-time assessment goes stale fast. The strongest practice ties the review to what your AI is actually doing in production. When you can see how systems are being used day to day, your next assessment starts from real behavior rather than what you hoped they would do, and the review becomes a live control instead of an annual ritual.
Frequently asked questions
How often should an AI risk assessment be repeated?
Whenever a system materially changes, a model gets swapped, or a new use case appears, plus a regular sweep of the whole inventory, say quarterly. Tying the assessment to real usage evidence keeps it closer to live than a once-a-year ritual.
Which framework should systems be classified against?
The EU AI Act risk tiers and the NIST AI Risk Management Framework are the usual anchors. Use the tiers to separate the high-risk systems that need strong controls from the low-risk ones that do not.
What is the most common gap these assessments find?
The gap between written policy and what actually happens on a live model call. Plenty of organizations have rules on paper that nothing applies in practice, and that is exactly where exposure concentrates.