Hello, this is Hamamoto from TIMEWELL. This is the third article in our enterprise AI governance series, and we are finally getting into the on-the-ground work: when you actually audit AI, what do you look at, in what order, and with which tools.
Most companies have made it as far as “we are using AI internally.” The harder problem is what comes next. When the management committee asks, “Is our AI really safe?”, most owners go silent. Open the conventional checklists for SOC 2 or ISO 27001, and you will find almost nothing that evaluates what is happening inside the AI itself. In Grant Thornton’s 2026 AI Impact Survey, 46% of executives cited governance and compliance as a reason AI deployments are not progressing as expected. Without a way to look inside, you are left with three options: stop, ignore, or look the other way.
This article proposes a way out: the “3-layer review.” We split the audit into a data layer, a model layer, and an operations layer, and spell out what to check in each, which tools to use, and who is responsible. The earlier two installments—Enterprise AI Governance: SOC 2, ISO 27001, and ISO 42001 Audit Controls and Enterprise AI Governance: Compliance with Japan’s METI and MIC Guidelines—are the backdrop, so reading them alongside this piece will help the story connect.
Why an AI audit naturally splits into “three layers”
When people say “AI audit,” the conversation tends to drift toward usage guidelines: “Are we allowed to use ChatGPT?” “Is this Copilot configuration safe?” But the real question is one level deeper. You have to look at three time axes separately: what the model has been fed, how it has learned to behave, and how it keeps being used. That structure maps cleanly onto a “data layer, model layer, operations layer” split.
ISO/IEC 23894:2023 (Artificial intelligence — Guidance on risk management) was the first document to articulate this layered way of thinking at the international standard level. It is built on top of ISO 31000:2018 risk management, with AI-specific hazards layered on top: drift, hallucination, black-box behavior, adversarial inputs. Because 23894 itself is not certifiable, it is a “tool you use,” not a “certificate you obtain.” The de facto practice in 2026 is to combine it with the certifiable ISO/IEC 42001 (AI management system). To put it crudely: 42001 is the structural frame, and 23894 is the inspection mechanism that runs inside it.
A second reason for the three layers is that responsibility itself shifts. The data layer belongs to the data engineering function and data providers; the model layer to ML engineers and model providers; the operations layer to service developers and SREs. Try to cover all of it with a single checklist, and everyone simply says “that’s not my job,” and the audit stalls. AICPA and CPA Canada’s 2026 joint paper “Closing the AI Trust Gap: The Role of the CPA in AI Assurance” is best read exactly as a document that re-organizes “who is responsible, where, and how far.”
The 3-layer review is just an honest projection of this responsibility split and risk taxonomy. There is nothing exotic about it. The hard part is implementation: each layer needs its own calendar, its own tools, and its own participants, and many companies get stuck right there. From the next section, we will go through what to inspect in each layer.
Struggling with AI adoption?
We have prepared materials covering ZEROCK case studies and implementation methods.
Data-layer audit: provenance, bias, copyright, and PII in training data
The first line of defense is the data layer. The old saying still holds: feed it garbage, and it will return garbage. The arrival of generative AI did not change that principle. If anything, the scale of training data has outgrown what business judgment alone can grasp, and the damage when garbage sneaks in is bigger than ever.
The first artifact you should produce in a data-layer audit is a “datasheet” modeled on the Datasheets for Datasets template. Proposed by Gebru et al. in 2018, it is a documentation format for datasets that captures the motivation, collection process, preprocessing, recommended use, distribution conditions, and so on, all in one place. Google’s “Data Cards” format is essentially the same lineage. Whether the data was bought externally or collected in-house, you should always create one such document per dataset. That single discipline cuts downstream audit cost by more than half.
Three concrete technical tasks follow. The first is provenance tracking. Use Amazon SageMaker ML Lineage Tracking on AWS, or DVC (Data Version Control) in the open-source world, to record which data, through which version of preprocessing, ended up in which model—as a lineage graph. The second is PII scanning. Microsoft’s Presidio ships, as of 2025, with more than 50 recognizers covering names, addresses, credit card numbers, medical entities, and more. Patterns like Wealthsimple’s “PII Redact Gateway,” which scrubs data before it ever reaches an LLM, are increasingly common. The third is copyright screening. If you are training directly on public corpora such as Common Crawl, you need to document a policy for how you interpret license terms and opt-out signals (noai, noimageai, and so on), and keep operational logs.
Bias inspection sits across the data layer and the next, model, layer. The University of Chicago DSSG’s “Aequitas” is the de facto tool for quantifying representation gaps across protected attributes (gender, age, race, and others). Since Aequitas Flow v1.0, it has covered not only inspection but also mitigation—resampling and reweighting—end to end. Microsoft’s Fairlearn plays much the same role and feels more natural for ML developers in the Python ecosystem. I generally recommend starting with Aequitas, simply because its UI-driven audit reports are polished enough to drop into a board deck.
Document your data-layer audit as a four-part set: datasheet, lineage log, PII scan results, and bias report. Refreshing this set quarterly is a realistic cadence in 2026. The technical documentation that the EU AI Act will require for high-risk AI starting in August 2026 ultimately reduces to “can you produce this four-part set, in each required language?” If you build it once up front, half of your regulatory work is already done.
Model-layer audit: accuracy, bias, fairness, and robustness
After data comes the model itself. The protagonist here is a document called the “Model Card,” supported by an ecosystem of evaluation tools.
The Model Card was proposed by Mitchell et al. in 2019 as a kind of “nutrition facts label” for models. On the Hugging Face Hub it has become the de facto standard format, capturing intended use, training data, evaluation metrics, expected failure modes, and usage caveats on a single sheet. For internal models or for fine-tuned open-source models alike, you should produce one Model Card per model. If a Datasheet is the ID for the data side, the Model Card is the ID for the model side. Only when both exist can you audit the data-to-model linkage.
For accuracy evaluation, the metric you choose decides what conclusions you can draw. The “Community Evals” feature added to the Hugging Face Evaluate library in 2026, which lets the community run leaderboards, took benchmark transparency to another level. Even for in-house models, recording numbers per version using Evaluate’s standard metrics (accuracy, precision, recall, F1, BLEU, ROUGE, and so on) lets you go back later and answer “when did accuracy regress?”—and that turns out to matter more than you would expect. In the field, what trips teams up is rarely the variance in metric values; it is the metric definition itself drifting over time. Locking your metrics first and creating a rule for how they may change in review tends to be a higher priority than refining the evaluation.
For bias and fairness evaluation, you reuse Aequitas and Fairlearn from the data layer. At the model layer, however, the question becomes “how is the post-training output skewed?”, so you also compute prediction-difference metrics such as Demographic Parity, Equal Opportunity, and Predictive Parity, on top of input distributions. Because tradeoffs with economic value are unavoidable here, choosing the metric definition is safer when the management committee signs off before implementation begins. Without putting “how do business KPIs change as a result of becoming fairer?” on the same slide, the field never lands a decision.
Robustness evaluation has, by now, become a central theme of AI auditing. IBM’s Adversarial Robustness Toolbox (ART), Microsoft’s PyRIT—used internally to evaluate Copilot—and the commercial offering Giskard are the 2026 staples. Automating tests by attack category—prompt injection, data poisoning, model extraction, evasion attacks—and wiring pass/fail thresholds into CI is now realistic. Treating robustness as a binary “did it / didn’t we?” is the worst posture: when an incident hits, the audit team finds nothing on the shelves. Always preserve test items and results per version.
Operations-layer audit: prompts, output logs, drift detection, and incident handling
The third layer is operations—auditing AI that is actually running in production. If the data and model layers are “before-the-fact” inspections, the operations layer is both “in-flight” and “after-the-fact.”
The center of gravity is collecting and evaluating output logs. For LLM applications, you keep entire traces: input prompts, model responses, token usage, cost, latency, and any related RAG context. For 2026, the open-source options include Langfuse, Arize Phoenix (self-hostable under ELL2), and on the enterprise side Datadog LLM Observability, Weights & Biases Weave, and Braintrust. Langfuse was acquired by ClickHouse in January 2026 but kept its OpenTelemetry-native design and permissive license. Arize Phoenix is strong on drift detection and evaluation templates, especially for visualizing RAG quality. The combination I most often recommend is to use Langfuse during development and experimentation, and centralize production monitoring in Datadog LLM Observability.
Drift detection is the mechanism that quantifies changes in feature distributions. Jensen-Shannon Divergence (JSD) and Population Stability Index (PSI), as standardized in Fiddler, are the de facto industry standards; Evidently AI, WhyLabs, and Amazon SageMaker Model Monitor all carry the same family of metrics. A realistic starting point is to monitor PSI weekly, fire automatic alerts when thresholds are crossed, and have a human triage from there. In the early days of production, your thresholds can be coarse. Run for a quarter or two, watch the false-positive and false-negative balance, and tune.
Design the incident-handling flow in five steps: detection, isolation, analysis, correction, and re-evaluation. Detection lives in monitoring (Datadog, Phoenix, Langfuse); isolation in feature flags or staged rollouts; analysis in trace logs and data lineage; correction in retraining or prompt adjustments; and re-evaluation by feeding the issue back into the model-layer evaluation set. COSO’s February 2026 generative-AI control guidance maps this flow onto the two elements “Control Activities” and “Monitoring Activities.” For an internal audit team, the quickest win is to put the new COSO guidance side by side and rewrite your runbook against it.
The EU AI Act, starting on August 2, 2026, will mandate automatic logging of inputs, outputs, and metadata for high-risk AI systems. Penalties run up to €15M or 3% of worldwide turnover, whichever is greater. Japanese companies can fall under extraterritorial application, so there is no harm in starting on the log-storage design today.
Choosing AI audit tools: open source vs SaaS
Many companies trip up on tooling, so let me sort out the selection logic. Up front: you do not need a dedicated tool for each of the three layers. Some platforms cover two layers; others slice things up by use case.
On the open-source side, every layer has a defacto tool: Aequitas (bias audit), Fairlearn (fairness), Hugging Face Evaluate (metrics), IBM ART (adversarial robustness), Microsoft Presidio (PII), Langfuse (LLM tracing), Arize Phoenix (LLM drift and evaluation), Evidently AI (data drift), DVC + MLflow (lineage). The strengths are threefold: cost is low, vendor lock-in is weak, and the inspection logic lives in your own repository, which makes it easy to explain to auditors. The trade-off is that you carry the operational burden and the documentation of internal controls yourself. It assumes a real engineering capability.
On the SaaS side, Datadog LLM Observability, Arize AX, Fiddler, WhyLabs, Weights & Biases Weave, Braintrust, and Confident AI are the 2026 mainstream. The strength is that they hold third-party certifications such as SOC 2 Type II and ISO 27001 already, which you can recycle directly into compliance reporting. The trade-off is that the architecture assumes data leaves your perimeter, so you have to design separately for keeping sensitive prompts and business context from leaking outside.
The setup I most often recommend in the field is a hybrid: “OSS for development and experimentation, SaaS for production monitoring.” Concretely: run Langfuse + Aequitas + Hugging Face Evaluate inside your perimeter for development, and centralize production runs in Datadog LLM Observability or Arize AX for SOC reports. Sensitive prompts stay on the OSS side, while the controls needed for compliance reporting are concentrated on the SaaS side. The Big 4 AI audit services—PwC’s GL.ai and H2O.ai partnership, EY Helix, KPMG Ignite, Deloitte’s AI-driven analytics—appear to be designed around the same hybrid premise. KPMG announced that AI-led auditing will go into full operation in summer 2026, with a $200M-scale investment ahead of full implementation in 2027. Building the same level entirely in-house is not realistic, so the practical best practice for the time being is to fill the coverage with three pieces: OSS, SaaS, and external audit.
One last point on tool selection: where do you audit AI agents? Agents span multiple models, multiple tools, and multiple data sources, so unless their traces are unified into a single thread, they are not auditable. As covered in our companion piece KPI Monitoring and Operational Design for AI Agents, operations-layer auditing in the agent era is safest if you assume from day one a three-piece set: per-step traces, drift indicators, and a human review queue.
Implementation steps to make the audit program stick inside your company
Finally, here is a six-step implementation plan to “start the 3-layer review and keep it going.” In practice, the most common failure in the first three months is, “We did the inspection, but the result belongs to no one.” More than the steps themselves, the keys are where you place responsibility and at what cadence.
Step 1 is fixing the responsibility split. Put the data layer with data engineering, the model layer with the ML/AI function, the operations layer with SRE and service development, and create a cross-cutting function with internal audit and legal. As AICPA’s “Closing the AI Trust Gap” notes, CPAs and internal auditors work best as “the cross-cutting final-check function across all layers.” Step 2 is asset inventory. List every model, where it runs, what data it uses, and who it serves. Without this list, your audit scope blurs. Step 3 is documenting each layer. Three artifacts per model: datasheet, Model Card, and operations runbook.
Step 4 is running inspections and accumulating logs. The data layer runs quarterly, the model layer at every new version release, and the operations layer weekly to daily. The realistic toolset here is the OSS-plus-SaaS hybrid described above. Step 5 is executive reporting and risk acceptance. Inspection results come out as numbers, and anything past threshold goes onto the management committee’s agenda. Using the COSO 2026 framework, a quarterly report mapped against the five elements (control environment, risk assessment, control activities, information & communication, monitoring activities) is about the right granularity. Step 6 is connecting to external audit and third-party assessment. The EU AI Act mandates third-party audits for high-risk AI. In Japan too, in finance, healthcare, and hiring, the pressure for external audit will only grow. To avoid scrambling later, keep the artifacts produced in steps 1–5 in a “shareable outside the company” format from the start.
Once the AI audit machinery is in place, running it turns out to be surprisingly light. Conversely, if you leave the “we’re just running it for now” state untouched, six months later IT and legal will be squeezing each other in the middle, and AI use itself will halt. In my experience, eight out of ten companies that are seriously stuck are operating with “only one of the three layers running.” Some companies have built up the data layer and feel safe; others collect operations-layer logs but never look at the contents. That kind of one-lung operation is the most dangerous configuration. You do not need to launch all three layers at once. You can sequence them. But every layer needs an owner and a cadence.
At TIMEWELL, as the provider of the enterprise AI platform “ZEROCK,” we have built the 3-layer review—data, model, and operations—into the platform from the start. GraphRAG reference logs, prompt library version control, and trace storage on AWS in-Japan servers are designed precisely to satisfy the audit requirements described here. If you are wrestling with how to set up your AI audit machinery, please reach out. The next article in this series will cover the “incident response playbook” that runs alongside AI audit.
References
[^1]: ISO/IEC 23894:2023 Artificial intelligence — Guidance on risk management, https://www.iso.org/standard/77304.html [^2]: Journal of Accountancy, COSO creates audit-ready guidance for governing generative AI (February 2026), https://www.journalofaccountancy.com/news/2026/feb/coso-creates-audit-ready-guidance-for-governing-generative-ai/ [^3]: The CPA Journal, How to Audit AI (January 13, 2026), https://www.cpajournal.com/2026/01/13/how-to-audit-ai/ [^4]: EU Artificial Intelligence Act, Article 6: Classification Rules for High-Risk AI Systems, https://artificialintelligenceact.eu/article/6/ [^5]: Langfuse FAQ, Langfuse vs Arize AI / Phoenix, https://langfuse.com/faq/all/best-phoenix-arize-alternatives [^6]: Aequitas: The Bias and Fairness Audit Toolkit, University of Chicago DSSG, https://dssg.github.io/aequitas/ [^7]: Hugging Face Evaluate Documentation, https://huggingface.co/docs/evaluate/index [^8]: Amazon SageMaker ML Lineage Tracking, https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html [^9]: Datadog Blog, Observability in the AI age: Datadog's approach, https://www.datadoghq.com/blog/datadog-ai-innovation/
![AI Audit in Practice|A 3-Layer Review of Training Data, Models, and Operations with Implementation Steps [2026 Edition]](/images/columns/enterprise-ai-audit-practical-3-layer-review/cover.png)