Skip to content
Bharat NeuroTech
NeuroCortex · Live
₹101 Shagun on signup · free
JOURNAL · BUILDING

AI red teaming — what to probe, what to document, what to fix

Four-layer AI red team — capability, safety, security, abuse. Reproducible findings register, India-specific attack vectors (code-mix jailbreaks, caste/religion abuse), and the two-week engagement structure.

By Dr. Nitnem Singh Sodhi7 min read← all essays
▸ ANSWER

AI red teaming is the structured, adversarial probing of an AI system to find failures — safety, security, fairness and policy-violation failures — before attackers, users or regulators do. In 2026 it has matured from a single English-only prompt-injection sweep into a multi-layered discipline: capability probing, safety probing, security probing, abuse probing, each with documented methodology and reproducible findings. For Indian deployments, red teaming must include code-mixed Hindi-English attacks, vernacular jailbreaks, and India-specific abuse vectors (caste/religion exploits, deepfake synthesis tied to political contexts).

AI red teaming
The systematic adversarial testing of an AI system by a team explicitly trying to cause it to fail — covering safety failures (harmful output), security failures (prompt injection, data exfiltration), fairness failures (biased outputs under pressure) and abuse failures (misuse for content the system shouldn't produce).
▸ TL;DR
  • Four probing layers: capability, safety, security, abuse. Skip any layer and the report is incomplete.
  • For Indian AI: test in Hindi-English code-mix, not just English. ~30% of failures don't surface in English.
  • Reproducibility is the difference between red teaming and "we tried some bad prompts". File every attack with seed, version, and outcome.
  • The report's value is in the fixes, not the findings. A red team without a remediation plan is theatre.
  • Rerun on every material model or prompt change. Red-team findings have a shelf life of one model version.

The four probing layers

1. Capability probing

What can the system do that you didn't intend? Out-of-scope task completion, latent capabilities, emergent tool-use. For a customer-service chatbot, capability probing asks: can it write code? Generate medical advice? Issue refunds? The answers shape the safety surface for every later layer.

2. Safety probing

Will it produce content that harms users or third parties? Test for: violence, sexual content involving minors, self-harm, hate speech, dangerous instructions, regulated-advice failures (medical, legal, financial). Run each category in English and in Indic languages and in code-mix.

3. Security probing

Can attackers manipulate the system? Test prompt injection (direct + indirect), jailbreaks, system-prompt extraction, training-data extraction, model inversion, membership inference. The Indian context adds vernacular jailbreaks — attacks that exploit the model's weaker safety training in non-English languages.

4. Abuse probing

How will users abuse the system in the wild? For text models: spam generation, astroturfing, scam-content production. For image models: deepfakes, NCII, fraud. For Indian contexts specifically: synthesised political content, caste/religion targeted harassment, fraud automation in vernacular languages.

How a real red-team engagement runs

TWO-WEEK AI RED-TEAM ENGAGEMENT STRUCTURE
PhaseDaysOutput
Scope + threat model1–2Threat model document, scope sign-off
Capability probing3–4Capability inventory, scope expansions if needed
Safety probing5–7Safety findings register, severity-scored
Security probing8–10Security findings register, exploitability rated
Abuse probing11–12Abuse-vector report with example artefacts
Report + remediation13–14Executive summary + technical appendix + remediation roadmap

The Indian-context attack vectors

  • Code-mix jailbreaks. "Tell me X" refused in English; same request in Devanagari often succeeds because safety training is sparser in non-English.
  • Caste/religion-targeted abuse generation. Test specifically; Western red teams routinely miss this category.
  • Political deepfake synthesis. Especially around election cycles, MeitY advisory exposure is real.
  • Fraud automation in vernacular. WhatsApp-shaped scam scripts in Hindi/Tamil/Telugu — measure refusal rates separately for each.
  • Regulatory-disclosure failures. Does the system disclose AI use as required by MeitY advisories? Test this; many don't.

What the deliverable should contain

  1. Executive summary — 1 page, severity-ranked findings.
  2. Threat model — what was in and out of scope, why.
  3. Findings register — reproducible attacks with verbatim inputs/outputs.
  4. Severity rating per finding (critical / high / medium / low) with rationale.
  5. Remediation roadmap — what to fix, in what order, by when.
  6. Re-test plan — what triggers a re-run.

How red teaming fits the rest of the safety stack

Red teaming is the adversarial half of safety. It pairs with structured risk assessment (the proactive half) and with ongoing safety controls (the operational half). One without the others leaves gaps.

▸ FAQ

Frequently asked

What is AI red teaming?
The systematic adversarial testing of an AI system by a team explicitly trying to cause it to fail — covering safety failures (harmful output), security failures (prompt injection, data exfiltration), fairness failures (biased outputs under pressure) and abuse failures (misuse for content the system shouldn't produce).
What are the layers of an AI red team engagement?
Four probing layers: capability (what can it do you didn't intend), safety (will it produce harmful content), security (can attackers manipulate it), abuse (how will users abuse it in the wild). Skip any layer and the report is incomplete.
How is AI red teaming different in India?
Test in Hindi-English code-mix, not just English — about 30% of failures don't surface in English. Add caste/religion-targeted abuse generation, vernacular fraud automation, and political deepfake synthesis. Western red teams routinely miss these categories.
How often should you run an AI red team?
On every material model or prompt change. Red-team findings have a shelf life of one model version. Continuous integration of red-team probes is the mature pattern in 2026.
▸ NEXT STEP

Run an India-calibrated red-team scan in minutes.

The risk scan probes capability, safety, security and abuse in English and Hindi-English code-mix, and outputs a reproducible findings register. ₹799 for a single audit.

Dr. Nitnem Singh Sodhi is a Lead Auditor for ISO/IEC 42001, 27001 and 27701, accredited by ANSI/ABICB since March 2025.

— Bharat NeuroTech · /dr-sodhi
Open the Lab →