Convert Speech to Text in Python
A complete Python guide — from zero to a working transcript in under 5 minutes. Covers speaker diarization, vocabulary hints, error handling, and Australian data residency.
Python speech-to-text options compared
There are several ways to convert speech to text in Python. Which one you choose depends on accuracy requirements, data residency, and whether you can afford to send audio to a US server.
| Option | Accuracy | Setup | Data leaves AU? |
|---|---|---|---|
| Australian Transcription API | High (Whisper) | pip install requests | No |
| OpenAI Whisper (local) | High | pip install openai-whisper + GPU recommended | No |
| OpenAI Whisper API | High | pip install openai | Yes (US) |
| AssemblyAI | High | pip install assemblyai | Yes (US) |
| SpeechRecognition library | Medium (Google/Sphinx) | pip install SpeechRecognition | Yes (Google cloud) |
| AWS Transcribe | High | pip install boto3 | AU region available |
For Australian developers handling personal information — call centre recordings, medical consultations, legal proceedings — data residency matters. Sending audio to US infrastructure triggers APP 8 obligations under the Privacy Act 1988 (Cth). Australian Transcription runs entirely on AWS Sydney so those obligations never apply.
Prerequisites
- Python 3.8 or later
requestslibrary (pip install requests)- An Australian Transcription API key (sign up free, no credit card required — 90 minutes included)
- An audio file: MP3, WAV, OGG, FLAC, or M4A
Convert speech to text in Python in 5 minutes
The API is asynchronous. You submit a file to POST /api/v1/transcribe, receive a job_id, then poll GET /api/v1/jobs/{job_id} until complete. Here's a minimal working example:
import time
import requests
API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}
def transcribe(audio_path: str) -> str:
"""Convert speech to text in Python — submit and poll."""
# Step 1: Submit the audio file
with open(audio_path, "rb") as f:
response = requests.post(
f"{BASE_URL}/transcribe",
headers=HEADERS,
files={"file": f},
data={"language": "en", "num_speakers": 2},
)
response.raise_for_status()
job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")
# Step 2: Poll until complete (max 60 attempts, ~5 min)
for _ in range(60):
result = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS)
result.raise_for_status()
data = result.json()
status = data["status"]
print(f"Status: {status}")
if status == "completed":
return data["transcription"]
elif status == "failed":
raise RuntimeError(f"Transcription failed: {data.get('error')}")
time.sleep(5)
raise TimeoutError(f"Job {job_id} did not complete in time")
if __name__ == "__main__":
transcript = transcribe("recording.mp3")
print("\nTranscript:")
print(transcript)
Automatic speech recognition (ASR) with speaker labels
The API includes speaker diarization — it labels each segment with the speaker who said it. Pass num_speakers for better accuracy when you know the count:
import time
import requests
API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}
def transcribe_with_speakers(audio_path: str, num_speakers: int = 2) -> dict:
"""Submit audio and return both transcription and speaker diarization."""
with open(audio_path, "rb") as f:
resp = requests.post(
f"{BASE_URL}/transcribe",
headers=HEADERS,
files={"file": f},
data={"language": "en", "num_speakers": num_speakers},
)
resp.raise_for_status()
job_id = resp.json()["job_id"]
for _ in range(60):
time.sleep(5)
data = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS).json()
if data["status"] == "completed":
return data
elif data["status"] == "failed":
raise RuntimeError("Transcription failed")
raise TimeoutError("Timed out")
result = transcribe_with_speakers("meeting.mp3", num_speakers=3)
print("Full transcript:")
print(result["transcription"])
print("\nBy speaker:")
print(result["diarization"])
# Output:
# [Speaker 1]: Good morning everyone, let's get started.
# [Speaker 2]: Thanks for joining.
# [Speaker 3]: Happy to be here.
Improve Python speech recognition accuracy with vocab hints
The prompt parameter passes domain-specific terms to Whisper, reducing transcription errors on uncommon words, product names, and proper nouns:
import requests
API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}
# Keep under 60 words — only the last ~200 tokens are used
MEDICAL_TERMS = "metformin, HbA1c, hypertension, dyslipidaemia, myocardial infarction, ECG"
LEGAL_TERMS = "indemnification, liquidated damages, Anton Piller, Mareva injunction, subrogation"
FINANCE_TERMS = "ACME Corp, KPIs, Q3 review, EBITDA, CRM, AML obligations"
def submit(audio_path: str, vocab: str) -> str:
with open(audio_path, "rb") as f:
resp = requests.post(
f"{BASE_URL}/transcribe",
headers=HEADERS,
files={"file": f},
data={"language": "en", "prompt": vocab},
)
resp.raise_for_status()
return resp.json()["job_id"]
job_id = submit("consultation.mp3", MEDICAL_TERMS)
print(f"Job submitted: {job_id}")
Error handling and retry logic
Two error codes worth handling explicitly when you transcribe audio files in Python:
- 429 Too Many Requests: rate limit hit. Read the
Retry-Afterheader and sleep before retrying. Limit is 10 req/min on/transcribe. - 402 Payment Required: credit exhausted. Check
GET /api/v1/creditand top up. - 401 Unauthorized: invalid or missing API key.
- 400 Bad Request: invalid file format or parameters.
import time
import requests
API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}
def submit_with_retry(audio_path: str, max_retries: int = 3) -> str:
"""Submit a transcription job with retry logic for rate limiting."""
for attempt in range(max_retries):
with open(audio_path, "rb") as f:
response = requests.post(
f"{BASE_URL}/transcribe",
headers=HEADERS,
files={"file": f},
data={"language": "en"},
)
if response.status_code == 200:
return response.json()["job_id"]
elif response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 10))
print(f"Rate limited. Retrying in {retry_after}s (attempt {attempt + 1}/{max_retries})")
time.sleep(retry_after)
elif response.status_code == 402:
raise RuntimeError(
"Insufficient credit. Top up at "
"https://australiantranscription.com.au/billing"
)
elif response.status_code == 401:
raise ValueError("Invalid API key. Check your X-API-Key header.")
else:
response.raise_for_status()
raise RuntimeError(f"Failed after {max_retries} attempts.")
def poll_with_backoff(job_id: str, max_wait_seconds: int = 1800) -> dict:
"""Poll with exponential backoff — gentler on the API."""
interval = 5.0
elapsed = 0.0
while elapsed < max_wait_seconds:
time.sleep(interval)
elapsed += interval
data = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS).json()
status = data["status"]
if status == "completed":
return data
elif status == "failed":
raise RuntimeError(f"Job failed: {data.get('error_message', 'unknown')}")
print(f"Status: {status} ({int(elapsed)}s elapsed)")
interval = min(interval * 1.2, 60.0) # cap at 60s
raise TimeoutError(f"Job {job_id} did not complete within {max_wait_seconds}s")
Optimise performance for large audio files
For bulk jobs, submit in parallel rather than sequentially. Each job is independent and you can poll them concurrently:
import time
import threading
import requests
API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}
def submit(path: str) -> str:
with open(path, "rb") as f:
r = requests.post(
f"{BASE_URL}/transcribe",
headers=HEADERS,
files={"file": f},
data={"language": "en"},
)
r.raise_for_status()
return r.json()["job_id"]
def poll(job_id: str, results: dict):
interval = 5.0
for _ in range(360): # 30 min max
time.sleep(interval)
data = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS).json()
if data["status"] == "completed":
results[job_id] = data["transcription"]
return
elif data["status"] == "failed":
results[job_id] = None
return
interval = min(interval * 1.2, 60.0)
# Submit up to 10 jobs (rate limit: 10/min on /transcribe)
files = ["call_1.mp3", "call_2.mp3", "call_3.mp3"]
job_ids = [submit(f) for f in files]
# Poll all jobs concurrently
results = {}
threads = [threading.Thread(target=poll, args=(jid, results)) for jid in job_ids]
for t in threads:
t.start()
for t in threads:
t.join()
for jid, transcript in results.items():
print(f"{jid}: {transcript[:100] if transcript else 'FAILED'}...")
Check your credit balance in Python
The /api/v1/credit endpoint returns your current AUD balance. Useful to check before submitting bulk jobs:
import requests
API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}
resp = requests.get(f"{BASE_URL}/credit", headers=HEADERS)
resp.raise_for_status()
credit = resp.json()
print(f"Balance: ${credit['balance_aud']} AUD")
print(f"Rate: ${credit['price_per_minute_aud']}/min")
print(f"Estimated remaining: {credit['estimated_remaining_minutes']} minutes")
API reference summary
| Endpoint | Method | Description |
|---|---|---|
| /api/v1/transcribe | POST | Submit audio. Returns job_id. |
| /api/v1/jobs/{'{job_id}'} | GET | Poll status. Returns transcription + diarization when completed. |
| /api/v1/jobs | GET | List all jobs (paginated). |
| /api/v1/credit | GET | AUD balance and usage summary. |
POST /api/v1/transcribe parameters
file: audio file (multipart/form-data, required) — MP3, WAV, OGG, FLAC, M4Alanguage: ISO 639-1 code, default"en"prompt: comma-separated vocabulary hints (optional, ~60 words max)num_speakers: number of speakers 1-10, default2
Frequently asked questions
How do I convert speech to text in Python?
requests, submit your audio to POST /api/v1/transcribe with your API key, then poll GET /api/v1/jobs/{job_id} until status == "completed". The whole setup takes under 5 minutes. See the example above.
What audio formats does the API support?
How do I improve transcription accuracy?
prompt parameter with comma-separated domain terms. Also set num_speakers to the exact count — it meaningfully improves diarization. Keep prompts under 60 words.
How do I handle rate limit errors?
429 response, read the Retry-After header and sleep for that many seconds before retrying. The /transcribe endpoint allows 10 requests per minute per API key.
How do I optimise performance for large audio files in Python?
Does this send my audio data overseas?
Full API documentation is available at /docs.
Start converting speech to text in Python
90 minutes free. No credit card required. Australian data residency included.
Related reading: