What audio formats does the Australian Transcription API support?

The API accepts MP3, WAV, OGG, FLAC, and M4A files. Files are submitted as multipart/form-data.

How do I improve transcription accuracy in Python?

Pass a prompt parameter with comma-separated domain-specific terms, names, or abbreviations that appear in your audio. For example: 'ACME Corp, Dr. Nguyen, HbA1c'. Only the last ~200 tokens are used so keep it to 40-60 words. Also pass num_speakers if you know the exact count — correct speaker count improves diarization accuracy.

How do I handle errors and retries when transcribing audio in Python?

Handle 429 Too Many Requests by reading the Retry-After header and sleeping before retrying. Handle 402 Payment Required by checking your credit balance. Handle 401 Unauthorized by verifying your X-API-Key header. For polling, use exponential backoff rather than a fixed sleep interval to avoid hammering the status endpoint.

How do I optimise performance when transcribing large audio files in Python?

Submit jobs in parallel rather than sequentially — each job gets its own job_id and you can poll them concurrently. Use Python's asyncio or a thread pool to submit multiple files simultaneously. Start with exponential backoff on polling (5s, 6s, 7.2s...) rather than hammering the status endpoint every second.

What is automatic speech recognition (ASR) in Python?

Automatic speech recognition (ASR) in Python means using a library or API to convert audio recordings into text programmatically. Options range from local libraries like SpeechRecognition (wraps CMU Sphinx and Google Web Speech) to cloud ASR APIs like Australian Transcription, AssemblyAI, and Deepgram that handle the heavy lifting server-side and return higher-accuracy transcripts.

Does converting speech to text in Python require data to leave Australia?

Not with Australian Transcription. All audio is processed on AWS infrastructure in Sydney (ap-southeast-2) and never leaves Australia. This means APP 8 cross-border disclosure obligations under the Privacy Act 1988 are never triggered — important for any application handling personal information in recordings.

Developer Guide

Convert Speech to Text in Python

Q: How do I convert speech to text in Python?

Use a speech-to-text API. Install the requests library, submit your audio file to POST /api/v1/transcribe with your API key, then poll GET /api/v1/jobs/{job_id} until the status is 'completed'. The response includes both a plain transcription and speaker-labelled diarization. The whole flow takes under 5 minutes to set up.

A complete Python guide — from zero to a working transcript in under 5 minutes. Covers speaker diarization, vocabulary hints, error handling, and Australian data residency.

Python speech-to-text options compared

There are several ways to convert speech to text in Python. Which one you choose depends on accuracy requirements, data residency, and whether you can afford to send audio to a US server.

Option	Accuracy	Setup	Data leaves AU?
Australian Transcription API	High (Whisper)	pip install requests	No
OpenAI Whisper (local)	High	pip install openai-whisper + GPU recommended	No
OpenAI Whisper API	High	pip install openai	Yes (US)
AssemblyAI	High	pip install assemblyai	Yes (US)
SpeechRecognition library	Medium (Google/Sphinx)	pip install SpeechRecognition	Yes (Google cloud)
AWS Transcribe	High	pip install boto3	AU region available

For Australian developers handling personal information — call centre recordings, medical consultations, legal proceedings — data residency matters. Sending audio to US infrastructure triggers APP 8 obligations under the Privacy Act 1988 (Cth). Australian Transcription runs entirely on AWS Sydney so those obligations never apply.

Prerequisites

Python 3.8 or later
requests library (pip install requests)
An Australian Transcription API key (sign up free, no credit card required — 90 minutes included)
An audio file: MP3, WAV, OGG, FLAC, or M4A

Convert speech to text in Python in 5 minutes

The API is asynchronous. You submit a file to POST /api/v1/transcribe, receive a job_id, then poll GET /api/v1/jobs/{job_id} until complete. Here's a minimal working example:

transcribe_basic.py

import time
import requests

API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def transcribe(audio_path: str) -> str:
    """Convert speech to text in Python — submit and poll."""

    # Step 1: Submit the audio file
    with open(audio_path, "rb") as f:
        response = requests.post(
            f"{BASE_URL}/transcribe",
            headers=HEADERS,
            files={"file": f},
            data={"language": "en", "num_speakers": 2},
        )
    response.raise_for_status()
    job_id = response.json()["job_id"]
    print(f"Job submitted: {job_id}")

    # Step 2: Poll until complete (max 60 attempts, ~5 min)
    for _ in range(60):
        result = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS)
        result.raise_for_status()
        data = result.json()

        status = data["status"]
        print(f"Status: {status}")

        if status == "completed":
            return data["transcription"]
        elif status == "failed":
            raise RuntimeError(f"Transcription failed: {data.get('error')}")

        time.sleep(5)

    raise TimeoutError(f"Job {job_id} did not complete in time")


if __name__ == "__main__":
    transcript = transcribe("recording.mp3")
    print("\nTranscript:")
    print(transcript)

Automatic speech recognition (ASR) with speaker labels

The API includes speaker diarization — it labels each segment with the speaker who said it. Pass num_speakers for better accuracy when you know the count:

extract_speakers.py

import time
import requests

API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def transcribe_with_speakers(audio_path: str, num_speakers: int = 2) -> dict:
    """Submit audio and return both transcription and speaker diarization."""
    with open(audio_path, "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/transcribe",
            headers=HEADERS,
            files={"file": f},
            data={"language": "en", "num_speakers": num_speakers},
        )
    resp.raise_for_status()
    job_id = resp.json()["job_id"]

    for _ in range(60):
        time.sleep(5)
        data = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS).json()
        if data["status"] == "completed":
            return data
        elif data["status"] == "failed":
            raise RuntimeError("Transcription failed")

    raise TimeoutError("Timed out")


result = transcribe_with_speakers("meeting.mp3", num_speakers=3)
print("Full transcript:")
print(result["transcription"])
print("\nBy speaker:")
print(result["diarization"])
# Output:
# [Speaker 1]: Good morning everyone, let's get started.
# [Speaker 2]: Thanks for joining.
# [Speaker 3]: Happy to be here.

Improve Python speech recognition accuracy with vocab hints

The prompt parameter passes domain-specific terms to Whisper, reducing transcription errors on uncommon words, product names, and proper nouns:

transcribe_with_vocab.py

import requests

API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}

# Keep under 60 words — only the last ~200 tokens are used
MEDICAL_TERMS = "metformin, HbA1c, hypertension, dyslipidaemia, myocardial infarction, ECG"
LEGAL_TERMS = "indemnification, liquidated damages, Anton Piller, Mareva injunction, subrogation"
FINANCE_TERMS = "ACME Corp, KPIs, Q3 review, EBITDA, CRM, AML obligations"

def submit(audio_path: str, vocab: str) -> str:
    with open(audio_path, "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/transcribe",
            headers=HEADERS,
            files={"file": f},
            data={"language": "en", "prompt": vocab},
        )
    resp.raise_for_status()
    return resp.json()["job_id"]

job_id = submit("consultation.mp3", MEDICAL_TERMS)
print(f"Job submitted: {job_id}")

Error handling and retry logic

Two error codes worth handling explicitly when you transcribe audio files in Python:

429 Too Many Requests: rate limit hit. Read the Retry-After header and sleep before retrying. Limit is 10 req/min on /transcribe.
402 Payment Required: credit exhausted. Check GET /api/v1/credit and top up.
401 Unauthorized: invalid or missing API key.
400 Bad Request: invalid file format or parameters.

transcribe_robust.py

import time
import requests

API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit_with_retry(audio_path: str, max_retries: int = 3) -> str:
    """Submit a transcription job with retry logic for rate limiting."""
    for attempt in range(max_retries):
        with open(audio_path, "rb") as f:
            response = requests.post(
                f"{BASE_URL}/transcribe",
                headers=HEADERS,
                files={"file": f},
                data={"language": "en"},
            )

        if response.status_code == 200:
            return response.json()["job_id"]

        elif response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 10))
            print(f"Rate limited. Retrying in {retry_after}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(retry_after)

        elif response.status_code == 402:
            raise RuntimeError(
                "Insufficient credit. Top up at "
                "https://australiantranscription.com.au/billing"
            )
        elif response.status_code == 401:
            raise ValueError("Invalid API key. Check your X-API-Key header.")
        else:
            response.raise_for_status()

    raise RuntimeError(f"Failed after {max_retries} attempts.")


def poll_with_backoff(job_id: str, max_wait_seconds: int = 1800) -> dict:
    """Poll with exponential backoff — gentler on the API."""
    interval = 5.0
    elapsed = 0.0

    while elapsed < max_wait_seconds:
        time.sleep(interval)
        elapsed += interval

        data = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS).json()
        status = data["status"]

        if status == "completed":
            return data
        elif status == "failed":
            raise RuntimeError(f"Job failed: {data.get('error_message', 'unknown')}")

        print(f"Status: {status} ({int(elapsed)}s elapsed)")
        interval = min(interval * 1.2, 60.0)  # cap at 60s

    raise TimeoutError(f"Job {job_id} did not complete within {max_wait_seconds}s")

Optimise performance for large audio files

For bulk jobs, submit in parallel rather than sequentially. Each job is independent and you can poll them concurrently:

transcribe_bulk.py

import time
import threading
import requests

API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit(path: str) -> str:
    with open(path, "rb") as f:
        r = requests.post(
            f"{BASE_URL}/transcribe",
            headers=HEADERS,
            files={"file": f},
            data={"language": "en"},
        )
    r.raise_for_status()
    return r.json()["job_id"]


def poll(job_id: str, results: dict):
    interval = 5.0
    for _ in range(360):  # 30 min max
        time.sleep(interval)
        data = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=HEADERS).json()
        if data["status"] == "completed":
            results[job_id] = data["transcription"]
            return
        elif data["status"] == "failed":
            results[job_id] = None
            return
        interval = min(interval * 1.2, 60.0)


# Submit up to 10 jobs (rate limit: 10/min on /transcribe)
files = ["call_1.mp3", "call_2.mp3", "call_3.mp3"]
job_ids = [submit(f) for f in files]

# Poll all jobs concurrently
results = {}
threads = [threading.Thread(target=poll, args=(jid, results)) for jid in job_ids]
for t in threads:
    t.start()
for t in threads:
    t.join()

for jid, transcript in results.items():
    print(f"{jid}: {transcript[:100] if transcript else 'FAILED'}...")

Check your credit balance in Python

The /api/v1/credit endpoint returns your current AUD balance. Useful to check before submitting bulk jobs:

check_credit.py

import requests

API_KEY = "sk_your_api_key_here"
BASE_URL = "https://api.icana.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}

resp = requests.get(f"{BASE_URL}/credit", headers=HEADERS)
resp.raise_for_status()
credit = resp.json()

print(f"Balance: ${credit['balance_aud']} AUD")
print(f"Rate: ${credit['price_per_minute_aud']}/min")
print(f"Estimated remaining: {credit['estimated_remaining_minutes']} minutes")

API reference summary

Endpoint	Method	Description
/api/v1/transcribe	POST	Submit audio. Returns `job_id`.
/api/v1/jobs/{'{job_id}'}	GET	Poll status. Returns `transcription` + `diarization` when `completed`.
/api/v1/jobs	GET	List all jobs (paginated).
/api/v1/credit	GET	AUD balance and usage summary.

POST /api/v1/transcribe parameters

file: audio file (multipart/form-data, required) — MP3, WAV, OGG, FLAC, M4A
language: ISO 639-1 code, default "en"
prompt: comma-separated vocabulary hints (optional, ~60 words max)
num_speakers: number of speakers 1-10, default 2

Frequently asked questions

How do I convert speech to text in Python?

Use a speech-to-text API. Install requests, submit your audio to POST /api/v1/transcribe with your API key, then poll GET /api/v1/jobs/{job_id} until status == "completed". The whole setup takes under 5 minutes. See the example above.

What audio formats does the API support?

MP3, WAV, OGG, FLAC, and M4A. Files are submitted as multipart/form-data.

How do I improve transcription accuracy?

Pass the prompt parameter with comma-separated domain terms. Also set num_speakers to the exact count — it meaningfully improves diarization. Keep prompts under 60 words.

How do I handle rate limit errors?

On a 429 response, read the Retry-After header and sleep for that many seconds before retrying. The /transcribe endpoint allows 10 requests per minute per API key.

How do I optimise performance for large audio files in Python?

Submit jobs in parallel using threading or asyncio — each job is independent. Use exponential backoff when polling rather than a fixed interval. See the bulk transcription example above.

Does this send my audio data overseas?

No. All processing happens on AWS Sydney infrastructure. Your audio never leaves Australia, so APP 8 cross-border disclosure obligations under the Privacy Act 1988 are never triggered.

Full API documentation is available at /docs.

Start converting speech to text in Python

90 minutes free. No credit card required. Australian data residency included.

Get your free API key API reference