Collect audio transcription and translation Datasets for Languages (Bonus for Same-Day Delivery)

Collect audio transcription and translation Datasets for Languages (Bonus for Same-Day Delivery)

Collect audio transcription and translation Datasets for Languages (Bonus for Same-Day Delivery)

Upwork

Upwork

Remoto

13 hours ago

No application

About

I need a freelancer (or small team) with experience in linguistic data collection, speech corpora, and translation datasets. This project is urgent — I need you to start immediately. If you deliver everything within the same day (or faster), I will provide a bonus payment. Please progressively upload results to Google Drive as you collect them, so I can begin working with partial data while you finish the rest. Scope of Work • Audio → Transcription (100+ pairs per language) • Audio recordings (WAV/MP3) of native speakers with matching transcripts • Sources: Common Voice, OpenSLR, GlobalPhone, etc. • Text Transcription → English Translation (100+ pairs per language) • Parallel sentences (English ↔ Target Language) • Sources: Tatoeba, ManyThings, FLORES-200, OPUS (JW300, OpenSubtitles, Bible), UDHR, etc. • 📌 Minimum: 100 audio–text pairs and 100 text–translation pairs for every language. Larger corpora welcome. Target Languages • Spanish — es-LA • French — fr-FR / fr • Portuguese (Brazil) — pt-BR • Portuguese (Portugal) — pt-PT • Romanian — ro-RO • Italian — it-IT • Indonesian — id-ID • German — de-DE • Dutch — nl-NL • Haitian Creole — ht-HT • Pashto — ps • Chinese (Mandarin) — zh-CN • Dari (Persian) — fa-AR • Arabic (Modern Standard) — ar • Vietnamese — vi-VN • Russian — ru-RU • Swahili — sw • Chinese (Cantonese) — zh-HK • Somali — so-SO • Burmese — my-MM • Nepali — ne-NP • Kinyarwanda — rw-RW • Tigrinya — ti-ET • Turkish — tr-TR • Wolof — wo • Farsi (Persian) — fa-IR • Ukrainian — uk-UA • Punjabi — pa-IN • Arabic (Iraqi) — ar-IQ • Amharic — am-ET • Hindi — hi-IN • Korean — ko-KR • Bengali — bn-IN • Hmong — hm • Khmer (Cambodian) — km-KH • Urdu — ur • Gujarati — gu-IN • Lingala — ln • Polish — pl-PL • Japanese — ja-JP • Thai — th-TH • Arabic (Egyptian) — ar-EG • Rohingya • Q’eqchi • Mam • K’iche’ Suggested Dataset Sources • Common Voice (Mozilla): audio + transcripts (Spanish, French, Portuguese, German, Kinyarwanda, Tigrinya, etc.) • Tatoeba / ManyThings: English ↔ sentence pairs (Spanish, French, Italian, Portuguese, German, Dutch, Polish, Japanese, etc.) • FLORES-200: multilingual text (Somali, Nepali, Wolof, etc.) • OPUS (OpenSubtitles, JW300, Bible): massive parallel corpora across many languages • UDHR Translations: useful for rarer languages like Amharic, K’iche’, Q’eqchi, Mam, Rohingya • Community/academic projects: for limited-coverage languages (Hmong, Lingala, Haitian Creole, Wolof, etc.) Requirements • At least 100 audio–transcription pairs + 100 text–translation pairs per language • Deliver progressively via Google Drive • Document source + license for each dataset • Priority to those who can finish within 24h (bonus if same-day) Please include in your proposal: • Ability to collect all data online • Confirmation that data is usable for commercial use • Your experience with corpora like Common Voice, OPUS, Tatoeba, etc. • Confirmation of ability to start ASAP and progressively upload results