
Scrape Illinois SOS Website for Non-Profit Corporations & Directors (Excel Dataset)
Upwork
Remoto
•7 hours ago
•No application
About
Project Overview I need a skilled developer or data engineer to build a dataset of Illinois non-profit corporations that are likely condominium or homeowners’ associations. The goal is to create an Excel file of ~30,000–35,000 corporations, with up-to-date directors/officers information pulled from state filings. Scope of Work 1. Source Initial Corporate List (Bulk Download) Use OpenCorporates.com API (subscription available) to pull Illinois corporation records that are Non-Profit and whose names contain any of these codewords: "Condominium", "Condo", "Manor", "Apartments", "Association", "Associations", "Condominium Association", "Condo Association", "HOA", "HOMEOWNERS" Extract the following fields and save as a master Excel list (~30–35k rows): Corporation Name Native Company Number (e.g., 51361989) 2. Cross-Check with Illinois Secretary of State (SOS) Website Site: apps.ilsos.gov/businessentitysearch For each corporation: Search by Native Company Number or Name Confirm Corporation Status (Active / Inactive) Retrieve the latest Filing PDF (Annual Report / Articles of Incorporation) Extract Director/Officer information: Names Titles (Director, President, Treasurer, etc.) Addresses Filing date 3. Technical Challenges (Must Handle) CAPTCHAs: site uses CAPTCHAs that may require manual solving or integration with services like 2Captcha. Limited results per search: must query by known IDs (from OpenCorporates) to bypass result limits. PDF parsing: PDFs may be text-based or scanned images (OCR required). 4. Deliverables Final Excel/CSV dataset containing at minimum: Corporation Name Native Company Number Corporation Status (Active/Inactive) Director/Officer Names Titles Addresses Latest Filing Date Folder/archive with downloaded Filing PDFs (for verification) Skills Required Web scraping (Playwright, Puppeteer, Selenium, or similar) PDF parsing (PyPDF2, pdfminer, or OCR with Tesseract if needed) Data cleaning & normalization (Python / Pandas) Experience handling CAPTCHAs and anti-bot systems Familiarity with U.S. corporate filings a plus Additional Notes OpenCorporates subscription (~$250–300/month) will be provided or reimbursed. Proxy / CAPTCHA solving costs should be budgeted into your proposal. This project must be automated end-to-end, not done manually. Deliverable is a clean, structured Excel file Plus extra bonus for supporting PDFs. What to Include in Your Proposal Your relevant experience with scraping sites that use CAPTCHAs & PDF parsing Tools/libraries you prefer to use for scraping & OCR Estimated cost and timeline for delivery of the dataset Any questions or assumptions about the project