Scrape Illinois SOS Website for Non-Profit Corporations & Directors (Excel Dataset)

Upwork

Remoto

•

7 hours ago

•

No application

About

Project Overview I need a skilled developer or data engineer to build a dataset of Illinois non-profit corporations that are likely condominium or homeowners’ associations. The goal is to create an Excel file of ~30,000–35,000 corporations, with up-to-date directors/officers information pulled from state filings. Scope of Work 1. Source Initial Corporate List (Bulk Download) Use OpenCorporates.com API (subscription available) to pull Illinois corporation records that are Non-Profit and whose names contain any of these codewords: "Condominium", "Condo", "Manor", "Apartments", "Association", "Associations", "Condominium Association", "Condo Association", "HOA", "HOMEOWNERS" Extract the following fields and save as a master Excel list (~30–35k rows): Corporation Name Native Company Number (e.g., 51361989) 2. Cross-Check with Illinois Secretary of State (SOS) Website Site: apps.ilsos.gov/businessentitysearch For each corporation: Search by Native Company Number or Name Confirm Corporation Status (Active / Inactive) Retrieve the latest Filing PDF (Annual Report / Articles of Incorporation) Extract Director/Officer information: Names Titles (Director, President, Treasurer, etc.) Addresses Filing date 3. Technical Challenges (Must Handle) CAPTCHAs: site uses CAPTCHAs that may require manual solving or integration with services like 2Captcha. Limited results per search: must query by known IDs (from OpenCorporates) to bypass result limits. PDF parsing: PDFs may be text-based or scanned images (OCR required). 4. Deliverables Final Excel/CSV dataset containing at minimum: Corporation Name Native Company Number Corporation Status (Active/Inactive) Director/Officer Names Titles Addresses Latest Filing Date Folder/archive with downloaded Filing PDFs (for verification) Skills Required Web scraping (Playwright, Puppeteer, Selenium, or similar) PDF parsing (PyPDF2, pdfminer, or OCR with Tesseract if needed) Data cleaning & normalization (Python / Pandas) Experience handling CAPTCHAs and anti-bot systems Familiarity with U.S. corporate filings a plus Additional Notes OpenCorporates subscription (~$250–300/month) will be provided or reimbursed. Proxy / CAPTCHA solving costs should be budgeted into your proposal. This project must be automated end-to-end, not done manually. Deliverable is a clean, structured Excel file Plus extra bonus for supporting PDFs. What to Include in Your Proposal Your relevant experience with scraping sites that use CAPTCHAs & PDF parsing Tools/libraries you prefer to use for scraping & OCR Estimated cost and timeline for delivery of the dataset Any questions or assumptions about the project

Remove Ads

Similar Positions

Licensed Practical Nurse LPN

Complete Care At Multi Medical Center Llc

Adzuna

Towson, Baltimore County

Complete Care at Multi Medical Center LLC - LPN Licensed Practical ...

68640 - 74360 20 minutes ago

Medical Psychiatry Research As...

Stanford University

Palo Alto, CA

The Division of Medical Psychiatry within the Department of P...

21 minutes ago

Primary Care Physician (Family...

Optum

New York, NY

Optum WA, (formerly The Everett Clinic) is seeking a Primary ...

21 minutes ago

Physician – Internal Med...

Optum

Atlanta, GA

Explore opportunities with Kelsey-Seybold Clinic, part o...

21 minutes ago

Nurse Practitioner or Physicia...

Optum

Farmington, CT

$40,000 Student Loan Repayment Or $20,000 Sign-on Bonus ...

21 minutes ago

Get our app today

Scrape Illinois SOS Website for Non-Profit Corporations & Directors (Excel Dataset)

Scrape Illinois SOS Website for Non-Profit Corporations & Directors (Excel Dataset)

About

Application