Web scraper with experience in Scrapy/Python and handling large CSVs (e.g., IRS data)

Web scraper with experience in Scrapy/Python and handling large CSVs (e.g., IRS data)

Web scraper with experience in Scrapy/Python and handling large CSVs (e.g., IRS data)

Upwork

Upwork

Remoto

22 hours ago

No application

About

I need a web scraper with experience in Scrapy/Python and handling large CSVs (e.g., IRS data) to: 1. Source seed lists of immigrant-focused nonprofits in these 10 cities (50,000+ total URLs): • Use IRS EO BMF CSVs (free downloads): Filter NTEE_CODE = "P84" (Ethnic/Immigrant Centers) and CITY for each: o Atlanta: https://www.irs.gov/pub/irs-soi/eo_ga.csv o Chicago: https://www.irs.gov/pub/irs-soi/eo_il.csv o Dallas/Houston: https://www.irs.gov/pub/irs-soi/eo_tx.csv o Los Angeles/San Francisco: https://www.irs.gov/pub/irs-soi/eo_ca.csv o Miami-Dade: https://www.irs.gov/pub/irs-soi/eo_fl.csv o New York City: https://www.irs.gov/pub/irs-soi/eo_ny.csv o Phoenix: https://www.irs.gov/pub/irs-soi/eo_az.csv o Washington DC/VA: https://www.irs.gov/pub/irs-soi/eo_dc.csv + https://www.irs.gov/pub/irs-soi/eo_va.csv • Extract WEBSITE column → only .org/.net domains. • Deduplicate → aim for ~5,000+ per city. • Expand with immigrant directories (e.g., https://www.immigrationadvocates.org/nonprofit/legaldirectory/search?state=[STATE]). 2. Scrape all URLs for keywords (case-insensitive) that I will supply upon hiring. This is for public nonprofit websites only—no personal/sensitive data. 3. Exclude domains (block 100%) that are social media, earned media, paid media, or .gov. 4. Deliver by EOD Friday, November 7th, 2025: • Excel file (sorted by city and mentions): URL | City | Keyword Found | Page Title | Snippet (50 words) • Summary report: Total URLs scraped, Total mentions, % in materials (if identifiable) with keywords, Top 10 URLs with most mentions. • 10 sample PDFs with keywords highlighted (save as PDF). Must respect robots.txt, delay 1 second per page, and use Scrapy or Python. Include in your proposal: “I will source from IRS EO BMF + P84 filter” so I know you read this. Opportunity for more work as the project unfolds!