Downloads itemized receipts from a Kroger account with Puppeteer and categorizes them using the Kroger API and custom data cleanup scripts.
brew install 1password-cli
yarn install
At the moment, both methods have to be manually copied/pasted into a json file, such as
src/data/receipts.json
.
This experimental script logs into Kroger and scrapes all the receipt data automatically.
If you run into issues with the automated scripts above, you can revert to a more manual process to collect the same data. This process includes logging in, navigating to the "/mypurchases" page, opening up the Chrome Console, and pasting in code from the "scrape*.md" files.
Since this process involves you controlling a regular version of Chrome, it's the least likely to be flagged as a bot. And unlike the automated scripts, it allows you to intervene if it runs into issues. We'll use scrape2b.md as an example.
This script is able to get a list of all the purchases, but often fails after collecting a few itemized receipts. I believe this is because bot protection is being triggered, but haven't found an automated way to bypass that.
If it fails when fetching a batch of receipts, you can do the following:
- Scroll/click around the Kroger interface a bit so that they reflag you as a human. Make sure you stay on the "/mypurchases" page though. There's some tabs at top that stay on the same page.
- In the console, manually rerun the batch that failed and the ones afterwards. For example, if it fails while grabbing the fourth batch, you'll need to run
processBatch(batches, 3)
and subsequent batches. - Repeat the steps above if additional batches fail.
This script grabs a simplified array of products from receipts.json
and exports them to products.json
.
This script loads products.json
and uses the Kroger api to request information about all the products. These categorized products are saved to src/data/categories.json
.
This script matches the products from products.json
to receipts.json
. Easiest course of action would be looping through categories.json and assigning the category based off the upc key.
The src/dev/docs
folder contains a list of markdown files that explains how the process works.
Run yarn dev bot
to get a feel for how bot detectors like Akamai's Bot Manager (used by Kroger) detect bots and ID your device by browser.
- How to scrape the web without getting blocked (Zyte.com)
- Detecting Headless Chrome’s Puppeteer Extra Stealth Plugin with JavaScript Browser Fingerprinting
- How To Make Puppeteer Undetectable
- Can a website detect when you are using Selenium with chromedriver?
- How to set User-Agent header with Puppeteer JS and not fail
- THE LAB #22: Scraping Akamai protected websites
- THE LAB #30: How to bypass Akamai protected website when nothing else works