Fix the url and gallery fetcher returning empty #279

Merged
Vylpes merged 3 commits from feature/277-url-returns-empty into hotfix/2.4.2 2026-06-08 20:49:09 +01:00
Owner
#277
Fix the url and gallery fetcher returning empty
All checks were successful
Test / build (push) Successful in 38s
6c129e494c
Smithy-bot approved these changes 2026-06-08 20:20:42 +01:00
Dismissed
Smithy-bot left a comment

Code Review

Branch: feature/277-url-returns-emptyhotfix/2.4.2
Issue: #277
Scope: 7 files, +88 / −205 lines
Tests: All 19 pass on the PR branch


Summary

This PR fixes gallery handling by reading gallery_data and media_metadata from the Reddit JSON API instead of scraping gallery HTML with htmlparser2. That removes a fragile second HTTP request, drops a dependency, and should directly fix the empty Url / Gallery bug.


Root cause (pre-existing)

The old code had a logic bug on top of the brittle scraping approach:

if (randomData.url.includes("/gallery")) {
    const galleryImage = await ImageHelper.FetchImageFromRedditGallery(`https://reddit.com${randomData.permalink}`);

    if (!galleryImage) {
        // ... error path
    }

    url = galleryImage[0];
    gallery = galleryImage;

FetchImageFromRedditGallery returns string[]. An empty array [] is truthy in JavaScript, so !galleryImage is false when scraping fails. The code then sets url = undefined and gallery = [] — matching issue #277.

The new approach avoids that path entirely.


What works well

  1. Correct data source — Gallery image URLs belong in the listing JSON (gallery_data, media_metadata, is_gallery), not in scraped HTML.
  2. Performance — One request instead of two per gallery post.
  3. Simpler dependency tree — Removing htmlparser2 and its DOM subtree is a clear win.
  4. HTML entity decodingconvertRedditUrl correctly turns & into & in preview URLs.
  5. Tests updated — Gallery tests now assert real metadata URLs and entity decoding; obsolete imageHelper tests are removed with the dead code.
  6. Typed contractIFetchResult extended with the Reddit fields the code actually uses.

Concerns and suggestions

1. Fallback may still return a non-image URL (minor)

When gallery metadata is missing, the PR falls back to randomData.url and returns success:

if (gallery.length == 0) {
    gallery = [randomData.url];
}

For galleries that is often something like https://i.redd.it/gallery/cr8xudsnkgua1, which is not a direct image. That is better than Url: undefined and Gallery: [], but consumers expecting embeddable image URLs may still fail in rare metadata-missing cases.

Suggestion: Acceptable for this hotfix. If you want stricter behavior later, only fall back for non-gallery posts, or keep returning NoImageResultsFound when is_gallery is true but metadata yields no URLs.

2. Only metadata.s.u is used (minor)

getGalleryImageUrls reads only the s (source/preview) entry. Gallery items that are videos or lack s.u are skipped. If every item is skipped, you hit the fallback above.

Suggestion: Fine for a rabbit-image bot focused on still images. No change needed unless video galleries matter.

Filter changed from url.includes("/gallery/") to is_gallery == true. That aligns with the API. Legacy posts without is_gallery would be excluded, but that field has been standard on Reddit gallery posts for years.

4. Helper functions live in index.ts (nit)

convertRedditUrl and getGalleryImageUrls are reasonable as module-private helpers. Extracting them would only help if you want isolated unit tests; not required here.

5. Missing edge-case test (nit)

Worth adding a test where one gallery item has metadata and another does not, to document partial-gallery behavior:

// gallery_data has 2 items, only one has media_metadata entry
// expect Gallery length === 1

Pre-existing issues (not introduced by this PR)

  • Image filter still only matches .jpg, .png, or galleries — .gif / .webp single-image posts are still excluded.
  • Old gallery test mocked FetchImageFromRedditGallery with a string instead of string[]; that test is gone with the helper, so this is moot.

Verdict

Approve with minor notes. The fix is sound, targets the real failure mode, simplifies the codebase, and tests pass. The metadata-based approach is the right long-term design for Reddit galleries.

The only behavioral change to be aware of: gallery posts with missing metadata now succeed with a gallery slug URL instead of erroring. That is a reasonable hotfix tradeoff and still better than returning empty values.


Suggested test plan before merge

  • Unit tests pass (yarn test)
  • Manual check against a live subreddit known to have gallery posts (e.g. r/rabbits) — confirm Url and Gallery are populated with preview.redd.it URLs
  • Confirm a normal .jpg post still works unchanged
  • Verify CLI / JSON output consumers handle the new gallery URL format
## Code Review **Branch:** `feature/277-url-returns-empty` → `hotfix/2.4.2` **Issue:** #277 **Scope:** 7 files, +88 / −205 lines **Tests:** All 19 pass on the PR branch --- ### Summary This PR fixes gallery handling by reading `gallery_data` and `media_metadata` from the Reddit JSON API instead of scraping gallery HTML with `htmlparser2`. That removes a fragile second HTTP request, drops a dependency, and should directly fix the empty `Url` / `Gallery` bug. --- ### Root cause (pre-existing) The old code had a logic bug on top of the brittle scraping approach: ```typescript if (randomData.url.includes("/gallery")) { const galleryImage = await ImageHelper.FetchImageFromRedditGallery(`https://reddit.com${randomData.permalink}`); if (!galleryImage) { // ... error path } url = galleryImage[0]; gallery = galleryImage; ``` `FetchImageFromRedditGallery` returns `string[]`. An empty array `[]` is truthy in JavaScript, so `!galleryImage` is `false` when scraping fails. The code then sets `url = undefined` and `gallery = []` — matching issue #277. The new approach avoids that path entirely. --- ### What works well 1. **Correct data source** — Gallery image URLs belong in the listing JSON (`gallery_data`, `media_metadata`, `is_gallery`), not in scraped HTML. 2. **Performance** — One request instead of two per gallery post. 3. **Simpler dependency tree** — Removing `htmlparser2` and its DOM subtree is a clear win. 4. **HTML entity decoding** — `convertRedditUrl` correctly turns `&` into `&` in preview URLs. 5. **Tests updated** — Gallery tests now assert real metadata URLs and entity decoding; obsolete `imageHelper` tests are removed with the dead code. 6. **Typed contract** — `IFetchResult` extended with the Reddit fields the code actually uses. --- ### Concerns and suggestions #### 1. Fallback may still return a non-image URL (minor) When gallery metadata is missing, the PR falls back to `randomData.url` and returns success: ```typescript if (gallery.length == 0) { gallery = [randomData.url]; } ``` For galleries that is often something like `https://i.redd.it/gallery/cr8xudsnkgua1`, which is not a direct image. That is better than `Url: undefined` and `Gallery: []`, but consumers expecting embeddable image URLs may still fail in rare metadata-missing cases. **Suggestion:** Acceptable for this hotfix. If you want stricter behavior later, only fall back for non-gallery posts, or keep returning `NoImageResultsFound` when `is_gallery` is true but metadata yields no URLs. #### 2. Only `metadata.s.u` is used (minor) `getGalleryImageUrls` reads only the `s` (source/preview) entry. Gallery items that are videos or lack `s.u` are skipped. If every item is skipped, you hit the fallback above. **Suggestion:** Fine for a rabbit-image bot focused on still images. No change needed unless video galleries matter. #### 3. Gallery detection tightened — generally good Filter changed from `url.includes("/gallery/")` to `is_gallery == true`. That aligns with the API. Legacy posts without `is_gallery` would be excluded, but that field has been standard on Reddit gallery posts for years. #### 4. Helper functions live in `index.ts` (nit) `convertRedditUrl` and `getGalleryImageUrls` are reasonable as module-private helpers. Extracting them would only help if you want isolated unit tests; not required here. #### 5. Missing edge-case test (nit) Worth adding a test where one gallery item has metadata and another does not, to document partial-gallery behavior: ```typescript // gallery_data has 2 items, only one has media_metadata entry // expect Gallery length === 1 ``` --- ### Pre-existing issues (not introduced by this PR) - Image filter still only matches `.jpg`, `.png`, or galleries — `.gif` / `.webp` single-image posts are still excluded. - Old gallery test mocked `FetchImageFromRedditGallery` with a string instead of `string[]`; that test is gone with the helper, so this is moot. --- ### Verdict **Approve with minor notes.** The fix is sound, targets the real failure mode, simplifies the codebase, and tests pass. The metadata-based approach is the right long-term design for Reddit galleries. The only behavioral change to be aware of: gallery posts with missing metadata now succeed with a gallery slug URL instead of erroring. That is a reasonable hotfix tradeoff and still better than returning empty values. --- ### Suggested test plan before merge - [x] Unit tests pass (`yarn test`) - [ ] Manual check against a live subreddit known to have gallery posts (e.g. `r/rabbits`) — confirm `Url` and `Gallery` are populated with `preview.redd.it` URLs - [ ] Confirm a normal `.jpg` post still works unchanged - [ ] Verify CLI / JSON output consumers handle the new gallery URL format
Member

Test results for review findings

Validated on branch feature/277-url-returns-empty using yarn test and yarn start.


Unit tests

yarn test
# 3 suites, 19 tests — all passed

Finding 1: Root cause (![] is falsy) — CONFIRMED

![] = false
g[0] = undefined

When the old HTML scraper returned [], the if (!galleryImage) guard did not trigger, producing Url: undefined and Gallery: []. This matches #277.


Mocked Reddit listing with gallery_data + media_metadata:

{
  "Url": "https://preview.redd.it/image-1.jpg?width=640&crop=smart&auto=webp&s=1",
  "Gallery": [
    "https://preview.redd.it/image-1.jpg?width=640&crop=smart&auto=webp&s=1",
    "https://preview.redd.it/image-2.jpg?width=640&crop=smart&auto=webp&s=2"
  ]
}
  • Url and Gallery are populated (no empty values)
  • & correctly decoded to & in URLs
  • All gallery entries are preview.redd.it image URLs

Finding 3: Normal .jpg post unchanged — CONFIRMED (yarn start --json)

{
  "Url": "https://i.redd.it/testbunny.jpg",
  "Gallery": ["https://i.redd.it/testbunny.jpg"]
}

Finding 4: Fallback when metadata missing — CONFIRMED (yarn start --json)

Gallery post with is_gallery: true but empty media_metadata:

{
  "Url": "https://i.redd.it/gallery/cr8xudsnkgua1",
  "Gallery": ["https://i.redd.it/gallery/cr8xudsnkgua1"]
}

Returns success (not an error), but the URL is a gallery slug — not a direct image. As noted in the review, this is acceptable for the hotfix but worth knowing for consumers.


2 gallery items, only 1 has media_metadata:

{
  "Url": "https://preview.redd.it/partial.jpg?width=640&s=1",
  "Gallery": ["https://preview.redd.it/partial.jpg?width=640&s=1"]
}

Missing items are skipped; valid items are still returned.


Live Reddit (yarn start --json) — BLOCKED from test runner

yarn start --json
# Failed to fetch result from Reddit (exit 1)
# got-cjs: Response code 403 (Blocked)

Reddit blocks unauthenticated requests from this environment. Live verification against r/rabbits should be done manually by @VylpesTester from a network Reddit allows.


Verdict after testing

All review findings that can be exercised locally are confirmed. The PR fix resolves the empty Url/Gallery bug; gallery posts now return proper preview.redd.it URLs via JSON metadata. No regressions observed for standard .jpg posts.

Remaining manual step: one live yarn start / yarn start --json run against r/rabbits from an unblocked network.

## Test results for review findings Validated on branch `feature/277-url-returns-empty` using `yarn test` and `yarn start`. --- ### Unit tests ``` yarn test # 3 suites, 19 tests — all passed ``` --- ### Finding 1: Root cause (`![]` is falsy) — **CONFIRMED** ``` ![] = false g[0] = undefined ``` When the old HTML scraper returned `[]`, the `if (!galleryImage)` guard did not trigger, producing `Url: undefined` and `Gallery: []`. This matches #277. --- ### Finding 2: Gallery metadata extraction — **CONFIRMED** (`yarn start --json`) Mocked Reddit listing with `gallery_data` + `media_metadata`: ```json { "Url": "https://preview.redd.it/image-1.jpg?width=640&crop=smart&auto=webp&s=1", "Gallery": [ "https://preview.redd.it/image-1.jpg?width=640&crop=smart&auto=webp&s=1", "https://preview.redd.it/image-2.jpg?width=640&crop=smart&auto=webp&s=2" ] } ``` - `Url` and `Gallery` are populated (no empty values) - `&` correctly decoded to `&` in URLs - All gallery entries are `preview.redd.it` image URLs --- ### Finding 3: Normal `.jpg` post unchanged — **CONFIRMED** (`yarn start --json`) ```json { "Url": "https://i.redd.it/testbunny.jpg", "Gallery": ["https://i.redd.it/testbunny.jpg"] } ``` --- ### Finding 4: Fallback when metadata missing — **CONFIRMED** (`yarn start --json`) Gallery post with `is_gallery: true` but empty `media_metadata`: ```json { "Url": "https://i.redd.it/gallery/cr8xudsnkgua1", "Gallery": ["https://i.redd.it/gallery/cr8xudsnkgua1"] } ``` Returns success (not an error), but the URL is a gallery slug — not a direct image. As noted in the review, this is acceptable for the hotfix but worth knowing for consumers. --- ### Finding 5: Partial gallery (one item missing metadata) — **CONFIRMED** (`yarn start --json`) 2 gallery items, only 1 has `media_metadata`: ```json { "Url": "https://preview.redd.it/partial.jpg?width=640&s=1", "Gallery": ["https://preview.redd.it/partial.jpg?width=640&s=1"] } ``` Missing items are skipped; valid items are still returned. --- ### Live Reddit (`yarn start --json`) — **BLOCKED from test runner** ``` yarn start --json # Failed to fetch result from Reddit (exit 1) # got-cjs: Response code 403 (Blocked) ``` Reddit blocks unauthenticated requests from this environment. Live verification against `r/rabbits` should be done manually by @VylpesTester from a network Reddit allows. --- ### Verdict after testing All review findings that can be exercised locally are **confirmed**. The PR fix resolves the empty `Url`/`Gallery` bug; gallery posts now return proper `preview.redd.it` URLs via JSON metadata. No regressions observed for standard `.jpg` posts. **Remaining manual step:** one live `yarn start` / `yarn start --json` run against `r/rabbits` from an unblocked network.
Fix Reddit fetch failures by using old.reddit.com with session cookies.
All checks were successful
Test / build (push) Successful in 1m1s
931834e22a
Reddit now returns 403 for unauthenticated reddit.com JSON requests, so bootstrap a session via old.reddit.com before fetching listings.

Co-authored-by: Cursor <cursoragent@cursor.com>
Vylpes dismissed Smithy-bot's review 2026-06-08 20:39:23 +01:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

Extract Reddit and gallery logic into helpers with unit tests.
All checks were successful
Test / build (push) Successful in 35s
37d68349fb
Move listing fetch and gallery URL parsing out of index.ts into dedicated helper classes to keep the entry point focused on orchestration.

Co-authored-by: Cursor <cursoragent@cursor.com>
Vylpes merged commit 1ee07ba332 into hotfix/2.4.2 2026-06-08 20:49:09 +01:00
Vylpes deleted branch feature/277-url-returns-empty 2026-06-08 20:49:10 +01:00
Sign in to join this conversation.
No description provided.