Skip to content
This repository was archived by the owner on Feb 2, 2026. It is now read-only.
This repository was archived by the owner on Feb 2, 2026. It is now read-only.

Add more advanced (poach_session based) conflict resolution for poached data #40

@bitspook

Description

@bitspook

Context:

Entropy (the app) has a thing called "poachers"; which are scrapers which go out
and gather/poach content from different sources. e.g they go to meetup.com to
collect which events are being organized in Chandigarh, they check the local
filesystem (e.g `./events/` directory) to see if new events are being added
there etc.

Problem:

Performing update operation on poached data is not possible because we
can't tell which records have been deleted at the source.

Example:

- We collect 4 blog posts from a directory =posts= using local poacher and keep
  them in database.
- Next time the poacher is ran, user have
  - changed content of 2 blog posts
  - deleted 1
  - 1 is left intact
  - added 1 new post
- If we perform an =upsert= operation, we can update the existing posts, create
  new, but we can't tell which (or if) post was deleted.

Right now, for the sake of simplicity (of implementation), we delete all
previously poached data when we re-poach. e.g all the groups/events collected
from local (i.e fs) are deleted before next poach is performed.

This ticket proposes a better behavior:

At its root, problem we are trying to solve is identifying the deleted data.

  1. Keep a poach_session column in all the tables which store poached data
    • poach_session is a new sequence per source i.e local data has a different
      poach_session sequence going, meetup poacher has a different one and so on
  2. On next poach, increment the version by 1 (hereby refereed to as
    current_poach_sesion)
  3. On conflict, update the row with new data
  4. After the poacher is done, all the rows with version current_poach_sesion - 1
    have been deleted at the original source. These can be safely deleted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    debtTechnical Debt

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions