You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 2, 2026. It is now read-only.
Entropy (the app) has a thing called "poachers"; which are scrapers which go out
and gather/poach content from different sources. e.g they go to meetup.com to
collect which events are being organized in Chandigarh, they check the local
filesystem (e.g `./events/` directory) to see if new events are being added
there etc.
Problem:
Performing update operation on poached data is not possible because we
can't tell which records have been deleted at the source.
Example:
- We collect 4 blog posts from a directory =posts= using local poacher and keep
them in database.
- Next time the poacher is ran, user have
- changed content of 2 blog posts
- deleted 1
- 1 is left intact
- added 1 new post
- If we perform an =upsert= operation, we can update the existing posts, create
new, but we can't tell which (or if) post was deleted.
Right now, for the sake of simplicity (of implementation), we delete all
previously poached data when we re-poach. e.g all the groups/events collected
from local (i.e fs) are deleted before next poach is performed.
This ticket proposes a better behavior:
At its root, problem we are trying to solve is identifying the deleted data.
Keep a poach_session column in all the tables which store poached data
poach_session is a new sequence per source i.e local data has a different poach_session sequence going, meetup poacher has a different one and so on
On next poach, increment the version by 1 (hereby refereed to as current_poach_sesion)
On conflict, update the row with new data
After the poacher is done, all the rows with version current_poach_sesion - 1
have been deleted at the original source. These can be safely deleted.
Context:
Entropy (the app) has a thing called "poachers"; which are scrapers which go out
and gather/poach content from different sources. e.g they go to meetup.com to
collect which events are being organized in Chandigarh, they check the local
filesystem (e.g `./events/` directory) to see if new events are being added
there etc.
Problem:
Performing update operation on poached data is not possible because we
can't tell which records have been deleted at the source.
Example:
Right now, for the sake of simplicity (of implementation), we delete all
previously poached data when we re-poach. e.g all the groups/events collected
from local (i.e fs) are deleted before next poach is performed.
This ticket proposes a better behavior:
At its root, problem we are trying to solve is identifying the deleted data.
poach_sessioncolumn in all the tables which store poached datapoach_sessionis a new sequence persourcei.elocaldata has a differentpoach_sessionsequence going,meetuppoacher has a different one and so oncurrent_poach_sesion)current_poach_sesion - 1have been deleted at the original source. These can be safely deleted.