python-att-mobility-pdf-extractor

Extract voice, data, and SMS records from AT&T Mobility Usage PDF reports into structured CSV files — automatically, page by page, with no manual copying.

Status: April 2026

⚡ Quick Navigation: The Problem | How it Works | The Debug Step | Output | Quick Start | 📩 Get in Touch

The problem

A PDF lands on your desk. Fifty-one pages. Hundreds of rows. Three types of records — voice calls, data sessions, SMS messages — mixed across the pages, each with its own set of columns.

Someone needs that data in a spreadsheet. Or a database. Or a case management system.

The obvious move is to open the PDF, select the table, and paste it somewhere. Except it does not paste cleanly. The columns bleed into each other. The multi-line rows collapse. The cell location coordinates — [17657/30852:-82.94528:39.95389:134] — end up split across three cells.

So someone copies it row by row. Or writes a fragile script that breaks the moment the layout shifts by a few pixels.

This project solves that problem properly.

Context

AT&T Mobility Usage reports are lawful intercept documents — PDFs generated by AT&T's SCAMP system in response to court orders or law enforcement requests. They contain the full call detail records (CDRs) for a given number over a given period: every call made and received, every data session opened, every SMS sent or received — with timestamps, durations, IMEI, IMSI, and cell tower location.

These documents are used in criminal investigations, civil litigation, and compliance workflows. The data inside them needs to be ingested into analysis tools, mapped against timelines, or cross-referenced with other records.

Doing that manually, page by page, is not a workflow. It is a bottleneck.

How it works

The PDF contains real text — not a scanned image — laid out as a fixed-width table across landscape pages. pdfplumber is used to extract the data spatially, using the position of known anchor words to locate the header block and table area on each page.

program.py
    │
    ├── INSPECTION
    │     ├── pdfplumber_to_fitz()   — convert coordinates between pdfplumber and PyMuPDF
    │     └── draw_section_areas()   — render coloured boxes onto a debug PDF
    │
    └── EXTRACTION
          ├── normalize_table()      — merge split rows into single clean rows
          ├── extract_data()         — iterate all pages, detect section type, accumulate records
          └── write_csv()            — write voice / data / SMS records to separate CSV files

Detecting the section type

Each page belongs to one of three sections — Voice, Data, or SMS. The section type is identified from the page header, which contains a line like Voice Usage For: (614)404-6348. The header area is cropped spatially using the position of the word Run as the top anchor and Number: as the bottom anchor.

if 'Voice Usage For:' in header_text:
    section_type = 'voice'
elif 'Data Usage For:' in header_text:
    section_type = 'data'
elif 'SMS Usage For:' in header_text:
    section_type = 'sms'

Extracting the table

The table area is cropped from Item (the first column header) down to just above the AT&T Proprietary footer. Within that crop, pdfplumber extracts the table using a text-based strategy — no borders required.

table = table_crop.extract_table({
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
})

Normalising split rows

Some records span two visual lines in the PDF — the IMEI appears on a second line below the main row, for example. The raw table reflects this as separate rows where the first cell is empty. normalize_table() detects these and merges them column by column into a single clean row.

The debug step

Running extract_data() with inspect=True produces a debug PDF (sections_inspection.pdf) showing exactly which areas are being used for the header and table extraction on the first page. Adjust the anchor logic until the boxes sit correctly, then run the full extraction.

Output

Three CSV files, one per section:

output/
├── voice_usage.csv
├── data_usage.csv
└── sms_usage.csv

See full CSV outputs in the output/ folder.

Voice records

Item	Conn. Date	Conn. Time	Seizure Time	Originating Number	Terminating Number	Elapsed Time	Number Dialed	IMEI	IMSI	Description	Cell Location
1	05/11/12	12:24A	0:11	16144046348	16144047261	0:00	16144047261	358920045328070	310410493829733	M2M_DIR	[17657/30475:-82.90778:39.93111:134]

Data records

Item	Conn. Date	Conn. Time	Originating Number	Elapsed Time	Bytes Up	Bytes Down	IMEI	IMSI	Access Pt	Description	Cell Location
1	05/11/12	12:20A	16144046348	14:54	242280	487774 9	3589200453280704	310410493829733	phone	MOBILE_DATA	[17657/30851:-82.94528:39.95389:19]

SMS records

Item	Conn. Date	Conn. Time	Originating Number	Terminating Number	IMEI	IMSI	Description	Cell Location
1	05/11/12	09:51A	1852894	16144046348	3589200453280704	310410493829733	IN	[17657/30452:-82.82917:39.95417:134]

Quick Start

git clone https://github.com/hasff/python-att-mobility-pdf-extractor.git
cd python-att-mobility-pdf-extractor
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt

Place your AT&T Mobility Usage PDF in the input/ folder, rename it to att_original.pdf, and run:

python program.py

The extracted records will be saved to the output/ folder as three CSV files.

To run in inspection mode and verify the detected areas visually:

# in program.py __main__ block
data = extract_data(FILE, inspect=True)

This saves a sections_inspection.pdf to output/ with the header and table areas highlighted.

Need this for your organisation?

CDR analysis is a common bottleneck in investigations and compliance workflows. The records are all there — timestamps, cell towers, device identifiers — but locked in a format that does not connect to anything.

I build PDF data extraction pipelines for:

legal teams and investigators processing carrier records for litigation or case analysis
compliance and audit functions handling telecommunications data
anyone who currently copies CDR data out of PDFs by hand

📩 Contact: hugoferro.business(at)gmail.com

🌐 Courses and professional tools: https://hasff.github.io/site/

Further Learning

The pdfplumber techniques used in this project — coordinate-based spatial cropping, text strategy table extraction, rotation handling, and the relationship between pdfplumber and PyMuPDF coordinate spaces — are covered in depth in my course:

Python PDF Manipulation: From Beginner to Winner (PyMuPDF)

The repository is fully usable on its own. The course provides the deeper understanding behind the decisions made here.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
input		input
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
program.py		program.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-att-mobility-pdf-extractor

⚡ Quick Navigation: The Problem | How it Works | The Debug Step | Output | Quick Start | 📩 Get in Touch

The problem

Context

How it works

Detecting the section type

Extracting the table

Normalising split rows

The debug step

Output

Voice records

Data records

SMS records

Quick Start

Need this for your organisation?

Further Learning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

python-att-mobility-pdf-extractor

⚡ Quick Navigation: The Problem | How it Works | The Debug Step | Output | Quick Start | 📩 Get in Touch

The problem

Context

How it works

Detecting the section type

Extracting the table

Normalising split rows

The debug step

Output

Voice records

Data records

SMS records

Quick Start

Need this for your organisation?

Further Learning

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages