Skip to content

Throw Warning When Reading File with Blank Lines#7707

Open
Asa-Henry wants to merge 7 commits intoRdatatable:masterfrom
Asa-Henry:fix-3339
Open

Throw Warning When Reading File with Blank Lines#7707
Asa-Henry wants to merge 7 commits intoRdatatable:masterfrom
Asa-Henry:fix-3339

Conversation

@Asa-Henry
Copy link
Copy Markdown

Adds a check to detect when blank lines are present in the file passed to fread, but the user did not specify to skip blank lines. This change throws a warning to let the user know they may want to consider setting the blank.lines.skip to TRUE.

…kip' is greater than 0 when blank lines are present, so I also check if blank lines should be skipped so I can throw a warning to let the user know.
@jangorecki
Copy link
Copy Markdown
Member

PR could have meaningful title

@Asa-Henry Asa-Henry changed the title Fix #3339 Throw Warning When Reading File with Blank Lines Apr 14, 2026
@tdhock
Copy link
Copy Markdown
Member

tdhock commented Apr 14, 2026

please add a test case

@aitap
Copy link
Copy Markdown
Member

aitap commented Apr 14, 2026

This is a good start.

Can you run R CMD build . in the Git checkout directory and then R CMD check data.table_1.18.99.tar.gz? You'll need a working compiler for this (Rtools on Windows, highly specific compiler versions and libraries on macOS, feel free to ask otherwise) and all of data.table's dependencies installed (which you can do using something like source('.ci/ci.R'); dcf.dependencies(which = 'all') |> install.packages(); this is needed only once). Generally, data.table is required to pass the check with nothing worse than a few NOTEs. (There are some shorthands for this process that need a little more familiarity with R. You can learn about them if you want to by reading .dev/cc.R. See also ?data.table::test.data.table.)

You can see that the tests don't pass. I think that tests 1578.1, 1578.2, 1578.6, 1883 will need to be updated for the new warning, possibly 1867.07 and 1867.11 too. Unfortunately, other tests show a regression:

  • tests 1840.2, 1867.05, 1867.06, 1867.09, 2251.07, 2251.09 produce the "blank line" warning despite there are no blank lines in the input:
    write('a,b\n ab,cd,ce\nabc,def \n hj,kli  ', f<-tempfile())
    fread(f)
    fread("A\nCol1,Col2\n1,3,5\n2,4,6\n")
    fread("Some header\ninfo\nCol1,Col2,Col3\n1,3,5\n2,4,6\n")
    fread("Heading text\nA,B,C,D,E\n1,3,5\n2,4,6\n")
    lines = c(
    "12223, University",
    "12227, bridge, Sky",
    "12828, Sunset",
    "13801, Ground",
    "14853, Tranceamerica",
    "14854, San Francisco",
    "15595, shibuya, Shrine",
    "16126, fog, San Francisco",
    "16520, California, ocean, summer, golden gate, beach, San Francisco",
    "")
    text = paste(lines, collapse="\n")
    fread(text)
    text = paste(lines[c(1:5, 9L, 6:8, 10L)], collapse="\n")
    fread(text)
  • test 1856.3 warns about the blank lines despite they don't pose a real problem:
    fread("A,B\n\n\n")

Any ideas how the spurious warnings can be fixed? If you need to debug data.table::fread while it runs, you can run R -d $your_debugger or call Sys.getpid() and attach to the running process.

@Asa-Henry
Copy link
Copy Markdown
Author

Thanks for this comprehensive response @aitap! This made it easier to understand what went wrong with my changes!

In response to your question: it appears I introduced confusion between a file with intermittent blank lines and a file which has the header and the data separated by a blank line. For issues such as 1840.2, looks like I over looked how spaces come into play when parsing...

I picked up on that topSkip, prevStart, and topStart are used to find chunks of lines, but topSkip was greater than zero for the example case presented in the issue #3339, but I don't know enough about freadMain to know what all goes into dealing with blank lines. Is there any insight for how freadMain handles files or maybe that I'm working in the wrong place?

… accomodate situation where the header and data are separated by a blank line.
…y a blank line. In the case of each line separated by a blank line, 'prevStart' is always NULL because each line could be a possible header.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 99.03%. Comparing base (8364344) to head (8a8056d).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/fread.c 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7707      +/-   ##
==========================================
- Coverage   99.04%   99.03%   -0.01%     
==========================================
  Files          87       87              
  Lines       17037    17039       +2     
==========================================
+ Hits        16874    16875       +1     
- Misses        163      164       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Asa-Henry
Copy link
Copy Markdown
Author

What is the problem with the 'atime performance tests'? It's saying that fix-3339 is an invalid reference: I wondering if I made the pull request from the wrong place?

@MichaelChirico
Copy link
Copy Markdown
Member

What is the problem with the 'atime performance tests'? It's saying that fix-3339 is an invalid reference: I wondering if I made the pull request from the wrong place?

That CI doesn't work on PRs from forks -- you can ignore it

@Asa-Henry
Copy link
Copy Markdown
Author

May I get some help writing a test? I'm trying to make a test using the input shown in the initial issue #3339. The following is what I have:

input = "x y\n\n1 a\n\n2 b\n\n3 c"
test(
  1578.10,
  fread(input),
  data.table(V1=3, V2=c),
  warning="The rows in this file appear to be separated by blank lines. This resulted in most rows being skipped. If this was not the intended outcome, please consider setting 'blank.lines.skip' to TRUE."
)

But then test 1578.1 fails. If I don't add my test, then all tests pass.

@aitap
Copy link
Copy Markdown
Member

aitap commented Apr 21, 2026

Congratulations on figuring this out! (Sorry I didn't find the time to help.)

Your test should work after you replace data.table(V1=3, V2=c) (which results in V2 being a list containing the function base::c) with data.table(V1=3L, V2="c") (R also distinguishes integers and double precision floats). This should also help increase coverage.

…nted out. Causes an error in test 1578.1?
@aitap
Copy link
Copy Markdown
Member

aitap commented Apr 22, 2026

Oh, right. Test numbers are ordinary doubles, so test() cannot distinguish between 1578.1 that it had already seen and the new 1578.10. On the other hand, test numbers must be strictly monotonically rising, so the test fails before running any code ("Test id 1578.1 is not in increasing order"). The minimal change that would fix that would be to renumber the new test 1578.91.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants