Parsing through a large file using a pattern match, but it doesn't seem to match every case shown in the text

StackOverflow https://stackoverflow.com/questions/23326442

Question

I am trying to parse through a large file that is very structured and pull out just the information i want to work with as represented by a tag at the beginning of the line. The size of the items i pulled out was not near large enough and it seems that some items are being skipped but i can't figure out why. The data is formatted as follows:

Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368] |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370] reviews: total: 2 downloaded: 2 avg rating: 5 2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9 2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5

Every item has every category listed, even if there are no items in that category (for example similar: 0) There are over 500,000 Id numbers however when i pattern match to find Id only around 58,000 are reported. I simply look for a line with "Id" and increment a sum. Here is the simple code below.

import java.util.*;
import java.io.*;
import java.util.regex.*;

public class metaData4{
  public static void main(String[] args) throws Exception{
  File a = new File(args[0]);
  Scanner doc = new Scanner(a);
  String pattern = "Id.*";
  int sum = 0;
  while (doc.hasNextLine()){
   String data = doc.nextLine();
    if (data.matches(pattern)  ){
       sum++;
     }
   }
System.out.println(sum);
 }
}

The link to the data i am using (Warning this is a large text file!) http://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz

EDIT: To make the problem more clear i am making a hasmap with the key as the ASIN and the value as the "similar" list. ASIN and Id show up the same number of times and i used Id as the line to pattern match because the number of occurrences is clearly indicated by the following number. Running the preceding code returns the correct number of occurrences of Id on a smaller text file taken from the link above, but is not correct on the original file.

Was it helpful?

Solution

This actually is not a problem with the pattern matching at all. The pattern matching works as it is supposed to, however the Scanner is flawed. Or at least the writing of the text file, i have found one case where the same thing has happened, the answers to the problem can be found here: Java scanner not going through entire file

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top