Determine actual errors from a load job

https://stackoverflow.com//questions/20039790

google-bigquery

21-12-2019
|

Domanda

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:

11:21:06.975 [main] INFO  xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE

11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126  caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
  "message" : "Too many errors encountered. Limit is: 0.",
  "reason" : "invalid"
?}

BTW - how do I tell the job that it can have more than zero errors using Java?

This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:

        if (err != null) {

            log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
            try {

                log.error(err.toPrettyString());
            }
        ...

In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.

This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:

InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();

My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.

Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.

Soluzione

It sounds like you have a couple of questions, so I'll try to address them all.

First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.

Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.

A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.

To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow