سؤال

I've been given a task to build a prototype for an app. I don't have any code yet, as the solution concepts that I've come up with seem stinky at best...

The problem:

the solution consist of various Azure projects which do stuff to lots of data stored in Azure SQL db-s. Almost every action that happens creates a gzipped log file in blob storage. So that's one .gz file per log entry.

We should also have a small desktop (WPF) app that should be able to read, filter and sort these log files.

I have absolutely 0 influence on how the logging is done, so this is something that can not be changed to solve this problem.

Possible solutions that I've come up with (conceptually):

1:

  • connect to the blob storage
  • open the container
  • read/download blobs (with applied filter)
  • decompress the .gz files
  • read and display

The problem with this is that, depending on the filter, this could mean a whole lot of data to download (which is slow), and process (which will also not be very snappy). I really can't see this as a usable application.

2:

  • create a web role which will run a WCF or REST service
  • the service will take the filter params and other stuff and return a single xml/json file with the data, the processing will be done on the cloud

With this approach, will I run into problems with decompressing these files if there's a lot of them (will it take up extra space on the storage/compute instance where the service is running).

EDIT: what I mean by filter is limit the results by date and severity (info, warning, error). The .gz files are saved in a structure that makes this quite easy, and I will not be filtering by looking into the files themselves.

3:

  • some other elegant and simple solution that I don't know of

I'd also need some way of making the app update the displayed logs in real time, which i suppose would need to be done with repeated requests to the blob storage/service.


This is not one of those "give me code" questions. I am looking for advice on best practices, or similar solutions that worked for similar problems. I also know this could be one of those "no one right answer" questions, as people have different approaches to problems, but I have some time to build a prototype, so I will be trying out different things, and I will select the right answer, which will be the one that showed a solution that worked, or the one that steered me in the right direction, even if it does take some time before I actually build something and test it out.

هل كانت مفيدة؟

المحلول

As I understand it, you have a set of log file in Azure Blob storage that are formatted in a particular way (gzip) and you want to display them.

How big are these files? Are you displaying every single piece of information in the log file?

Assuming that if this is a log file, it is static and historical...meaning that once the log/gzip file is created it cannot be changed (you are not updating the gzip file once it is out on Blog storage). Only new files can be created...

One Solution


Why not create an worker role/job process that periodically goes out and scans the blob storage and builds a persisted "database" so that you can display. Nice thing about this is that you are not putting the unzipping/business logic to extract the log file in a WPF app or UI.

1) I would have the worker role scan the log file in Azure Blob storage 2) Have some kind of mechanism to track which ones where processed and a current "state" maybe the UTC date of the last gzip file 3) Do all the unzipping/extracting of the log file in the worker role 4) Have the worker role place the content in a SQL database, Azure Table Storage or Distributed Cache for access 5) Access can be done by a REST service (ASP.NET Web API/Node.js etc)

You can add more things if you need to scale this out, for example run this as a job to re-do all of the log files from a given time (refresh all). I don't know the size of your data so I am not sure if that is feasable.

Nice thing about this is that if you need to scale your job (overnight), you can spin up 2, 3, 6 worker roles...extract the content, pass the result to a Service Bus or Storage Queue that would insert into SQL, Cache etc for access.

نصائح أخرى

Simply storing the blobs isn't sufficient. The metadata you want to filter on should be stored somewhere else where it's easy to filter and retrieve all the metadata. So I think you should split this into 2 problems:

A. How do I efficiently list all "gzips" with their metadata and how can I apply a filter on these gzips in order to show them in my client application.

Solutions

  • Blobs: Listing blobs is slow and filtering is not possible (you could group in a container per month or week or user or ... but that's not filtering).
  • Table Storage: Very fast, but searching is slow (only PK and RK are indexed)
  • SQL Azure: You could create a table with a list of "gzips" together with some other metadata (like user that created the gzip, when, total size, ...). Using a stored procedure with a few good indexes you can make search very fast, but SQL Azure isn't the most scalable solution
  • Lucene.NET: There's an AzureDirectory for Windows Azure which makes it possible to use Lucene.NET in your application. This is a super fast search engine that allows you to index your 'documents' (metadata) and this would be perfect to filter and return a list of "gzips"

Update: Since you only filter on date and severity you should review the Blob and Table options:

  • Blobs: You can create a container per date+severity (20121107-low, 20121107-medium, 20121107-high ...). Assuming you don't have too many blobs per data+severity, you can simply list the blobs directly from the container. The only issue you might have here is that a user will want to see all items with a high severity from the last week (7 days). This means you'll need to list the blobs in 7 containers.
  • Tables: Even though you say table storage or db aren't an option, do consider table storage. Using partitions and row keys you can easily filter in a very scalable way (you can also use CompareTo to get a range of items (for example, all records between 1 and 7 november). Duplicating data is perfectly acceptable in Table Storage. You could include some data from the gzip in the Table Storage entity in order to show it in your WPF application (the most essential information you want to show after filtering). This means you'll only need to process the blob when the user opens/double clicks the record in the WPF application

B. How do I display a "gzip" in my application (after double clicking on a search result for example)

Solutions

  • Connect to the storage account from the WPF application, download the file, unzip it and display it. This means that you'll need to store the storage account in the WPF application (or use SAS or a container policy), and if you decide to change something in the backend of how files are stored, you'll also need to change the WPF application.
  • Connect to a Web Role. This Web Role gets the blob from blob storage, unzips it and sends it over the wire (or send it compressed in order to speed up the transfer). In case something changes in how you store files, you only need to update the Web Role
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top