Frage

I am attempting to manually scrape tabular information from a website for importing into a Drupal site.

The data is of the following format:

Opening Balances of Banks/Discount Houses   76991.16
Rediscounted Bills                          0
Standing Lending Facility (Net)             0
Standing Deposit Facility (Net)             522078.9
Repo                                        0
Reverse Repo                                0
OMO Sales/Under-Writing by MMDs             0
OMO Repayment                               0

Pasting that into a spreadsheet, I can create a CSV file for importing into Drupal. The CSV generates as follows:

Opening Balances of Banks/Discount Houses,76991.16
,
Rediscounted Bills,0
,
Standing Lending Facility (Net),0
,
Standing Deposit Facility (Net),522078.9
,
Repo,0
,
Reverse Repo,0
,
OMO Sales/Under-Writing by MMDs,0
,
OMO Repayment,0

My problem is the CSV is wrong. The data in the first column should represent the headers of the CSV which means they should be listed on the first line of the CSV and not on the left. The followed lines should then be the sequence of data to be imported as occurrences of items in the header.

How can I generate a CSV file in the correct order which will solve my problem?

There are hundred of lines of data to import so a manual approach is not feasible.

UPDATE: Two full records:

Date                                        Financial Data As At 5/8/2014
Opening Balances of Banks/Discount Houses   76991.16
Rediscounted Bills                          0
Standing Lending Facility (Net)             0
Standing Deposit Facility (Net)             522078.9
Repo                                        0
Reverse Repo                                0
OMO Sales/Under-Writing by MMDs             0
OMO Repayment                               0
Primary Market Sales (e.g NTBs, FGN Bonds)  0
Primary Market Repayment                    0
CRR (Debit/Credit)                          0
Net Foreign Exchange Auction (WDAS)         0
Statutory Allocations (FAAC, VAT,etc)       0
Joint Venture Cash Call Payment             0
Net Clearing (Lagos/Abuja)                  0
NDIC Premium (Debit/Credit)                 0
Other Major (Debit/Credit)                  0
Date                                        Financial Data As At 5/7/2014
Opening Balances of Banks/Discount Houses   98357.49
Rediscounted Bills                          0
Standing Lending Facility (Net)             475
Standing Deposit Facility (Net)             483157.7
Repo                                        0
Reverse Repo                                0
OMO Sales/Under-Writing by MMDs             0
OMO Repayment                               237451.43
Primary Market Sales (e.g NTBs, FGN Bonds)  157177.87
Primary Market Repayment                    157057.31
CRR (Debit/Credit)                          0
Net Foreign Exchange Auction (WDAS)         0
Statutory Allocations (FAAC, VAT,etc)       0
Joint Venture Cash Call Payment             0
Net Clearing (Lagos/Abuja)                  0
NDIC Premium (Debit/Credit)                 0
Other Major (Debit/Credit)                  0
War es hilfreich?

Lösung

TextDistil will do this for you. (Disclosure - I'm the author). Assuming that you want to generate multiple rows, each of the 8 columns you've described, the easiest way to do it is:

Note that you should not include the quotes when pasting patterns into TextDistil

  • Cut and paste the lines from your example into the input window
  • Use CTRL-N to add a 'replace text' recipe with match of "financial data as at" to clean up the column values
  • Add an "insert text at beginning of line" to insert "!" before all lines starting with "Other major". This step is only done to make the next one easier.
  • Add a "join lines after" recipe with match of "^[^!]". The first '^' matches the start of a line and the section inside the brackets matches anything that isn't an exclamation mark. The net effect is that this pattern matches all the lines that don't start with an exclamation mark. Since this is a 'join lines after' operation, all the lines that match the pattern will have the following line joined to them. So all the lines for a single record are now joined into one.
  • Add a 'select text (matching only)' recipe with matching expression "\d[\d./]*" and "," as the joining string. This matches all the numbers and dates you have.
  • At this stage you should see only two lines in the output window, each of which correspond to a record.

5/8/2014,76991.16,0,0,522078.9,0,0,0,0,0,0,0,0,0,0,0,0,0 5/7/2014,98357.49,0,475,483157.7,0,0,0,237451.43,157177.87,157057.31,0,0,0,0,0,0,0

You may find that the 'all' view is useful - it allows you to see both the final output and the input and output of the recipe you are adding.

First Recipe

First recipe

Second Recipe

Second recipe

Third recipe

Third recipe

Final recipe

Final recipe

Andere Tipps

The CSV seems easy but it's not. Just imagine if, in the middle of your first column, you have a comma, and bang.

Now imagine with two commas.

:-)

I don't know what language you're going to use to work on this (php maybe), but I think you must write some program to

  • parse the html (at least the html table)
  • get each column data
  • encode the data into CSV, escaping when necessary

That's because HTML makes it clear what's inside the cell and what's not. While just copy and pasting manually will get the contents, but unformatted and you'll end up having problems with corner cases.

For good CSV libraries in PHP, take a look on https://stackoverflow.com/questions/3087287/is-there-a-popular-and-or-robust-php-csv-library

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top