ソートされた文字列をカウントするアルゴリズム（Homebrew＆＃8220; uniq -c＆＃8221;）

https://stackoverflow.com/questions/633881

10-07-2019
|

質問

次のソートされたデータがあります：

AAA
AAA
TCG
TTT
TTT
TTT

各文字列の出現回数をカウントしたい：

AAA 2
TCG 1
TTT 3

uniq -c を使用してこれを実行できることはわかっていますが、ここでは、所有しているC ++コード全体に対して追加の処理を行う必要があります。

このコンストラクトにこだわっています（「pgras」の提案に従って修正されました）：

#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
using namespace std;


int main  ( int arg_count, char *arg_vec[] ) {
    if (arg_count !=2 ) {
        cerr << "expected one argument" << endl;
        return EXIT_FAILURE;
    }

    string line;
    ifstream myfile (arg_vec[1]);


    if (myfile.is_open())
    {
        int count;
        string lastTag = "";

        while (getline(myfile,line) )
        {
            stringstream ss(line);
            string Tag;

            ss >> Tag; // read first column
            //cout << Tag << endl; 

            if (Tag != lastTag) {
               lastTag = Tag;
               count = 0;
            }
            else {
                count++;
            }

             cout << lastTag << " " << count << endl;
        }
        cout << lastTag << " " << count << endl;
        myfile.close();

    }
    else {cout << "Unable to open file";}
    return 0;
}

この誤った結果が出力されます：

AAA 0
AAA 1
TCT 0
TTT 0
TTT 1
TTT 2
TTT 2

解決

タグがlastTagと異なる場合はカウンターをリセットし、同じ場合はインクリメントする必要があります...タグが異なる場合は、関連付けられたカウント値で前のタグを処理できます（カウントをリセットする前）...

他のヒント

印刷したいだけなら、アルゴリズムは大丈夫です。別の関数に渡す場合は、たとえばSTLマップを使用できます。

map<string, int> dict;

while(getline(myfile,line)) {
          string Tag;
          stringstream ss(line);
          ss >> Tag;
          if (dict.count(Tag) == 0) 
               dict[Tag] = 1;
           else
               dict[Tag]++;
}

次のようなものを使用します：

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <iterator>


std::ostream& operator << ( std::ostream& out, const std::pair< std::string, size_t >& rhs )
{
    out << rhs.first << ", " << rhs.second;
    return out;
}

int main() 
{
    std::ifstream inp( "mysorted_data.txt" );
    std::string str;
    std::map < std::string, size_t > words_count;
    while ( inp >> str )
    {
        words_count[str]++;
    }

    std::copy( 
        words_count.begin(), 
        words_count.end(), 
        std::ostream_iterator< std::pair< std::string, size_t > >( std::cout, "\n" ) );

    return 0;
}

データが実際に長さ3（または、より一般的な長さ N で N は非常に小さい）のDNAストリングで構成されていると仮定すると、テーブルサイズが4 ^Nで、次のハッシュ関数を持つ特殊なハッシュテーブルであるq-gramテーブル：

// Disregard error codes.
int char2dna_lookup[] = {
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x0  – 0xF
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x10 – 0x1F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x20 – 0x2F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x30 – 0x3F
    0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, // A    – P
    0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // Q    – 0x5F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x60 – 0x6F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x70 – 0x7F
}

unsigned int hash(string const& dna) {
    unsigned int ret = 0;

    for (unsigned int i = 0; i < dna.length(); ++i)
        ret = ret * 4 + char2dna_lookup[dna[i]];

    return ret;
}

非常に効率的に配列のインデックスを作成できるようになりました。

ifstream ifs("data.txt");
string line;

if (not ifs >> line)
    exit(1);

unsigned* frequencies = new unsigned int[line.length()];

frequencies[hash(line)] = 1;

while (ifs >> line)
    ++frequencies[hash(line)];

// Print the frequencies …

delete[] frequencies;

または、このようなタスクには SeqAn などのライブラリを使用します。

これを置き換えるだけでいいと思う

        if (Tag != lastTag) {
           lastTag = Tag;
           count = 0;
        }
        else {
            count++;
        }

        cout << lastTag << " " << count << endl;

これ：

        if (Tag != lastTag) {
            if (lastTag != "") {  // don't print initial empty tag
                cout << lastTag << " " << count << endl;
            }
            lastTag = Tag;
            count = 1; // count current
          } else {
            count++;
        }

コードは構文的にわずかに壊れているように見えますが（ ifstream など）、全体的なアルゴリズムは健全だと思います。行を読み取り、行が前のものと同じになるたびにカウンターを増分します。考慮すべき境界条件があるかもしれませんが（入力が1行しかない場合はどうでしょうか）、テスト中にそれらをキャッチします。

タグを取得するためだけにstringstreamを使用するのは少しやり過ぎだと思われます。おそらくstring :: substrを使用します。それはさておき、あなたのコードで何が間違っていると思いますか？何を改善したいですか？

編集：次に、呼吸のために投票権を取得します...

#include <map>
#include <string>
#include <algorithm>
#include <iterator>
#include <iostream>

class Counter
{   private: std::map<std::string,int>&   m_count;
    public:  Counter(std::map<std::string,int>& data) :m_count(data){}
        void operator()(std::string const& word)
        {
            m_count[word]++;
        }};
class Printer
{   private: std::ostream& m_out;
    public:  Printer(std::ostream& out) :m_out(out) {}
        void operator()(std::map<std::string,int>::value_type const& data)
        {
            m_out << data.first << " = " << data.second << "\n";
        }};

int main()
{
    std::map<std::string,int>       count;

    for_each(std::istream_iterator<std::string>(std::cin),
             std::istream_iterator<std::string>(),
             Counter(count)
            );

    for_each(count.begin(),count.end(),
             Printer(std::cout)
            );
}

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow