어떤 데이터 구조를 사용 하시겠습니까 : Treemap 또는 Hashmap? (자바)

https://stackoverflow.com/questions/302371

08-07-2019
|

문제

설명 | 텍스트 파일을 읽고 텍스트에서 단어가 발생하는 횟수와 함께 알파벳 순서로 각각의 고유 단어를 인쇄하는 Java 프로그램.

프로그램은 변수 유형을 선언해야합니다 Map<String, Integer> 단어와 해당 발생 빈도를 저장합니다. 그래도 어떤 콘크리트 유형? TreeMap<String, Number> 또는 HashMap<String, Number> ?

입력은 소문자로 변환되어야합니다.

단어는 다음과 같은 문자를 포함하지 않습니다. \t\t\n]f.,!?:;\"()'

예제 출력 |

 Word            Frequency
  a                 1
  and               5
  appearances       1
  as                1
         .
         .
         .

비고 | 나는 Perl에서 약 2 줄의 코드로 우아한 솔루션을 보았습니다. 그러나 나는 그것을 Java에서보고 싶다.

편집 : 아, 예, 이러한 구조 중 하나 (Java)를 사용하여 구현을 보여주는 것이 도움이됩니다.

해결책

트리 맵 단순히 "알파벳 순서"요구 사항 때문에 나에게는 쉬운 것 같습니다. 해시 맵은 당신이 그것을 반복 할 때 주문하지 않습니다. Treemap은 자연스러운 키 순서로 반복합니다.

편집 : Konrad의 의견은 "해시 맵 사용, 정렬"을 제안했을 것입니다. 처음에는 반복적이지만, 복제로 인해 K <= N 키가 끝날 때까지 우리는 반복을 가질 것이기 때문에 좋습니다. 우리는 우리가 갈 때 정리하는 것을 유지하는 것이 작지만 불안정한 히트를 취하는 것보다 키가 적을 때까지 끝날 때까지 값 비싼 비트 (정렬)를 절약 할 수 있습니다.

그렇게 말하면서, 나는 지금 내 대답을 고수하고 있습니다. 가장 간단합니다 목표 달성 방법. 우리는 OP가 특히 성능에 대해 걱정하고 있다는 것을 알지 못하지만 문제는 그가 우아함과 간결함에 대해 걱정하고 있음을 의미합니다. Treemap을 사용하면이 간단한 간단한 것이 나에게 호소합니다. 성능이 실제로 문제라면 Treemap 또는 Hashmap보다 공격하는 더 나은 방법이있을 수 있습니다 :)

다른 팁

Treemap은 이미 당신을 위해 정렬되어 있기 때문에 해시 맵을 이겼습니다.

그러나보다 적절한 데이터 구조 인 백을 사용하는 것을 고려할 수 있습니다. 보다커먼즈 컬렉션 - 그리고 트리 백 수업:

이것은 최적화 된 내부 구조와 API를 가지고 있습니다.

bag.add("big")
bag.add("small")
bag.add("big")
int count = bag.getCount("big")

편집 : Hashmap vs Treemap 성능의 문제는 Jon -Hashmap과 정렬이 더 빠를 수 있지만 (시도해보십시오!) Treebag는 더 쉽습니다. 가방도 마찬가지입니다. 트리 백뿐만 아니라 해시 백도 있습니다. 구현 (돌연변이 정수 사용)을 기반으로 백은 정수의 동등한 일반 맵보다 성능이 뛰어나야합니다. 확실히 알 수있는 유일한 방법은 성능 질문과 마찬가지로 테스트하는 것입니다.

나는 "Treemap Look-Up Takes"라고 말하는 사람들이 꽤 있습니다. O(n log n)"!! 어떻게 오세요?

나는 그것이 어떻게 구현되었는지 모르겠지만 내 머리 속에는 O(log n).

이것은 나무에서의 조회를 할 수 있기 때문입니다. O(log n). 항목을 삽입 할 때마다 트리 전체를 정렬하지 않습니다. 그것은 나무를 사용한다는 모든 아이디어입니다!

따라서 원래 질문으로 돌아가서 비교를위한 수치는 다음과 같습니다.

해시 맵 접근 : O(n + k log k) 평균 사례, 최악의 경우는 훨씬 클 수 있습니다

Treemap 접근법 : O(k + n log k) 최악의 경우

여기서 n = 텍스트의 단어 수, k = 텍스트의 고유 한 단어 수.

해시 맵이 훨씬 빠르야합니다. 항목을 결국 배열하려는 방식에 따라 컨테이너를 선택해서는 안됩니다. 끝에 (단어, 주파수) 페어의 목록을 정렬하십시오. 일반적으로 파일의 단어보다 정렬되는 쌍이 줄어들므로 해시 맵을 사용한 점근 (및 실제) 성능이 더 좋습니다.

당신은 할당 할 수 없습니다 TreeMap<String,Number> 유형의 변수로 Map<String,Integer>. Double, Long, 등 등을 "넣을 수 있습니다 TreeMap<String,Number>. 내가 a에서 값을 "받을 때 Map<String,Integer>, 그것은 an이어야합니다 Integer.

I18N 문제, 메모리 제약 조건 및 오류 처리를 완전히 무시하면 다음과 같습니다.

class Counter {

  public static void main(String... argv)
    throws Exception
  {
    FileChannel fc = new FileInputStream(argv[0]).getChannel();
    ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
    CharBuffer cb = Charset.defaultCharset().decode(bb);
    Pattern p = Pattern.compile("[^ \t\r\n\f.,!?:;\"()']+");
    Map<String, Integer> counts = new TreeMap<String, Integer>();
    Matcher m = p.matcher(cb);
    while (m.find()) {
      String word = m.group();
      Integer count = counts.get(word);
      count = (count == null) ? 1 : count + 1;
      counts.put(word, count);
    }
    fc.close();
    for (Map.Entry<String, Integer> e : counts.entrySet()) {
      System.out.printf("%s: %d%n", e.getKey(), e.getValue());
    }
  }

}

"키가 이미 존재하면 해시 맵과 동일한 성능이 있습니다." - 그것은 단지 명백한 잘못입니다. 해시 맵에는 O (1) 삽입 및 Treemap O (N log n)가 있습니다. 테이블에 있는지 확인하기 위해 최소한 N 로그 N 검사가 필요합니다!

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.ObjectInputStream.GetField;
import java.util.Iterator;
import java.util.Map;
import java.util.StringTokenizer;
import java.util.TreeMap;

public class TreeMapExample {

    public static void main (String args[]){
        Map<String,Integer> tm = new TreeMap<String,Integer>();
        try {

            FileInputStream fis = new FileInputStream("Test.txt");
            DataInputStream in = new DataInputStream(fis);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String line;
            int countValue = 1;
            while((line = br.readLine())!= null ){
                line = line.replaceAll("[-+.^:;,()\"\\[\\]]","");
                StringTokenizer st = new StringTokenizer(line, " ");    
                while(st.hasMoreTokens()){
                    String nextElement = (String) st.nextElement();

                    if(tm.size()>0 && tm.containsKey(nextElement)){
                        int val = 0;
                        if(tm.get(nextElement)!= null){
                        val = (Integer) tm.get(nextElement);
                        val = val+1;
                        }
                        tm.put(nextElement, val);
                    }else{
                    tm.put(nextElement, 1);
                    }

                }
            }
            for(Map.Entry<String,Integer> entry : tm.entrySet()) {
            System.out.println(entry.getKey() + " : " + entry.getValue());
            }

        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

}

이런 식으로, 제 생각에는 더 나은 사용 해시 백 ~에서 아파치 커먼즈 컬렉션 또는 해시 모티 세트 ~에서 구아바 또는 해시 백 ~에서 일식 컬렉션 (형식 GS 컬렉션) 또는 다음 수업 :

    Order    |  Guava           |   Apache  | Eclipse(GS) | JDK analog
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Not define   | HashMultiset     |   HashBag | HashBag     | HashMap<String, Integer>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Sorted       | TreeMultiset     |   TreeBag | TreeBag     | TreeMap<String, Integer>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Linked       |LinkedHashMultiset|     -     |     -       | LinkedHashMap<String, Integere>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Concurrent & | ConcurrentHash-  |Synchroniz-|Synchroniz-  | Collections.synchronizedMap(
not define   | Multiset         |   edBag   | edBag       |       HashMap<String, Integer>)
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Concurrent   |         -        |Synchroniz-|Synchroniz-  | Collections.synchronizedSorted-
and sorted   |                  |edSortedBag| edSortedBag |       Map(TreeMap<>))
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Immutable and| ImmutableMultiset|Unmodifiab-|Unmodifiab-  | Collections.unmodifiableMap(
not define   |                  |   leBag   | leBag       | HashMap<String, Integer>)
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Immutable and| ImmutableSorted- |Unmodifiab-|Unmodifiab-  | Collections.unmodifiableSorted-
sorted       | Multiset         |leSortedBag| leSortedBag | Map(TreeMap<String, Integer>))
────────────────────────────────────────────────────────────────────────

예 :

1. Apache에서 동기화 된 소식을 사용합니다:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Bag bag = SynchronizedSortedBag.synchronizedBag(new TreeBag(Arrays.asList(INPUT_TEXT.split(" "))));

    // Print count words
    System.out.println(bag); // print [1:All!,2:Hello,1:Hi,2:World!]- in natural (alphabet) order
    // Print all unique words
    System.out.println(bag.uniqueSet());    // print [All!, Hello, Hi, World!]- in natural (alphabet) order


    // Print count occurrences of words
    System.out.println("Hello = " + bag.getCount("Hello"));    // print 2
    System.out.println("World = " + bag.getCount("World!"));    // print 2
    System.out.println("All = " + bag.getCount("All!"));    // print 1
    System.out.println("Hi = " + bag.getCount("Hi"));    // print 1
    System.out.println("Empty = " + bag.getCount("Empty"));    // print 0

    // Print count all words
    System.out.println(bag.size());    //print 6

    // Print count unique words
    System.out.println(bag.uniqueSet().size());    //print 4

2. Eclipse (GC)의 Treebag 사용:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    MutableSortedBag<String> bag =  TreeBag.newBag(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(bag); // print [All!, Hello, Hello, Hi, World!, World!]- in natural order
    // Print all unique words
    System.out.println(bag.toSortedSet());    // print [All!, Hello, Hi, World!]- in natural order

    // Print count occurrences of words
    System.out.println("Hello = " + bag.occurrencesOf("Hello"));    // print 2
    System.out.println("World = " + bag.occurrencesOf("World!"));    // print 2
    System.out.println("All = " + bag.occurrencesOf("All!"));    // print 1
    System.out.println("Hi = " + bag.occurrencesOf("Hi"));    // print 1
    System.out.println("Empty = " + bag.occurrencesOf("Empty"));    // print 0

    // Print count all words
    System.out.println(bag.size());    //print 6

    // Print count unique words
    System.out.println(bag.toSet().size());    //print 4

3. Guava의 Linkedhashmultiset 사용:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Multiset<String> multiset = LinkedHashMultiset.create(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(multiset); // print [Hello x 2, World! x 2, All!, Hi]- in predictable iteration order
    // Print all unique words
    System.out.println(multiset.elementSet());    // print [Hello, World!, All!, Hi] - in predictable iteration order

    // Print count occurrences of words
    System.out.println("Hello = " + multiset.count("Hello"));    // print 2
    System.out.println("World = " + multiset.count("World!"));    // print 2
    System.out.println("All = " + multiset.count("All!"));    // print 1
    System.out.println("Hi = " + multiset.count("Hi"));    // print 1
    System.out.println("Empty = " + multiset.count("Empty"));    // print 0

    // Print count all words
    System.out.println(multiset.size());    //print 6

    // Print count unique words
    System.out.println(multiset.elementSet().size());    //print 4

내 GitHub 프로젝트에서 찾을 수있는 더 많은 예제입니다

나는 확실히 treemap을 선택할 것입니다 :

Treemap은 삽입시 새 키를 자동으로 정렬합니다. 이후에는 정렬이 필요하지 않습니다.
키가 이미 존재하면 해시 맵과 동일한 성능이 있습니다.

TreeSet은 내부적으로 Treemap을 사용하므로 Treemap을 직접 사용하지 않겠습니까?

속도 요구 사항에 따라 트리. 그러나 Treemap이 충분히 빠르면 그 중 하나를 구현할 필요는 없습니다.

데이터 구조에 대한 첨가 또는 삭제 빈도를 고려하십시오. Treemap은 높으면 이상적이지 않을 것입니다. 기존 항목 NLN 검색 외에도 빈번한 재조정도 겪습니다.

반면에 해시 구조는 메모리에 약간 화려합니다 (오버 할당). 그 총알을 물릴 수 있다면 해시 구조를 위해 가서 필요할 때 정렬하십시오.

다음은 텍스트 파일을 읽고 키를 기반으로 정렬 한 다음 값을 정렬하기위한 Java 예입니다. 파일에서 단어 발생 수에 따라

public class SortFileWords {

    public static void main(String[] args) {
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        ValueCompare vc = new ValueCompare(map);
        TreeMap<String, Integer> sorted_map = new TreeMap<String, Integer>(map);
        List<String> list = new ArrayList<>();
        Scanner sc;
        try {
            sc = new Scanner(new File("c:\\ReadMe1.txt"));
            while (sc.hasNext()) {
                list.add(sc.next());
            }
            sc.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        for (String s : list) {
            if (map.containsKey(s)) {
                map.put(s, map.get(s) + 1);
            } else
                map.put(s, 1);
        }

        System.out.println("Unsorted map: " + map);
        sorted_map.putAll(map);
        System.out.println("Sorted map on keys: " + sorted_map);

        TreeMap<String, Integer> sorted_value_map = new TreeMap<>(vc);
        sorted_value_map.putAll(map);
        System.out.println("Sorted map on values: " + sorted_value_map);
    }
}

class ValueCompare implements Comparator<String> {

    Map<String, Integer> map;

    public ValueCompare(Map<String, Integer> map) {
        this.map = map;
    }

    @Override
    public int compare(String s1, String s2) {
        if (map.get(s1) >= map.get(s2))
            return -1;
        else
            return 1;
    }
}

사용하지 않는 이유 트리 셋?

정의에 따라 "중복 요소가없는 컬렉션"입니다.

문제 설명에서 세트가 필요한 것처럼 들립니다. 어떤 키와 값이 함께 매핑되는지 알 수 없습니다.

이 클래스는 Treemap 인스턴스로 뒷받침되는 설정 인터페이스를 구현합니다. 이 클래스는 정렬 된 세트가 오름차순 요소 순서로, 요소의 자연 순서 (비교 가능) 또는 사용되는 생성자에 따라 세트 생성 시간에 제공된 비교기에 따라 정렬 된 요소 순서가 될 것을 보장합니다.

기본적으로 요구 사항에 따라 다릅니다. 때때로 해시 맵은 때때로 트리 맵이 좋습니다. 그러나 해시 맵은 오버 헤드를 정렬하기위한 제약 조건 만 사용하는 것이 좋습니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow