是否有可能制定一个强大的网络搜索引擎的使用二郎,Mnesia&雅司病?

https://stackoverflow.com/questions/195809

10-07-2019
|

题

我想到的是发展一个网络搜索引擎的使用二郎,Mnesia&雅司病.是否有可能使一个强大和最快的网络搜索引擎使用这些软件？什么它需要完成这个如何我该怎么开始的？

解决方案

二郎可以使最强大的网履今天。让我带你通过我简单爬行。

步骤1。我创建一个简单的并行性的模块，这是我的呼叫 map/reduce

-module(mapreduce).
-export([compute/2]).
%%=====================================================================
%% usage example
%% Module = string
%% Function = tokens
%% List_of_arg_lists = [["file\r\nfile","\r\n"],["muzaaya_joshua","_"]]
%% Ans = [["file","file"],["muzaaya","joshua"]]
%% Job being done by two processes
%% i.e no. of processes spawned = length(List_of_arg_lists)

compute({Module,Function},List_of_arg_lists)->
    S = self(),
    Ref = erlang:make_ref(),
    PJob = fun(Arg_list) -> erlang:apply(Module,Function,Arg_list) end,
    Spawn_job = fun(Arg_list) -> 
                    spawn(fun() -> execute(S,Ref,PJob,Arg_list) end)
                end,
    lists:foreach(Spawn_job,List_of_arg_lists),
    gather(length(List_of_arg_lists),Ref,[]).
   
gather(0, _, L) -> L;
gather(N, Ref, L) ->
    receive
        {Ref,{'EXIT',_}} -> gather(N-1,Ref,L);
        {Ref, Result} -> gather(N-1, Ref, [Result|L])
    end.
    
execute(Parent,Ref,Fun,Arg)->
    Parent ! {Ref,(catch Fun(Arg))}.

步骤2。 HTTP客户

一个通常会使用 inets httpc module 建成朗或 ibrowse.然而，对于存储管理和速度(获取存储器的脚印尽可能低)，一个好爱尔兰的程序员会选择使用 卷毛.通过应用 操作系统：cmd/1 这需要，卷曲的命令行，一个将得到输出直接进入二郎叫功能。但是，它更好，让卷扔其输出的文件，然后我们的应用程序有另外一个线程(的过程)中读取和分析这些文件

命令: curl "http://www.erlang.org" -o "/downloaded_sites/erlang/file1.html"

在爱尔兰

os:cmd("curl \"http://www.erlang.org\" -o \"/downloaded_sites/erlang/file1.html\"").

所以你可以产生许多进程。你还记得以逃脱的URL以及输出的文件路径，为执行该命令。有一个过程的另一方面，他们的工作是观察目录下载的页。这些网页读取和分析，它随后可以删除后分析或将其保存在一个不同的位置，或者更好的是，它们存档的使用 zip module

folder_check()->
    spawn(fun() -> check_and_report() end),
    ok.

-define(CHECK_INTERVAL,5).

check_and_report()->
    %% avoid using
    %% filelib:list_dir/1
    %% if files are many, memory !!!
    case os:cmd("ls | wc -l") of
        "0\n" -> ok;
        "0" -> ok;
        _ -> ?MODULE:new_files_found()
    end,
    sleep(timer:seconds(?CHECK_INTERVAL)),
    %% keep checking
    check_and_report().

new_files_found()->
    %% inform our parser to pick files
    %% once it parses a file, it has to 
    %% delete it or save it some
    %% where else
    gen_server:cast(?MODULE,files_detected).

步骤3。 Html分析器。
更好地使用这个 mochiweb html分析器和XPATH.这将帮助你分析，并得到所有你最喜欢HTML tags,提取内容，然后好去。下面的例子，我的重点只有 Keywords, description 和 title 在标记

模块试验壳...真棒结果！

2> spider_bot:parse_url("http://erlang.org").
[[[],[],
  {"keywords",
   "erlang, functional, programming, fault-tolerant, distributed, multi-platform, portable, software, multi-core, smp, concurrency "},
  {"description","open-source erlang official website"}],
 {title,"erlang programming language, official website"}]

3> spider_bot:parse_url("http://facebook.com").
[[{"description",
   " facebook is a social utility that connects people with friends and others who work, study and live around them. people use facebook to keep up with friends, upload an unlimited number of photos, post links
 and videos, and learn more about the people they meet."},
  {"robots","noodp,noydir"},
    [],[],[],[]],
 {title,"incompatible browser | facebook"}]

4> spider_bot:parse_url("http://python.org").
[[{"description",
   "      home page for python, an interpreted, interactive, object-oriented, extensible\n      programming language. it provides an extraordinary combination of clarity and\n      versatility, and is free and
comprehensively ported."},
  {"keywords",
   "python programming language object oriented web free source"},
  []],
 {title,"python programming language – official website"}]

5> spider_bot:parse_url("http://www.house.gov/").
[[[],[],[],
  {"description",
   "home page of the united states house of representatives"},
  {"description",
   "home page of the united states house of representatives"},
  [],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
  [],[],[]|...],
 {title,"united states house of representatives, 111th congress, 2nd session"}]

你现在可以意识到，我们可以索引的网页对他们的关键词，加上一个好计划的网页revisists.另一个挑战是如何使履(将移动整个网络，从域来域)，但是，一个是容易的。它可能通过分析Html文件href标签。使HTML分析器来提取的所有href标签，然后你可能需要一些经常表达这里和那里获得的链接，正确的下一给定领域。

运行履带式

7> spider_connect:conn2("http://erlang.org").        

        Links: ["http://www.erlang.org/index.html",
                "http://www.erlang.org/rss.xml",
                "http://erlang.org/index.html","http://erlang.org/about.html",
                "http://erlang.org/download.html",
                "http://erlang.org/links.html","http://erlang.org/faq.html",
                "http://erlang.org/eep.html",
                "http://erlang.org/starting.html",
                "http://erlang.org/doc.html",
                "http://erlang.org/examples.html",
                "http://erlang.org/user.html",
                "http://erlang.org/mirrors.html",
                "http://www.pragprog.com/titles/jaerlang/programming-erlang",
                "http://oreilly.com/catalog/9780596518189",
                "http://erlang.org/download.html",
                "http://www.erlang-factory.com/conference/ErlangUserConference2010/speakers",
                "http://erlang.org/download/otp_src_R14B.readme",
                "http://erlang.org/download.html",
                "https://www.erlang-factory.com/conference/ErlangUserConference2010/register",
                "http://www.erlang-factory.com/conference/ErlangUserConference2010/submit_talk",
                "http://www.erlang.org/workshop/2010/",
                "http://erlangcamp.com","http://manning.com/logan",
                "http://erlangcamp.com","http://twitter.com/erlangcamp",
                "http://www.erlang-factory.com/conference/London2010/speakers/joearmstrong/",
                "http://www.erlang-factory.com/conference/London2010/speakers/RobertVirding/",
                "http://www.erlang-factory.com/conference/London2010/speakers/MartinOdersky/",
                "http://www.erlang-factory.com/",
                "http://erlang.org/download/otp_src_R14A.readme",
                "http://erlang.org/download.html",
                "http://www.erlang-factory.com/conference/London2010",
                "http://github.com/erlang/otp",
                "http://erlang.org/download.html",
                "http://erlang.org/doc/man/erl_nif.html",
                "http://github.com/erlang/otp",
                "http://erlang.org/download.html",
                "http://www.erlang-factory.com/conference/ErlangUserConference2009",
                "http://erlang.org/doc/efficiency_guide/drivers.html",
                "http://erlang.org/download.html",
                "http://erlang.org/workshop/2009/index.html",
                "http://groups.google.com/group/erlang-programming",
                "http://www.erlang.org/eeps/eep-0010.html",
                "http://erlang.org/download/otp_src_R13B.readme",
                "http://erlang.org/download.html",
                "http://oreilly.com/catalog/9780596518189",
                "http://www.erlang-factory.com",
                "http://www.manning.com/logan",
                "http://www.erlang.se/euc/08/index.html",
                "http://erlang.org/download/otp_src_R12B-5.readme",
                "http://erlang.org/download.html",
                "http://erlang.org/workshop/2008/index.html",
                "http://www.erlang-exchange.com",
                "http://erlang.org/doc/highlights.html",
                "http://www.erlang.se/euc/07/",
                "http://www.erlang.se/workshop/2007/",
                "http://erlang.org/eep.html",
                "http://erlang.org/download/otp_src_R11B-5.readme",
                "http://pragmaticprogrammer.com/titles/jaerlang/index.html",
                "http://erlang.org/project/test_server",
                "http://erlang.org/download-stats/",
                "http://erlang.org/user.html#smtp_client-1.0",
                "http://erlang.org/user.html#xmlrpc-1.13",
                "http://erlang.org/EPLICENSE",
                "http://erlang.org/project/megaco/",
                "http://www.erlang-consulting.com/training_fs.html",
                "http://erlang.org/old_news.html"]
ok

储存： 是的一个最重要的概念对于一个搜索引擎。它的一个很大的错误储存的搜索引擎数据库等MySQL,Oracle,MS SQL e。t.c.这种系统完全复杂的应用程序接口，与他们的雇用了启发式算法。这给我们带来了 关键价值的商店, 其中两个是我最好的 沙发上的基础服务器 和 Riak.这些都是伟大的云文件系统。另一个重要参数是缓存。缓存处获得的使用说 缓存, 其他两个存储系统上面提到的支持。存储系统对搜索引擎的应该是 schemaless DBMS，其重点是 Availability rather than Consistency.了解更多关于搜索引擎，从这里: http://en.wikipedia.org/wiki/Web_search_engine

其他提示

据我所知 Powerset的的自然语言处理。第搜索引擎使用Erlang的开发。

你看 CouchDB的（其被写入的erlang以及）作为可能的工具帮你解决自己的方式几个问题？

我会推荐CouchDB而不是Mnesia.

Mnesia没有地图-减少，CouchDB并(更正看到的评论意见)
Mnesia是静态型，CouchDB是一个文档数据库(和网页的文件，即一个更适合的信息模型在我的意见)
Mnesia主要意图是一个存储驻留数据库

雅司病是漂亮的好。你也应该考虑MochiWeb.

你不会走错朗

在 'RDBMS' 的contrib ，还有波特词干算法的实现。这是从来没有集成到“RDBMS”，所以它基本上只是坐在那里。我们已经在内部使用它，和它的工作相当不错，至少对于未巨大的数据集（我没有测试它的巨大的数据量）。