Pergunta

Get .txt file instead of .jpg - using Webclient and DownloadFile();

I'm trying to download the .jpg from this URL:

http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600-h/avengers02_B&W_UL.jpg

Using this code:

private void TEST_button1_Click(object sender, EventArgs e)
{
    WebClient MyDownloader = new WebClient();
    MyDownloader.DownloadFile(@"http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600-h/avengers02_B&W_UL.jpg", @"c:\test.jpg");
}

However, when I run this, I end up with a file called test.jpg, which contains html mark up... :

<html>
<head>
<title>avengers02_B&amp;W_UL.jpg (image)</title>
<script type="text/javascript">
<!--
if (top.location != self.location) top.location = self.location;
// -->
</script>
</head>
<body bgcolor="#ffffff" text="#000000">
<img src="http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600/avengers02_B%26W_UL.jpg" alt="[avengers02_B&amp;W_UL.jpg]" border=0>
</body>
</html>

How can I download the actual .jpg?

Any help is greatly appreciated - thank you!

Foi útil?

Solução

There is a way to do it. First you download the HTML content to a string and extract the correct image URL. Then use the correct URL to download the file.

 WebClient client = new WebClient();
 var path = @"http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600-h/avengers02_B&W_UL.jpg";

 var content = client.DownloadString(path);
 System.Text.RegularExpressions.Regex regex = new Regex(@"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)");
 var match = regex.Match(content);
 if (match.Success)
 {
     client.DownloadFile(match.Value, @"e:\test1.jpg");
 } 

Outras dicas

If server returns HTML to your request at particular Url you can't do much to force it to return something else at that Url.

What you can do is parse response with HtmlAgilityPack and find url to actual image and get it in another request.

Clicking that link causes 2 downloads, first a page of HTML (mislabelled with suffix .jpg), and next an image in the HTML.

So perhaps you need to fetch the url of the img tag in the HTML fetched by the previous request?

http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600/avengers02_B%26W_UL.jpg

I'm guessing that removing -h from the original URL might point to the actual file that you're after.

Here's hoping that you have permission to scrape these files...

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top