I have been trying to get sizeable results fetching img urls from a image site (pixiv) (Input links are of artwork kind. For example:
will work with this php) And while retrieving the relevant links via patternmatching is no problem, it seems that link(s), even if correctly formatted are throwing 403's as that site is configured to thwart outside access (probably to preserve bandwidth).
I did stumble across a option to pass on a valid "request header" in order to get things to work: https://www.reddit.com/r/Rlanguage/comments/ytgtun/im_trying_to_use_downloadfile_but_i_get_a_403/?rdt=55917
However so far this seems not to work (the original example was in "R", I'm using PHP to try and replicate the behavior.)
My code sofar looks like this (the main focus is on the php side, the rest is just JS to ease things should I get it to work:
<!DOCTYPE html>
<html>
<head>
<title>Image Retrieval</title>
</head>
<body>
<form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>">
<label for="url">Enter the URL:</label>
<input type="text" id="url" name="url">
<button type="submit">Submit</button>
</form>
<?php
if ($_SERVER["REQUEST_METHOD"] == "POST") {
$url = $_POST["url"];
$options = [
'http' => [
'header' => "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0\r\n" .
"Referer: https://accounts.pixiv.net\r\n",
],
];
$context = stream_context_create($options);
$html = file_get_contents($url, false, $context);
$pattern = '/image" href="(.*?)"/'; //find downscaled master img (always in jpg format)
//$pattern = '/"original":"(.*?)"/'; //find original image (usually only works when logged in)
preg_match($pattern, $html, $matches);
$imageUrl = $matches[1];
echo '<p>Image Link: <a id="image-link" href="' . $imageUrl . '">' . $imageUrl . '</a></p>';
}
?>
<script>
var imageLink = document.getElementById("image-link");
if (imageLink) {
window.location.href = imageLink.href;
}
</script>
<-!Autofill if querystring exists-->
<script>
var urlParams = new URLSearchParams(window.location.search);
var pixivUrl = urlParams.get('pixivurl');
if (pixivUrl) {
var urlInput = document.getElementById('url');
if (urlInput) {
urlInput.value = pixivUrl;
}
var form = document.querySelector('form');
if (form) {
form.submit();
}
}
</script>
</body>
</html>
I'm fairly certain something specific is needed to pass on a request header properly, but I never had to use that feature, so I'm at a bit of a loss.
Thanks in advance