Lecture
libcurl for data exchange
libcurl is a library of functions that allows you to communicate (exchange information) with different servers using different protocols. Currently, libcurl supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl can also work with HTTPS certificates, send requests to HTTP servers using POST and PUT methods, upload files using HTTP and FTP protocols (the latter can also be done using the FTP module), use proxy servers, cookies and user authentication.
They will understand us only by protocol.
Any servers, including the web, are able to respond to the data sent to them, but only if they understand what they have been sent to. For this, the data sent to them are drawn up according to certain rules. This set of rules is called a protocol. Data on the protocol will issue libcurl itself, we will only consider further how to transfer it to it.
One, two, started.
There are very few functions in the library, but you will find one of them that drives gears in any script with curl.curl_init - Initializes a CURL session. In other words, this function starts the curl mechanism and returns a pointer, a handle to the created mechanism.
Let me remind you what a resource (pointer) is, for those who have forgotten.
Handle (Resource)
The descriptor is a pointer, a link, to an external resource.
Imagine a car service with a large number of cars, which poured several liters of oil into the engine. With a whistle of brakes, a Ferrari takes off from the gate and parks at the next service parking. The owner of the service is already shouting to the worker, pour 5 liters of oil and pokes his thumb into the red Ferrari.
Our worker receives in this case two types of data - the number (liters of oil) and a pointer to the car, that is, the definition - which object from all around need to add something.
I note that the worker receives from the boss not the machine itself, but only the handle (pointer) of the machine with which to work. In php, this descriptor is the resource data type.
The curl_init function can also immediately receive the url, the address of the server with which we will communicate. You can not specify it, and specify later. Having initialized the mechanism, you can immediately send a request, and finally, free the memory from this mechanism.
Here's what happens in the end:
<?PHP
$ch = curl_init('http://php.su');
curl_exec($ch); // выполняем запрос curl - обращаемся к сервера php.su
curl_close($ch);
?>
The result of this code will be a direct output of the contents of the main page of php.su. It is not always necessary to output the result of the request directly to the browser and for this it is enough just to twist a couple of settings. Now we find out how to do it.
Configuring communication
curl_setopt - sets the parameter for the CURL session
The parameters are different, and there are many, very much. What is responsible for what you can see, of course, in the reference function curl_setopt . We now consider a couple of major and find out how to manipulate them.
But first I want to go back to the discussion of the http protocol for a moment. What the rule set of this protocol consists of. Let's see how the browser communicates with our server, what it sends to it, and what it receives from it. I use the livehttpheaders browser extension for this. And this is what happens when the browser communicates with the server:
Communication browser with the server turned out not too convoluted. We look:
GET /index.php HTTP / 1.1 Give the index.php page. data came correctly issued
Host: php.su from php.su
User-Agent: Mozilla / 5.0 And here it is!
Accept: text / html And I understand only text and html.
Accept-Language: ru, en-us; And speak Russian or I don’t understand yours
Accept-Charset: windows-1251, utf-8; With spices please.
Connection: keep-alive Waiting for an answer from you
Keep-Alive: 300 but I have little patience.
Cookie: lastvisit = 1243232518; And I went yesterday and booked a table. Remember me ? Not? Well, no matter, you asked me to remind you that I went to that many. Now remember? Fine. Where is my table?
HTTP / 1.1 200 OK Data received, status 200.
Date: Mon, 25 May 2009 06:33:05 GMT
Server: Apache is served by Apache Server Ltd.
X-Powered-By: PHP / 5.2.6 Senior Chef, php 5.2.6
Transfer-Encoding: chunked This is the first portion, the second one will be
Connection: close And now you get it ?, sign up, don't wait anymore
Content-Type: text / html; charset = cp1251 Your pizza in Russian, with mushrooms
Immediately after the response headers comes the answer itself, that is, the html of the page. And we, in the meantime, have become witnesses of the communication of my browser with the php.su server. And since the browser can, then we can with our program.
We order data with the necessary options
After we run our code simple code
<?PHP
$ch = curl_init('http://php.su');
curl_exec($ch); // выполняем запрос curl - обращаемся к сервера php.su
curl_close($ch);
?>
We saw in the browser only the html code received from the server. Perhaps we would also like to look at the headers sent by the server, but then all of a sudden I invented everything written above?
To do this, set the option "show headlines"
CURLOPT_HEADER: If you set this parameter to a non-zero value, the result will include the received headers.
Of course, you need to set the parameters before you send the request itself to the server. And so, it turned out like this:
<?PHP
$ch = curl_init('http://php.su');
curl_setopt ($ch, CURLOPT_HEADER, true);
curl_exec($ch); // выполняем запрос curl
curl_close($ch);
?>
Perhaps we also want to get the content in a variable, and not at all output directly to the browser. To do this, we will have to set this value among other commands: curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1);
Although the documentation seems to be wrong in terms of settings ..
Let me note that the documentation says that the name of the parameter should be a string. We are passing the setopt function not a string at all (there are no quotes). We are passing a constant. Predefined variable whose value cannot be changed. These constants are themselves determined by the library, something else is interesting ... the values of these constants are not at all string, but numerical. So if someone finds out why the documentation says "string" instead of a number - let me know.
And while we try to log in to the forum. And in order to mimic a regular browser, we need to see what the browser says to the server, what it answers and what it will do next.
We are a user
First of all, so that all this seems plausible, let's see what happens when my browser is authorized on the forum. To do this, I again use the FireFox extension called liveHttpHeaders and this is what I see.
http://php.su/forum/loginout.php
POST /forum/loginout.php HTTP / 1.1
Host: php.su
User-Agent: Mozilla / 5.0 (X11; U; Linux x86_64; en-US; rv: 1.9.0.10) Gecko / 2009042523 Ubuntu / 8.10 (intrepid) Firefox / 3.0.10
Accept: text / html, application / xhtml xml, application / xml; q = 0.9, * / *; q = 0.8
Accept-Language: ru, en-us; q = 0.7, en; q = 0.3
Accept-Encoding: gzip, deflate
Accept-Charset: windows-1251, utf-8; q = 0.7, *; q = 0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://php.su/forum/loginout.php
Content-Type: application / x-www-form-urlencoded
Content-Length: 71
action = login & imembername = valenok & ipassword = ne_skaju & submit =% C2% F5% EE% E4
HTTP / 1.x 302 Found
Date: Tue, 26 May 2009 14:09:09 GMT
Server: Apache
X-Powered-By: PHP / 5.2.6
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check = 0, pre-check = 0
Pragma: no-cache
Set-Cookie: lastvisit = 1243346949; expires = Wed, 26-May-2010 14:09:09 GMT; path = /
Set-Cookie: exbbn = 19; expires = Wed, 26-May-2010 14:09:09 GMT; path = /
Set-Cookie: exbbp = 1234567525d2b72bcb01cd2ffe123456; expires = Wed, 26-May-2010 14:09:09 GMT; path = /
Set-Cookie: PHPSESSID = 123456789e4eef401e4539060010cc0f;
Set-Cookie: lastvisit = 1243346949; expires = Wed, 26-May-2010 14:09:09 GMT; path = /
Location: index.php
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 26
Connection: close
Content-Type: text / html; charset = cp1251
The request differs from the previous one only a little. consider the difference.
POST /forum/loginout.php HTTP / 1.1 This time we do not just ask us to give us the content of the page, but send our data to the server.
Referer: http://php.su/forum/loginout.php From where we send the data (which page just had a browser).
Content-Type: application / x-www-form-urlencoded data type (form data).
Content-Length: 71 is the length of the data to be sent.
action = login & imembername = valenok & ipassword = ne_skaju & submit =% C2% F5% EE% E4 and the data itself .. Notice that all the data from the form is sent.
But the answer is slightly different from the previous one. New headlines appear in it. First of all, we are interested in Set-Cookie and Location. The rest of the special role does not play and what they mean can be found in Wikipedia.
Set-Cookie as you see, and there are a lot of them, perhaps you are already familiar. The task of this title is to stick a label with a name on you so that later the server can recognize you from it and say, oh, yes, exactly, it is you. Of course, for this, each time you access the server, you will need to come with this sticker.
Location: index.php redirects the browser to another page, after authorization.
Well, now let's try to play for the browser? I'm also a browser
We will analyze in practice. The code is the following.
<?PHP
$ch = curl_init('http://php.su/forum/loginout.php');
# /forum/loginout.php HTTP/1.1
curl_setopt($ch, CURLOPT_POST, 1);
# POST /forum/..
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (бла бла бла..) ");
# User-Agent
$headers = array
(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*;q=0.8',
'Accept-Language: ru,en-us;q=0.7,en;q=0.3',
'Accept-Encoding: deflate',
'Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7'
);
curl_setopt($ch, CURLOPT_HTTPHEADER,$headers);
# добавляем заголовков к нашему запросу. Чтоб смахивало на настоящих
curl_setopt($ch, CURLOPT_REFERER, "http://php.su/forum/loginout.php");
# Подделываем значение - откуда пришли данные.
curl_setopt($ch, CURLOPT_POSTFIELDS, 'action=login&imembername=valenok&ipassword=ne_skaju&submit=%C2%F5%EE%E4');
# post данные.
# умная libcurl сама добавит заголовки
# Content-Type: application/x-www-form-urlencoded и Content-Length: 71
curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
# Функции для обработки устанавливаемых форумом кук.
# подробнее рассмотрим далее.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
# Убираем вывод данных в браузер. Пусть функция их возвращает а не выводит
$result = curl_exec($ch); // выполняем запрос curl
curl_close($ch);
?>
What parameter is responsible for what I showed in the code. If something is not completely clear, you can always look in the documentation. However, in two words I will describe the cookiejar and cookiefile settings.
When the server gives us a cookie, that is, a sticker - You are such and such, he then looks at this sticker and remembers you. But for this, of course, we need to contact the server when the sticker hangs in a prominent place. libcurl can save a sticker for us to a file if we specify it in the cookiejar parameter and also send cookies, that is, to apply along with the sticker, if we specify the file in which this sticker we saved the cookiefile. And since it was necessary for authorization so that the server remembers that we, this is the next time we are addressed, then in fact it was just necessary to receive cookies when authorizing. So that's how we do it.
<?PHP
$ch = curl_init('http://php.su/forum/loginout.php');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, 'action=login&imembername=valenok&ipassword=ne_skaju&submit=%C2%F5%EE%E4');
curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
curl_close($ch);
?>
Here I cut a few steps. The reason was that the server does not actually check where the authorization data came from, so the referrer does not make sense to specify. Which browser is authorized - it doesn't care either. We remove.
Headings that we want to receive in response - we will not send too. In response, we still get cookies and redirection. In addition, almost no programmer still takes these headers into account. They, like sheep, have decided that they know how to properly and what the user needs - and that's it.
Also, the parameter is set - without a body (nobody) which says that we do not need all the html, we just need headers. In fact, he is not there, but this is only in our case. In fact, the script can and conduct authorization and swear in one bottle.
I also removed the assignment of the result to the variable. Why do we need to remember the result of the query .. We already know that it is successful. But if we don’t know, you can check for an example the presence of a Location: index.php redirect and, based on this, decide whether you are properly authorized or not. But have we not forgotten anything?
A couple of recommendations
In fact, all that you could need - we have already passed. You can already log in somewhere, send a form (post data) using curl library while posing as another browser. And remind the server who you are from last time with the help of cookies.
I will only say that sometimes you don’t need a page at all, but only its headings, as in our authorization example. You do not need to create an instance of the curl mechanism for each request in memory. We initialize it once, then just change the parameters and the url address. I think this is quite enough for a successful exchange of data with other sites at the beginning, but if not enough, then we can consider several difficult cases further. Multiple parallel queries and curl_multi_init
If we would need to get, let's say, the contents of three pages, then probably we would send the request first to one, get the result, send to the second and then only the third. But here's a miracle - this library allows you to send a request at once to 3 sources in parallel, while spending on the whole thing as much time as one request.
For the following code, examples of $ data may be as follows:
<?PHP
// GET
$data = Array
(
'http://yandex.ru',
'http://php.su',
'http://google.com'
);
// POST
$data = Array
(
Array('url' => 'http://yandex.ru/login.php', 'post' => 'a=b&c=d'),
Array('url' => 'http://php.su/index.php', 'post' => 'a=b&c=d'),
Array('url' => 'http://google.com/search.py', 'post' => 'a=b&c=d')
);
?>
The function itself:
<?PHP
function multiCurl($data, $options = array())
{
$curls = array();
// Массив дескрипторов. Библиотека создает много экземпляров своего
// механизма, но работать они будут параллельно
$result = array();
// массив с результатами запрошенных страниц которые наша функция вернет.
$mh = curl_multi_init();
// Дескриптор мульти потока. То есть эта штука отвечает за то, чтобы много
// запросов шли параллельно.
foreach ($data as $id => $d) {
$curls[$id] = curl_init();
// Для каждого url создаем отдельный curl механизм чтоб посылал запрос)
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
// Если $d это массив (как в случае с пост), то достаем из массива url
// если это не массив, а уже ссылка - то берем сразу ссылку
curl_setopt($curls[$id], CURLOPT_URL, $url);
curl_setopt($curls[$id], CURLOPT_HEADER, 0);
curl_setopt($curls[$id], CURLOPT_RETURNTRANSFER, 1);
// Если у нас есть пост данные, то есть запрос отправляется постом
// устанавливаем флаги и добавляем сами данные
if (is_array($d) && !empty($d['post']))
{
curl_setopt($curls[$id], CURLOPT_POST, 1);
curl_setopt($curls[$id], CURLOPT_POSTFIELDS, $d['post']);
}
// Если указали дополнительные параметры $options то устанавливаем их
// смотри документацию функции curl_setopt_array
if (count($options)>0) curl_setopt_array($curls[$id], $options);
// добавляем текущий механизм к числу работающих параллельно
curl_multi_add_handle($mh, $curls[$id]);
}
// число работающих процессов.
$running = null;
// curl_mult_exec запишет в переменную running количество еще не завершившихся
// процессов. Пока они есть - продолжаем выполнять запросы.
do { curl_multi_exec($mh, $running); } while($running > 0);
// Собираем из всех созданных механизмов результаты, а сами механизмы удаляем
foreach($curls as $id => $c)
{
$result[$id] = curl_multi_getcontent($c);
curl_multi_remove_handle($mh, $c);
}
// Освобождаем память от механизма мультипотоков
curl_multi_close($mh);
// возвращаем данные собранные из всех потоков.
return $result;
}
?>
Comments
To leave a comment
Running server side scripts using PHP as an example (LAMP)
Terms: Running server side scripts using PHP as an example (LAMP)