Конспект злого адміна: 2015

Налаштування додаткового віртуального Wi-Fi в MikroTik

Постала задачка, на MikroTik RB951Ui-2HnD підняти дві Wi-Fi точки доступу. Перша (point1) - доступ по паролю. Друга (point1) - доступ безпарольний по mac-адресах. point1 працює у мережі 192.168.88.0/24, point2 у 192.168.89.0/24. Шлюз і DNS 192.168.x.1.

Детальніше »

Вибірка Django: Групування об'єктів по днях з їх підрахунком

from django.db import connection
from django.db.models import Count

truncate_date = connection.ops.date_trunc_sql('day', 'created')
qs = NewsContent.objects.extra({'day':truncate_date})
news_report = qs.values('day').annotate(Count('pk')).order_by('day')

і на виході маємо:

[{'pk__count': 110, 'day': u'2015-11-21'}, {'pk__count': 83, 'day': u'2015-11-22'}]

Python Selenium PhantomJS робота з Facebook

По замовчуванні PhantomJS на відріз відмовився працювати з Facebook, хоча webdriver.Firefox() працює прекрасно. Методом гугління, проб і помилок вияснилось, що проблема при роботі з SSL. Отже, робочий "сніпєтс":

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = \
    ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")
wd = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ssl-protocol=tlsv1'])
wd.set_window_size(1120, 550)
wd.get("https://www.facebook.com/")

Прочитати змінений скриптами DOM в Selenium

Простенькі сайти можна парсити через curl+lxml чи щось інше. Динамічні сайти, які генеруються за допомогою скриптів на стороні клієнта можна парсити через Selenium і отримувати код сторінки через driver.page_source. Але є ще більш комплексні сайти (наприклад Facebook), які після завантаження і генерації сторінки додатково змінюють DOM і page_source вже не допоможе. Для вирішення питання треба запустити скрипт:

outerhtml = wd.execute_script("return document.documentElement.outerHTML")
tree = etree.parse(StringIO(outerhtml), parser)

Пакетне переіменування розширення файлів в bash

ls *.mp4* | xargs -I {} sh -c 'mv ${1} ${1%.*}.mp4' - {}

Скрапер weblancer.net на python+lxml

Колега попросив помогти знайти помилку в скрипті. Парсалось за допомогою BS, а я його ненавиджу. Переписав на lxml

#!/usr/bin/env python

#-*- coding: utf-8 -*-

import re
from lxml import etree
from lxml import html as lxml_html
import urllib2
from io import StringIO

_base_url = "https://www.weblancer.net/projects/?page={}"
def main():
    parser = lxml_html.HTMLParser(encoding='windows-1251')
    page =urllib2.urlopen(_base_url.format(1))
    data=page.read()
    tree = etree.parse(StringIO(data.decode("windows-1251", errors="ignore")), parser)

    last_page_url = tree.xpath("//ul[@class='pagination']/li[last()]/a/@href")[0]
    last_page = re.compile(r"(\d+)").search(last_page_url).group(1)

    for i in range(0, int(last_page)+1):
        page =urllib2.urlopen(_base_url.format(i))
        data=page.read()
        tree = etree.parse(StringIO(data.decode("windows-1251", errors="ignore")), parser)
        search_results = tree.xpath("//div[@class='container-fluid cols_table show_visited']/div[@class='row']")
        for sr in search_results:
            title =  sr.xpath("./div[@class='col-sm-7']/a[@class='title']/text()")[0].lstrip().rstrip()
            categories =  sr.xpath("./div[@class='col-sm-7']/div[@class='text-muted']/a[@class='text-muted']/text()")[0].lstrip().rstrip()
            try:
                price = sr.xpath("./div[@class='col-sm-2 amount title']/text()")[0].lstrip().rstrip()
            except IndexError:
                price = ''

            try:
                application = sr.xpath("./div[@class='col-sm-3 text-right text-nowrap hidden-xs']/text()")[0].lstrip().rstrip()
            except IndexError:
                application = ''
            print title, categories, application, price

if __name__ == "__main__":
    main()

Прибирання дубляжів об'єктів моделі, які знаходяться у списку (Django, Python)

objects = [obj1, obj2, obj3,]

uniq = []
seen = set()
for obj in objects:
    if obj.pk not in seen:
        uniq.append(obj)
        seen.add(obj.pk)

xfce і ширина скролбару

По замовчуванні ширина скролбару дуже маленька, пікселів може з 5, що не дуже зручно.

Лікується наступним чином:

Створюємо або редагуємо файл "~/.gtkrc-2.0"

Вписуємо:

style "myscrollbar"
{
     GtkScrollbar::slider-width=20
}
class "GtkScrollbar" style "myscrollbar"

Як повернути іконку network manager у панельку xfce

Йдемо у /etc/xdg/autostart/nm-applet.desktop:

# sudo su
# nano /etc/xdg/autostart/nm-applet.desktop

Шукаємо лінію з Exec.

Міняємо запис:

nm-applet
на
dbus-launch nm-applet

Upgrading MySQL 5.5 to MySQL 5.6 on Ubuntu 14.04 LTS

sudo apt-get remove mysql-server
sudo apt-get autoremove
sudo apt-get install mysql-client-5.6 mysql-client-core-5.6
sudo apt-get install mysql-server-5.6

Глюки SSH з'єднання

Обриви чи взагалі інколи неможливість законектитись на деякі хости лікуються фіксом конфіга на клієнті в /etc/ssh/ssh_config

Host *
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
GSSAPIDelegateCredentials no
Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc
HostKeyAlgorithms ssh-rsa,ssh-dss
MACs hmac-md5,hmac-sha1,hmac-ripemd160

=IMPORTXML("https://api.privatbank.ua/p24api/pubinfo?exchange&coursid=5"; 
"//exchangerates/row[3]/exchangerate/@sale")

pycurl

Перед встановленням pycurl, треба поставити: sudo apt-get install libcurl4-openssl-dev

і ще один парсер

#!/usr/bin/env python

import re
import helpers
import hashlib
from lxml import etree
from lxml import html as lxml_html
from urlparse import urlparse

_base_url = "http://efccnigeria.org/efcc/index.php/wanted?limitstart={}"
_base_site_name = urlparse(_base_url)

def cleanhtml(raw_html):
    cleanr =re.compile('<.*?>')
    cleantext = re.sub(cleanr,'', raw_html)
    return cleantext

def parse_links(div_blog, iter):
    if iter == 0:
        div_blog = div_blog[:-2]
    else:
        div_blog = div_blog[:-1]

    links = []
    for db in div_blog:
        left_div = db.getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      attrib.get('href')
        right_div = db.getchildren()[1].\
                      getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      getchildren()[0].\
                      attrib.get('href')
        links.append("%s://%s%s" % (_base_site_name[0], _base_site_name[1], left_div))
        links.append("%s://%s%s" % (_base_site_name[0], _base_site_name[1], right_div))
    return links

def build_document(content):
    name = content[0].getchildren()[0].text.lstrip().rstrip()
    description = etree.tostring(content[2].getchildren()[0], encoding="utf-8")
    description_clean = cleanhtml(description)
    description_clean = re.sub('\&\#13\;', '', description_clean)
    description_clean = re.sub('\xc2\xa0', '', description_clean)
    description_clean = "\n".join([ll.rstrip() for ll in description_clean.splitlines() if ll.strip()]) # strip blank lines
    description_clean = description_clean.split("\n")
    index = description_clean.index(' <!--')
    description_clean = description_clean[:index]
    description_clean = ' '.join(description_clean)

    entity = {
        "_meta": {
            "id": hashlib.sha224((re.sub("[^a-zA-Z0-9]", "", name + description_clean.decode("utf8", errors="ignore")))).hexdigest(),
            "entity_type": "person"
        },
        "name": name,
        "types": ["pep"],
        "fields": [
            {"name": "Description", "value": description_clean.decode("utf8", errors="ignore")}
        ]
    }
    helpers.emit(entity)

def main():
    parser = lxml_html.HTMLParser(encoding='utf-8')

    tree = etree.parse(_base_url.format(0), parser)
    last_page_url = tree.findall("//li[@class='pagination-end']/a")[0].attrib.get('href')
    rx_sequence=re.compile(r"start=([0-9]+)")
    last_page = rx_sequence.search(last_page_url).group(1)

    for i in range(0, int(last_page)+1, 20):
        tree = etree.parse(_base_url.format(i), parser)
        div_blog = tree.findall("//div[@class='blog']/div")
        links = parse_links(div_blog, i)
        for link in links:
            tree = etree.parse(link, parser)
            xpath = "/html/body/div[2]/div[2]/div/div[5]/div/div/div[2]/div[2]/div/div/div[1]"
            person_info = tree.xpath(xpath)[0].getchildren()
            build_document(person_info)

if __name__ == "__main__":
    main()

Парсер

#!/usr/bin/env python

import re
import helpers
import hashlib
from lxml import etree
from lxml import html as lxml_html

_base_url = "http://guernseyregistry.com/article/4036/Disqualified-Directors"

def check_children(element):
    if element.getchildren():
        return element.getchildren()[0].text
    return element.text

def build_document(member):
    date_of_disqualification, \
    applicant_for_disqualification, \
    name_of_disqualified_director, \
    period_of_disqualification, \
    end_of_disqualification_period =  [check_children(member[i]) for i in range(0, 5)]

    entity = {
        "_meta": {
            "id": hashlib.sha224((re.sub("[^a-zA-Z0-9]", "", name_of_disqualified_director + date_of_disqualification))).hexdigest(),
            "entity_type": "person"
        },
        "name": name_of_disqualified_director,
        "types": ["pep"],
        "fields": [
            {"name": "Date of disqualification", "value": date_of_disqualification},
            {"name": "Applicant for disqualification", "value": applicant_for_disqualification},
            {"name": "Period of disqualification", "value": period_of_disqualification},
            {"name": "End of disqualification period", "value": end_of_disqualification_period}
        ]
    }
    helpers.emit(entity)

def main():
    parser = lxml_html.HTMLParser(encoding='utf-8')
    tree = etree.parse(_base_url, parser)
    table = tree.findall("//table[@summary='disqual directors']/tbody/tr")
    for tr in table[1:]:
        build_document(tr.getchildren())

if __name__ == "__main__":
    main()

Конспект злого адміна