Docs‎ > ‎OpenSource‎ > ‎

jlddk


    

Overview

This package contains a collection of robots written in Python.  It can be easily installed through easy_install or pip using the Pypi package:

[sudo] easy_install jlddk

Related projects

  • jldaws : collection of Amazon AWS scripts
  • jldzeromq : collection of ZeroMQ scripts

Robots

jldwebscraper

This robot scrapes a specified web page for anchor links.  All links are extracted and written to stdout.
The output formats on stdout is either plain text (one link per line) or JSON based e.g. :

{   "extract_status": "ok",
    "hrefs": [...],
    "source_url":"source page url", 
    "etag": "etag header of source page", 
    "last-modified": "field from page header",
    "http_status": 200 
}

jldwebscaper [-p polling_interval] [ -j ] [-e] -su source_page_url [-cp check_path]

OR jldwebscraper @config.txt

where '-j' enables JSON output format on stdout
where '-e' propagates errors down stdout
where '-cp' specifies a filesystem path - if the path exists, the operation proceeds or else it doesn't.



jldfilter

Filters a stdin stream through a specified Python module, outputs on stdout.  Logging is produced on stderr.

jldfilter [-m module] [-f function] [-a args] [-lc path] [-ll level]

OR jldfilter @config.txt

Where '-m' is a python module name with a callable "function" specified through '-f'.
Where '-a' specifies arguments to be passed to function.

Example:  

    jldfilter -m string -f upper

    Will transform the stdin input to all caps and write the result to stdout.

Any module.function accessible in the system throught the default Python interpreter can be used.  If the module.function returns None, then nothing is sent to stdout.


Common Options

-ll
Log Level: [ debug, info, warning, error]
-lc
Log configuration file path
@path
Path to a configuration file containing one option per line
-p
Integer Polling Interval (seconds)
-cp
"check path" : gate operation based on the existence of the filesystem path





jldwebdl


Downloads files from the web.  A source_path directory is inspected at most polling_interval (seconds) for text files containing URL to download in dest_path.
Once a file is downloaded, the source_path text file pointing to it will be deleted.  When a file is finished downloading and written to dest_path, a JSON string describing the said file is sent over on stdout.  Additionally, the script implements a "stdin --> stdout pass-through" for easier chaining with piping.

{
         "dest_filename": dest_filename
         ,"src_filename": src_file
         ,"url": url
         ,"http_code": http_code
         ,"headers": headers
}


Note that an atomic write is used to save the file: a temporary file path is used to write to disk and once it's done, a rename operation (guaranteed to be atomic) is performed.

The source_path directory is not traversed for nested directories. Only files in the root of the directory are processed.

The filename used to write to the destination path will be the base name of the URL if a period '.' is present or else a unique name will be generated. The association between the URL and the created file will be provided in the output JSeON object on stdout.

The '-dfe' (delete on fetch error) is useful when the url in a source_file can't be fetched. Use with caution as a url can be only temporarily unavailable.

A maximum of batch_size files will be processed per polling interval.

jldwebdl -sp source_path -dp dest_path [-p polling_interval] [-bs batch_size] [-dfe] [-cp check_path]

where '-cp' specifies a filesystem path - if the path exists, the operation proceeds or else it doesn't.

NOTE: Make sure that the source_path files can be deleted by the jldwebdl process or else the same files might get downloaded repeatedly.



jldjsoncat


Streams file(s) to stdin by encapsulating them in JSON objects line-by-line, outputs to stdout.

jldjsoncat -sp source_path [-mp move_path] [-ds] [-bs batch_size] [-cp check_path]

Where '-sp' specifies the source path.
Where '-mp' specifies the move path i.e. when a file is finished streaming, it is moved to the specified path.
Where '-ds' means "delete source when finished".
Where '-bs' specifies the maximum number of files to process per polling interval.
Where '-cp' specifies a filesystem path - if the path exists, the operation proceeds or else it doesn't.

Note that '-mp' and '-ds' are mutually exclusive.

When a file begins/ends, the following JSON string is sent on stdout:

{ "sp": "$source_file_path", "code": "[begin|end]" }

For each line of the file, the following JSON string is sent on stdout:

{ "code":"line", "line": "$line_contents" }



jldcomp

Compares two directories ( primary against compare paths ).

jldcomp -pp primary_path -zp compare_path [-p polling_interval] [-sf status_filename] [-wsf] [-cp path/to/check] [-jbn] [-tp topic_name] [-exts ext ext ...]

pp : primary path
zp : compare path
sf : status file name (in primary path)
wsf: wait for 'ok' status in status file
cp : check if this path exists before proceeding
jbn: just compare the basenames of the files
tp : topic name for the output JSON object
exts: just compare files of the following extension(s)

If the option 'wsf' is used, the robot will wait for either:
  • 'ok' string on the first line of the status file
  • 'code':'ok' in the JSON object contained in the status file
The difference between the primary and compare paths will be output on stdout as a JSON object.  E.g.

{"pp-cp": ["file1", "file4"], "pp": "/tmp/jlddk_test", "cp": "/tmp/jlddk_test2", "common": ["file3", "file2"], "cp-pp": ["file10"] [, "topic": topic_name]}

In the example above, the primary path contained the files 'file1' and 'file4' more than in compare path.  The common files in both primary and compare paths were 'file3' and 'file2'.



jldinotify

Notify of changes to a path, output on JSON/stdout.

jldinotify -sp path_source  OR  jldinotify @configfile.txt

Example output from a 'touch' on a file:

{"event_name": "IN_CREATE", "path_name": "/tmp/tests3/file4", "path_source": "/tmp/tests3", "path_base": "file4", "path_mask": 256}
{"event_name": "IN_OPEN", "path_name": "/tmp/tests3/file4", "path_source": "/tmp/tests3", "path_base": "file4", "path_mask": 32}
{"event_name": "IN_ATTRIB", "path_name": "/tmp/tests3/file4", "path_source": "/tmp/tests3", "path_base": "file4", "path_mask": 4}
{"event_name": "IN_CLOSE_WRITE", "path_name": "/tmp/tests3/file4", "path_source": "/tmp/tests3", "path_base": "file4", "path_mask": 8}



jldfetcher

Fetches web pages from specifications taken received from stdin.

jldfetcher [-dp dest_path]

If dest_path is specified, then the resulting filename will be : dest_path/basename(url).
If dest_path is not specified, then the second string on stdin must be the specified a destination path.

Hence, the script either waits for (on stdin):

url \n

OR

url  dest_path  \n



jldfilelist

Lists files from a filesystem path, filters list through include or exclude list of extensions.

jldfilelist -sp source_path [-ee ext1 ... ] [-ie ext1 ...] [-p polling_interval]

where 'ee' is a list of file extensions to exclude from output list
where 'ie' is a list of file extensions to include in the output list



jldpclean

Periodically inspects which process(es) should be killed given specifications.

jldpclean [-p polling_interval] [-u username] [-pr prefix] [-ppid ppid] [-f]

One can specify processes to target based on: username, parent process id (ppid) and command prefix.
The prefix parameter applies to all components of the command line which started any given process.  Example follows:

python /some/path/$prefix_rest_of_command
$prefix /some/other/parameters

Use the '-f' to effect the kill operation or else only what would the script have done is reported.


jldostr

Periodically sends a string on stdout.

Implements the "stdin --> stdout pass-through" functionality.

jldostr -ostr 'some_string' [-p polling_interval]


jldstatsubdirs

Periodically outputs status information about sub-directories of a path on stdout as JSON format.

Implements the "stdin --> stdout pass-through" functionality.

jldstatsubdirs -tp topic -sp path [-p polling_interval] [-ll loglevel] [-a]

The option '-a' can be used to periodically send the update even when there is no change.

Example output on stdout:

{"topic": "path_status", "path": "/tmp/test_dst/dir1", "mtime": 1330377240.5986614}
{"topic": "path_status", "path": "/tmp/test_dst/dir2", "mtime": 1330376973.9186623}
{"topic": "path_status", "path": "/tmp/test_dst/dir3", "mtime": 1330377282.6386614}


jldebouncer

Debounces {key:value} pairs found through JSON encoded stdin stream.  Outputs on stdout changes in the form:

{ "topic": $otopic, "key": $key, "value":$value }

Command-Line:  

jlddebouncer [-ukv] [-hin hysteresis] [-to timeout] [-p polling_interval] [-ll loglevel] -itp input_topic -otp output_topic -key field_name_for_key -value field_name_for_value

ukv: use the key & value field names for the output message.
hin: hysteresis for {key:value} pairs, in polling_interval cycles.
to: timeout for declaring {key:value} stale (polling_interval cycles).
itp: topic field value on the input side.
otp: topic name to use for generated messages on stdout.



jldrun


Executes a local script periodically.

jldrun -m module_name -f function_name [-a positional arguments] [-p polling_interval] [-ll loglevel]




Note on the 'check path' option

Certain scripts support the optional '-cp' option.  This option is used to gate operations of the script.  If the filesystem path exists, then the operation is allowed to proceed normally or else operation is skipped for this interval.  The path existence is checked on each polling interval.

This option is useful when used in conjunction with a Manager like  jldleader in the package jldaws.




In the works



Comments