This series explores real-life examples of advanced Yara uses, seeking to generalize them to form an abstract problem. It also explores general approaches to solve such uses, while looking at the pros and cons of possible solutions. The material provided here should be educational for those who are new to Yara but should also be suitable for very experienced Yara users as it uncovers some fundamental issues.
In my third post, we looked at using Yara’s native hexadecimal pattern definition features to create fast rules with fewer false positives and no alarming nested loops.
Yara was originally designed for quick and easy malware recognition. In most cases it is as simple as detecting a unique string or a binary pattern (and their combination) inside a file, adding some header checks or file size limit and you are good to go.
But Yara is also a powerful general-purpose search tool. Based on the Aho-Corasick algorithm, it can scan for thousands of patterns within one pass over a file, which makes it so effective when scanning large volumes of data with multiple rules during digital forensics or threat hunting.
On the other hand, Yara’s condition checking mechanism wasn’t really designed to deal with an enormous number of matches where a special computational check, such as distance measurement, is applied to each or all combinations. While a stack-based Yara virtual machine (VM) seems Turing-complete, a Yara condition check isn’t. This means you have limited control over the process of condition checks.
However, if you run Yara in your own controlled environment, you may combine it with other tools and break free of the VM restraints, having full control over the condition validation process. Who knows, maybe this is the red pill you are looking for?
Let’s go back to the start and find a use for the very first rule we explored in this series:
rule three_body_problem {
meta:
description = "Simple rule to detect 3 patterns inside a file."
strings:
$x = { 11 11 11 11 }
$y = { 22 22 22 22 }
$z = { 33 33 33 33 }
condition:
all of them
}
Its simplicity and speed of scanning are tempting, but let’s combine it with additional external condition checks. All you need is to pass Yara scan results to another tool, be it another binary, a Python script, or maybe an AWK text processor program.
Let’s use AWK, a never-ageing classic, for our task:
BEGIN { # process the first line of Yara output, set parameters
M=16; # max distance between the last and the first pattern
N=3; # the minimal count of patterns for the condition
f=$0; # the verdict line with file name
}
/^0x[0-9a-f]*:\$/ { a[strtonum($1)]=$2; } # populate yara matches ("a" array)
function condition() { # the main condition check, uses "o" array (sliding window of matches)
L = length(o);
D = o[L]-o[1];
return L==N && D>=0 && D<=M;
}
function process() {
asorti(a,b,"@ind_num_asc"); # sort the a array's indices numerically and put in b array
for(k in b) {
c[a[b[k]]]=b[k]; # save the offset in the "c" dictionary indexed by pattern names
asort(c,o,"@val_num_asc"); # sort the patterns by their offsets
if(condition()) {
if(f) { print f; f="";} # print the filename just once
for(t in c){ printf " "t"="c[t]; } # print the patterns and their offsets
print "";
delete c;
}
}
 delete a;
}
!/^0x[0-9a-f]*:\$/ or END { # processing yara output line for a detected new file
process();
f=$0; # reset the accumulator array and save the next filename
}
END { # the final output line (the last detected file)
process();
}
The following shows how it would look if you want to use it in a single-line command combined with a Yara call:
yara -r -s ./rule.yara ./samples | awk -F: 'BEGIN { M=16; N=3; f=$0; } /^0x[0-9a-f]*:\$/ {
a[strtonum($1)]=$2; }
function condition() { L = length(o); D = o[L]-o[1]; return L==N &&
D>=0 && D<=M; } function process() { asorti(a,b,"@ind_num_asc"); for(k in b) {
c[a[b[k]]]=b[k]; asort(c,o); if(condition()) { if(f) { print f; f="";} for(t in c){ printf " "t"="c[t]}; print
""; delete c; } } delete a; } !/^0x[0-9a-f]*:\$/ { process(); f=$0; } END{ process(); }'
The code above was tested with multiple files and can handle over one million matches within a few seconds. It removes Yara’s hard limit of 200 bytes per pattern, which we observed in the previous approach.
In addition, this solution is easy to modify to include more than three patterns, while measuring the distance between them — the ‘three’ in the three_body_problem becomes just a number defined in global variable N=3.
Conclusions
Yara offers rich capabilities in detecting patterns and their combination. Without a doubt, it will remain a top tool for malware detection, threat hunting, and artifact search in the future.
As we have demonstrated in this series, there are multiple approaches to solving a problem. Each approach has its pros and cons and may or may not be suitable to address your case. You should explore solutions carefully to make sure you don’t miss any results while avoiding false detections.
When you apply Yara for something less conventional, such as scanning very large files, or data full of many potential matches, you may need to be ready for potential side effects that impact Yara’s performance. My advice is to stay creative and look for your elegant solution!
If you would like to further sharpen your Yara skills, check out our online Yara course for APT hunters.
Appendix: Yara test set file generator
Should you like to try out the proposed solution or keep finding the best one, feel free to use the following python script that generates initial test files with some special cases mentioned in this series. Note, that good solutions must detect all sample_good* files and no sample_bad* files.
X = b"\x11\x11\x11\x11" Y = b"\x22\x22\x22\x22" Z = b"\x33\x33\x33\x33" J = b" " #Junk or blank space with open("sample_good1.dat","wb") as f: f.write( X + Y + Z ) and f.close() with open("sample_good2.dat","wb") as f: f.write( Y + X + J*100 + X + Y + Z ) and f.close() with open("sample_good3.dat","wb") as f: f.write( Z + J*4 + Y + X ) and f.close() with open("sample_good4.dat","wb") as f: f.write( (X+J)*90+ J*20 + X + J*4 + Y + J*4 + Z ) and f.close() with open("sample_good5.dat","wb") as f: f.write( X + J*4 + Y + J*4 + Z + \ J*100 + X + J*4 + Y + Z + \ J*100 + X + Y + Z ) f.close() with open("sample_good6.dat","wb") as f: f.write( (X+J)*100 + J*2 + Y + J*4 + Z ) and f.close() with open("sample_bad1.dat","wb") as f: f.write( X + 16*J + Y + 16*J + Z ) and f.close() with open("sample_bad2.dat","wb") as f: f.write( X + 8*J + Y ) and f.close() with open("sample_bad3.dat","wb") as f: f.write( X + 8*J + Y + 8*J + Z ) and f.close() with open("sample_bad4.dat","wb") as f: f.write( ( X + 8*J + Y + 8*J + Z+8*J) * 1000 ) and f.close() with open("sample_bad5.dat","wb") as f: f.write( ( X*100 + 8*J + Y + 8*J + Z*100) * 1278) and f.close()
Vitaly Kamluk is Director of the Global Research and Analysis Team for Kaspersky Asia Pacific.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.