You are not logged in.

#1 2009-12-11 06:07:23

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,534
Website

A package detection script to help rebuild the local database.

edit: This script is old. See the posts below for updates that work with current systems.

At this point this is a quick "proof of principle" script inspired by this thread.

This will download file lists for the specified repos and then check them against files on your system. It will then print a percentage for each package that indicates how many of that package's files were found on your system. Note that it ignores directories as that would lead to loads of partial false positives.

The idea is that it should provide a decent starting point for rebuilding the local package database. You can safely run it on your system to see the output as it doesn't change anything.

The script depends on Perl and curl. You should change the "$url" variable to a local mirror (I've used archlinux.org along with "`arch`" as relatively failsafe default). You can also add or remove repos from the "@repos" array.

Again, this is just something that I threw together to see if it works and it's very much a "hands-on" script right now. I might flesh it out and try to add more features later. Note that it cannot determine which packages were installed as dependencies (although I posted a script somewhere on the forum that could explicitly install only top-level packages, which could probably be merged into this if it goes anywhere). It is also limited to repos that contain <repo>.files.tar.gz.

#!/usr/bin/perl
use strict;
use warnings;

use File::Temp qw/tempdir/;

my $url = 'ftp://ftp.archlinux.org/$repo/os/' . `arch`;
chomp $url;
my @repos = qw/core extra community/;

my $tmpdir = tempdir(CLEANUP=>1);

foreach my $repo (@repos)
{
  my $files_url = $url;
  $files_url =~ s/\/\$repo\//\/$repo\//;
  $files_url .= '/' . $repo .'.files.tar.gz';
  `cd "$tmpdir" && curl "$files_url" | bsdtar -xf-`;
}
opendir(my $dh, $tmpdir) or die;
my @pkgs = readdir($dh);
close($dh);

my $l = 0;
foreach my $pkg (@pkgs)
{
  my $i = length($pkg);
  $l = $i if $i > $l;
}

foreach my $pkg (sort @pkgs)
{
  next if ($pkg eq '.' or $pkg eq '..');
  my @files = ();
  if (open(my $fh, '<', $tmpdir .'/'. $pkg .'/files'))
  {
    while (defined(my $line = <$fh>))
    {
      chomp $line;
      next if $line eq '%FILES%' or substr($line,-1) eq '/';
      push @files, '/' . $line;
    }
    close($fh);
    my $n = scalar @files;
    next if $n == 0;
    my $i = 0;
    foreach my $file (@files)
    {
      $i++ if -f $file;
    }
    printf("%-${l}s %3d%%\n", $pkg, 100*$i/$n);
  }
  else
  {
    print "error: failed to open $tmpdir/$pkg/files\n";
  }
}

Example output:

perl-xml-xpath-1.13-4                                0%
perl-xmms-0.12-4                                     0%
perl-xyne-arch-0.95-1                              100%
perl-xyne-common-0.05-3                            100%
perl-yaml-0.70-1                                     0%

Last edited by Xyne (2020-09-02 16:30:31)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#2 2009-12-11 06:42:20

tavianator
Member
From: Waterloo, ON, Canada
Registered: 2007-08-21
Posts: 858
Website

Re: A package detection script to help rebuild the local database.

Way to pick a language I don't know... I should probably learn perl anyway though.  I'll test this and maybe hack on it a bit tomorrow.

Offline

#3 2009-12-11 21:03:01

tavianator
Member
From: Waterloo, ON, Canada
Registered: 2007-08-21
Posts: 858
Website

Re: A package detection script to help rebuild the local database.

Works quite well.  I set it to only return packages with a >= 90% match, and there were only 7 false positives and 8 false negatives out of 1128 packages (excluding AUR packages).

Offline

#4 2009-12-11 23:43:34

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,534
Website

Re: A package detection script to help rebuild the local database.

Thanks for the feedback, tavianator.

Which false positives did it detect? Were they variants of installed packages?

I think the false negatives are due to packages which manipulate their own files during or after installation. They should still show up in the list though, albeit with a lower percentage. I don't think there's any way to work around that.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#5 2009-12-12 00:15:55

tavianator
Member
From: Waterloo, ON, Canada
Registered: 2007-08-21
Posts: 858
Website

Re: A package detection script to help rebuild the local database.

The false positives were gimp-devel, id3lib-rcc, libxft-lcd, links, taglib-rcc, ttf-freefont, and vncviewer-jar.  So yeah, variants of installed packages.  It also does a good job of detecting the official versions of git packages I have installed, which is expected but still cool.

Offline

#6 2020-08-26 23:03:07

TheAmigo
Member
Registered: 2008-04-08
Posts: 65

Re: A package detection script to help rebuild the local database.

I found this thread from the wiki and gave the script a go.  As tends to happen, things have changed in the past 10 years so it didn't work as-is, but still pretty close.

Due to a minor typo (missing a trailing / from an rsync command), I accidentally deleted many files under /var... including /var/log and /var/cache.  Running pacman (or yay) with --overwrite '*' has worked pretty well so far at getting pacman's db built back up.  But I can't remember all the packages I have installed.  This script fills in that gap.

The only two things that prevented the script from running were:
1) I don't have 'arch' installed... is that even a thing? it's nearly impossible to search for (AUR has >1600 matches and the one named 'arch' isn't it)
2) ftp.archlinux.org is no more

Making those two updates was easy, but while I was at it, I added a few extra features:
1) reads first mirror from /etc/pacman.d/mirrorlist
2) optional command-line override of mirrorlist file or directly specifying a URL
3) defaults to showing only packages with >= 1% file match (with optional command-line override)
4) added a -l arg to list just packages names (without version and build) to make mass-reinstalling easier

For the benefit of anyone unfortunate enough to end up in the same situation as me, here's the updated code:

#!/usr/bin/perl
use strict;
use warnings;

use File::Temp qw/tempdir/;
use POSIX qw/uname/;
use Getopt::Long;

my $arch = (uname())[4];
my $mirrorlist = '/etc/pacman.d/mirrorlist';
my ($url, $list, $packages);
my $pct = 1;

GetOptions(
	'list'      => \$list,
	'mirrors=s' => \$mirrorlist,
	'url=s'     => \$url,
	'pct=i'     => \$pct,
) or die "
Usage: $0 [-q] [-m mirrorlist|-u URL] [-p PCT]

-m mirrorlist  Specify file in which to find a mirror (def: /etc/pacman.d/mirrorlist)
-u URL         Use this URL instead of reading the mirrorlist file
-l             Output only package names, not versions or percentages
-p PCT         Only include packages with >=PCT% match in output
";

$url ||= findMirror($mirrorlist);
$url =~ s/\$arch/$arch/;

my @repos = qw/core extra community/;
my $tmpdir = tempdir(CLEANUP=>1);

# Download list of files in packages
foreach my $repo (@repos) {
	my $files_url = $url;
	$files_url =~ s/\/\$repo\//\/$repo\//;
	$files_url .= '/' . $repo .'.files.tar.gz';
	chdir $tmpdir;
	`curl "$files_url" | bsdtar -xf-`;
}
opendir(my $dh, $tmpdir) or die;
while (my $pkg = readdir($dh)) {
	next if $pkg eq '.' || $pkg eq '..';
	$packages->{$pkg} = findFiles($pkg);
}
close($dh);

# Show results
# Find longest package name for output formatting
my $longest = getLongest($packages);
for my $pkg (sort keys %$packages) {
	next if $packages->{$pkg}{pct} < $pct;
	if ($list) {
		print "$packages->{$pkg}{name}\n";
	} else {
		printf("%-${l}s %3d%%\n", $pkg, $packages->{$pkg}{pct});
	}
}

# Check each package to see if its files exist
sub findFiles {
	my ($pkg) = @_;
	my @files = ();
	my $pct = 0;
	my $name = '';

	# Get the list of files and see which ones exist in the filesystem
	if (open(my $fh, '<', "$tmpdir/$pkg/files")) {
		while (defined(my $line = <$fh>)) {
			chomp $line;
			next if $line eq '%FILES%' or substr($line,-1) eq '/';
			push @files, "/$line";
		}
		close($fh);
		my $n = scalar @files;
		if ($n) {
			my $i = 0;
			foreach my $file (@files) {
				$i++ if -f $file;
			}
			$pct = int(100*$i/$n);
		}
	} else {
		print "Warning: failed to open $tmpdir/$pkg/files: $!\n";
	}

	# Find the package name in the desc file
	if (open(my $fh, '<', "$tmpdir/$pkg/desc")) {
		local $/;
		$_=<$fh>;
		/\%NAME\%\s*\R(.*)/m;
		$name = $1;
		close($fh);
	} else {
		print "Warning: failed to open $tmpdir/$pkg/desc: $!\n";
	}
	return {pct => $pct, name => $name};
}

# Returns the first uncommented mirror from the mirrorlist file
sub findMirror {
	my ($filename) = @_;
	my $url;

	open(my $fh, '<', $filename) || die "failed to open $filename $!\n";
	while (<$fh>) {
		next unless /^\s*Server\s*=\s*(\S+)/i;
		$url = $1;
		last;
	}
	close($fh);
	return $url || die "No mirror found in $filename, exiting.\n";
}

# Return the longest key name in the given hashref
sub longestKey {
	my ($ref) = @_;
	my $longest = 0;
	for my $key (keys %$ref) {
		my $len = length($key);
		$longest = $len if $len > $longest;
	}
	return $longest;
}

P.S. Do I get an achievement for reviving a thread >10years old?

Offline

#7 2020-08-27 16:24:52

eschwartz
Trusted User/Bug Wrangler
Registered: 2014-08-08
Posts: 3,610

Re: A package detection script to help rebuild the local database.

The /usr/bin/arch program would have been installed by coreutils, but it's disabled per default, see https://github.com/coreutils/coreutils/ … ams.sh#L18

It's exactly equivalent to uname -m and should never be used as it's not portable whereas uname -m is a POSIX requirement. But it used to be installed on a number of different Linux distros so people likely assumed it worked. Of course using perl builtins is better than shelling out. big_smile (My perl knowledge is approximately nonexistent, maybe this was not an option in 2009.)

P.S. Do I get an achievement for reviving a thread >10years old?

Well, it's definitely pretty rare to do so productively. big_smile Nice work.


Managing AUR repos The Right Way -- aurpublish (now a standalone tool)

Offline

#8 2020-09-02 16:28:37

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,534
Website

Re: A package detection script to help rebuild the local database.

This also reminds me that I haven't touched Perl for over 10 years tongue
I'll edit the first post to point to the updated script.

edit
Here's a variant in Bash that can use the Arch Linux Archive to retrieved dated file databases for better matches if you know the approximate date of your last system upgrade. This is important for packages that install to versioned directories (e.g. /usr/lib/foo-xx.xx) because all paths will mismatch if the versions mismatch, resulting in a false negative.
It also includes some command line options inspired by TheAmigo's improved version of the Perl script (percent threshold, only print package names).

#!/bin/bash
set -eu

#------------------------------------------------------------------------------#
#                                Configuration                                 #
#------------------------------------------------------------------------------#

# Default values for command-line arguments.

# Default repos to check. The matching file databases must be present in the
# local sync database or available on the Arch Linux Archive:
# https://archive.archlinux.org/repos/
REPOS=(core extra community)

# The date to use when retrieving databases from the Arch Linux Archive, in the
# format yyyy/mm/dd. If not date is given, then today's date will be used.
DATE=

# Set the system root.
ROOT=/

# The pacman sync db directory to check for existing file databases.
SYNC_DB_DIR=/var/lib/pacman/sync/

# Set match percentage threshold (matching files / total files) for a package to
# be considered a match and printed.
MATCH_THRESHOLD=50

# If true, only print matching package names.
QUIET=false



#------------------------------------------------------------------------------#
#                                  Functions                                   #
#------------------------------------------------------------------------------#

function check_files_db()
{
  local files_db=$1
  local entry
  local pkg
  local pkgname

  bsdtar -tf "$files_db" | while read entry
  do
    if [[ $entry =~ /files$ ]]
    then
      pkg=${entry%/*}
      pkgname=${pkg%-*-*}
      # Invoke bsdtar in a subshell so that we can update a path counter outside
      # of the loop.
      local match=0
      local number=0
      local path
      local pct
      while read path
      do
        # Skip directories.
        if [[ $path =~ /$ ]]
        then
          continue
        fi
        ((number+=1))
        if [[ -e ${ROOT%/}/$path ]]
        then
          ((match+=1))
        fi
      done < <(bsdtar -Oxf "$files_db" "$entry")
      if [[ $match -gt 0 ]]
      then
        pct=$(bc -l <<< "scale=0; 100 * $match / $number")
        if [[ $pct -gt $MATCH_THRESHOLD ]]
        then
          if $QUIET
          then
            echo "$pkgname"
          else
            echo "$pkg ${pct}%"
          fi
        fi
      fi
    fi
  done
}

# Usage: get_files_db <repo> [<yyyy/mm/dd>]
function get_files_db()
{
  local repo=$1
  local date=${2:-$(date +'%Y/%m/%d')}
  local arch=$(uname -m)
  wget -N "https://archive.archlinux.org/repos/$date/$repo/os/$arch/$repo.files"
}



function display_help()
{
  cat <<HELP
ABOUT

This is a tool to help recover the local pacman database. It will check
installed files against paths in the pacman file databases to estimate which
package are probably installed on the system. The matches are based entirely on
matching file paths.

For each package, the percent of matching files is calculated. If it meets the
threshold, the package is printed to stdout.

The list of matching packages is only a starting point for
recovering the local database. Some packages include the same paths so the user
will have to determine which of the matches, if any, are likely to be installed
on the system. Also note that no information about the install reaspon can be
determined by matching paths. It is up to the user to figure out which packages
were installed explicitly and which were installed as dependencies.

Also note that if the installed version of a package does not match the version
in the file database, then all paths that contain the version number
(e.g. /usr/lib/foo-xx.xx/...) will fail to match. It is therefore important to
determine which file database matches the installed files.

Local file databases in the sync directory (/var/lib/pacman/sync by default)
will be consulted first if they exist as they are likely to correspond to the
installed version if everything was synchronized together (pacman -Syu; pacman
-Fy). This can be disabled by passing an empty path to the "-s" option.

If the local file databases do not exist, or the check is disabled, an attempt
will be made to download the file database from the Arch Linux Archive
(https://archive.archlinux.org/repos/). A date may be given in the format
yyyy/mm/dd to select a specific date, otherwise the current date will be used.
If you know the date of your last system upgrade (even approximately), pass it
with the "-d" option.


USAGE

  ${0} [options] [repo repo ...]

If no repos are given, the following will be checked: ${REPOS[@]}


OPTIONS

  -d yyyy/mm/dd
    Set the date for retrieving databases from the Arch Linux Archive.
    Default: today's date

  -r /path/to/root
    The system root in which to match packages.
    Default: "/".

  -s /path/to/sync_db/dir
    The path to the pacman sync database directory to query local file
    databases. These are normally downloaded with "pacman -Fy". See the notes
    above about matching database package versions to installed versions. To
    disable local checks, set this to an empty string.
    Default: /var/lib/pacman/sync

  -t <percent>
    The percent of files per package that must exist on the local system for the
    package to match.
    Default: 50

  -q
    Only print package names. This can be used to create an install list that
    can be piped to pacman. E.g.
    ${0} -q | pacman -S -

HELP
  exit "$1"
}



#------------------------------------------------------------------------------#
#                                     Main                                     #
#------------------------------------------------------------------------------#

while getopts 'hd:r:s:t:q' flag
do
  case "$flag" in
    d) DATE=$OPTARG ;;
    r) ROOT=$OPTARG ;;
    s) SYNC_DB_DIR=$OPTARG ;;
    t) MATCH_THRESHOLD=$OPTARG ;;
    q) QUIET=true ;;
    h) display_help 0 ;;
    *) display_help 1 ;;
  esac
done

shift $((OPTIND - 1))
if [[ $# -gt 0 ]]
then
  REPOS=("$@")
fi

for repo in "${REPOS[@]}"
do
  files_db=
  if [[ ! -z $SYNC_DB_DIR ]]
  then
    # Check for local file databases first.
    files_db=${SYNC_DB_DIR%/}/${repo}.files
  fi
  if [[ -z $files_db || ! -e $files_db ]]
  then
    get_files_db "$repo" "$DATE"
    files_db=${repo}.files
  fi
  if $QUIET
  then
    check_files_db "$files_db"
  else
    date -r "$files_db"  +"$files_db %F %R %Z"
    echo '--------------------'
    check_files_db "$files_db"
    echo ''
  fi
done

To anyone browsing this thread 10 years from now, greetings from 2020. Hopefully life is back to normal and society is mostly intact if you're searching the internet to repair a pacman database. Then again, maybe there was a total collapse and you're reading this off a charred server disk to restore the local database after a gamma radiation pulse from yet another member of  your group getting cabin fever and opening the bunker door. I hope that it's not the system running your air and water filtration. At least the Arch Linux Archive is pretty reliable so that's probably still up and running. Without the internet though, you'll need a rad suit and an ethernet cable to connect to it. Good luck and congrats on surviving!

Last edited by Xyne (2020-09-02 19:36:40)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

Board footer

Powered by FluxBB