I'm going to go out on a limb here and tell you to let your synthesizer optimize it. Other than that you can use a minimizer (e.g. espresso) on your table then code the result in VHDL.
I'm guessing this should be what you should do when targeting an FPGA:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity bit_count is
port (
a,b,c,d: in std_logic;
x,y,z: out std_logic
);
end entity;
architecture lut of bit_count is
subtype lutin is std_logic_vector (3 downto 0);
subtype lutout is std_logic_vector (2 downto 0);
type lut is array (natural range 0 to 15) of lutout;
constant bitcount: lut := (
"000", "001", "001", "010",
"011", "010", "010", "011",
"001", "010", "010", "011",
"010", "011", "011", "100"
);
signal temp: std_logic_vector (2 downto 0);
begin
temp <= bitcount( TO_INTEGER ( unsigned (lutin'(a&b&c&d) ) ) );
(x, y, z) <= lutout'(temp(2), temp(1), temp(0));
end architecture;
And failing that I think hand optimizing it as a ROM is likely to be close in terms of gate count:
-- 0000 0001 0010 0011
-- "000", "001", "001", "010",
-- 0100 0101 0110 0111
-- "011", "010", "010", "011",
-- 1000 1001 1010 1011
-- "001", "010", "010", "011",
-- 1100 1101 1110 1111
-- "010", "011", "011", "100"
-- output Input
-----------------------
-- bit 0 is true 0001 0010 0100 0111 1000 1011 1101 1111
-- bit 1 0011 0100 0101 0110 0111 1001 1010 1011 1100 1101 1110
-- bit 2 1111
architecture rom of bit_count is
signal t0,t1,t2: std_logic;
signal t4,t7,t8: std_logic;
signal t11,t13,t14: std_logic;
signal t15: std_logic;
begin
-- terms
t0 <= not a and not b and not c and not d;
t1 <= a and not b and not c and not d;
t2 <= not a and b and not c and not d;
-- t3 <= a and b and not c and not d;
t4 <= not a and not b and c and not d;
-- t5 <= a and not b and c and not d;
-- t6 <= not a and b and c and not d;
t7 <= a and b and c and not d;
t8 <= not a and not b and not c and d;
-- t9 <= a and not b and not c and d;
-- t10 <= not a and b and not c and d;
t11 <= a and b and not c and d;
-- t12 <= not a and not b and c and d;
t13 <= a and not b and c and d;
t14 <= not a and b and c and d;
t15 <= a and b and c and d;
-- outputs
x <= t15;
y <= not ( t0 or t1 or t2 or t8 or t15 );
Z <= t1 or t2 or t4 or t7 or t8 or t11 or t13 or t14;
end architecture;
It should be fewer gates than your chained multiplexers and a bit flatter (faster).
The two architectures have been analyzed but not simulated. It's easy to get errors when doing hand gate level coding.